Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cleanup #78

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions content/general_advice/intro.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,9 +23,9 @@ Therefore, this section is intended to **review potential issues on the ML side

The General Advice chapter is divided into into 3 sections. Things become logically aligned if presented from the perspective of the training procedure (fitting/loss minimisation part). That is, the sections will group validation items as they need to be investigated:

* Before training
* During training
* After training
* [Before training](./before/domains.md)
* [During training](./during/overfitting.md)
* [After training](./after/after.md)

---

Expand Down
4 changes: 2 additions & 2 deletions content/inference/conifer.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,15 +19,15 @@ All L1T algorithms require bit-exact emulation for performance studies and valid
Both the conifer FPGA firmware and C++ emulation use Xilinx's arbitrary precision types for fixed-point arithmetic (`hls` external of CMSSW). This is cheaper and faster in the FPGA fabric than floating-point types. An important part of the model preparation process is choosing the proper fixed-point data types to avoid loss of performance compared to the trained model. Input preprocessing, in particular scaling, can help constrain the input variables to a smaller numerical range, but may also have a hardware cost to implement. In C++ the arbitrary precision types are specified like: `ap_fixed<width, integer, rounding mode, saturation mode>`.

Minimal preparation from Python:
```
```python
import conifer
model = conifer. ... # convert or load a conifer model
# e.g. model = conifer.converters.convert_from_xgboost(xgboost_model)
model.save('my_bdt.json')
```

CMSSW C++ user code:
```
```c++
// include the conifer emulation header file
#include "L1Trigger/Phase2L1ParticleFlow/interface/conifer.h"

Expand Down
8 changes: 4 additions & 4 deletions content/inference/onnx.md
Original file line number Diff line number Diff line change
Expand Up @@ -175,7 +175,7 @@ Let's construct the full example.

The example assumes the following directory structure:

```
```bash
MySubsystem/MyModule/
├── plugins/
Expand Down Expand Up @@ -216,7 +216,7 @@ Let's construct the full example.
Under `MySubsystem/MyModule/test`, run `#!bash cmsRun my_plugin_cfg.py` to launch our module. You may see the following from the output, which include the input and output vectors in the inference process.

??? hint "Click to see the output"
```
```bash
...
19-Jul-2022 10:50:41 CEST Successfully opened file root://xrootd-cms.infn.it//store/mc/RunIISummer20UL18MiniAODv2/DYJetsToLL_M-50_TuneCP5_13TeV-amcatnloFXFX-pythia8/MINIAODSIM/106X_upgrade2018_realistic_v16_L1v1-v2/230000/4C8619B2-D0C0-4647-B946-B33754F4ED16.root
Begin processing the 1st record. Run 1, Event 27074045, LumiSection 10021 on stream 0 at 19-Jul-2022 10:50:43.494 CEST
Expand Down Expand Up @@ -291,7 +291,7 @@ print('output ->', outputs)

Under the directory `MySubsystem/MyModule/test`, run the example with `python3 my_standalone_test.py`. Then we see the output:

```
```bash
input -> [45. 46. 47. 48. 49. 50. 51. 52. 53. 54.]
output -> [[0.9956566 0.00434343]]
```
Expand Down Expand Up @@ -326,7 +326,7 @@ Please find details in the following block.
```

We should see the output as follows
```
```bash
processing.examples.exampleOrtModule exampleOrtModuleConstr -N 10
Loading exampleOrtModuleConstr from PhysicsTools.NanoAODTools.postprocessing.examples.exampleOrtModule
Will write selected trees to outDir
Expand Down
2 changes: 1 addition & 1 deletion content/inference/tensorflow2.md
Original file line number Diff line number Diff line change
Expand Up @@ -299,7 +299,7 @@ delete graphDef;

The example assumes the following directory structure:

```
```bash
MySubsystem/MyModule/
├── plugins/
Expand Down
6 changes: 3 additions & 3 deletions content/inference/tensorflow_aot.md
Original file line number Diff line number Diff line change
Expand Up @@ -140,7 +140,7 @@ The following files should have been created upon success.

??? hint "SavedModel files"

```
```bash
/path/to/saved_model
├── variables/
Expand Down Expand Up @@ -270,7 +270,7 @@ Upon success, all generated files can be found in `$CMSSW_BASE/tfaot/test` and s

???+ hint "Generated files"

```
```bash
${CMSSW_BASE}/tfaot/test
├── lib/
Expand Down Expand Up @@ -398,7 +398,7 @@ std::tie(out1, out2) = model.run<tfaot::DoubleArrays, tfaot::Int32Arrays>(

The example assumes the following directory structure:

```
```bash
MySubsystem/MyModule/
├── plugins/
Expand Down
6 changes: 3 additions & 3 deletions content/optimization/data_augmentation.md
Original file line number Diff line number Diff line change
Expand Up @@ -113,7 +113,7 @@ RDA methods augment the existing dataset by performance some transformation on t

In [Barnard et al., 2016][1e], the authors investigate the effect of parton shower modelling in DNN jet taggers using images of hadronically decaying W bosons. They introduce a method known as zooming to study the scale invariance of these networks. This is the RDA strategy used by [Dolan & Ore, 2021][1a]. Zooming is similar to a normalization procedure such that it standardizes features in signal data, but it aims to not create similar features in background.

After some standard data processing steps, including jet trimming and clustering via the $k_t$ algorithm, and some further processing to remove spatial symmetries, the resulting jet image depicts the leading subjet and subleading subjet directly below. [Barnard et al., 2016][1e] notes that the separation between the leading and subleading subjets varies linearly as $2m/p_T$ where $m$ and $p_T$ are the mass and transverse momentum of the jet. Standardizing this separation, or removing the linear dependence, would allow the DNN tagger to generalize to a wide range of jet $p_T$. To this end, the authors construct a factor, $R/\DeltaR_{act}$, where $R$ is some fixed value and $\DeltaR_{act}$ is the separation between the leading and subleading subjets. To discriminate between signal and background images with this factor, the authors enlarge the jet images by a scaling factor of $\text{max}(R/s,1)$ where $s = 2m_W/p_T$ and $R$ is the original jet clustering size. This process of jet image enlargement by a linear mass and $p_T$ dependent factor to account for the distane between the leading and subleading jet is known as zooming. This process can be thought of as an RDA technique to augment the data in a domain-specific way.
After some standard data processing steps, including jet trimming and clustering via the $k_t$ algorithm, and some further processing to remove spatial symmetries, the resulting jet image depicts the leading subjet and subleading subjet directly below. [Barnard et al., 2016][1e] notes that the separation between the leading and subleading subjets varies linearly as $2m/p_T$ where $m$ and $p_T$ are the mass and transverse momentum of the jet. Standardizing this separation, or removing the linear dependence, would allow the DNN tagger to generalize to a wide range of jet $p_T$. To this end, the authors construct a factor, $R/\Delta R_{act}$, where $R$ is some fixed value and $\Delta R_{act}$ is the separation between the leading and subleading subjets. To discriminate between signal and background images with this factor, the authors enlarge the jet images by a scaling factor of $\text{max}(R/s,1)$ where $s = 2m_W/p_T$ and $R$ is the original jet clustering size. This process of jet image enlargement by a linear mass and $p_T$ dependent factor to account for the distane between the leading and subleading jet is known as zooming. This process can be thought of as an RDA technique to augment the data in a domain-specific way.

Advantage of using the zooming technique is that it makes the construction of scale invariant taggers easier. Scale invariant searches which are able to interpolate between the boosted and resolved parts of phase space have the advantage of being applicable over a broad range of masses and kinematics, allowing a single search or analysis to be effective where previously more than one may have been necessary.

Expand Down Expand Up @@ -168,7 +168,7 @@ Oversampling and undersampling are essentially opposite and roughly equivalent t

It has been shown that the combination of SMOTE and undersampling performs better than only undersampling the majority class. However, over- and undersampling remain popular as it each is much easier to implement alone than in some complex hybrid approach.

**Synthetic Minority Over-sampling Technique (SMOTE)**
### Synthetic Minority Over-sampling Technique (SMOTE)
*Text mostly based on [Chawla et al., 2002][2j] and in part on [He et al., 2010][2k]*

In case of Synthetic Minority Over-sampling Technique (SMOTE), the minority class is oversampled by creating synthetic examples along the line segments joining any or all of the $k$-nearest neighbours in the minority class.
Expand Down Expand Up @@ -197,7 +197,7 @@ Extend X by SYNTHETIC_SAMPLES
```


**Adaptive synthetic sampling approach (ADASYN)**
### Adaptive synthetic sampling approach (ADASYN)
*Text mostly based on [He et al., 2010][2k]*

Adaptive synthetic sampling approach (ADASYN) is a sampling approach for learning from imbalanced datasets. The main idea is to use a weighted distribution for different minority class examples according to their level of difficulty in learning, where more synthetic data is generated for minority class examples that are harder to learn compared to those minority examples that are easier to learn. Thus, ADASYN improves learning with respect to the data distributions by reducing the bias introduced by the class imbalance and by adaptively shifting the classification boundary toward the difficult examples.
Expand Down
4 changes: 2 additions & 2 deletions content/training/MLaaS4HEP.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ Here is a list of the dependencies:

### Installation
The easiest way to install and run [MLaaS4HEP](https://cloud.docker.com/u/veknet/repository/docker/veknet/mlaas4hep) and [TFaaS](https://cloud.docker.com/u/veknet/repository/docker/veknet/tfaas) is to use pre-build docker images
```
```bash
# run MLaaS4HEP docker container
docker run veknet/mlaas4hep
# run TFaaS docker container
Expand All @@ -43,7 +43,7 @@ MLaaS4HEP python repository provides the `reader.py` module that defines a DataR
[uproot](https://github.com/scikit-hep/uproot) framework.

Basic usage
```
```bash
# setup the proper environment, e.g.
# export PYTHONPATH=/path/src/python # path to MLaaS4HEP python framework
# export PATH=/path/bin:$PATH # path to MLaaS4HEP binaries
Expand Down