ML Hyperparameter tuning #2549

thomasflynn918 · 2022-03-15T21:52:18Z

thomasflynn918
Mar 15, 2022

These are some scripts that I've been using to do ML hyperparameter tuning on Cori and Perlmutter. There are 4 pieces: a driver script that actually does a run at a particular configuration, a YAML file specifying the hyperparameter space, a template for an sbatch script and a python script launch.py that enumerates the configs and submits a job for each configuration.

Note that the example driver script I included is completely trivial, but ordinarily it would be a pytorch training script and we'd run each configuration on multiple nodes. There is also a python notebook that aggregates the outputs from each job, matching them with their configurations, that can be used for filtering, summarizing the results etc. I am attaching the scripts here and also including the contents of the README.md in the archive that has more details.

It will be interesting to see if something similar can be achieved within RADICAL-pilot.

How to use the launching scripts

Make the python script that runs a single configuration. In this example, it is test.py. Our test.py takes four arguments. Three of them (--lr, --rank_approx, --c) are parameters of the algorithm we want to tune. The fourth one, --metrics, is used to specify an output file where the results should be saved.
Make the template for the job submission. In this example it is cori-template.sh This just includes the queue name, account, and running time and the python command to call test.py. In our case it's just
srun python test.py --metrics $SLURM_JOBID.pt $suffix. Here, $suffix is an environment variable that will be populated by our launching script when each job is submitted. And --metrics $SLURM_JOBID.pt specifies that we want to save all the data generated for that configuraiton into $SLURM_JOBID.pt.
Specify the range of arguments in YAML format. In this example we've done this in params.yaml. As you can see, we indicate 3 values for each of lr, rank_approx, and c, for a total of 27 configurations.
Launch the jobs using launch.py: python launch.py --params params.yaml --script cori-template.sh --outdir. This will enumerate all the configurations, make a the appropriate arg string, and call sbatch on each one.For each configuration we capture the JOBID generated by slurm and save the correspondence between job IDs and configurations in a file jobs.yaml, which is also backed up to jobs-MM-DD-YYY-HH-MM-SS.yaml. For the current example, the first few rows of jobs.yaml might look like this:
```
 '2822315':
```

launcher-main.zip

       c: 1.0
      lr: 0.1
      rank_approx: 10
    '2822316':
      c: 10.0
      lr: 0.1
      rank_approx: 10
    '2822317':
      c: 31.2
      lr: 0.1
      rank_approx: 10
All the stdout/stderr from each job will be stored in the files `config_jobid.txt` In this example, we'll have `c_1.0_lr_0.1_rank_approx_100_2822660.txt`,  `c_1.0_lr_0.1_rank_approx_10_2822657.txt`, etc. Our script writes results in YAML format in the files of the form `JOBID.pt`, like `2822660.pt`, `2822657.pt`, etc.

Using the jobs.yaml we can obtain the correspondenece between the configurations and the results in the the various JOBID.pt files we can aggregate and summarize the results. See the status-example.ipynb for an example.

vrpascuzzi · 2022-03-15T22:06:17Z

vrpascuzzi
Mar 15, 2022

This is great, @thomasflynn918. Thanks a lot for the opening this issue.
Give me a few days to follow up.

0 replies

vrpascuzzi · 2022-03-25T15:25:54Z

vrpascuzzi
Mar 25, 2022

Launching executables: https://github.com/radical-cybertools/radical.pilot/blob/devel/examples/00_getting_started.py
Launching functions: https://github.com/radical-cybertools/radical.pilot/blob/devel/examples/12_task_function.py
Per-rank configuration: https://github.com/radical-cybertools/radical.pilot/blob/devel/examples/misc/env.py

2 replies

vrpascuzzi Mar 25, 2022

As per discussion on 25/03/22:

Single GPU execution
Integrate HPO analysis
Scale-up (single-node, multi-GPU)

mturilli Mar 25, 2022
Maintainer

@thomasflynn918, based on our discussion please let us know whether you would need any help with wrapping your Python functions into a script that can be executed as a task by RP on a single GPU. Also, I think we may need to coordinate to run a RP application that executes multiple instances of that task on multiple GPU of Perlmutter (I think we agreed on that target resource).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ML Hyperparameter tuning #2549

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

ML Hyperparameter tuning #2549

thomasflynn918 Mar 15, 2022

How to use the launching scripts

Replies: 2 comments · 2 replies

vrpascuzzi Mar 15, 2022

vrpascuzzi Mar 25, 2022

vrpascuzzi Mar 25, 2022

mturilli Mar 25, 2022 Maintainer

thomasflynn918
Mar 15, 2022

Replies: 2 comments 2 replies

vrpascuzzi
Mar 15, 2022

vrpascuzzi
Mar 25, 2022

mturilli Mar 25, 2022
Maintainer