ML Hyperparameter tuning #2549
thomasflynn918
started this conversation in
General
Replies: 2 comments 2 replies
-
This is great, @thomasflynn918. Thanks a lot for the opening this issue. |
Beta Was this translation helpful? Give feedback.
0 replies
-
Launching executables: https://github.com/radical-cybertools/radical.pilot/blob/devel/examples/00_getting_started.py |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Tagging @vrpascuzzi
These are some scripts that I've been using to do ML hyperparameter tuning on Cori and Perlmutter. There are 4 pieces: a driver script that actually does a run at a particular configuration, a YAML file specifying the hyperparameter space, a template for an sbatch script and a python script launch.py that enumerates the configs and submits a job for each configuration.
Note that the example driver script I included is completely trivial, but ordinarily it would be a pytorch training script and we'd run each configuration on multiple nodes. There is also a python notebook that aggregates the outputs from each job, matching them with their configurations, that can be used for filtering, summarizing the results etc. I am attaching the scripts here and also including the contents of the README.md in the archive that has more details.
It will be interesting to see if something similar can be achieved within RADICAL-pilot.
How to use the launching scripts
Make the python script that runs a single configuration. In this example, it is
test.py
. Ourtest.py
takes four arguments. Three of them (--lr
,--rank_approx
,--c
) are parameters of the algorithm we want to tune. The fourth one,--metrics
, is used to specify an output file where the results should be saved.Make the template for the job submission. In this example it is
cori-template.sh
This just includes the queue name, account, and running time and the python command to calltest.py
. In our case it's justsrun python test.py --metrics $SLURM_JOBID.pt $suffix
. Here,$suffix
is an environment variable that will be populated by our launching script when each job is submitted. And--metrics $SLURM_JOBID.pt
specifies that we want to save all the data generated for that configuraiton into$SLURM_JOBID.pt
.Specify the range of arguments in YAML format. In this example we've done this in
params.yaml
. As you can see, we indicate 3 values for each oflr
,rank_approx
, andc
, for a total of 27 configurations.Launch the jobs using
launch.py
:python launch.py --params params.yaml --script cori-template.sh --outdir
. This will enumerate all the configurations, make a the appropriate arg string, and call sbatch on each one.For each configuration we capture the JOBID generated by slurm and save the correspondence between job IDs and configurations in a filejobs.yaml
, which is also backed up tojobs-MM-DD-YYY-HH-MM-SS.yaml
. For the current example, the first few rows of jobs.yaml might look like this:launcher-main.zip
jobs.yaml
we can obtain the correspondenece between the configurations and the results in the the variousJOBID.pt
files we can aggregate and summarize the results. See thestatus-example.ipynb
for an example.Beta Was this translation helpful? Give feedback.
All reactions