scaling issues due to prolog tagging api #34

rvencu · 2022-07-22T20:24:36Z

We got into scaling issue with the tagging in prolog script

I understand the prolog is ran at every step and when many nodes are involved the job fails with timeouts

we need to find another place to do the tagging and I understand that the comment is job related but some other tags can be done only once when the instances are created, either because of the min value in the configuration or created by slurm

I am looking at places where this could be done.

maybe it can be done at the headnode instead in the PrologSlurmctld https://slurm.schedmd.com/prolog_epilog.html

rvencu · 2022-07-22T22:29:23Z

some comment on this topic from slurm support team

Considering the nature of this command in that it needs to run in parallel but async from the other prologs/epilogs. I think a SPANK plugin would fit better than a PREP plugin and avoid the need to write any non-trivial code.

For instance, this is a popular plugin to use lua with SPANK:

https://github.com/stanford-rc/slurm-spank-lua

I think the slurm_spank_init_post_opt() is likely the function to call the tagging command.

rvencu · 2022-07-23T11:28:24Z

looking more closely I notice the loop in the prolog script. the prolog script runs on every compute node and at every step execution and

RPC calls to the headnode (with scontrol) is discouraged
tagging all nodes from every node is making the problem exponential (n ^ 2)

I think we can still keep this in the prolog, find own instance ID with curl and make the node tag itself with a single call. there will be only n calls to the tagging API

not perfect like async tagging but much better anyway I think

rvencu · 2022-07-24T11:02:44Z

I changed the prolog script to PrologSlurmctld and any job larger than 30 nodes crashes

Then I tried this approach inside the prolog.sh

host=$(curl http://169.254.169.254/latest/meta-data/instance-id)
aws ec2 create-tags --region $cfn_region --resources ${host} --tags Key=aws-parallelcluster-username,Value=${SLURM_JOB_USER} Key=aws-parallelcluster-jobid,Value=${SLURM_JOBID} Key=aws-parallelcluster-partition,Value=${SLURM_JOB_PARTITION}

This works for 40 nodes, will test with larger jobs too. But could not find a way to transport the comments yet

Prolog script edits to reduce scaling issues reported in #34

rvencu mentioned this issue Jul 25, 2022

grafana monitoring not working for static resources #35

Open

Soham-G mentioned this issue Aug 12, 2022

Prolog script edits to reduce scaling issues reported in #34 #37

Merged

nicolaven added a commit that referenced this issue Aug 17, 2022

Merge pull request #37 from Soham-G/prolog-debug

ea67bd2

Prolog script edits to reduce scaling issues reported in #34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scaling issues due to prolog tagging api #34

scaling issues due to prolog tagging api #34

rvencu commented Jul 22, 2022 •

edited

Loading

rvencu commented Jul 22, 2022

rvencu commented Jul 23, 2022

rvencu commented Jul 24, 2022

scaling issues due to prolog tagging api #34

scaling issues due to prolog tagging api #34

Comments

rvencu commented Jul 22, 2022 • edited Loading

rvencu commented Jul 22, 2022

rvencu commented Jul 23, 2022

rvencu commented Jul 24, 2022

rvencu commented Jul 22, 2022 •

edited

Loading