No documentation for using SciLuigi with SLURM #45

pietromarchesi · 2018-02-01T11:55:13Z

Hi,

I am trying to switch to SciLuigi from Luigi because I am interested in having support for SLURM. However, I cannot find any docs which show how to set up SlurmTask tasks. I am reading through the code, but it's still not clear to me what the intended usage is. Do you have any examples? I'd happy to write up and contribute a documentation page once I get the hang of it, I think it would be a nice addition to the wiki.

Cheers,

Pietro

samuell · 2018-02-01T12:36:06Z

Hi @pietromarchesi and sorry for the lack of documentation on SLURM. Contributions very welcome, e.g. for the wiki.

For now, you could have a look at our use case project for the sciluigi publication:

E.g. see these lines on how to send a slurminfo object to the new_task() factory method.

(The components used, are available in this accompanying repo).

As you can see, the info needed is the typical SLURM job details, plus a runmode, which lets you switch dynamically whether to run locally or via SLURM, with python code. (In the linked code example, this is set up in the beginning of the script, on these lines).

Hope this helps!

pietromarchesi · 2018-02-01T13:18:30Z

Brilliant. Yeah I couldn't figure out exactly where things had to be defined, now it's clear. I could write up a draft for a doc page where I extend the example workflow of this examples with the necessary modifications necessary to run on SLURM, if you think that would be useful!

samuell · 2018-02-01T17:00:48Z

I could write up a draft for a doc page where I extend the example workflow of this examples with the necessary modifications necessary to run on SLURM, if you think that would be useful!

That'd be great!

pietromarchesi · 2018-02-02T15:45:14Z

I wrote a draft of a potential wiki page which is now at this gist. If you have suggestions on how it can be improved/fixed I will incorporate those and then add it to the wiki. The example works for me but it may be a good idea to test it as well. Cheers

pietromarchesi · 2018-02-05T13:46:27Z

I also extended the example of the previous comment in a workflow where we run several instances of the same workflow in parallel, which you can find here. It would be great if you take a look, because I am noticing that the workflows get shipped always to the same node (instead of being sent as batch jobs to different nodes). They also don't show up in squeue. Any idea why this could be the case?

samuell · 2018-02-05T13:56:33Z

I am noticing that the workflows get shipped always to the same node (instead of being sent as batch jobs to different nodes).

How many cores per node are there? ... since I think SLURM might schedule multiple core-jobs together on the same node, as long as there are free cores on the node. Any difference if you set cores=the same number of cores you have per node?

They also don't show up in squeue. Any idea why this could be the case?

This is more strange. It's a bit hard to tell though without testing at a concrete system ... so many things that could happen with SLURM etc. E.g. I've been surprised a few time about how salloc and srun work together (sometimes having jobs start just locally on the login node ... sometimes only starting one job per node, although having many cores per node, etc)....

pietromarchesi · 2018-02-05T14:51:19Z

Hi Samuel, many thanks for your reply.

I can see jobs appearing in the queue now, although for some reason only fooreplacer and the call to the shell script that replaces foo with hostname (and not foowriter).

Interestingly, jobs still appear sequentially in the queue, as if SciLuigi was waiting for one workflow to be completed before creating the allocation for the next.

I am on a system with effectively 256 cores (64 cores with four hardware threads), so I tried requesting 256 cores, but not much changed. In fact, I discovered that I was already getting all 256 cores even was asking for only one, so I was getting the whole node in any case.

The output of sacct shows only fooreplacer, and as you can see all three runs of the workflow ended up on the same computing node prod-0004.

$ sacct --jobs=7208,7209,7210 --user pcp0135 --format=JobID,JobName,Partition,Account,AllocNodes,AllocCPUS,State,NodeList
       JobID    JobName  Partition    Account AllocNodes  AllocCPUS      State        NodeList 
------------ ---------- ---------- ---------- ---------- ---------- ---------- --------------- 
7208         fooreplac+        knl    pcp0135          1        256  COMPLETED       prod-0004 
7208.0       replace_w+               pcp0135          1          1  COMPLETED       prod-0004 
7209         fooreplac+        knl    pcp0135          1        256  COMPLETED       prod-0004 
7209.0       replace_w+               pcp0135          1          1  COMPLETED       prod-0004 
7210         fooreplac+        knl    pcp0135          1        256  COMPLETED       prod-0004 
7210.0       replace_w+               pcp0135          1          1  COMPLETED       prod-0004

samuell · 2018-02-05T15:19:08Z

Interestingly, jobs still appear sequentially in the queue, as if SciLuigi was waiting for one workflow to be completed before creating the allocation for the next.

Hmm, are you specifying number of workers to Luigi? The default one is 1, and SciLuigi needs one worker per job in the queue as it works now ... We used to run 64 workers in our workflows, as reasonable balance between getting in enough jobs in the queue at once, and not starting too many python processes.

pietromarchesi · 2018-02-05T16:04:32Z

Brilliant, that was it. And I guess the reason that only fooreplacer shows up in the queue is because it is the last one in the workflow. If you agree, I will add this parallel workflows example as a second part of the wiki page on Slurm (or a separate page entirely), where it will be easy to find.

samuell · 2018-02-05T16:07:08Z

If you agree, I will add this parallel workflows example as a second part of the wiki page on Slurm (or a separate page entirely), where it will be easy to find.

Absolutely. Many thanks indeed!

pietromarchesi · 2018-02-05T16:55:23Z

Cool, I'm writing it up now.

Just to be sure: as far as I understand, the only way to send tasks as batch jobs is to make a call to ex() and everything else that appears in the run methods which is Python code will be executed locally, is that right? And so the best way to run Python code is to put it in a separate script and call it with ex().

samuell · 2018-02-05T17:31:33Z

Just to be sure: as far as I understand, the only way to send tasks as batch jobs is to make a call to ex() and everything else that appears in the run methods which is Python code will be executed locally, is that right? And so the best way to run Python code is to put it in a separate script and call it with ex().

Exactly!

pietromarchesi · 2018-02-06T10:45:59Z

I added the two pages to the wiki! Only thing is, in the menu on the right automatically generated by GitHub, the order of the pages got messed up, and Using shows up last now. Not sure how to fix that.

Again ran into the problem of jobs not showing up in squeue but showing up in sacct, which is a bit of a mystery.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No documentation for using SciLuigi with SLURM #45

No documentation for using SciLuigi with SLURM #45

pietromarchesi commented Feb 1, 2018

samuell commented Feb 1, 2018 •

edited

Loading

pietromarchesi commented Feb 1, 2018

samuell commented Feb 1, 2018

pietromarchesi commented Feb 2, 2018

pietromarchesi commented Feb 5, 2018

samuell commented Feb 5, 2018 •

edited

Loading

pietromarchesi commented Feb 5, 2018

samuell commented Feb 5, 2018

pietromarchesi commented Feb 5, 2018

samuell commented Feb 5, 2018

pietromarchesi commented Feb 5, 2018

samuell commented Feb 5, 2018

pietromarchesi commented Feb 6, 2018

No documentation for using SciLuigi with SLURM #45

No documentation for using SciLuigi with SLURM #45

Comments

pietromarchesi commented Feb 1, 2018

samuell commented Feb 1, 2018 • edited Loading

pietromarchesi commented Feb 1, 2018

samuell commented Feb 1, 2018

pietromarchesi commented Feb 2, 2018

pietromarchesi commented Feb 5, 2018

samuell commented Feb 5, 2018 • edited Loading

pietromarchesi commented Feb 5, 2018

samuell commented Feb 5, 2018

pietromarchesi commented Feb 5, 2018

samuell commented Feb 5, 2018

pietromarchesi commented Feb 5, 2018

samuell commented Feb 5, 2018

pietromarchesi commented Feb 6, 2018

samuell commented Feb 1, 2018 •

edited

Loading

samuell commented Feb 5, 2018 •

edited

Loading