Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No documentation for using SciLuigi with SLURM #45

Open
pietromarchesi opened this issue Feb 1, 2018 · 13 comments
Open

No documentation for using SciLuigi with SLURM #45

pietromarchesi opened this issue Feb 1, 2018 · 13 comments

Comments

@pietromarchesi
Copy link
Contributor

Hi,

I am trying to switch to SciLuigi from Luigi because I am interested in having support for SLURM. However, I cannot find any docs which show how to set up SlurmTask tasks. I am reading through the code, but it's still not clear to me what the intended usage is. Do you have any examples? I'd happy to write up and contribute a documentation page once I get the hang of it, I think it would be a nice addition to the wiki.

Cheers,

Pietro

@samuell
Copy link
Member

samuell commented Feb 1, 2018

Hi @pietromarchesi and sorry for the lack of documentation on SLURM. Contributions very welcome, e.g. for the wiki.

For now, you could have a look at our use case project for the sciluigi publication:

E.g. see these lines on how to send a slurminfo object to the new_task() factory method.

(The components used, are available in this accompanying repo).

As you can see, the info needed is the typical SLURM job details, plus a runmode, which lets you switch dynamically whether to run locally or via SLURM, with python code. (In the linked code example, this is set up in the beginning of the script, on these lines).

Hope this helps!

@pietromarchesi
Copy link
Contributor Author

Brilliant. Yeah I couldn't figure out exactly where things had to be defined, now it's clear. I could write up a draft for a doc page where I extend the example workflow of this examples with the necessary modifications necessary to run on SLURM, if you think that would be useful!

@samuell
Copy link
Member

samuell commented Feb 1, 2018

I could write up a draft for a doc page where I extend the example workflow of this examples with the necessary modifications necessary to run on SLURM, if you think that would be useful!

That'd be great!

@pietromarchesi
Copy link
Contributor Author

I wrote a draft of a potential wiki page which is now at this gist. If you have suggestions on how it can be improved/fixed I will incorporate those and then add it to the wiki. The example works for me but it may be a good idea to test it as well. Cheers

@pietromarchesi
Copy link
Contributor Author

I also extended the example of the previous comment in a workflow where we run several instances of the same workflow in parallel, which you can find here. It would be great if you take a look, because I am noticing that the workflows get shipped always to the same node (instead of being sent as batch jobs to different nodes). They also don't show up in squeue. Any idea why this could be the case?

@samuell
Copy link
Member

samuell commented Feb 5, 2018

I am noticing that the workflows get shipped always to the same node (instead of being sent as batch jobs to different nodes).

How many cores per node are there? ... since I think SLURM might schedule multiple core-jobs together on the same node, as long as there are free cores on the node. Any difference if you set cores=the same number of cores you have per node?

They also don't show up in squeue. Any idea why this could be the case?

This is more strange. It's a bit hard to tell though without testing at a concrete system ... so many things that could happen with SLURM etc. E.g. I've been surprised a few time about how salloc and srun work together (sometimes having jobs start just locally on the login node ... sometimes only starting one job per node, although having many cores per node, etc)....

@pietromarchesi
Copy link
Contributor Author

Hi Samuel, many thanks for your reply.

I can see jobs appearing in the queue now, although for some reason only fooreplacer and the call to the shell script that replaces foo with hostname (and not foowriter).

Interestingly, jobs still appear sequentially in the queue, as if SciLuigi was waiting for one workflow to be completed before creating the allocation for the next.

I am on a system with effectively 256 cores (64 cores with four hardware threads), so I tried requesting 256 cores, but not much changed. In fact, I discovered that I was already getting all 256 cores even was asking for only one, so I was getting the whole node in any case.

The output of sacct shows only fooreplacer, and as you can see all three runs of the workflow ended up on the same computing node prod-0004.

$ sacct --jobs=7208,7209,7210 --user pcp0135 --format=JobID,JobName,Partition,Account,AllocNodes,AllocCPUS,State,NodeList
       JobID    JobName  Partition    Account AllocNodes  AllocCPUS      State        NodeList 
------------ ---------- ---------- ---------- ---------- ---------- ---------- --------------- 
7208         fooreplac+        knl    pcp0135          1        256  COMPLETED       prod-0004 
7208.0       replace_w+               pcp0135          1          1  COMPLETED       prod-0004 
7209         fooreplac+        knl    pcp0135          1        256  COMPLETED       prod-0004 
7209.0       replace_w+               pcp0135          1          1  COMPLETED       prod-0004 
7210         fooreplac+        knl    pcp0135          1        256  COMPLETED       prod-0004 
7210.0       replace_w+               pcp0135          1          1  COMPLETED       prod-0004 

@samuell
Copy link
Member

samuell commented Feb 5, 2018

Interestingly, jobs still appear sequentially in the queue, as if SciLuigi was waiting for one workflow to be completed before creating the allocation for the next.

Hmm, are you specifying number of workers to Luigi? The default one is 1, and SciLuigi needs one worker per job in the queue as it works now ... We used to run 64 workers in our workflows, as reasonable balance between getting in enough jobs in the queue at once, and not starting too many python processes.

@pietromarchesi
Copy link
Contributor Author

Brilliant, that was it. And I guess the reason that only fooreplacer shows up in the queue is because it is the last one in the workflow. If you agree, I will add this parallel workflows example as a second part of the wiki page on Slurm (or a separate page entirely), where it will be easy to find.

@samuell
Copy link
Member

samuell commented Feb 5, 2018

If you agree, I will add this parallel workflows example as a second part of the wiki page on Slurm (or a separate page entirely), where it will be easy to find.

Absolutely. Many thanks indeed!

@pietromarchesi
Copy link
Contributor Author

Cool, I'm writing it up now.

Just to be sure: as far as I understand, the only way to send tasks as batch jobs is to make a call to ex() and everything else that appears in the run methods which is Python code will be executed locally, is that right? And so the best way to run Python code is to put it in a separate script and call it with ex().

@samuell
Copy link
Member

samuell commented Feb 5, 2018

Just to be sure: as far as I understand, the only way to send tasks as batch jobs is to make a call to ex() and everything else that appears in the run methods which is Python code will be executed locally, is that right? And so the best way to run Python code is to put it in a separate script and call it with ex().

Exactly!

@pietromarchesi
Copy link
Contributor Author

I added the two pages to the wiki! Only thing is, in the menu on the right automatically generated by GitHub, the order of the pages got messed up, and Using shows up last now. Not sure how to fix that.

Again ran into the problem of jobs not showing up in squeue but showing up in sacct, which is a bit of a mystery.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants