Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job scripts for candide #13

Open
lbaumo opened this issue Jun 30, 2023 · 7 comments
Open

job scripts for candide #13

lbaumo opened this issue Jun 30, 2023 · 7 comments

Comments

@lbaumo
Copy link
Collaborator

lbaumo commented Jun 30, 2023

create a sample script to submit to the batch queue and a script version of the code

@lbaumo lbaumo mentioned this issue Jun 30, 2023
@lbaumo
Copy link
Collaborator Author

lbaumo commented Aug 29, 2023

there are still memory problems- Im getting this error for the gaussian process regressor:
Traceback (most recent call last): File "/home/baumont/.conda/envs/sp-peaks/lib/python3.6/site-packages/joblib/externals/loky/backend/resource_tracker.py", line 278, in main registry[rtype][name] -= 1 KeyError: '/dev/shm/joblib_memmapping_folder_28332_8b7b16870bc540a78fa449e97d0c2b55_4cb00576fda144e5bb3b654cf5effdde/28332-140497104158792-9f8cc414dcd74eef9ce42f5cf4cead4a.pkl'

@lbaumo
Copy link
Collaborator Author

lbaumo commented Aug 29, 2023

the solution appears to be here, I will give it a try

@lbaumo
Copy link
Collaborator Author

lbaumo commented Aug 29, 2023

solution above works, but there is now an issue:

Traceback (most recent call last):
  File "/home/baumont/software/shear-pipe-peaks/example/constraints_CFIS-P3.py", line 111, in <module>
    with Pool() as pool:
  File "/home/baumont/.conda/envs/sp-peaks/lib/python3.6/multiprocessing/context.py", line 119, in Pool
    context=self.get_context())
  File "/home/baumont/.conda/envs/sp-peaks/lib/python3.6/multiprocessing/pool.py", line 174, in __init__
    self._repopulate_pool()
  File "/home/baumont/.conda/envs/sp-peaks/lib/python3.6/multiprocessing/pool.py", line 239, in _repopulate_pool
    w.start()
  File "/home/baumont/.conda/envs/sp-peaks/lib/python3.6/multiprocessing/process.py", line 105, in start
    self._popen = self._Popen(self)
  File "/home/baumont/.conda/envs/sp-peaks/lib/python3.6/multiprocessing/context.py", line 277, in _Popen
    return Popen(process_obj)
  File "/home/baumont/.conda/envs/sp-peaks/lib/python3.6/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/home/baumont/.conda/envs/sp-peaks/lib/python3.6/multiprocessing/popen_fork.py", line 66, in _launch
    self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory

@martinkilbinger
Copy link
Collaborator

Are you running it on a compute node?

@lbaumo
Copy link
Collaborator Author

lbaumo commented Aug 29, 2023

I can run on my laptop and in a notebook, so I have tried the script on the login node just to see if the chains get started. I am trying now on a compute node have increased the memory allocation
`* JOBID: 283632.cmaster

  • USER: baumont
  • GROUP: baumont
  • JOBNAME: peaks-mcmc
  • SESSIONID: 231638
  • RESOURCESLIST: mem=10gb,neednodes=1:ppn=8,nodes=1:ppn=8,walltime=10:00:00
  • RESOURCESUSED: cput=00:19:58,mem=5583580kb,vmem=91802740kb,walltime=00:00:55
  • QUEUE: batch
  • JOB EXIT STATUS: 1`
    hmm- weird that python gives the error before the job script

@lbaumo
Copy link
Collaborator Author

lbaumo commented Aug 29, 2023

increasing the memory gives the same error but the usage does not really get close to the max:

  • JOBID: 283633.cmaster
  • USER: baumont
  • GROUP: baumont
  • JOBNAME: peaks-mcmc
  • SESSIONID: 232911
  • RESOURCESLIST: mem=32gb,neednodes=1:ppn=24,nodes=1:ppn=24,walltime=10:00:00
  • RESOURCESUSED: cput=00:19:23,mem=4153168kb,vmem=90186128kb,walltime=00:00:53
  • QUEUE: batch
  • JOB EXIT STATUS: 1

@lbaumo
Copy link
Collaborator Author

lbaumo commented Aug 29, 2023

ah so apparently when using a multiprocessing.Pool, the default way to start the processes is fork. The issue with fork is that the entire process is duplicated. the script was using a default number of processes equal to the number of cores on the node, even if I did not allocate the entire note in the batch queue script. I tried to make the number of processes a user input, and now the job is running. hopefully it works

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants