Utility/Wrapper for interacting with the Computational Memory Lab's computing resources using the Dask distributed computing framework
Clone this repository:
git clone git@github.com:pennmem/cmldask.git
(Using SSH)
Install using pip:
pip install -e cmldask -r cmldask/requirements.txt
See included notebooks for more detailed instructions and examples, but in short:
- Initialize a new client with dask:
client = CMLDask.new_dask_client_{sge,slurm}("job_name", "1GB")
(depending on whether you're connecting to SGE or SLURM for job scheduling) - Optional: Open dashboard using instructions printed upon running (1)
- Define some function
func
that takes an argumentarg
(or many arguments) and returns a result. - Define an iterable of arguments for
func
that you want to compute in parallel:arguments
- Use the client to map these arguments to the function on different parallel workers:
futures = client.map(func, arguments)
. Dask's Futures API launches parallel jobs and computes results immediately, but delays collecting the results from distributed memory. - Gather results into memory of your current process:
result = client.gather(futures)
Refer to the Dask Futures API for great documentation on the preferred approach for parallel computing.
Some cases might warrant using Dask Delayed, which evaluates functions lazily.
Dask clients, once initialized in your kernel, will also automatically parallelize operations on a Dask DataFrame or a Dask Array. It is straightforward to convert between standard pandas, numpy, and xarray objects and these dask objects for distributed computing. As long as your data object fits easily in memory, though, it's unlikely that you will stand to benefit much from using these implementations (there is significant overhead).
Here is a good YouTube walkthrough of the Dask interactive monitoring dashboard.