What's the best way to ship my user code in a distributed execution environment? #511

lgray · 2021-04-21T17:09:02Z

lgray
Apr 21, 2021
Maintainer

A question brought up in the developers meeting. We should discuss, determine, and advertise some best practices for shipping user environments around among the various ways that people use coffea.

It would be nice if we can figure out a task queue independent way of presenting this interface.

@nsmith-

Answered by nsmith-

Apr 30, 2021

The answer to this can easily become as complex as the answer to "how do I install coffea", which the installation documentation attempts to answer. The best answer largely depends on the environment you work in. For some cases, it is indeed as simple as pip install coffea, but there are often cases where that isn't the best solution. The guide tries to provide several options with the hope that one of them will feel familiar to a user, and may be the lowest barrier to entry. Here I'll try to do the same for user code.

Shared filesystem

If you are running coffea at an analysis facility with a shared filesystem mounted on the worker nodes of a batch farm, essentially nothing needs to be do…

View full answer

nsmith- · 2021-04-30T18:42:03Z

nsmith-
Apr 30, 2021
Maintainer

The answer to this can easily become as complex as the answer to "how do I install coffea", which the installation documentation attempts to answer. The best answer largely depends on the environment you work in. For some cases, it is indeed as simple as pip install coffea, but there are often cases where that isn't the best solution. The guide tries to provide several options with the hope that one of them will feel familiar to a user, and may be the lowest barrier to entry. Here I'll try to do the same for user code.

Shared filesystem

If you are running coffea at an analysis facility with a shared filesystem mounted on the worker nodes of a batch farm, essentially nothing needs to be done. As long as your distributed worker processes are launched in the same working directory as when you run with a local executor, everything should "just work".

HTCondor file transfer mechanism

If you plan to use a condor queue to either submit pre-defined jobs using local execution on a worker, or to use X-on-condor (dask, parsl, workqueue, etc.) distributed workers you can probably specify transfer_input_files in the job description for the workers. If you include all the local python files or directories you plan to access and ensure they are unpacked in the same relative path to the worker as from your client process, the modules should be importable by the worker process.

If you have installed extra modules via pip, more work needs to be done. Probably the best approach is to work with a virtual environment. Sadly, by default python -m venv NAME will make a new virtual environment referencing the absolute path of the system installed python. Various solutions exist to make it relocatable. An explicit example is shown for LCG-based installations, however, experience has shown that LCG python environment can be a bit unstable when adding new packages on top of it.

HTCondor containerized approach

As discussed above, the main issue with python virtual environments is the absolute path. A second consideration is that if the environment is created from scratch (i.e. not on top of LCG) then it can quickly become very large (500MB or more) after installing all standard packages (numpy, scipy, etc.). This can be avoided if the environment is containerized. For example, by using a pre-built image of coffea and bind-mounting a working directory to the same location in the client via

singularity shell -B ${PWD}:/srv /cvmfs/unpacked.cern.ch/registry.hub.docker.com/coffeateam/coffea-base:latest
python -m venv --without-pip --system-site-packages .env

then the worker job submission can take advantage of the HTCondor singularity support to start a worker process in the same absolute directory (/srv) as the client container. For example, adding the lines

+SingularityImage "/cvmfs/unpacked.cern.ch/registry.hub.docker.com/coffeateam/coffea-base:latest"
tranfer_input_files = .env

to the job submission file, along with a source .env/bin/activate line to the job wrapper script, should allow the job to run with the same virtual environment. A more advanced version of this mechanism is used in lpcjobqueue to programmatically generate dask clusters on HTCondor.

Distributed executor native solutions

Some executors provide a built-in solution to transfer user files. I only know details for the dask case.

Dask

For example, dask provides upload_file. If you upload a single python file, it will be importable as a module on the workers. You can also upload an entire package egg or zipball if you have set up your user code as a proper python package. A more rudimentary solution for directories is to simply zip them. If you have a module-like directory myanalysis (e.g. with myanalysis/__init__.py, myanalysis/helpers.py, etc.), you can zip it with:

shutil.make_archive("myanalysis", "zip", base_dir="myanalysis")
dask_client.upload_file("myanalysis.zip")

and the module myanalysis should then be available for remotely executed functions.

One word of caution is that any data files loaded in modules from a zipball need proper treatment, e.g. if you use something to the effect of

with gzip.open(os.path.join(os.path.dirname(__file__), "correctiondata.pkl.gz")) as fin:
    correctiondata = pickle.load(fin)

in your module, you will need to modify it to work in a zip-safe manner by using the importlib.resources standard library:

with importlib.resources.path("myanalysis", "correctiondata.pkl.gz") as path:
    with gzip.open(path) as fin:
        correctiondata = pickle.load(fin)

Cloudpickle

If you've wondered how everything magically works with all executors if your user code is entirely in one file, this is thanks to cloudpickle. This package extends the python standard library pickle module to handle functions, lambdas, and other objects, at the expense of becoming very closely tied to the python binary version. Due to this, cloudpickle is not safe for archiving functions or data but does provide a way to serialize and execute functions remotely. There is in fact no technical limitation to having cloudpickle also serialize functions imported from other modules. In fact this feature has been requested many times and may soon be implemented.

3 replies

lgray Apr 30, 2021
Maintainer Author

Thanks for this, much of this makes sense... we may want to make some corresponding notebooks to demonstrate each of these.

A worked out example with lpcjobqueue would help here for people using it.
It's pretty straightforward but should spell it out.

kpedro88 Apr 30, 2021

A simple way that I implemented this (just for user code, not pip-installed modules):

# make list of local package directories (absolute paths) that should be sent to jobs
initpylist = [os.path.abspath(os.path.dirname(x)) for x in glob('*/__init__.py')]
job_extra = {'transfer_input_files': ','.join(initpylist)}

and then provide job_extra to the Dask cluster object.

oshadura Jun 29, 2021

Other option in case if it is pip installable package is to add DependecyInstaller class and then use dask register_worker_plugin functionality ( see more details https://coffea-casa.readthedocs.io/en/latest/cc_packages.html#installations-on-the-workers)

dthain · 2021-07-08T18:55:39Z

dthain
Jul 8, 2021

Bumping an old thread at the suggestion of @nsmith- .

Here is how we handle python environments in the WQ world. We have a set of internal tools python_package_analyze, python_package_create, and python_package_run for invoking unix applications within specific environments. Admittedly, these are not well documented, and we are working on making these tools a more visible first class citizen in our framework.

If the end user has no idea what their dependencies are, python_package_analyze will use some introspection techniques to look at the code, identify the python module dependencies, and work backwards to the conda packages that provide those modules. These are dumped out in conda-pack format to spec.json. This process works, but it's slow. If you know what you are doing, you can write out spec.json by hand instead.

python_package_analyze --function my_analysis_function my_application.py spec.json

Then, use python_package_create which processes the specification to produce a tarball with dependencies. The tarball is in the conda-pack format. (We are adding some features to include other kinds of data sources in the package.)

python_package_create spec.json package.tar.gz

Finally, python_package_run will invoke an executable within a package by unpacking it, setting the appropriate environment, running it, and then cleaning up:

python_package_run package.tar.gz -- some-executable

Then, there are two ways to integrate these tools with Coffea.

One way is to attach the package to each task that gets executed. This is done by setting the environment-file option to point to the tarball in the WQ executor. WQ will transfer the file and adjust the invocation appropriately. This is simpler for the user to arrange, but has some additional per-task overhead.
The other way is to attach the package to each worker, effectively converting the command work_queue_worker ... into python_package_run package.tar.gz -- work_queue_worker .... This is obviously more efficient, but a bit trickier for the end user, since there are many different ways of getting the worker started, depending on what the underlying batch system is.

Diagram here:
https://docs.google.com/presentation/d/1vFX1RqP1fxm1vzzxqgTV1mbBc3aY2fNurkiV_aqYnBA/edit#slide=id.ge3ae5482c3_2_1

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's the best way to ship my user code in a distributed execution environment? #511

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

What's the best way to ship my user code in a distributed execution environment? #511

lgray Apr 21, 2021 Maintainer

Shared filesystem

Replies: 2 comments · 3 replies

nsmith- Apr 30, 2021 Maintainer

Shared filesystem

HTCondor file transfer mechanism

HTCondor containerized approach

Distributed executor native solutions

Dask

Cloudpickle

lgray Apr 30, 2021 Maintainer Author

kpedro88 Apr 30, 2021

oshadura Jun 29, 2021

dthain Jul 8, 2021

lgray
Apr 21, 2021
Maintainer

Replies: 2 comments 3 replies

nsmith-
Apr 30, 2021
Maintainer

lgray Apr 30, 2021
Maintainer Author

dthain
Jul 8, 2021