-
A question brought up in the developers meeting. We should discuss, determine, and advertise some best practices for shipping user environments around among the various ways that people use coffea. It would be nice if we can figure out a task queue independent way of presenting this interface. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 3 replies
-
The answer to this can easily become as complex as the answer to "how do I install coffea", which the installation documentation attempts to answer. The best answer largely depends on the environment you work in. For some cases, it is indeed as simple as Shared filesystemIf you are running coffea at an analysis facility with a shared filesystem mounted on the worker nodes of a batch farm, essentially nothing needs to be done. As long as your distributed worker processes are launched in the same working directory as when you run with a local executor, everything should "just work". HTCondor file transfer mechanismIf you plan to use a condor queue to either submit pre-defined jobs using local execution on a worker, or to use X-on-condor (dask, parsl, workqueue, etc.) distributed workers you can probably specify transfer_input_files in the job description for the workers. If you include all the local python files or directories you plan to access and ensure they are unpacked in the same relative path to the worker as from your client process, the modules should be importable by the worker process. If you have installed extra modules via HTCondor containerized approachAs discussed above, the main issue with python virtual environments is the absolute path. A second consideration is that if the environment is created from scratch (i.e. not on top of LCG) then it can quickly become very large (500MB or more) after installing all standard packages (numpy, scipy, etc.). This can be avoided if the environment is containerized. For example, by using a pre-built image of coffea and bind-mounting a working directory to the same location in the client via singularity shell -B ${PWD}:/srv /cvmfs/unpacked.cern.ch/registry.hub.docker.com/coffeateam/coffea-base:latest
python -m venv --without-pip --system-site-packages .env then the worker job submission can take advantage of the HTCondor singularity support to start a worker process in the same absolute directory (
to the job submission file, along with a Distributed executor native solutionsSome executors provide a built-in solution to transfer user files. I only know details for the dask case. DaskFor example, dask provides upload_file. If you upload a single python file, it will be importable as a module on the workers. You can also upload an entire package egg or zipball if you have set up your user code as a proper python package. A more rudimentary solution for directories is to simply zip them. If you have a module-like directory shutil.make_archive("myanalysis", "zip", base_dir="myanalysis")
dask_client.upload_file("myanalysis.zip") and the module One word of caution is that any data files loaded in modules from a zipball need proper treatment, e.g. if you use something to the effect of with gzip.open(os.path.join(os.path.dirname(__file__), "correctiondata.pkl.gz")) as fin:
correctiondata = pickle.load(fin) in your module, you will need to modify it to work in a zip-safe manner by using the importlib.resources standard library: with importlib.resources.path("myanalysis", "correctiondata.pkl.gz") as path:
with gzip.open(path) as fin:
correctiondata = pickle.load(fin) CloudpickleIf you've wondered how everything magically works with all executors if your user code is entirely in one file, this is thanks to cloudpickle. This package extends the python standard library |
Beta Was this translation helpful? Give feedback.
-
Bumping an old thread at the suggestion of @nsmith- . Here is how we handle python environments in the WQ world. We have a set of internal tools If the end user has no idea what their dependencies are,
Then, use
Finally,
Then, there are two ways to integrate these tools with Coffea.
Diagram here: |
Beta Was this translation helpful? Give feedback.
The answer to this can easily become as complex as the answer to "how do I install coffea", which the installation documentation attempts to answer. The best answer largely depends on the environment you work in. For some cases, it is indeed as simple as
pip install coffea
, but there are often cases where that isn't the best solution. The guide tries to provide several options with the hope that one of them will feel familiar to a user, and may be the lowest barrier to entry. Here I'll try to do the same for user code.Shared filesystem
If you are running coffea at an analysis facility with a shared filesystem mounted on the worker nodes of a batch farm, essentially nothing needs to be do…