-
Notifications
You must be signed in to change notification settings - Fork 129
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #204 from CoffeaTeam/moredocs
More docs
- Loading branch information
Showing
7 changed files
with
214 additions
and
96 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,69 @@ | ||
Coffea concepts | ||
=============== | ||
|
||
This page explains concepts and terminology used within the coffea package. | ||
It is intended to provide a high-level overview, while details can be found in other sections of the documentation. | ||
|
||
.. _def-columnar-analysis: | ||
|
||
Columnar analysis | ||
----------------- | ||
Columnar analysis is a paradigm that describes the way the user writes the analysis application that is best described | ||
in contrast to the the traditional paradigm in high-energy particle physics (HEP) of using an event loop. In an event loop, the analysis operates row-wise | ||
on the input data (in HEP, one row usually corresponds to one reconstructed particle collision event.) Each row | ||
is a structure containing several fields, such as the properties of the visible outgoing particles | ||
that were reconstructed in a collision event. The analysis code manipulates this structure to either output derived | ||
quantities or summary statistics in the form of histograms. In contrast, columnar analysis operates on individual | ||
columns of data spanning a *chunk* (partition, batch) of rows using `array programming <https://en.wikipedia.org/wiki/Array_programming>`_ | ||
primitives in turn, to compute derived quantities and summary statistics. Array programming is widely used within | ||
the `scientific python ecosystem <https://www.scipy.org/about.html>`_, supported by the `numpy <https://numpy.org/>`_ library. | ||
However, although the existing scientific python stack is fully capable of analyzing rectangular arrays (i.e. | ||
no variable-length array dimensions), HEP data is very irregular, and manipulating it can become awkward without | ||
first generalizing array structure a bit. The `awkward <https://github.com/scikit-hep/awkward-array>`_ package does this, | ||
extending array programming capabilities to the complexity of HEP data. | ||
|
||
.. image:: images/columnar.png | ||
:width: 70 % | ||
:align: center | ||
|
||
.. _def-processor: | ||
|
||
Coffea processor | ||
---------------- | ||
In almost all HEP analyses, each row corresponds to an independent event, and it is exceptionally rare | ||
to need to compute inter-row derived quantites. Due to this, horizontal scale-out is almost trivial: | ||
each chunk of rows can be operated on independently. Further, if the output of an analysis is restricted | ||
to reducible accumulators such as histograms (abstracted by `AccumulatorABC`), then outputs can even be merged via tree reduction. | ||
The `ProcessorABC` class is an abstraction to encapsulate analysis code so that it can be easily scaled out, leaving | ||
the delivery of input columns and reduction of output accumulators to the coffea framework. | ||
|
||
.. _def-scale-out: | ||
|
||
Scale-out | ||
--------- | ||
Often, the computation requirements of a HEP data analysis exceed the resources of a single thread of execution. | ||
To facilitate parallelization and allow the user to access more compute resources, coffea employs various *executors* | ||
to ease the transition between a local analysis on a small set of test data to a full-scale analysis. | ||
The executors roughly fall into two categories: local and distributed. | ||
|
||
.. _def-local-executors: | ||
|
||
Local executors | ||
^^^^^^^^^^^^^^^ | ||
Currently, two local executors exist: `iterative_executor` and `futures_executor`. | ||
The iterative executor simply processes each chunk of an input dataset in turn, using the current | ||
python thread. The futures executor employs python `multiprocessing` to spawn multiple python processes | ||
that process chunks in parallel on the machine. Processes are used rather than threads to avoid | ||
performance limitations due to the CPython `global interpreter lock <https://wiki.python.org/moin/GlobalInterpreterLock>`_. | ||
|
||
.. _def-distributed-executors: | ||
|
||
Distributed executors | ||
^^^^^^^^^^^^^^^^^^^^^ | ||
Currently, coffea supports three types of distributed executors: | ||
|
||
- the `parsl <http://parsl-project.org/>`_ distributed executor, accessed via `parsl_executor`, | ||
- the `dask <https://distributed.dask.org/en/latest/>`_ distributed executor, accessed via `dask_executor`, | ||
- and the Apache `Spark <https://spark.apache.org/>`_ distributed executor, accessed via `run_spark_job`. | ||
|
||
These executors use their respective underlying libraries to distribute processing tasks over multiple machines. |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,140 @@ | ||
.. _installing-coffea: | ||
|
||
Installing coffea | ||
================= | ||
|
||
Quick start | ||
----------- | ||
To try coffea now, without installing anything, you can experiment with our | ||
`hosted tutorial notebooks <https://mybinder.org/v2/gh/CoffeaTeam/coffea/master?filepath=binder/>`_. | ||
|
||
Platform support | ||
---------------- | ||
Coffea is a python package distributed via `PyPI <https://pypi.org/project/coffea>`_. A python installation is required to use coffea. | ||
Python version 3.6 or newer is preferred since it supports all features of coffea, but a subset of coffea features will run in python 2.7 or newer. | ||
|
||
.. note:: Python 2 end-of-life is Jan. 1, 2020. All major scientific python packages will no longer provide support for python 2 by that date: https://python3statement.org/ | ||
|
||
All functional features in each supported python version are routinely tested. | ||
You can see the python version you have installed by typing the following at the command prompt: | ||
|
||
>>> python --version | ||
|
||
or, in some cases, if both python 2 and 3 are available, you can find the python 3 version via: | ||
|
||
>>> python3 --version | ||
|
||
coffea core functionality is routinely tested on Windows, Linux and MacOS. | ||
All :ref:`def-local-executors` are tested against all three platforms, | ||
however the :ref:`def-distributed-executors` are not routinely tested on Windows. | ||
|
||
Coffea starts from v0.5.0 in the PyPI repository since before v0.5.0 it was hosted as `fnal-column-analysis-tools <https://pypi.org/project/fnal-column-analysis-tools/>`_. If you are still using fnal-column-analysis-tools, please move to `coffea <https://pypi.org/project/coffea/>`_! | ||
|
||
Install coffea | ||
-------------- | ||
To install coffea, there are several mostly-equivalent options: | ||
|
||
- install coffea system-wide using ``pip install coffea``; | ||
- if you do not have administrator permissions, install as local user with ``pip install --user coffea``; | ||
- if you prefer to not place coffea in your global environment, you can set up a `virtual environment <https://docs.python.org/3/library/venv.html>`_, and use the venv-provided pip; | ||
- if you use `Conda <https://docs.conda.io/projects/conda/en/latest/index.html>`_, simply activate the environment you wish to use and install via the conda-provided pip. | ||
- or, if you are using a machine that has cvmfs available, see `Install via cvmfs`_ below. | ||
|
||
To update a previously installed coffea to a newer version, use: ``pip install --upgrade coffea`` | ||
Although not required, it is recommended to also `install Jupyter <https://jupyter.org/install>`_, as it provides a more interactive development environment. | ||
The installation procedure is essentially identical as above: ``pip install jupyter``. (If you use conda, ``conda install jupyter`` is a better option.) | ||
|
||
In rare cases, you may find that the ``pip`` executable in your path does not correspond to the same python installation as the ``python`` executable. This is a sign of a broken python environment. However, this can be bypassed by using the syntax ``python -m pip ...`` in place of ``pip ...``. | ||
|
||
Install optional dependencies | ||
----------------------------- | ||
Coffea supports several optional components that require additional package installations. | ||
In particular, all of the :ref:`def-distributed-executors` require additional packages. | ||
The necessary dependencies can be installed easily via ``pip`` using the setuptools `extras <https://setuptools.readthedocs.io/en/latest/setuptools.html#declaring-extras-optional-features-with-their-own-dependencies>`_ facility: | ||
|
||
- Apache `Spark <https://spark.apache.org/>`_ distributed executor: ``pip install coffea[spark]`` | ||
- `parsl <http://parsl-project.org/>`_ distributed executor: ``pip install coffea[parsl]`` | ||
- `dask <https://distributed.dask.org/en/latest/>`_ distributed executor: ``pip install coffea[dask]`` | ||
|
||
Multiple extras can be installed together via, e.g. ``pip install coffea[dask,spark]`` | ||
|
||
Install via cvmfs | ||
----------------- | ||
Although the local installation can work anywhere, if the base environment does not already have most of the coffea dependencies, then the user-local package directory can become quite bloated. | ||
An option to avoid this bloat is to use a base python environment provided via `CERN LCG <https://ep-dep-sft.web.cern.ch/document/lcg-releases>`_, which is available on any system that has the `cvmfs <https://cernvm.cern.ch/portal/filesystem>`_ directory ``/cvmfs/sft.cern.ch/`` mounted. | ||
Simply source a LCG release (shown here: 96python3) and install: | ||
|
||
.. code-block:: bash | ||
# check your platform: CC7 shown below, for SL6 it would be "x86_64-slc6-gcc8-opt" | ||
source /cvmfs/sft.cern.ch/lcg/views/LCG_96python3/x86_64-centos7-gcc8-opt/setup.sh # or .csh, etc. | ||
pip install --user coffea | ||
Creating a portable virtual environment | ||
--------------------------------------- | ||
In some instances, it may be useful to have a self-contained environment that can be relocated. | ||
One use case is for users of coffea that do not have access to a distributed compute cluster that is compatible with | ||
one of the coffea distributed executors. Here, a fallback solution can be found by creating traditional batch jobs (e.g. condor) | ||
which then use coffea local executors, possibly multi-threaded. In this case, often the user-local python package directory | ||
is not available from batch workers, so a portable python enviroment needs to be created. | ||
Annoyingly, python virtual environments are not portable by default due to several hardcoded paths in specific locations, | ||
however there are not many locations and some sed hacks can save the day. | ||
Here is an example of a bash script that installs coffea on top of the LCG 96python3 software stack inside a portable virtual environment, | ||
with the caveat that cvmfs must be visible from batch workers: | ||
|
||
.. code-block:: bash | ||
#!/usr/bin/env bash | ||
NAME=coffeaenv | ||
LCG=/cvmfs/sft.cern.ch/lcg/views/LCG_96python3/x86_64-centos7-gcc8-opt/setup.sh | ||
source $LCG | ||
# following https://aarongorka.com/blog/portable-virtualenv/, an alternative is https://github.com/pantsbuild/pex | ||
python -m venv --copies $NAME | ||
source $NAME/bin/activate | ||
python -m pip install setuptools pip --upgrade | ||
python -m pip install coffea | ||
sed -i '40s/.*/VIRTUAL_ENV="$(cd "$(dirname "$(dirname "${BASH_SOURCE[0]}" )")" \&\& pwd)"/' $NAME/bin/activate | ||
sed -i '1s/#!.*python$/#!\/usr\/bin\/env python/' $NAME/bin/* | ||
sed -i "2a source ${LCG}" $NAME/bin/activate | ||
tar -zcf ${NAME}.tar.gz ${NAME} | ||
The resulting tarball size is about 5 MB. | ||
An example batch job wrapper script is: | ||
|
||
.. code-block:: bash | ||
#!/usr/bin/env bash | ||
tar -zxf coffeaenv.tar.gz | ||
source coffeaenv/bin/activate | ||
echo "Running command:" $@ | ||
time $@ || exit $? | ||
For Developers | ||
-------------- | ||
|
||
1. Install development dependencies and download source: | ||
|
||
.. code-block:: bash | ||
pip install flake8 pytest coverage | ||
git clone https://github.com/CoffeaTeam/coffea | ||
2. Install: | ||
|
||
.. code-block:: bash | ||
cd coffea | ||
pip install --editable . | ||
// to install extras, use .[extra] | ||
3. Develop a cool new feature or fix some bugs | ||
|
||
4. Test and build documentation: | ||
|
||
.. code-block:: bash | ||
python setup.py flake8 | ||
python setup.py pytest | ||
source docs/build_docs.sh |
This file was deleted.
Oops, something went wrong.