Skip to content

Commit

Permalink
Merge pull request #204 from CoffeaTeam/moredocs
Browse files Browse the repository at this point in the history
More docs
  • Loading branch information
nsmith- authored Nov 18, 2019
2 parents 09b5617 + 0e1cefd commit 1a85ac4
Show file tree
Hide file tree
Showing 7 changed files with 214 additions and 96 deletions.
3 changes: 2 additions & 1 deletion README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ of a high-energy collider physics (HEP) experiment analysis using the scientific
python ecosystem. It makes use of `uproot <https://github.com/scikit-hep/uproot>`_
and `awkward-array <https://github.com/scikit-hep/awkward-array>`_ to provide an
array-based syntax for manipulating HEP event data in an efficient and numpythonic
way. There are sub-packages that implement histogramming, plotting, and look-up
way. There are sub-packages that implement histogramming, plotting, and look-up
table functionalities that are needed to convey scientific insight, apply transformations
to data, and correct for discrepancies in Monte Carlo simulations compared to data.

Expand Down Expand Up @@ -65,6 +65,7 @@ Install coffea like any other Python package:
pip install coffea
or similar (use ``sudo``, ``--user``, ``virtualenv``, or pip-in-conda if you wish).
For more details, see the `Installing coffea <https://coffeateam.github.io/coffea/installation.html>`_ section of the documentation.

Strict dependencies
===================
Expand Down
3 changes: 1 addition & 2 deletions docs/build_docs.sh
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
#!/usr/bin/env bash

pip -q install -U sphinx nbsphinx sphinx-rtd-theme sphinx-automodapi
python setup.py -q install
pip -q install -e .
pushd docs
rm -rf build
pushd source
Expand All @@ -12,4 +12,3 @@ popd
make html
touch build/html/.nojekyll
popd
pip uninstall --yes coffea
69 changes: 69 additions & 0 deletions docs/source/concepts.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
Coffea concepts
===============

This page explains concepts and terminology used within the coffea package.
It is intended to provide a high-level overview, while details can be found in other sections of the documentation.

.. _def-columnar-analysis:

Columnar analysis
-----------------
Columnar analysis is a paradigm that describes the way the user writes the analysis application that is best described
in contrast to the the traditional paradigm in high-energy particle physics (HEP) of using an event loop. In an event loop, the analysis operates row-wise
on the input data (in HEP, one row usually corresponds to one reconstructed particle collision event.) Each row
is a structure containing several fields, such as the properties of the visible outgoing particles
that were reconstructed in a collision event. The analysis code manipulates this structure to either output derived
quantities or summary statistics in the form of histograms. In contrast, columnar analysis operates on individual
columns of data spanning a *chunk* (partition, batch) of rows using `array programming <https://en.wikipedia.org/wiki/Array_programming>`_
primitives in turn, to compute derived quantities and summary statistics. Array programming is widely used within
the `scientific python ecosystem <https://www.scipy.org/about.html>`_, supported by the `numpy <https://numpy.org/>`_ library.
However, although the existing scientific python stack is fully capable of analyzing rectangular arrays (i.e.
no variable-length array dimensions), HEP data is very irregular, and manipulating it can become awkward without
first generalizing array structure a bit. The `awkward <https://github.com/scikit-hep/awkward-array>`_ package does this,
extending array programming capabilities to the complexity of HEP data.

.. image:: images/columnar.png
:width: 70 %
:align: center

.. _def-processor:

Coffea processor
----------------
In almost all HEP analyses, each row corresponds to an independent event, and it is exceptionally rare
to need to compute inter-row derived quantites. Due to this, horizontal scale-out is almost trivial:
each chunk of rows can be operated on independently. Further, if the output of an analysis is restricted
to reducible accumulators such as histograms (abstracted by `AccumulatorABC`), then outputs can even be merged via tree reduction.
The `ProcessorABC` class is an abstraction to encapsulate analysis code so that it can be easily scaled out, leaving
the delivery of input columns and reduction of output accumulators to the coffea framework.

.. _def-scale-out:

Scale-out
---------
Often, the computation requirements of a HEP data analysis exceed the resources of a single thread of execution.
To facilitate parallelization and allow the user to access more compute resources, coffea employs various *executors*
to ease the transition between a local analysis on a small set of test data to a full-scale analysis.
The executors roughly fall into two categories: local and distributed.

.. _def-local-executors:

Local executors
^^^^^^^^^^^^^^^
Currently, two local executors exist: `iterative_executor` and `futures_executor`.
The iterative executor simply processes each chunk of an input dataset in turn, using the current
python thread. The futures executor employs python `multiprocessing` to spawn multiple python processes
that process chunks in parallel on the machine. Processes are used rather than threads to avoid
performance limitations due to the CPython `global interpreter lock <https://wiki.python.org/moin/GlobalInterpreterLock>`_.

.. _def-distributed-executors:

Distributed executors
^^^^^^^^^^^^^^^^^^^^^
Currently, coffea supports three types of distributed executors:

- the `parsl <http://parsl-project.org/>`_ distributed executor, accessed via `parsl_executor`,
- the `dask <https://distributed.dask.org/en/latest/>`_ distributed executor, accessed via `dask_executor`,
- and the Apache `Spark <https://spark.apache.org/>`_ distributed executor, accessed via `run_spark_job`.

These executors use their respective underlying libraries to distribute processing tasks over multiple machines.
Binary file added docs/source/images/columnar.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 2 additions & 1 deletion docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,9 @@
.. include:: ../../README.rst

.. toctree::
quick_start
installation
introduction
concepts
reference
..

Expand Down
140 changes: 140 additions & 0 deletions docs/source/installation.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
.. _installing-coffea:

Installing coffea
=================

Quick start
-----------
To try coffea now, without installing anything, you can experiment with our
`hosted tutorial notebooks <https://mybinder.org/v2/gh/CoffeaTeam/coffea/master?filepath=binder/>`_.

Platform support
----------------
Coffea is a python package distributed via `PyPI <https://pypi.org/project/coffea>`_. A python installation is required to use coffea.
Python version 3.6 or newer is preferred since it supports all features of coffea, but a subset of coffea features will run in python 2.7 or newer.

.. note:: Python 2 end-of-life is Jan. 1, 2020. All major scientific python packages will no longer provide support for python 2 by that date: https://python3statement.org/

All functional features in each supported python version are routinely tested.
You can see the python version you have installed by typing the following at the command prompt:

>>> python --version

or, in some cases, if both python 2 and 3 are available, you can find the python 3 version via:

>>> python3 --version

coffea core functionality is routinely tested on Windows, Linux and MacOS.
All :ref:`def-local-executors` are tested against all three platforms,
however the :ref:`def-distributed-executors` are not routinely tested on Windows.

Coffea starts from v0.5.0 in the PyPI repository since before v0.5.0 it was hosted as `fnal-column-analysis-tools <https://pypi.org/project/fnal-column-analysis-tools/>`_. If you are still using fnal-column-analysis-tools, please move to `coffea <https://pypi.org/project/coffea/>`_!

Install coffea
--------------
To install coffea, there are several mostly-equivalent options:

- install coffea system-wide using ``pip install coffea``;
- if you do not have administrator permissions, install as local user with ``pip install --user coffea``;
- if you prefer to not place coffea in your global environment, you can set up a `virtual environment <https://docs.python.org/3/library/venv.html>`_, and use the venv-provided pip;
- if you use `Conda <https://docs.conda.io/projects/conda/en/latest/index.html>`_, simply activate the environment you wish to use and install via the conda-provided pip.
- or, if you are using a machine that has cvmfs available, see `Install via cvmfs`_ below.

To update a previously installed coffea to a newer version, use: ``pip install --upgrade coffea``
Although not required, it is recommended to also `install Jupyter <https://jupyter.org/install>`_, as it provides a more interactive development environment.
The installation procedure is essentially identical as above: ``pip install jupyter``. (If you use conda, ``conda install jupyter`` is a better option.)

In rare cases, you may find that the ``pip`` executable in your path does not correspond to the same python installation as the ``python`` executable. This is a sign of a broken python environment. However, this can be bypassed by using the syntax ``python -m pip ...`` in place of ``pip ...``.

Install optional dependencies
-----------------------------
Coffea supports several optional components that require additional package installations.
In particular, all of the :ref:`def-distributed-executors` require additional packages.
The necessary dependencies can be installed easily via ``pip`` using the setuptools `extras <https://setuptools.readthedocs.io/en/latest/setuptools.html#declaring-extras-optional-features-with-their-own-dependencies>`_ facility:

- Apache `Spark <https://spark.apache.org/>`_ distributed executor: ``pip install coffea[spark]``
- `parsl <http://parsl-project.org/>`_ distributed executor: ``pip install coffea[parsl]``
- `dask <https://distributed.dask.org/en/latest/>`_ distributed executor: ``pip install coffea[dask]``

Multiple extras can be installed together via, e.g. ``pip install coffea[dask,spark]``

Install via cvmfs
-----------------
Although the local installation can work anywhere, if the base environment does not already have most of the coffea dependencies, then the user-local package directory can become quite bloated.
An option to avoid this bloat is to use a base python environment provided via `CERN LCG <https://ep-dep-sft.web.cern.ch/document/lcg-releases>`_, which is available on any system that has the `cvmfs <https://cernvm.cern.ch/portal/filesystem>`_ directory ``/cvmfs/sft.cern.ch/`` mounted.
Simply source a LCG release (shown here: 96python3) and install:

.. code-block:: bash
# check your platform: CC7 shown below, for SL6 it would be "x86_64-slc6-gcc8-opt"
source /cvmfs/sft.cern.ch/lcg/views/LCG_96python3/x86_64-centos7-gcc8-opt/setup.sh # or .csh, etc.
pip install --user coffea
Creating a portable virtual environment
---------------------------------------
In some instances, it may be useful to have a self-contained environment that can be relocated.
One use case is for users of coffea that do not have access to a distributed compute cluster that is compatible with
one of the coffea distributed executors. Here, a fallback solution can be found by creating traditional batch jobs (e.g. condor)
which then use coffea local executors, possibly multi-threaded. In this case, often the user-local python package directory
is not available from batch workers, so a portable python enviroment needs to be created.
Annoyingly, python virtual environments are not portable by default due to several hardcoded paths in specific locations,
however there are not many locations and some sed hacks can save the day.
Here is an example of a bash script that installs coffea on top of the LCG 96python3 software stack inside a portable virtual environment,
with the caveat that cvmfs must be visible from batch workers:

.. code-block:: bash
#!/usr/bin/env bash
NAME=coffeaenv
LCG=/cvmfs/sft.cern.ch/lcg/views/LCG_96python3/x86_64-centos7-gcc8-opt/setup.sh
source $LCG
# following https://aarongorka.com/blog/portable-virtualenv/, an alternative is https://github.com/pantsbuild/pex
python -m venv --copies $NAME
source $NAME/bin/activate
python -m pip install setuptools pip --upgrade
python -m pip install coffea
sed -i '40s/.*/VIRTUAL_ENV="$(cd "$(dirname "$(dirname "${BASH_SOURCE[0]}" )")" \&\& pwd)"/' $NAME/bin/activate
sed -i '1s/#!.*python$/#!\/usr\/bin\/env python/' $NAME/bin/*
sed -i "2a source ${LCG}" $NAME/bin/activate
tar -zcf ${NAME}.tar.gz ${NAME}
The resulting tarball size is about 5 MB.
An example batch job wrapper script is:

.. code-block:: bash
#!/usr/bin/env bash
tar -zxf coffeaenv.tar.gz
source coffeaenv/bin/activate
echo "Running command:" $@
time $@ || exit $?
For Developers
--------------

1. Install development dependencies and download source:

.. code-block:: bash
pip install flake8 pytest coverage
git clone https://github.com/CoffeaTeam/coffea
2. Install:

.. code-block:: bash
cd coffea
pip install --editable .
// to install extras, use .[extra]
3. Develop a cool new feature or fix some bugs

4. Test and build documentation:

.. code-block:: bash
python setup.py flake8
python setup.py pytest
source docs/build_docs.sh
92 changes: 0 additions & 92 deletions docs/source/quick_start.rst

This file was deleted.

0 comments on commit 1a85ac4

Please sign in to comment.