Merge pull request #204 from CoffeaTeam/moredocs

More docs
scikit-hep · Nov 18, 2019 · 1a85ac4 · 1a85ac4
2 parents 09b5617 + 0e1cefd
commit 1a85ac4
Show file tree

Hide file tree

Showing 7 changed files with 214 additions and 96 deletions.
diff --git a/README.rst b/README.rst
@@ -36,7 +36,7 @@ of a high-energy collider physics (HEP) experiment analysis using the scientific
 python ecosystem. It makes use of `uproot <https://github.com/scikit-hep/uproot>`_
 and `awkward-array <https://github.com/scikit-hep/awkward-array>`_ to provide an
 array-based syntax for manipulating HEP event data in an efficient and numpythonic
-way. There are  sub-packages that implement histogramming, plotting, and look-up
+way. There are sub-packages that implement histogramming, plotting, and look-up
 table functionalities that are needed to convey scientific insight, apply transformations
 to data, and correct for discrepancies in Monte Carlo simulations compared to data.
 
@@ -65,6 +65,7 @@ Install coffea like any other Python package:
     pip install coffea
 
 or similar (use ``sudo``, ``--user``, ``virtualenv``, or pip-in-conda if you wish).
+For more details, see the `Installing coffea <https://coffeateam.github.io/coffea/installation.html>`_ section of the documentation.
 
 Strict dependencies
 ===================

diff --git a/docs/build_docs.sh b/docs/build_docs.sh
@@ -1,7 +1,7 @@
 #!/usr/bin/env bash
 
 pip -q install -U sphinx nbsphinx sphinx-rtd-theme sphinx-automodapi
-python setup.py -q install
+pip -q install -e .
 pushd docs
 rm -rf build
 pushd source
@@ -12,4 +12,3 @@ popd
 make html
 touch build/html/.nojekyll
 popd
-pip uninstall --yes coffea 
diff --git a/docs/source/concepts.rst b/docs/source/concepts.rst
@@ -0,0 +1,69 @@
+Coffea concepts
+===============
+
+This page explains concepts and terminology used within the coffea package.
+It is intended to provide a high-level overview, while details can be found in other sections of the documentation.
+
+.. _def-columnar-analysis:
+
+Columnar analysis
+-----------------
+Columnar analysis is a paradigm that describes the way the user writes the analysis application that is best described
+in contrast to the the traditional paradigm in high-energy particle physics (HEP) of using an event loop.  In an event loop, the analysis operates row-wise
+on the input data (in HEP, one row usually corresponds to one reconstructed particle collision event.) Each row
+is a structure containing several fields, such as the properties of the visible outgoing particles
+that were reconstructed in a collision event. The analysis code manipulates this structure to either output derived
+quantities or summary statistics in the form of histograms. In contrast, columnar analysis operates on individual
+columns of data spanning a *chunk* (partition, batch) of rows using `array programming <https://en.wikipedia.org/wiki/Array_programming>`_
+primitives in turn, to compute derived quantities and summary statistics. Array programming is widely used within
+the `scientific python ecosystem <https://www.scipy.org/about.html>`_, supported by the `numpy <https://numpy.org/>`_ library.
+However, although the existing scientific python stack is fully capable of analyzing rectangular arrays (i.e. 
+no variable-length array dimensions), HEP data is very irregular, and manipulating it can become awkward without
+first generalizing array structure a bit. The `awkward <https://github.com/scikit-hep/awkward-array>`_ package does this,
+extending array programming capabilities to the complexity of HEP data.
+
+.. image:: images/columnar.png
+   :width: 70 %
+   :align: center
+
+.. _def-processor:
+
+Coffea processor
+----------------
+In almost all HEP analyses, each row corresponds to an independent event, and it is exceptionally rare
+to need to compute inter-row derived quantites. Due to this, horizontal scale-out is almost trivial:
+each chunk of rows can be operated on independently. Further, if the output of an analysis is restricted
+to reducible accumulators such as histograms (abstracted by `AccumulatorABC`), then outputs can even be merged via tree reduction.
+The `ProcessorABC` class is an abstraction to encapsulate analysis code so that it can be easily scaled out, leaving
+the delivery of input columns and reduction of output accumulators to the coffea framework.
+
+.. _def-scale-out:
+
+Scale-out
+---------
+Often, the computation requirements of a HEP data analysis exceed the resources of a single thread of execution.
+To facilitate parallelization and allow the user to access more compute resources, coffea employs various *executors*
+to ease the transition between a local analysis on a small set of test data to a full-scale analysis.
+The executors roughly fall into two categories: local and distributed.
+
+.. _def-local-executors:
+
+Local executors
+^^^^^^^^^^^^^^^
+Currently, two local executors exist: `iterative_executor` and `futures_executor`.
+The iterative executor simply processes each chunk of an input dataset in turn, using the current
+python thread. The futures executor employs python `multiprocessing` to spawn multiple python processes
+that process chunks in parallel on the machine. Processes are used rather than threads to avoid
+performance limitations due to the CPython `global interpreter lock <https://wiki.python.org/moin/GlobalInterpreterLock>`_.
+
+.. _def-distributed-executors:
+
+Distributed executors
+^^^^^^^^^^^^^^^^^^^^^
+Currently, coffea supports three types of distributed executors:
+
+   - the `parsl <http://parsl-project.org/>`_ distributed executor, accessed via `parsl_executor`,
+   - the `dask <https://distributed.dask.org/en/latest/>`_ distributed executor, accessed via `dask_executor`,
+   - and the Apache `Spark <https://spark.apache.org/>`_ distributed executor, accessed via `run_spark_job`.
+
+These executors use their respective underlying libraries to distribute processing tasks over multiple machines.
diff --git a/docs/source/images/columnar.png b/docs/source/images/columnar.png
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -6,8 +6,9 @@
 .. include:: ../../README.rst
 
 .. toctree::
-   quick_start
+   installation
    introduction
+   concepts
    reference
 ..
 

diff --git a/docs/source/installation.rst b/docs/source/installation.rst
@@ -0,0 +1,140 @@
+.. _installing-coffea:
+
+Installing coffea
+=================
+
+Quick start
+-----------
+To try coffea now, without installing anything, you can experiment with our
+`hosted tutorial notebooks <https://mybinder.org/v2/gh/CoffeaTeam/coffea/master?filepath=binder/>`_.
+
+Platform support
+----------------
+Coffea is a python package distributed via `PyPI <https://pypi.org/project/coffea>`_. A python installation is required to use coffea.
+Python version 3.6 or newer is preferred since it supports all features of coffea, but a subset of coffea features will run in python 2.7 or newer.
+
+.. note:: Python 2 end-of-life is Jan. 1, 2020. All major scientific python packages will no longer provide support for python 2 by that date: https://python3statement.org/
+
+All functional features in each supported python version are routinely tested.
+You can see the python version you have installed by typing the following at the command prompt:
+
+>>> python --version
+
+or, in some cases, if both python 2 and 3 are available, you can find the python 3 version via:
+
+>>> python3 --version
+
+coffea core functionality is routinely tested on Windows, Linux and MacOS.
+All :ref:`def-local-executors` are tested against all three platforms,
+however the :ref:`def-distributed-executors` are not routinely tested on Windows.
+
+Coffea starts from v0.5.0 in the PyPI repository since before v0.5.0 it was hosted as `fnal-column-analysis-tools <https://pypi.org/project/fnal-column-analysis-tools/>`_. If you are still using fnal-column-analysis-tools, please move to `coffea <https://pypi.org/project/coffea/>`_!
+
+Install coffea
+--------------
+To install coffea, there are several mostly-equivalent options:
+
+   - install coffea system-wide using ``pip install coffea``; 
+   - if you do not have administrator permissions, install as local user with ``pip install --user coffea``;
+   - if you prefer to not place coffea in your global environment, you can set up a `virtual environment <https://docs.python.org/3/library/venv.html>`_, and use the venv-provided pip;
+   - if you use `Conda <https://docs.conda.io/projects/conda/en/latest/index.html>`_,  simply activate the environment you wish to use and install via the conda-provided pip.
+   - or, if you are using a machine that has cvmfs available, see `Install via cvmfs`_ below.
+
+To update a previously installed coffea to a newer version, use: ``pip install --upgrade coffea``
+Although not required, it is recommended to also `install Jupyter <https://jupyter.org/install>`_, as it provides a more interactive development environment.
+The installation procedure is essentially identical as above: ``pip install jupyter``. (If you use conda, ``conda install jupyter`` is a better option.)
+
+In rare cases, you may find that the ``pip`` executable in your path does not correspond to the same python installation as the ``python`` executable. This is a sign of a broken python environment. However, this can be bypassed by using the syntax ``python -m pip ...`` in place of ``pip ...``.
+
+Install optional dependencies
+-----------------------------
+Coffea supports several optional components that require additional package installations.
+In particular, all of the :ref:`def-distributed-executors` require additional packages.
+The necessary dependencies can be installed easily via ``pip`` using the setuptools `extras <https://setuptools.readthedocs.io/en/latest/setuptools.html#declaring-extras-optional-features-with-their-own-dependencies>`_  facility:
+
+   - Apache `Spark <https://spark.apache.org/>`_ distributed executor: ``pip install coffea[spark]``
+   - `parsl <http://parsl-project.org/>`_ distributed executor: ``pip install coffea[parsl]``
+   - `dask <https://distributed.dask.org/en/latest/>`_ distributed executor: ``pip install coffea[dask]``
+
+Multiple extras can be installed together via, e.g. ``pip install coffea[dask,spark]``
+
+Install via cvmfs
+-----------------
+Although the local installation can work anywhere, if the base environment does not already have most of the coffea dependencies, then the user-local package directory can become quite bloated.
+An option to avoid this bloat is to use a base python environment provided via `CERN LCG <https://ep-dep-sft.web.cern.ch/document/lcg-releases>`_, which is available on any system that has the `cvmfs <https://cernvm.cern.ch/portal/filesystem>`_ directory ``/cvmfs/sft.cern.ch/`` mounted.
+Simply source a LCG release (shown here: 96python3) and install:
+
+.. code-block:: bash
+
+  # check your platform: CC7 shown below, for SL6 it would be "x86_64-slc6-gcc8-opt"
+  source /cvmfs/sft.cern.ch/lcg/views/LCG_96python3/x86_64-centos7-gcc8-opt/setup.sh  # or .csh, etc.
+  pip install --user coffea
+
+Creating a portable virtual environment
+---------------------------------------
+In some instances, it may be useful to have a self-contained environment that can be relocated.
+One use case is for users of coffea that do not have access to a distributed compute cluster that is compatible with
+one of the coffea distributed executors. Here, a fallback solution can be found by creating traditional batch jobs (e.g. condor)
+which then use coffea local executors, possibly multi-threaded. In this case, often the user-local python package directory
+is not available from batch workers, so a portable python enviroment needs to be created.
+Annoyingly, python virtual environments are not portable by default due to several hardcoded paths in specific locations,
+however there are not many locations and some sed hacks can save the day.
+Here is an example of a bash script that installs coffea on top of the LCG 96python3 software stack inside a portable virtual environment,
+with the caveat that cvmfs must be visible from batch workers:
+
+.. code-block:: bash
+
+  #!/usr/bin/env bash
+  NAME=coffeaenv
+  LCG=/cvmfs/sft.cern.ch/lcg/views/LCG_96python3/x86_64-centos7-gcc8-opt/setup.sh
+
+  source $LCG
+  # following https://aarongorka.com/blog/portable-virtualenv/, an alternative is https://github.com/pantsbuild/pex
+  python -m venv --copies $NAME
+  source $NAME/bin/activate
+  python -m pip install setuptools pip --upgrade
+  python -m pip install coffea
+  sed -i '40s/.*/VIRTUAL_ENV="$(cd "$(dirname "$(dirname "${BASH_SOURCE[0]}" )")" \&\& pwd)"/' $NAME/bin/activate
+  sed -i '1s/#!.*python$/#!\/usr\/bin\/env python/' $NAME/bin/*
+  sed -i "2a source ${LCG}" $NAME/bin/activate
+  tar -zcf ${NAME}.tar.gz ${NAME}
+
+The resulting tarball size is about 5 MB.
+An example batch job wrapper script is:
+
+.. code-block:: bash
+
+  #!/usr/bin/env bash
+  tar -zxf coffeaenv.tar.gz
+  source coffeaenv/bin/activate
+
+  echo "Running command:" $@
+  time $@ || exit $?
+
+For Developers
+--------------
+
+1. Install development dependencies and download source:
+
+  .. code-block:: bash
+
+    pip install flake8 pytest coverage
+    git clone https://github.com/CoffeaTeam/coffea
+
+2. Install:
+
+  .. code-block:: bash
+
+    cd coffea
+    pip install --editable .
+    // to install extras, use .[extra]
+
+3. Develop a cool new feature or fix some bugs
+
+4. Test and build documentation:
+
+  .. code-block:: bash
+
+    python setup.py flake8
+    python setup.py pytest
+    source docs/build_docs.sh
diff --git a/docs/source/quick_start.rst b/docs/source/quick_start.rst