Skip to content

Latest commit

 

History

History
69 lines (57 loc) · 12.6 KB

DEVELOPMENT_NOTES.md

File metadata and controls

69 lines (57 loc) · 12.6 KB

Developer Documentation

This page lists information for developers working on viral-ngs.

Modifying code and testing

The current dev, build, and deploy paradigm is intentionally docker-centric. This means that developers will need docker working within their dev environment--and not much else (other than git, and a text/code editor). The code base is also modularized and layered. In order to work on code changes, you will:

  1. Clone the git code repository for this module on your local host machine (git clone https://github.com/broadinstitute/viral-classify.git), create and checkout a branch for your development work, and edit with your favorite code/text editor (standard git flow).
  2. Run the docker image on your machine while mounting your local git checkout into the container to test live edits. There are two ways to do this:
    1. If you aren't making (significant) changes to the tool dependencies in requirements-conda.txt, just use the latest docker image of viral-classify (or the specific tag you want to start from) and test your code changes in there. Do the following:
      1. docker pull and run the image FROM which this is built, while mounting your local git checkout into the container (docker run -it --rm -v pwd/viral-classify:/opt/viral-ngs/viral-classify quay.io/broadinstitute/viral-core)
      2. If you made any changes to requirements-conda.txt, you'll need to update them within your container. Run /opt/viral-ngs/source/docker/install-conda-dependencies.sh /opt/viral-ngs/viral-classify/requirements-conda.txt). Skip this step if you're only testing non-conda changes on your branch. If this took any significant amount of time, you can snapshot the resulting container (see instructions below).
    2. If you are making significant changes to the conda tool dependencies, it might be best to start from viral-core and layer on from there. Do the following:
      1. docker pull and run the image FROM which this is built, while mounting your local git checkout into the container (docker run -it --rm -v pwd/viral-classify:/opt/viral-ngs/viral-classify quay.io/broadinstitute/viral-core)
      2. Install this module's dependencies inside the container (within container: /opt/viral-ngs/viral-classify/docker/install-dev-layer.sh). This make take a couple minutes.
      3. (optional) Snapshot this docker image locally if you want to continue using it and skip the install-dev-layer step in the future (docker commit <image hash> local/viral-classify-dev, where the image hash comes from docker container ls).
    3. Either way, whenever you exit the container's command line, it will delete itself. Code changes will always persist in your local checkout (because they were on your host machine), but other OS changes, conda changes, and temp files start fresh again the next time you docker run the image (from either the original image, or a local snapshot you made with docker commit).
  3. Test code and execution interactively within the container (cd /opt/viral-ngs/viral-classify; pytest -rsxX -n auto test/unit, or whatever you'd like to do).
  4. Push changes back to github (from your host machine) for automated CI testing & builds, using standard, collaborative github code review processes. Note that only your host machine (not the container) should have your github credentials, although both are working on the same working copy of files.

Machinery under the hood

Dependency install destinations

When Python and binary dependencies for viral-ngs are installed by conda, they can end up in several locations. The default and preferred method of installation assumes a conda environment is active in the current shell, complete with environment variables we can access to specify the path of the active environment. In this case, conda packages are installed in the active conda environment. If conda is installed and available on the path but no environment is currently active, viral-ngs dependencies are installed in isolation within viral-ngs/tools/build/conda-tools/{default} (unless this location is overridden in the CondaPackage() constructor). For tools without a conda recipe (as may be the case on certain platforms, like Mac OSX), or where conda install fails, custom install methods are used to download and build some tools.

Adding a new tool or dependency

When adding a new tool or dependency to viral-ngs, check to see if a conda package is already available either on the default channel (conda search <package_name>), or on the bioconda channel (conda search -c bioconda <package_name>). If so, it will needed to be added to the conda recipe template for viral-ngs. If a recipe is unavailable, it will first need to be added to a particular conda channel. Bioconda is used by default.

Changing dependency versions

The viral-ngs package installed by conda install viral-ngs from the broad-viral channel depends on a conda build recipe distributed in this repository. The recipe files source the various Python and binary depedencies of viral-ngs as conda packages, including version numbers, from the requirements-*.txt files within this repository. When updating a package version in requirements-conda.txt, update it also in requirements-minimal.txt if it appears there as well.

(Automated) testing

Travis CI performs automated unit and integration tests for viral-ngs on each branch and pull request. Unit tests are run on each new branch commit, and longer integration tests are performed on pull requests to help ensure the stability of the master branch. Pull requests are gated to ensure merging to master is allowed only if all tests pass. The Travis configuration is specified in .travis.yml, and relies on files stored within viral-ngs/travis/.

A few notes on testing:

  • Travis Linux with Py3 is the authoritative test server
  • Tox is present for convenience (local use only, not currently in use on Travis)
  • py.test is used in place of nose or the built-in unittest framework for its generative testing, fixtures, and parallelized execution of tests.
  • .flake8, .pylintrc, and .style.yapf are available in the repository root, and should be used when running the associated tools
  • During Travis tests an encrypted tarball of third-party resources is downloaded into the build environment to provide a licensed copy of GATK and Novoalign. For security, forks of viral-ngs will not have the encrypted resources available for testing on Travis, nor will pull requests from forks.
  • Many tests for viral-ngs depend on static input files distributed with the repository. The test input files reside in viral-ngs/test/input/<TestClassName>. Within specific unit tests, class-specific test input files may be accessed via the function util.file.get_test_input_path(self). The parent directory for all static test files can be accessed via util.file.get_test_path()

The Travis build matrix

Each commit on any branch, and any pull request, will trigger a build on Travis CI. Branch commits will test code from a specific commit hash. Pull requests will test the simulated result of merging a branch HEAD onto the target branch HEAD. For each build, the following Travis jobs are launched:

  1. Docker & WDL
    1. A docker image is built and deployed to the Docker registry at quay.io. Master branch images are pushed to quay.io/broadinstitute/viral-ngs:latest and are also given a versioned tag. Non-master branch images and pull requests are pushed to quay.io/broadinstitute/viral-ngs-dev with versioned tags. The docker build is preceded by a docker pull of the docker image associated with the previous Travis build parental to this commit in order to utilize layer caching. Note that our tool dependencies result in a very large docker image (2GB compressed, this is about 10x the typical size for a docker image). The Dockerfile builds the tool dependencies before incorporating the full viral-ngs source code. This means that most docker image builds will be extremely fast: usually 10-20 seconds. The docker push/deploy is similarly fast, since the Docker registry already has most of the layers, and only the new source code layer needs to upload. The docker pull of the 2GB image takes about 5 minutes, so altogether this step takes about 6 minutes on Travis. However, if your code commit alters anything in requirements-*.txt or the easy deploy script, it will rebuild the heavy conda install layer, adding another 10 minutes or so to this build. The docker push requires login credentials for a docker registry (e.g. DockerHub, Quay.io, GCP, AWS), stored as an encrypted Travis variable.
    2. After the docker image is deployed, WDL pipeline files are edited to reflect the version tag of the recently pushed docker image. A WDL validator is then run (using wdltool.jar) to ensure that all WDL files are still valid. This completes in seconds.
    3. WDL pipelines are compiled to DNAnexus workflows using dxWDL.jar. These are deployed to a DNAnexus CI project using an API token stored as an encrypted Travis variable. This completes in under a minute.
    4. A couple DNAnexus workflows are test executed in the CI project.
    5. WDL pipelines are executed with test data using Cromwell on the local Travis instance. This is a bit slow (roughly 5 mins for a simple test).
  2. Documentation is built automatically. It is not deployed to Read the Docs--this test only exists on Travis in order to bring the developer's attention to any auto build problems. Read the Docs has its own auto build process separate from Travis (see section below) but it does not notify anyone of its build failures. This usually completes in less than 1 minute.
  3. The viral-ngs conda package is built and deployed to the broad-viral channel. This requires anaconda.org credentials stored as an encrypted Travis variable. This takes about 10 minutes.
  4. py.test is run on Python 2.7. Tool dependencies are installed prior to unit tests via conda. Unit and integration tests are run with every branch commit. The Travis cache is cleared for each tagged release, invoking a full re-install of dependencies. Normally, this job completes in 10-15 minutes, about half of which is the loading of conda tool dependencies from the cache. There are some tests that are skipped on Py27 as they cannot be run on Py2.
  5. py.test is run on Python 3.6. Tool dependencies are installed prior to unit tests via conda. Integration and unit tests are run with every branch commit--note that this is the reverse order of the Py27 tests (unit then integration) so that errors are likely to be detected earlier in the overall build process, if they exist. The Travis cache is cleared for each tagged release, invoking a full re-install of dependencies. Normally, this job completes in 15+ minutes, about half of which is the loading of conda tool dependencies from the cache. Coverage reports are sent to coveralls.io from this Travis job only.

Some TO DO improvements for the future:

  • DNAnexus workflow testing should check output for correctness.
  • Cromwell workflow testing should check output for correctness.
  • Utilize Travis build stages.
    • All of the sub-steps of the first Docker & WDL Travis job should be broken out as separate jobs that wait for the Docker build and deploy.
    • Unit tests for Python 3.6, and possibly the conda package build, should occur within the Docker container.
    • Second-stage jobs that pull the docker image should utilize quay.io's torrent squashed image pull to reduce the time spent pulling our Docker image (currently about 5 minutes to pull from DockerHub).
    • Alternatively, we can explore creating a minimal docker image that installs only the conda pip packages (and perhaps extremely common conda tools like samtools and Picard) and leaves the rest of the conda tools out, letting them dynamically install themselves as needed using our dynamic tool install code.

Building documentation

Documentation is built automatically for certain branches of viral-ngs by Read the Docs. The documentation template files reside within viral-ngs/docs, and are formatted in standard docutils reStructuredText format. Pandoc may be used for converting from other formats (such as Markdown) to reStructuredText. The sphinx-argparse module is used to automatically generate documentation for the argparse parsers used in viral-ngs.