A logical, reasonably standardized, but flexible project structure for doing and sharing data science work while developing a software tool.
A project template for data science projects that uses Cookiecutter for easy deployment. It is a direct modification of Cookiecutter Data Science to serve a use case common to research sciences (especially computational biology / bioinformatics) where a data science project is developed alongside a software development project.
It utilizes many new tools and best practices to data science:
- Integration of a software development project using git-subrepo
- Experiment development and communication using Jupyter-lab
- Environment reproducability and package management using Conda and git hooks
- Python 2.7 or 3.5
- Cookiecutter Python package >= 1.4.0: This can be installed with pip by or conda depending on how you manage your Python packages:
$ pip install cookiecutter
or
$ conda install -c conda-forge cookiecutter
cookiecutter https://github.com/michael-ford/data-science-development-project-template.git
Just follow the prompts to populate and build the new project!
After building the project, you'll want to do the following:
Conda is used to create a independent environment, isolated from your system intallations (especially python). Included is a basic python 3.7 conda environment containing standard DS tools (including everything mentioned here), and by default saved to <YOUR-PROJECT-NAME>.yml
.
- Create environment with
conda env create --file <YOUR-ENVIRONMENT-FILE-NAME>.yml
. - Activate the environment with
conda activate <YOUR-ENV-NAME>
- Add the environment yml file to git and commit
By running chmod u+x .git/hooks/pre-commit .git/hooks/post-merge
.
The pre-commit
hook overwrites the environment yml file with the current state of the conda environment using the following command:
conda env export -n <YOUR-ENV-NAME> --from-history | grep -v "^prefix" > <YOUR-ENVIRONMENT-FILE-NAME>
The --from-history
argument prevents dependencies and binary versions from being included in the export, and grep -v "^prefix"
removes the system-specific conda prefix to ensure cross-platform compatibility.
If the environment.yml file hasn't it hasn't changed, the commit proceeds. If it has changed, the commit aborts and the environment file is added to the index.
The post-merge
hook checks the merged version of the environment yml, and if it has changed it updates the conda environment with
conda env update -n <YOUR-ENV-NAME> -f <YOUR-ENVIRONMENT-FILE-NAME> --prune
Note: the --prune
parameter removes unneeded dependencies, but due to a bug in the current version of Conda (> 4.4) any package removed from the yml will *not be removed from the environment.
First, setup a bare repository elsewhere (e.g. Github) to serve as your remote. Then run the following:
git remote add origin <REMOTE-REPO-URL>
git remote -v
If your DS project is supporting the development of some software development project, I recommend using git-subrepo to include it's repo in the project by cloning it as follows:
Install git-subrepo:
git clone https://github.com/ingydotnet/git-subrepo </PATH/TO/INSTALL-LOCATION>
echo 'source </PATH/TO/INSTALL-LOCATION>/git-subrepo/.rc' >> ~/.bashrc
Clone the software repo as a subrepo:
git subrepo clone <REPO-REMOTE-URL> src/<REPO-NAME> -b <DESIRED-BRANCH-NAME>
If you are starting off a software development project from scratch, consider using Pyscaffold. You can follow the following instructions to setup and configure the repo. Note if you want to use a remote (e.g. Github repo), set up a bare repo there first.
cd src/
putup <PROJECT-NAME>
cd ..
git subrepo init src/<PROJECT-NAME> -r <REMOTE-PATH>
Note if you would to use a branch other than master
, checkout that branch before the git subrepo init
and add -b <BRANCH-NAME>
to the git subrepo init
command (see note about git subrepo branches below).
pip install -e .
will setup the src/
directory as a python package.
Note it is recommended to install your subrepo as a python package independent from other code in src/. By default your subrepo is excluded from the package install of src/
(as defined in setup.py
), so you have to install is seperately by running pip install -e .
in the subrepo.
The directory structure of your new project will look something like this:
├── data
│ ├── external <- Data from third party sources.
| | └── SOURCE.tsv <- Record of data sources
│ ├── interim <- Intermediate data that has been transformed.
│ ├── raw <- The original, immutable data dump.
│ └── results <- Final results
|
├── environment.yml <- Conda environment definition for reproducing
|
├── experiments <- Experiments devided by subdirectory
│ └── 03-04-14.sample-experiment <- Sample demonstrating the structure
│ ├── data -> ../../data/ <- Softlink to data/ for code cleanliness
│ ├── notebooks <- Jupyter notebooks for experiment
| | └── template.ipynb <- Sample notebook containing useful code
│ └── results <- Final reports for communication (e.g. html
| | version of jupyter nb)
│ └── figures <- Figures for reports
|
├── LICENSE
├── README.md <- The top-level README for developers using
| this project.
├── setup.py <- Configuration for installing src/ as python package
|
└── src <- Source code for use in this project.
├── data
│ └── __init__.py
└── __init__.py <- Makes src a Python module
Any step in an experiment that fetches raw data, generates data or creates results that is longer than a one-liner should be developed as a standalone script. This keeps your notebooks clean, but also helps ensure data reproducability later (see below).
The setup.py
script installs yout src/
directory as a package in the conda environment, so can import any python code stored in src/ module. For example:
from src.data.clean import clean_raw_dataset
Jupyter Lab offers many features that make it an ideal base from which to run experiments, from data generation to results analysis. The idea being that someone could open a notebook and press 'Run all', and not only have the entire experiment reproduced, but also understand what is happening. Therefore it is important to keep in mind that the notebook purpose is to act as a driver and communicator for the experiment - external scripts (in src/
) should be used to keep code streamlined and minimal, and markdown text should be liberally used to explain the process. For more information of Jupyter use see this excellent blog post.
The default conda environment includes a package called nb_conda_kernels. This ensures that a Jupyter session run from this environment will automatically detect any other conda environments and add them as jupyter kernels, as long as they have the ipykernel
package installed
The default environment includes the nbdime
package, which allows git to play nicely with jupyter notebooks.
You can create new experiment directories by copying the provided example or using another cookiecutter as follows:
cd experiments
cookiecutter https://github.com/michael-ford/new-experiment-template
Git-subrepo allows you to keep include software project as a seperate repo, but include the version/state of the project in your larger DS project commits. Generally it works the same as git, but there are some key things to keep in mind:
- It is not a standalone repository - the subrepo needs to have a proper repository elsewhere as a remote
- You can only have a single branch in the project at once. To switch to another branch, you need to re-clone the subrepo using the
--force
argument:git subrepo -f clone src/<REPO-NAME> -b <BRANCH-NAME> -r <REMOTE-PATH>
. Depending on your git workflow this might constrain how much development you do in the subrepo (but thats why you have a proper remote repository right? :) - You can push new commits in your subrepo to its parent remote using
git subrepo push
. - Any commit in the parent DS project repo that contains files from the subrepo will go into the commit history of the subrepo. This does make sense (since how would you sync the branches between the DS repo and subrepo), but it means you may want to commit your subrepo changes independently and before you commit and DS repo changes.