Standard lab protocols, recommendations, and a collection of scripts for basic data processing, analysis and visualization, to be modified as necessary for different projects
This is something I'm actively optimizing and learning about, so my strategy might not exactly be a gold strategy, but it is extremely important to at least be thinking about!
Primary principles:
- Keep a 'raw data' directory, where files are considered completely immutable for a given project or analysis. I waver between having a system-wide raw data directory that contains multiple project sub-directories, or a raw data directory within each project directory. But the point is that there should be a place where files belong that can be considered 'original' and that have READMEs associated with them to explain where these data came from. This directory could ideally be re-generated by unzipping a version archived somewhere such as Zenodo.
- 'intermediate data', such as processed versions of your data like filtered sequencing reads or 'cleaned' metadata files, should be generated from the 'raw' files using scripts, which are contained in a separate:
- 'Scripts' directory. This directory is ideally a project-specific Git repository that can track changes and be collaborated on through GitHub, where an archived version can be created upon manuscript submission. These scripts should ideally be cross-platform compatible, use relative filepaths, and take command line arguments, such that someone can simply download the raw data, clone this repository, specify local settings and the location of an output directory, and completely re-generate the results of your study by running your scripts as specified in a README.
- Try to organize things around groups of 7 or less - e.g. if you have 20 individual scripts, merge some of them, tie them together with 7 wrappers, or organize them in 7 subfolders
- Order scripts by starting their filenames with a 2-digit number (e.g. 01_sequencing_qc.sh)
- Have each script place all its outputs (including log files) in a folder with the same name as the script, within the overall output directory specified by the user
Here's a list of resources my students and I have found useful for learning the science and code:
Not just for programmers: How GitHub can accelerate collaborative and reproducible research in ecology and evolution
An entire paper written and edited directly on GitHub
Experimental design in ecology
General principles and great examples
All basic stats use linear models
Permutation testing, bootstrapping, resampling
Inferring Multiple Causality: The Limitations of Path Analysis
My own explainer for setting up a computer
https://nyu-cdsc.github.io/learningr/assets/simulation.pdf
https://intro2r.com/
https://www.codecademy.com/learn/learn-r