Analysis templates

Standard lab protocols, recommendations, and a collection of scripts for basic data processing, analysis and visualization, to be modified as necessary for different projects

Organization

File structure (Mise en place)

This is something I'm actively optimizing and learning about, so my strategy might not exactly be a gold strategy, but it is extremely important to at least be thinking about!

Primary principles:

Keep a 'raw data' directory, where files are considered completely immutable for a given project or analysis. I waver between having a system-wide raw data directory that contains multiple project sub-directories, or a raw data directory within each project directory. But the point is that there should be a place where files belong that can be considered 'original' and that have READMEs associated with them to explain where these data came from. This directory could ideally be re-generated by unzipping a version archived somewhere such as Zenodo.
- 'intermediate data', such as processed versions of your data like filtered sequencing reads or 'cleaned' metadata files, should be generated from the 'raw' files using scripts, which are contained in a separate:
'Scripts' directory. This directory is ideally a project-specific Git repository that can track changes and be collaborated on through GitHub, where an archived version can be created upon manuscript submission. These scripts should ideally be cross-platform compatible, use relative filepaths, and take command line arguments, such that someone can simply download the raw data, clone this repository, specify local settings and the location of an output directory, and completely re-generate the results of your study by running your scripts as specified in a README.
- Try to organize things around groups of 7 or less - e.g. if you have 20 individual scripts, merge some of them, tie them together with 7 wrappers, or organize them in 7 subfolders
- Order scripts by starting their filenames with a 2-digit number (e.g. 01_sequencing_qc.sh)
- Have each script place all its outputs (including log files) in a folder with the same name as the script, within the overall output directory specified by the user

Resources

Here's a list of resources my students and I have found useful for learning the science and code:

Coding

Shell and general HPC use

My own explainer for setting up a computer

HPC Best Practices

In R

https://nyu-cdsc.github.io/learningr/assets/simulation.pdf
https://intro2r.com/
https://www.codecademy.com/learn/learn-r

In Stan

https://betanalpha.github.io/

In Markdown

https://www.markdownguide.org/basic-syntax/

Add fancy equations to markdown

Name		Name	Last commit message	Last commit date
Latest commit History 491 Commits
generic_dekupl		generic_dekupl
getting_data		getting_data
gwas/epistatic		gwas/epistatic
kmer_diff_assemble		kmer_diff_assemble
microbiomes		microbiomes
nanopore		nanopore
r_functions		r_functions
shell_snippets		shell_snippets
teaching_examples		teaching_examples
telomeres		telomeres
transcriptomics/find_orthogroups		transcriptomics/find_orthogroups
wetlab_protocols		wetlab_protocols
.gitignore		.gitignore
README.md		README.md
computational_environment.md		computational_environment.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Analysis templates

Organization

File structure (Mise en place)

Resources

GitHub itself:

Stats

Good analysis of how stats are related to causal inference

Coding

Shell and general HPC use

In R

In Stan

In Markdown

About

Releases

Packages

Contributors 3

Languages

McMinds-Lab/analysis_templates

Folders and files

Latest commit

History

Repository files navigation

Analysis templates

Organization

File structure (Mise en place)

Resources

GitHub itself:

Stats

Good analysis of how stats are related to causal inference

Coding

Shell and general HPC use

In R

In Stan

In Markdown

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages