forked from cct-datascience/group-handbook-template
-
Notifications
You must be signed in to change notification settings - Fork 0
/
research-best-practices.qmd
81 lines (59 loc) · 3.93 KB
/
research-best-practices.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
---
title: "Research Best Practices"
editor: visual
---
::: callout-important
## Group Edit
This template only includes best practices as they apply to *data and code*.
You may wish to edit this to add other best practices relevant to your group's lab and/or field work.
:::
## Research Compendia
A [research compendium](https://book.the-turing-way.org/reproducible-research/compendia.html) is a collection of all the files (data, metadata, notes, code, manuscripts, etc.) associated with a research project.
Even before the first data point is collected, some thought should be put into how you'll organize your research compendium.
This includes things like:
- Directory structure
- File naming conventions
- File formats to use
- Where the research compendium will be stored (shared drive? Box? your laptop?)
### Directory Structure
A minimal research compendium might look like this:
```
project_name/
├── README.md
├── data/
│ ├── metadata.txt
│ └── my_raw_data.csv
├── scripts/
│ ├── 01-wrangle_data.R
│ ├── 02-fit_models.R
│ └── 03-plot_results.R
└── output/
```
The `README.md` file should contain at minimum 1) a brief description of the project, 2) who is involved, and 3) instructions on how to reproduce the analysis (e.g. what software is needed to run it, any manual steps that aren't done by code, etc.).
Once the project is complete (e.g. a paper is in review) it should also contain links or citation information for any products resulting from the project.
The `data/` folder should contain the **raw** data for the project.
**This raw data should not be edited by hand**.
Files in the `data/` folder should be considered (or actually made) ***read-only!*** This folder might contain separate `data_raw/` and `data_cleaned/` folders if there is a need to save intermediate data to disk, in which case the files in `data_raw/` should be read-only.
The `data/` folder is also a good place for metadata—information about how to interpret the data and how/when/where it was collected (e.g. what the column names in `my_raw_data.csv` mean and what units the values are measured in).
::: callout-important
## Group Edit
You may wish to discuss as a group what constitutes "raw data".
For example, if you are entering data manually into Excel, at what point should that file be considered "finished" and treated as read-only?
:::
The `scripts/` folder is a place to put code to run your analysis.
For R code, this folder is often called `R/` by convention, and for other languages it might be called something like `src/`.
In this example, I've used numbered file names to indicate the order in which scripts are to be run to reproduce the results.
The `output/` folder is where any figures or summary data or tables created should be saved.
Other things a research compendium might contain, especially if it is made public (e.g. published as supplementary)
- A code of conduct
- A license file and/or citation file indicating how others can use your work
- A `docs/` folder for notes or manuscripts
- A `.bib` file containing reference information for a manuscript
### File naming conventions
As above, numbering scripts that should be run in order is always a good idea—make all the numbers the same number of digits (e.g. 09-, 10-, 11-) so they sort alphabetically in the right order.
When dates are relevant in file names, put them at the start and use the ISO format (YYYY-MM-DD) so that when sorted alphabetically they are in chronological order.
Avoid spaces and special characters that may not work on all operating systems.
Similarly, file names are case sensitive on some systems and case-insensitive on others, so avoid capital letters in file names to keep things simple.
### File formats
Preference non-proprietary and plain-text formats when possible.
E.g. use `.csv` rather than `.xlsx` when there isn't a need for multiple sheets or other Excel features.