-
Notifications
You must be signed in to change notification settings - Fork 3
/
05-open_science.Rmd
639 lines (493 loc) · 27.3 KB
/
05-open_science.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
---
editor_options:
markdown:
wrap: 72
---
# Open Science
*Authors: Francesco Giardina, Yunke Peng, Koen Hufkens, Benjamin
Stocker*
## Motivation
### Why should I care?
The immediate benefits of working reproducibly for the individual are
manifold. They include:
1. Resuming work after absence becomes easier.
2. By consequence, implementation of revisions of a paper become
faster.
3. Writing legible code and documenting a project ensures that you can
hand it off easily to collaborators.
4. Using a project-based approach (read below for more info) retains
most data in one spot. This way, data and code can easily be moved
from system to system.
5. Movable projects allow for local development but remote (cluster)
deployment of projects.
6. Legible code and the use of version control software allow for a
team to quickly iterate over a problem if needed.
### Some history
The importance of good data management transcends the success of a
scientific project itself: It guarantees the reproducibility of results,
thus encouraging the scientific community to reuse a workflow - a key
concept for the advancement of science.
After an attempt to standardize the practice of open science in 2007, it
was not until 2016 that representatives of academia, industry, funding
agencies and publishers convened to define a set of principles that
would become known as the **FAIR Data Principles**. Later that year,
during its 2016 G20 Hangzhou summit, the G20 approved the use of FAIR
principle within research. In short, the FAIR data principles mean that
data is:
- **Findable**: Data should be easy to find, both by humans and
machines. There is a strong emphasis on the use of metadata and the
assignment of data identifiers.
- **Accessible**: The access to data should be open, free and
universally implementable. If necessary, an authentication and
authorization procedure should be clearly explained.
- **Interoperable**: This refers to the ease of integration with other
data. It should be easy to interoperate within a workflow for
analysis, storage and processing
- **Reusable**: The documentation of data should be clear enough to
allow reusability, which is the ultimate goal of FAIR principles
### Why is it important?
Open Science refers to the effort to make scientific research available
to peers in academia and to the public, to foster the dissemination of
findings and the development of knowledge. It involves a methodical
change to the research cycle, promoting more exchange and collaboration.
The shift to an open science practice is not trivial, and often
encompasses a cultural change, but it is essential for strengthening the
link between scientific research and society. Contributing towards this
effort, the European Union is leading the way with several initiatives
([The Facilitate Open Science Training for European Research
(FOSTER)](https://www.fosteropenscience.eu/foster-taxonomy/open-science-policies),
[The European Open Science Cloud
(EOSC)](https://eosc-portal.eu/about/eosc),
[OpenAIRE](https://www.openaire.eu)).
The transition to open science over the past few years has been
motivated by a "reproducibility crisis", i.e. the failure to replicate
scientific results. Recent evidence has found that in the field of
medicine, less than half of scientific findings may not be reliable
(Ioannidis, 2005). Practicing open science offers a solution to this
problem, as sharing materials and data eases the replication of original
studies by other scientists. Open science is made possible by **open
data**, i.e. when the data itself is freely accessible to other peers
and the public. Policies enabling open data allow researchers to
innovate, starting from existing knowledge. Other essential benefits for
scientists include reputational gains, increased visibility and impact,
and a broader increase in the reliability of research, as it enables the
replication and verification of scientific results. This in turn boosts
citizens' trust in science.
The revolution of open science is not limited to the academic world
alone. Society is a consumer of science - from the computer you're
reading this article on, to the vaccines that are taking the world out
of the current pandemic.
Another key advantage of scientific results being accessible without a
fee is that anyone can benefit from them, regardless of their location
or economic situation. This is especially important for the scientific
community in developing countries. In the end, open science speeds up
the circulation of new information and helps generate solutions to the
current great challenges. As Isaac Newton said in his 1675 letter: "If I
have seen further, it is by standing on the shoulders of Giants."
Last but not least, by making your code, data, and methods freely
available, you are creating more citeable items per published project.
This will grant more visibility to your research, as well as possibly
increasing the total number of times your work get referenced. It can
trigger new collaborations and strengthens the impact of your research.
### Obstacles to Open Science
Obstacles to the development of open science trace back to when journals
were still in paper format, and storing and sharing materials and data
was difficult. Luckily for us, technology is making it easier to
practice open science.
Another problem is that journal articles are basically a summary of long
months (and sometimes years) of research. This means that in the final
report, some part of the analysis must be summarized or excluded, for
the sake of brevity. Papers with a clean story and clear results tend to
be published more easily. In the end, what we are presented with in a
published article can be far from the actual trial and error steps of
the scientific method. Nevertheless, this phenomenon causes a
impoverishment of science, as other scientists could benefit from those
"failed" attempts.
Finally, another obstacle is the time constraint imposed by the current
incentive structure. In the current academic world, publications are the
easiest way for scientists to propel their careers forward. In this
frame of things, it is thus hard to dedicate more time to a process
which may not bring direct benefits in terms of primary research
articles. However, considering the bigger picture as described above,
open science bring advantages both to the single researcher and the
scientific community.
## Data resources
### Environmental data repositories
The following is a limited list of permanent data repositories for
(environmental) data.
- [Copernicus Climate Data Store](https://cds.climate.copernicus.eu/)
- [Pangaea](https://www.pangaea.de/)
- [NASA Earth Data](https://earthdata.nasa.gov/)
- [ORNL DAAC](https://daac.ornl.gov/)
### Some of our favourites
- **TRY database**: <https://www.try-db.org/TryWeb/Prop2.php> It
includes all possible leaf traits (mainly in natural environments).
Following instructions below. You can find a traits list in left
bottom "show traits" and each number would mean an unique traits
(e.g. 1 means leaf area). Then just add all numbers you want
together, and press continue to fill rest of information. Please
note, never putting too many variables one time as it it could make
the document too large to download. ![](./figures/public1.png)
- ...
### General permanent data repositories
- **Dryad**: <https://datadryad.org/search> A public database to
search, browse and download data, and especially include many
ecology, plant sciences and climate data globally or regionally.
Each database includes a doi so you can cite when using them in your
publication.
- **Zenodo**: <https://zenodo.org/> This seems to be more widely used
than Dryad, as it encompassed all possible research areas than just
ecology data. Many publications have used this place to deliver this
data locally and forming an unique doi.
- **Github**: an example here: <https://github.com/forc-db/ForC> A few
research also derlivers their data in Github. Although Github cannot
produce doi directly (it can only form doi when combing with
Zenodo), the advantage of using Github is that, it enables multiple
data providers to update data together, and also, it is convenient
for them to update data frequently. But this also has side effect -
it is difficult to cite them, and also, has to prevent the risk that
when data is recently updated but you still use old version of data.
So, always keep in mind to git pull repository when downloading this
sort of data. ![](./figures/public2.png)
- **Other sources**: Many other public data used for forcing or
modelling were already available in Euler. For example, WFDEI,
CRU...Always checked Euler first to see if it already exists
### Software for data interfacing
#### R
- MODISTools
- ...
#### Python
- ...
## Open science practice
The following describes principles and rules for open science practice
as performed in our research group - *Computational Ecosystem Science*
at ETH Zurich. The key concepts are general and specific rules to be
considered as examples.
### Organizing the working environment
Keep things separate and in the right place is key. You may want to use
the same code and data across multiple computers and possibly a remote
server for High Performance Computing. It is therefore advisable to
adopt an identical directory structure on each computer. This allows you
use identical relative paths (see
[here](https://www.geeksforgeeks.org/absolute-relative-pathnames-unix/))
in scripts that work across platforms, and enables easy syncing of files
across platforms.
It is advisable to keep all git repositories at a top level in your
working environment, e.g., in your home, e.g., (`~/my_repo1`,
`~/my_repo2`, etc.). Code for your project (containing all steps from
reading original data to final results and figures) goes into these and
is synced with github (or any other git remote host service, see Chapter
'Git in Project Management'). Note that figures produced by your code
and intermediate output should *not* be added to git.
You may want to collect any particularly useful functions that may be
used in several different projects in your own personal utility
package/library. An example is the
[*rbeni*](https://www.notion.so/TBC-rbeni-c7afcfef13cf4c9e838773d5adeeb22f)
package by Benjamin Stocker. Make sure it's well documented.
Original data, as downloaded from the web, goes into a sub-directory,
which is also sitting at a top level in your working environment, e.g.,
in your home (`~/data`).
The project directory and other utility packages are synced with Github.
Using *git*, the code of entire projects can be synced across multiple
workstations and with your personal directory on your institution's
cluster or cloud services (e.g. the HPC cluster 'Euler' at ETH Zurich).
![](./figures/folder_organization_new.png)
### Where does the data go?
Original data, as downloaded from the web, should always be in the local
data directory `~/data` and synced with the remote server. Once you
downloaded data files from a particular source (e.g. accompanying a
paper, or a particular satellite mission, etc.), create a sub-directory
within `~/data` and place it in there, along with a `README` file (plain
text) where you specify when and how the data was obtained (contact
person, URL, etc.), what the data contains (variables, units of
variables if not provided in the files directly, etc.), what data use
policies apply, who obtained the data (your name) and how it should be
cited. If original data was processed to a different format or
re-gridded, please note it in the README and refer to respective scripts
which may be added to the same directory (e.g., `proc_this_data.sh`)
(but not any further analysis).
Note that intermediate outputs produced by your project's code should
not go on git. Git is only to back up code (and small plain-text data
files, on the order of a about \<100 MB). Still you may want to sync
these intermediate outputs with your cluster as a backup of your
analysis.
![](./figures/data_location_new.png)
### Reproducibility
Making your analysis reproducible - from the original files you
downloaded on the web to final results and figures - is our highest
standard. No excuses! Before submission of a paper, you should be able
to demonstrate reproducibility.
The key file here is a "workflow" script written on RMarkdown or Jupyter
Notebook, that may run all steps of your analysis in the right order (or
at least contains accurate descriptions of how stuff is run). Upon
publication, all code and important outputs of the analysis are to be
uploaded to a permanent repository and assigned a digital object
identifiyer (DOI), e.g, on *Zenodo* (or any other service that generates
DOIs). Regularly *tag* your *git* project at important stages (e.g.
first submission, final submission, etc.). Tags can form a *release*
within Github and releases can be synced with *Zenodo* where each
release gets its DOI (see chapter below). Refer to the code and output
DOIs in your published paper.
![](./figures/reproducibility_new.png)
### A quick note on *.bash_profile*
To ease the access to a cluster, it is possible to use *.bash_profile*.
This is a configuration script generally located in your home directory
(file path must be: \~/.bash_profile). It is used to customize your bash
environment and can contain variable specifications, login information,
etc. You can add bash functions there. It is basically a normal bash
script that is executed each time you open a bash shell (terminal window
or tab). The bash profile is named `.bash_profile` and sits in your home
(`~`).
To open and edit your bash_profile do:
``` bash
subl ~/.bash_profile
```
In your *.bash_profile*, you can list aliases (shorthands), environment
variables, and functions that may be useful. Just add them to your
bash_profile.
Once you made changes to bash_profile, they are not immediately
effective in the shell windows already opened. To make them effective,
you can do
``` bash
source ~/.bash_profile
```
The bash profile can contain specifications of environment variables and
functions. The first line is (as for every bash script) `#!/bin/bash` .
Then come the environment specifications.
### Best practice notes
- When working on git, avoid creating new files with "\_v2". Git makes
this obsolete. If you want, you can always go back to earlier
versions. If you want to keep a stable version while trying out
something new, create a new "branch", while the "master" remains the
stable version. If the "\_v2" files are the ones that are specific
to *your* development and you are afraid of messing up the original
code: don't worry! You have already forked the repo and if the
maintainer of the original code doesn't want to adopt the changes in
your fork, they won't.
- Avoid writing data into programming language-specific formats (e.g.
`.RData`, or `.rds`), unless these are temporary files used for
intermediate steps. All published data must be made accessible under
*FAIR* principles. This means that data formats should be readable
without reliance on proprietary software. In general, write data
frames into CSV files and geo-located raster data into NetCDF or
GeoTIFF.
- Make sure that important figures (e.g. the ones in published papers)
can be easily reproduced, starting from original data that were
downloaded from (possibly open access-) sources. A good place to
describe how to reproduce figures are RMarkdown files or the README
file.
- When creating `.RData`, or `.rds` files, put only one variable into
them and name the file the same as the variable. E.g. file
`df_combined.RData` contains only one variable, namely
`df_combined.RData`.
### In a nutshell
- For each project that you are working on, create a separate
directory in your home (e.g., `~/my_current_project/`). All scripts
that you use for the analysis go in here. Outputs from analyses or
model runs, go to a separate sub-directory `data/`. Figures may go
to `fig/`.
- Separate original data and analysis/model outputs. Original data is
what was collected in the field, obtained from collaborators, or
downloaded from public sources. Place original data outside the
project directory, it is *not* part of the respective git
repository.
- Create a new git repository for each project folder and sync all
source code files in it with your github (or gitlab, or bitbucket)
account.
- To make all your analyses reproducible, place a RMarkdown (for R) or
Jupyter Notebook (for Python) file in the main directory
`~/my_current_project/` (or in a subdirectory `analysis/`) that
documents all the steps - from reading in the original data, to data
processing steps (document any decision regarding processing you
made), and to final published plots and numbers. For bigger
projects, you may keep separate analysis scripts (RMarkdown or
Jupyter Notebooks) for different parts of the analysis, and you may
add text describing which other scripts performed what part of the
analysis in what order. For such bigger projects, it is important to
document the main steps of the analysis and respective scripts in
the `README`.
- The outputs (html, PDF, ...) from RMarkdown files or Jupyter
Notebooks serve not only as a lab report for yourself, but can also
be used to communicate your results with others, or provided as a
supplementary material, accompanying papers.
## An open science workflow
The following provides an example that contains all steps of an open
science workflow:
- Obtain and document data
- Organise analysis code in a git repository (and RStudio project)
- Document analysis and communicate results
- Sync analysis outputs with a remote server
- Code versioning and publication
### Obtain data
Below is an example to get data publicly. Generally, please always read
README carefully, and especially the data-use policy for further
analysis and publications.
### Document data
- After downloading data, please make sure one file just stored one
source of data. Taking below file as an example, it includes
original csv, original paper, and a README. Here README was created
just to note myself, you can add when and where you download this
data (email or public link?), which variables did it generally
include, important note, unit, or any others you wish to note to
your self. ![](./figures/document1.png)
- Then you can mark in Notion:
<https://www.notion.so/computationales/Site-scale-data-d1438daa123448858fd962d23b0f2185>
So that everyone else can find it easily when browsing this data
again. ![](./figures/document2.png)
### Organise analysis code in a Rstudio project
See link for a very good guidance:
[\<https://alexd106.github.io/intro2R/project_setup.html\>](https://alexd106.github.io/intro2R/project_setup.html){.uri}
What advantages would creating a project bring? It helps to best manage
a project. For example, all analyses used within this project can be
managed, and it will never mix with other types of analysis.
Main steps:
1. Create a project
2. ![](./figures/project1.png) ![](./figures/project2.png)
3. Then you will see this new Rproj at the window
![](./figures/project3.png)
4. Then you can set your project work path in Rstudio, which is just
exactly the same place shown in above window.
setwd('/Users/yunpeng/../first_project')
### Document and communicate analysis
- Below is an example for using ingestr function to download daily
climate data (e.g. Average temperature). This data is available in
Euler. To run analysis, you can either download data from Euler to
your own hard-drive to run, or directly run in Euler. See below
commented path that relevant to Euler.
```{r eval = FALSE}
library(ingestr)
library(ggplot2)
siteinfo <- tibble(sitename="site1",lon=
-79.2,lat=34.8,elv=61,year_start=1992,year_end=1992)
df_watch <- ingest( siteinfo = siteinfo1, source = "watch_wfdei",
getvars = c("temp"), dir = "/Volumes/My Passport/data/watch_wfdei/", #In
Euler: /cluster/work/climate/bestocke/data/watch_wfdei/ settings =
list(correct_bias = "worldclim", dir_bias = "/Volumes/My
Passport/data/worldclim/")) #In Euler:
/cluster/work/climate/bestocke/data/worldclim/
ggplot(data=as.data.frame(df_watch$data),aes(x=date,
y=temp))+geom_point()+ ggtitle("Daily air temperature at example
site") + xlab("Date") + ylab("Temperature")
```
![](./figures/ggplot.png)
Some other tips
- Structure your code into parts that naturally go together. A code
file should not contain more than a few hundred lines of code.
- Visually highlight the structure inside your code files.
- Implement the entire workflow of your project in one or a few
RMarkdown files. RMarkdown is a great tool to give structure to your
code and make your code readable.
- If additional scripts are used outside of your main workflow
script/RMarkdown file, point to them (by file name).
- Consider a hierarchy of sections. Explain briefly what is done by
each code chunk.
- The final aim is to make it easy for someone to reproduce your
published Figure X.
- When using RMarkdown or Markdown, please follow the syntax for a
nicely readable formatting
### Sync outputs
- After processing of data analysis and before publication, all data
used or saved in desktop should be uploaded to Euler so that it
benefits re-use from other users. It is recommended to use
Filezilla. The left is page at desktop, the right is page at remote
(Euler). You can just directly move the file from left to right (for
upload) or move file from right to left (for download).
- Very important: We recommend to keep path consistently between
desktop and remote place. For data saving, my desktop path is
"/Users/yunpeng/data/" and Euler path is
"/cluster/work/climate/bestocke/data". But they can both be written
as "\~/data/" so that it can be consistently inputted in any
analysis.
![](./figures/filezilla.png)
### Code versioning and publication
XXX Explain git tags and github releases, see
[here](https://docs.github.com/en/repositories/releasing-projects-on-github/managing-releases-in-a-repository)
XXX For public code, make consistent versioning, see
[here](https://r-pkgs.org/release.html#release-version).
- Sharing raw and processed data on a server/cluster (e.g. euler at
ETH, other at Imperial). See above point.
- Make data publicly available (Zenodo, dryad) and quote its DOI for
the use of paper.
- Make code publicly available in Github, and publish its release.
- How to make code publicly availble in Github and publish its release
and zenodo? See here:
[\<https://docs.github.com/en/repositories/archiving-a-github-repository/referencing-and-citing-content\>](https://docs.github.com/en/repositories/archiving-a-github-repository/referencing-and-citing-content){.uri}
![](./figures/zenodo1.png) ![](./figures/zenodo2.png)
## Remaining bits
### Getting public data
Getting public data from widely used portals e.g. Copernicus data, NASA
data, etc. using their APIs
Here we show an example about how to ingest (download) climate data from
Euler. Here, we are interested in downloading air temperature at a
certain range of time and site.
Guidance of installing ingestr is here:
<https://github.com/computationales/ingestr>
The source data are all available in Euler. See comments below. You can
either download data from Euler to hard drive/desktop and then run this
code in desktop, or directly run this code in Euler.
### Git Attributes
Using .gitattrributes to keep data on git:
[\<https://git-scm.com/book/en/v2/Customizing-Git-Git-Attributes#Binary-Files\>](https://git-scm.com/book/en/v2/Customizing-Git-Git-Attributes#Binary-Files){.uri}
![](./figures/git_info.png)
Git add: Adding (updating) changed file in current desktop. Storing this
change from working directory to staging area.
Git commit: Commit this change along with some comments at staging area,
and prepared before submission.
Git push: push this change from local-repo to remote-repo. In other
words, this change can now be shared with other users.
Git pull: Pull the latest version of remote repo's updates into current
local desktop.
Git merge/checkout: merge local and remote's code together.
Git tag: create tag so that it will points out important release of
code. See complete step here:
[\<https://stackoverflow.com/questions/18216991/create-a-tag-in-a-github-repository\>](https://stackoverflow.com/questions/18216991/create-a-tag-in-a-github-repository){.uri}
#### How to create a new repo in Github?
See link:
[\<https://docs.github.com/en/get-started/quickstart/create-a-repo\>](https://docs.github.com/en/get-started/quickstart/create-a-repo){.uri}
First press "Create a repository" in Github. Follow the instructions and
then finish. Then, following instructions shown in below. Using "cd
file-path" to point out your saved path. Then add files manually, commit
and git push
![](./figures/github_info.png)
#### How to fork a repo in Github?
See link:
[\<https://docs.github.com/en/get-started/quickstart/fork-a-repo\>](https://docs.github.com/en/get-started/quickstart/fork-a-repo){.uri}
First, find the repo you wish to fork, then cloned it to your own
github, and your own desktop basing on below.
When using other users code, why we prefer choose to use fork a repo,
rather than directly using others repo (e.g.
devtools::install_github("../.."))? Because:
(1) in this way, it will help us to easier to edit the code. We knew
exactly where we saved this code, and we can edit them, trace them
(through git.), and push them into our own github anytime!
(2) it will also help us to communicate with other users. For example,
you find an apparent mistakes in original branch - you can edit it
your own, submit to github and then report to original author.
![](./figures/fork1.png) ![](./figures/fork2.png)
## Exercise: is your workflow reproducible?
1. Clone the CES project template:
<https://github.com/computationales/R-proj-template>
2. Download some public available data (put in ./data but not on git --
be creative!)
3. Do some analysis/data visualization (one .Rmd file in ./vignettes)
4. Document and explain all steps
5. Another student may clone your repo and download the data you used
6. They should be able to reproduce the analysis and explain you what
you did
## References and further reading
1. Allen C, Mehler DMA (2019) Open science challenges, benefits and
tips in early career and beyond. PLOS Biology 17(12): e3000587.
<https://doi.org/10.1371/journal.pbio.3000587>
2. Wilkinson, Mark D et al. "The FAIR Guiding Principles for scientific
data management and stewardship." *Scientific data* vol. 3 160018.
15 Mar. 2016, <doi:10.1038/sdata.2016.18>
3. Ioannidis JPA (2005) Why Most Published Research Findings Are False.
PLoS Med 2(8): e124. <https://doi.org/10.1371/journal.pmed.0020124>
4. Wilson G, Aruliah DA, Brown CT, Chue Hong NP, Davis M, Guy RT, et
al. (2014) Best Practices for Scientific Computing. *PLoS Biol*
12(1): e1001745. <https://doi.org/10.1371/journal.pbio.1001745>
5. RMarkdown resources:
[basics](https://rmarkdown.rstudio.com/authoring_basics.html), [PDF
Cheatsheet](https://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf),
[Another longer
cheatsheet](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet)