forked from cmustatistics/data-repository
-
Notifications
You must be signed in to change notification settings - Fork 0
/
submit-data.qmd
94 lines (76 loc) · 4.71 KB
/
submit-data.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
---
title: Submit a Dataset
---
The simplest way to submit data is to use our [online submission
form](https://docs.google.com/forms/d/e/1FAIpQLScQgxgq7bFnKicoS0I2DdKGscGoJLvO5EG20_NYkKPEsOB8Yg/viewform?usp=sf_link).
Provide the requested information and hit Submit; we'll review the submission
and eventually add it to the site, if it's suitable.
Our overall goal is to provide a collection of useful, real-world datasets from
a variety of application areas that can be used as in-class examples, homework
assignments, or course projects. The repository can always use new datasets, so
we welcome submissions. Datasets should meet a few requirements:
- The data must be publicly shareable. There should not be any licensing
restrictions that prevent us from sharing it, or any human subjects ethics
concerns or other limitations. Look for datasets marked as being in the public
domain or with licenses like the Creative Commons Attribution (CC-BY) license.
- The data should be in a standard, easy-to-use format, like CSV. If you have to
clean the data from an original form, include the R script you used to clean
it with your submission.
- There should be good motivation for interesting analyses of the data that
would be appropriate for a course.
- The data should be less than 100 MB, as large sizes pose technical problems
and are inconvenient for students. Files larger than a few megabytes should be
compressed when practical. R (`read.csv()`) and Python (`pandas.read_csv()`)
can read `.csv.gz` files directly, so gzip compression is a good choice for
larger files.
If you're familiar with R Markdown or Quarto, you can instead prepare the data
description page yourself from a template. Read on for detailed instructions.
## Getting Started with Quarto
This website is built using the [Quarto](https://quarto.org) system and
automatically rendered into a website. Data description pages can be made
directly from a template file we provide.
You can get the template in two ways:
1. Copy the [`_dataset-template.qmd` file from our GitHub
repository](https://github.com/cmustatistics/data-repository/blob/main/_dataset-template.qmd)
and save it on your computer. Once you're done, you can email us the file and
the data.
2. Fork [our GitHub
repository](https://github.com/cmustatistics/data-repository) into your own
GitHub account and edit it like any other Git repository. Once you're done,
you can submit a pull request.
## The Dataset Template
The dataset template is a [Quarto](https://quarto.org/docs/guide/) file; Quarto
is much like R Markdown and is supported by recent versions of RStudio. Quarto
files can contain R code, just like R Markdown, so your data description can
include graphics and tables generated by R code embedded in the file, if you
think this would be helpful to illustrate important features of the data. As you
fill out the template, you can use RStudio's "Render" button to see a preview of
the finished page.
The template asks for:
- Basic metadata, such as the statistical methods applicable to the data, a
short title, and your name. (All submissions are credited with the name of the
submitter.)
- A description of the problem. This should be sufficient to motivate the
analysis and help students understand the setting.
- A description of the data and the variables. It should be clear what each row
of data represents and what all variables mean. Units of measure should be
included whenever possible.
- References to the original source. If the source is an academic paper and the
data is deposited in a third-party repository (such as Zenodo, Figshare, or
Dryad), include references to both the paper and the archived dataset. When
available, include the DOIs of the references.
The template is meant to prevent a common problem with course datasets: they get
passed down from instructor to instructor, and eventually all information about
the original source is lost. The dataset may be presented to students without
context or important details (like units), and the instructor may not be able to
find the original data or references to answer questions about the data. By
providing sufficient detail, we can make datasets reusable for years to come.
## Submitting
Once you've filled in the template, you can email the template file, the data,
and any cleaning scripts to the repository editor (currently Alex Reinhart).
Alternately, submit a pull request to [our repository on
GitHub](https://github.com/cmustatistics/data-repository). Save the Qmd with a
meaningful name and place it in one of the top-level category directories (like
`medicine/` or `astronomy/`). Place the data files in `data/`. Double-check that
you've committed everything (and nothing extraneous) and open a pull request as
usual.