Skip to content

Instructions on how to add a new dataset

Albert Y. Kim edited this page Aug 14, 2019 · 1 revision

"Forking" and "cloning" your own copy of repository

  • Setting GitHub username and email globally on your computer:
  1. Open your terminal. Here's how:

How do I open terminal in windows?

How to use the Terminal command line in macOS

  1. In your terminal, copy each line and change the Github username and Github registering email to your own info without the quotation marks, and hit Enter to run (run one line at a time):
$ git config --global user.name "Mona Lisa"
* git config --global user.email "email@example.com"

For example, I would type and run the following:

$ git config --global user.name Starryz
* git config --global user.email zhouyujiaa@gmail.com
  • Fork data from Albert's repository for FiveThirtyEight

  • Go to your forked repository, click the green clone or download, then click the circled button to copy the URL of the repository

  • Open RStudio > File > New Project > Version Control > Git, then paste the URL copied above to Repository URL, fill in the Project directory name (or have it automatically called FiveThirtyEight), and select the directory location under Create project as subdirectory of, then click Create Project

Processing Data - an example

  • Under the current directory, open a new R Script (not .Rmd) and name it process_data_set_yourname.R (e.g. process_data_set_starry.R) and put it in the R/ folder, and load the following packages at the beginning of the script (remember to run them!):
library(tidyverse)
library(stringr)
library(lubridate)
library(janitor)
library(usethis)

Then, run the following code to read our test dataset fruit_basket:

fruit_basket <- read_csv("data-raw/test_folder/fruit_basket.csv")

Read dataset and "tame" dataset

  • How to "tame" datasets to a beginner-friendly version?

Read Albert's paper The fivethirtyeight R Package: "Tame Data" Principles for Introductory Statistics and Data Science Courses

And if you are in a hurry, here is my summary

Long story super short - to achieve minimum viable product, we need to:

  1. Tidy the format (see the resources above for more details about this) ;

  2. Clean column name;

  3. Make sure the variables have the reasonable classes

  • Read Dataset

Here, we use an example to demonstrate how to read a dataset and "tame" it. Note that because each dataset is different, some may require more work to tidy the format, change more classes for more messy variables, etc.

We use the dataset august_senate_polls (original dataset here) as an example.

If you followed along, now you should have pasted this dataset's folder called august-senate-polls and saved it in data-raw folder. So we read and save it in R with read_csv:

august_senate_polls <- read_csv("data-raw/august-senate-polls/august_senate_polls.csv")

  • process dataset

After reading, open the environment or type View(august_senate_polls) in Console to check out what the original dataset is like:

First, we notice that the format of this dataset is tidy. Great! So we don't need to change the strucutre of the dataset.

Then, we see that the names of columns are not consistent. To make the names clean, we can run the clean_names() function from janitor package to make the column names consistent

We will also take a look at the classes of each variable by running str(august_senate_polls) in Console:

Here, as we can see, some classes of the variables are not quite right. For example, state represents the state of the poll, and there are only a certain number of states. So it makes more sense to transform state into a factor, instead of keeping it as chr as R read it by default. We can do the same changes to some other variables, such as senate_class because it only has 3 levels.

In addition, start_date and end_date were originally read as Date class, which is great! So we don't have to change the format for these variables.

In summary, to get the MVP for "taming" a dataset which is by default tidy, we run:

august_senate_polls <- read_csv("data-raw/august-senate-polls/august_senate_polls.csv") %>%
  clean_names() %>%
  mutate(
    cycle = as.numeric(cycle),
    state = as.factor(state),
    senate_class = as.factor(senate_class)
  )

Finally, we run usethis::use_data(august_senate_polls, overwrite = TRUE) to to read the data/*.rda files

So the entire code chunk for processing this dataset looks like:

Once you finished processing one dataset, don't forget to run the code!

Documenting Dataset

After processing one dataset and make sure it is in a beginner-friendly format, we can then write some documentation for users when they look up this dataset by the ?dataset command (e.g. ? august_senate_polls). Here, we keep using august_senate_polls as an example.

  • A closer look into the dataset To write up the documentation, first, let's get some understanding of this dataset. In the folder where we got the csv file of the dataset, there should also be a README.md. Let's see if it helps us to document this dataset.

OK, This gives us some basic background knowledge of this dataset: the article it came from and its link; what every row stands for, etc. Sometimes, (if the author is nice enough), there will be a data dictionary where there is explantion of what each column stands for. However, most of the times there is not. But we need to create a dictionary of each column for our beginner-level users, so they can have a clearer picture of what they are looking at, right?

  • Start documentation

To start the documentation, go to a new directory fivethirtyeith > R and create a new R Script called data_yourname.R (e.g. data_starry.R)

To make things easier, I will just show the code I have for creating the documentation of august_senate_polls and what it ends up looking like. This should be reasonable self-explanable.

You are more than welcome to copy and paste my code and change the corresponding content for your own dataset.

#' @format A data frame with 594 rows representing senate polls, and 11 variables:
#' \describe{
#'   \item{cycle}{the election year}
#'   \item{state}{the state of the poll}
#'   \item{senate_class}{the class of the senate}
#'   \item{start_date}{the start date of the poll}
#'   \item{end_date}{the end odate of the poll}
#'   \item{dem_poll}{the percent of support for the Democrat during the poll}
#'   \item{rep_poll}{the percent of support for the Republican during the poll}
#'   \item{dem_result}{the result percent of support for the Democrat during the election}
#'   \item{rep_result}{the result percent of support for the Republican during the election}
#'   \item{error}{the difference between the percent of support of one party during the poll and the result percent of support for the same party during the election}
#'   \item{absolute_error}{the absolutel value of the error value}
#'   }
#' @source Emerson College’s poll of registered voters \url{https://www.emerson.edu/sites/default/files/Files/Academics/ecp-tx-aug2018-pr.pdf}
"august_senate_polls"

Some key points:

  • Title of the artile in the first line
  • Provide the url of the article in the corresponding line (in my example, line 24)
  • Then, explain the format the dataset: "A data frame with X rows representing Y, and Z variables"
  • Describe the meaning of each item. Here, most items should be easy to understand, such as start_date, state, etc. Some may be a little confusing. It would help to read the article and see how the data is used, or if the author made a notation in the article, to check out the numbers and words behind those columns to make an assumption, etc. You can also go to the original 538 respository and post an issue to inquire the meaing of a certain column in a certain dataset.
  • Afterwards, include any source and their links mentioned in the article. For example, here in the article the author made a notation on using "Emerson College’s poll of registered voters" and its link, so we included them in the documentation for the users to have an easy access.
  • Finally, when you finish documentation, start a new line (without the default '#'' at the beginning) and write "dataset" (e.g. "august_senate_poll"). This is necessary to make sure the ?dataset command gets linked to this documentation.

Build the Package!

  • When finished all the processing and documentation (it's a lot of work, I know, but you will get better and faster at it!), it comes the most exciting part - build the package!

Save your changes in all R files.

  • Then, In the upper right side of RStudio, under Build, click on Install and Restart. Wait for the install to finish - this may take a little while.

  • To check if the documentation worked: try ?dataset (your newly-documented dataset) and see if the documentation you just wrote shows up, and if there are anything in terms of content and format in the documentation that you want to change.

(Screenshot from: Object documentation)

  • Then, run Check right next to the Install and Restart button as shown in the above graph. Fix the errors from the Check's result, if there is any.

Push to Git & Pull Request

  • When you are satisfied with your data wrangling and documentation, and ready to have them pulled for review, click on Git in the upper right side of RStudio (right next to Build)

  • Click on Commit, and select all the files you have changed and want to push to your repository. Write a commit message, then click on Commit

  • Click on Push (the green up arrow on the upper right), and wait for your push to finish.

  • Make a new pull request from your repository to Albert's. Here's how.


Special thanks to Elaine Ye @Elaineyex and Rachel Yan @RachelYan49 for their contribution to improving this instruction!