02.3-data-mani.Rmd

## Data Import/Export

[Extended Manual by R](https://cran.r-project.org/doc/manuals/r-release/R-data.html)

| Format                                                | Typical Extension       | Import Package                                                  | Export Package                                                  | Installed by Default |
|:-------------|:-------------|:---------------|:---------------|:-------------|
| Comma-separated data                                  | .csv                    | [**data.table**](https://cran.r-project.org/package=data.table) | [**data.table**](https://cran.r-project.org/package=data.table) | Yes                  |
| Pipe-separated data                                   | .psv                    | [**data.table**](https://cran.r-project.org/package=data.table) | [**data.table**](https://cran.r-project.org/package=data.table) | Yes                  |
| Tab-separated data                                    | .tsv                    | [**data.table**](https://cran.r-project.org/package=data.table) | [**data.table**](https://cran.r-project.org/package=data.table) | Yes                  |
| CSVY (CSV + YAML metadata header)                     | .csvy                   | [**data.table**](https://cran.r-project.org/package=data.table) | [**data.table**](https://cran.r-project.org/package=data.table) | Yes                  |
| SAS                                                   | .sas7bdat               | [**haven**](https://cran.r-project.org/package=haven)           | [**haven**](https://cran.r-project.org/package=haven)           | Yes                  |
| SPSS                                                  | .sav                    | [**haven**](https://cran.r-project.org/package=haven)           | [**haven**](https://cran.r-project.org/package=haven)           | Yes                  |
| SPSS (compressed)                                     | .zsav                   | [**haven**](https://cran.r-project.org/package=haven)           | [**haven**](https://cran.r-project.org/package=haven)           | Yes                  |
| Stata                                                 | .dta                    | [**haven**](https://cran.r-project.org/package=haven)           | [**haven**](https://cran.r-project.org/package=haven)           | Yes                  |
| SAS XPORT                                             | .xpt                    | [**haven**](https://cran.r-project.org/package=haven)           | [**haven**](https://cran.r-project.org/package=haven)           | Yes                  |
| SPSS Portable                                         | .por                    | [**haven**](https://cran.r-project.org/package=haven)           |                                                                 | Yes                  |
| Excel                                                 | .xls                    | [**readxl**](https://cran.r-project.org/package=readxl)         |                                                                 | Yes                  |
| Excel                                                 | .xlsx                   | [**readxl**](https://cran.r-project.org/package=readxl)         | [**openxlsx**](https://cran.r-project.org/package=openxlsx)     | Yes                  |
| R syntax                                              | .R                      | **base**                                                        | **base**                                                        | Yes                  |
| Saved R objects                                       | .RData, .rda            | **base**                                                        | **base**                                                        | Yes                  |
| Serialized R objects                                  | .rds                    | **base**                                                        | **base**                                                        | Yes                  |
| Epiinfo                                               | .rec                    | [**foreign**](https://cran.r-project.org/package=foreign)       |                                                                 | Yes                  |
| Minitab                                               | .mtp                    | [**foreign**](https://cran.r-project.org/package=foreign)       |                                                                 | Yes                  |
| Systat                                                | .syd                    | [**foreign**](https://cran.r-project.org/package=foreign)       |                                                                 | Yes                  |
| "XBASE" database files                                | .dbf                    | [**foreign**](https://cran.r-project.org/package=foreign)       | [**foreign**](https://cran.r-project.org/package=foreign)       | Yes                  |
| Weka Attribute-Relation File Format                   | .arff                   | [**foreign**](https://cran.r-project.org/package=foreign)       | [**foreign**](https://cran.r-project.org/package=foreign)       | Yes                  |
| Data Interchange Format                               | .dif                    | **utils**                                                       |                                                                 | Yes                  |
| Fortran data                                          | no recognized extension | **utils**                                                       |                                                                 | Yes                  |
| Fixed-width format data                               | .fwf                    | **utils**                                                       | **utils**                                                       | Yes                  |
| gzip comma-separated data                             | .csv.gz                 | **utils**                                                       | **utils**                                                       | Yes                  |
| Apache Arrow (Parquet)                                | .parquet                | [**arrow**](https://cran.r-project.org/package=arrow)           | [**arrow**](https://cran.r-project.org/package=arrow)           | No                   |
| EViews                                                | .wf1                    | [**hexView**](https://cran.r-project.org/package=hexView)       |                                                                 | No                   |
| Feather R/Python interchange format                   | .feather                | [**feather**](https://cran.r-project.org/package=feather)       | [**feather**](https://cran.r-project.org/package=feather)       | No                   |
| Fast Storage                                          | .fst                    | [**fst**](https://cran.r-project.org/package=fst)               | [**fst**](https://cran.r-project.org/package=fst)               | No                   |
| JSON                                                  | .json                   | [**jsonlite**](https://cran.r-project.org/package=jsonlite)     | [**jsonlite**](https://cran.r-project.org/package=jsonlite)     | No                   |
| Matlab                                                | .mat                    | [**rmatio**](https://cran.r-project.org/package=rmatio)         | [**rmatio**](https://cran.r-project.org/package=rmatio)         | No                   |
| OpenDocument Spreadsheet                              | .ods                    | [**readODS**](https://cran.r-project.org/package=readODS)       | [**readODS**](https://cran.r-project.org/package=readODS)       | No                   |
| HTML Tables                                           | .html                   | [**xml2**](https://cran.r-project.org/package=xml2)             | [**xml2**](https://cran.r-project.org/package=xml2)             | No                   |
| Shallow XML documents                                 | .xml                    | [**xml2**](https://cran.r-project.org/package=xml2)             | [**xml2**](https://cran.r-project.org/package=xml2)             | No                   |
| YAML                                                  | .yml                    | [**yaml**](https://cran.r-project.org/package=yaml)             | [**yaml**](https://cran.r-project.org/package=yaml)             | No                   |
| Clipboard                                             | default is tsv          | [**clipr**](https://cran.r-project.org/package=clipr)           | [**clipr**](https://cran.r-project.org/package=clipr)           | No                   |
| [Google Sheets](https://www.google.com/sheets/about/) | as Comma-separated data |                                                                 |                                                                 |                      |

: Table by [Rio Vignette](https://cran.r-project.org/web/packages/rio/vignettes/rio.html)

R limitations:

-   By default, R use 1 core in CPU

-   R puts data into memory (limit around 2-4 GB), while SAS uses data from files on demand

-   Categorization

    -   Medium-size file: within RAM limit, around 1-2 GB

    -   Large file: 2-10 GB, there might be some workaround solution

    -   Very large file \> 10 GB, you have to use distributed or parallel computing

Solutions:

-   buy more RAM

-   HPC packages

    -   Explicit Parallelism

    -   Implicit Parallelism

    -   Large Memory

    -   Map/Reduce

-   specify number of rows and columns, typically including command `nrow =`

-   Use packages that store data differently

    -   `bigmemory`, `biganalytics`, `bigtabulate` , `synchronicity`, `bigalgebra`, `bigvideo` use C++ to store matrices, but also support one class type

    -   For multiple class types, use `ff` package

-   Very Large datasets use

    -   `RHaddop` package
    -   `HadoopStreaming`
    -   `Rhipe`

### Medium size

```{r}
library("rio")
```

To import multiple files in a directory

```{r, eval = FALSE}
str(import_list(dir()), which = 1)
```

To export a single data file

```{r, eval = FALSE}
export(data, "data.csv")
export(data,"data.dta")
export(data,"data.txt")
export(data,"data_cyl.rds")
export(data,"data.rdata")
export(data,"data.R")
export(data,"data.csv.zip")
export(data,"list.json")
```

To export multiple data files

```{r, eval = FALSE}
export(list(mtcars = mtcars, iris = iris), "data_file_type") 
# where data_file_type should substituted with the extension listed above
```

To convert between data file types

```{r, eval = FALSE}
# convert Stata to SPSS
convert("data.dta", "data.sav")
```

### Large size

Use R on a cluster

-   Amazon Web Service (AWS): \$1/hr

Import files as chunks

```{r, eval = FALSE}
file_in    <- file("in.csv","r")
chunk_size <- 100000 # choose the best size for you
x          <- readLines(file_in, n=chunk_size)
```

`data.table` method

```{r, eval = FALSE}
require(data.table)
mydata = fread("in.csv", header = T)
```

`ff` package: this method does not allow you to pass connections

```{r, eval = FALSE}
library("ff")
x <- read.csv.ffdf(
    file = "file.csv",
    nrow = 10,
    header = TRUE,
    VERBOSE = TRUE,
    first.rows = 10000,
    next.rows = 50000,
    colClasses = NA
)
```

`bigmemory` package

```{r, eval = FALSE}
my_data <- read.big.matrix('in.csv', header = T)
```

`sqldf` package

```{r, eval = FALSE}
library(sqldf)
my_data <- read.csv.sql('in.csv')

iris2 <- read.csv.sql("iris.csv", 
    sql = "select * from file where Species = 'setosa' ")
```

```{r}
library(RMySQL)
```

`RQLite` package

-   [Download SQLite](https://sqlite.org/download.html), pick "A bundle of command-line tools for managing SQLite database files" for Window 10
-   Unzip file, and open `sqlite3.exe.`
-   Type in the prompt
    -   `sqlite> .cd 'C:\Users\data'` specify path to your desired directory
    -   `sqlite> .open database_name.db` to open a database
    -   To import the CSV file into the database
        -   `sqlite> .mode csv` specify to SQLite that the next file is .csv file
        -   `sqlite> .import file_name.csv datbase_name` to import the csv file to the database
    -   `sqlite> .exit` After you're done, exit the sqlite program

```{r, eval = FALSE}
library(DBI)
library(dplyr)
library("RSQLite")
setwd("")
con <- dbConnect(RSQLite::SQLite(), "data_base.db")
tbl <- tbl(con, "data_table")
tbl %>% 
    filter() %>%
    select() %>%
    collect() # to actually pull the data into the workspace
dbDisconnect(con)
```

`arrow` package

```{r, eval = FALSE}
library("arrow")
read_csv_arrow()
```

`vroom` package

```{r, eval = FALSE}
library(vroom)
spec(vroom(file_path))
compressed <- vroom_example("mtcars.csv.zip")
vroom(compressed)
```

`data.table` package

```{r, eval = FALSE}
s = fread("sample.csv")
```

Comparisons regarding storage space

```{r, eval = FALSE}
test = ff::read.csv.ffdf(file = "")
object.size(test) # worst

test1 = data.table::fread(file = "")
object.size(test1) # best

test2 = readr::read_csv(""))
object.size(test2) # 2nd

test3 = vroom(file = "")
object.size(test3) # equal to read_csv
```

To work with big data, you can convert it to `csv.gz` , but since typically, R would require you to load the whole data then export it. With data greater than 10 GB, we have to do it sequentially. Even though `read.csv` is much slower than `readr::read_csv` , we still have to use it because it can pass connection, and it allows you to loop sequentially. On the other, because currently `readr::read_csv` does not have the `skip` function, and even if we can use the skip, we still have to read and skip lines in previous loop.

For example, say you `read_csv(, n_max = 100, skip =0)` and then `read_csv(, n_max = 200, skip = 100)` you actually have to read again the first 100 rows. However, `read.csv` without specifying anything, will continue at the 100 mark.

Notice, sometimes you might have error looking like this

"Error in (function (con, what, n = 1L, size = NA_integer\_, signed = TRUE, : can only read from a binary connection"

then you can change it instead of `r` in the connection into `rb` . Even though an author of the package suggested that `file` should be able to recognize the appropriate form, so far I did not prevail.

## Data Manipulation

```{r}
# load packages
library(tidyverse)
library(lubridate)


x <- c(1, 4, 23, 4, 45)
n <- c(1, 3, 5)
g <- c("M", "M", "F")
df <- data.frame(n, g)
df
str(df)

#Similarly
df <- tibble(n, g)
df
str(df)

# list form
lst <- list(x, n, g, df)
lst

# Or
lst2 <- list(num = x, size = n, sex = g, data = df)
lst2

# Or
lst3 <- list(x = c(1, 3, 5, 7),
             y = c(2, 2, 2, 4, 5, 5, 5, 6),
             z = c(22, 3, 3, 3, 5, 10))
lst3

# find the means of x, y, z.

# can do one at a time
mean(lst3$x)
mean(lst3$y)
mean(lst3$z)

# list apply
lapply(lst3, mean)

# OR
sapply(lst3, mean)

# Or, tidyverse function map() 
map(lst3, mean)

# The tidyverse requires a modified map function called map_dbl()
map_dbl(lst3, mean)


# Binding 
dat01 <- tibble(x = 1:5, y = 5:1)
dat01
dat02 <- tibble(x = 10:16, y = x/2)
dat02
dat03 <- tibble(z = runif(5)) # 5 random numbers from interval (0,1)
dat03

# row binding
bind_rows(dat01, dat02, dat01)

# use ".id" argument to create a new column 
# that contains an identifier for the original data.
bind_rows(dat01, dat02, .id = "id")

# with name
bind_rows("dat01" = dat01, "dat02" = dat02, .id = "id")

# bind_rows() also works on lists of data frames
list01 <- list("dat01" = dat01, "dat02" = dat02)
list01
bind_rows(list01)
bind_rows(list01, .id = "source")

# The extended example below demonstrates how this can be very handy.

# column binding
bind_cols(dat01, dat03)


# Regular expressions -----------------------------------------------------
names <- c("Ford, MS", "Jones, PhD", "Martin, Phd", "Huck, MA, MLS")

# pattern: first comma and everything after it
str_remove(names, pattern = ", [[:print:]]+")

# [[:print:]]+ = one or more printable characters


# Reshaping ---------------------------------------------------------------

# Example of a wide data frame. Notice each person has multiple test scores
# that span columns.
wide <- data.frame(name=c("Clay","Garrett","Addison"), 
                   test1=c(78, 93, 90), 
                   test2=c(87, 91, 97),
                   test3=c(88, 99, 91))
wide

# Example of a long data frame. This is the same data as above, but in long
# format. We have one row per person per test.
long <- data.frame(name=rep(c("Clay","Garrett","Addison"),each=3),
                   test=rep(1:3, 3),
                   score=c(78, 87, 88, 93, 91, 99, 90, 97, 91))
long

# mean score per student
aggregate(score ~ name, data = long, mean)
aggregate(score ~ test, data = long, mean)

# line plot of scores over test, grouped by name
ggplot(long, aes(x = factor(test), y = score, color = name, group = name)) +
  geom_point() +
  geom_line() +
  xlab("Test")


#### reshape wide to long
pivot_longer(wide, test1:test3, names_to = "test", values_to = "score")

# Or
pivot_longer(wide, -name, names_to = "test", values_to = "score")

# drop "test" from the test column with names_prefix argument
pivot_longer(wide, -name, names_to = "test", values_to = "score", 
             names_prefix = "test")

#### reshape long to wide 
pivot_wider(long, name, names_from = test, values_from = score)

# using the names_prefix argument lets us prepend text to the column names.
pivot_wider(long, name, names_from = test, values_from = score,
            names_prefix = "test")

```

The verbs of data manipulation

-   `select`: selecting (or not selecting) columns based on their names (eg: select columns Q1 through Q25)
-   `slice`: selecting (or not selecting) rows based on their position (eg: select rows 1:10)
-   `mutate`: add or derive new columns (or variables) based on existing columns (eg: create a new column that expresses measurement in cm based on existing measure in inches)
-   `rename`: rename variables or change column names (eg: change "GraduationRate100" to "grad100")
-   `filter`: selecting rows based on a condition (eg: all rows where gender = Male)
-   `arrange`: ordering rows based on variable(s) numeric or alphabetical order (eg: sort in descending order of Income)
-   `sample`: take random samples of data (eg: sample 80% of data to create a "training" set)
-   `summarize`: condense or aggregate multiple values into single summary values (eg: calculate median income by age group)
-   `group_by`: convert a tbl into a grouped tbl so that operations are performed "by group"; allows us to summarize data or apply verbs to data by groups (eg, by gender or treatment)
-   the pipe: `%>%`
    -   Use Ctrl + Shift + M (Win) or Cmd + Shift + M (Mac) to enter in RStudio

    -   The pipe takes the output of a function and "pipes" into the first argument of the next function.

    -   new pipe is `|>` It should be identical to the old one, except for certain special cases.
-   `:=` (Walrus operator): similar to `=` , but for cases where you want to use the `glue` package (i.e., dynamic changes in the variable name in the left-hand side)

Writing function in R

Tunneling

`{{` (called curly-curly) allows you to tunnel data-variables through arg-variables (i.e., function arguments)

```{r}
library(tidyverse)

get_mean <- function(data, group_var, var_to_mean){
    data %>% 
        group_by({{group_var}}) %>% 
        summarize(mean = mean({{var_to_mean}}))
}

data("mtcars")
head(mtcars)

mtcars %>% 
    get_mean(group_var = cyl, var_to_mean = mpg)

# to change the resulting variable name dynamically, 
# you can use the glue interpolation (i.e., `{{`) and Walrus operator (`:=`)
get_mean <- function(data, group_var, var_to_mean, prefix = "mean_of"){
    data %>% 
        group_by({{group_var}}) %>% 
        summarize("{prefix}_{{var_to_mean}}" := mean({{var_to_mean}}))
}

mtcars %>% 
    get_mean(group_var = cyl, var_to_mean = mpg)
```