GitHub - sndaba/RPackagesForDataCleaning: This is a presentation that I did for R-Ladies Gaborone 😀

🗓 25-01-2022

R-Ladies Brisbane GitHub hosts an event for R-Ladies Gaborone on the introduction to R packages for data cleaning, namely Naniar, Janitor, Amelia and Datawizard.

The presentation used the 2022 NYC housing dataset to demonstrate the R package for data cleaning.

1. Data wrangling and Exploration

Step 1 installs and loads the packages and this is where data wrangling is done with datawizard and janitor for exploration.

library(Amelia)
library(naniar)
library(data.table)
library(datawizard)
library(janitor)
library(readr)
library(ggplot2)
library(dplyr)

HP <- read_csv("https://raw.githubusercontent.com/sndaba/RPackagesForDataCleaning/main/NYC_2022.csv")
View(HP)

HP <- datawizard::data_remove(HP,"latitude")        #remove data.frame,column
HP <- datawizard::data_remove(HP,"longitude")       #remove data.frame,column
HP <- datawizard::data_remove(HP,"id")              #remove data.frame,column
HP <- datawizard::data_reorder(HP,c("host_id","name")) #add the names of the cols in the new order
HP <- datawizard::data_reorder(HP,c("host_name","name")) #add the names of the cols in the new order
HP <- datawizard::data_reorder(HP,c("host_id","host_name")) #add the names of the cols in the new order
HP <- janitor::clean_names(HP) #changes to lower case
HP <- datawizard::data_rename(HP,"price","house_price") #changes col name

janitor::get_dupes(HP,colnames(HP)) #checks whether there are any duplicates
janitor::tabyl(HP,host_name) %>% adorn_pct_formatting(digit=0,affix_sign=TRUE)  #col tabulation
janitor::top_levels(as.factor(HP$house_price),5) %>%    #shows the lowest, middle and highest numeric range
  adorn_pct_formatting(digits = 0, affix_sign=TRUE)

2. Missing values and visualisation

The next Step 2 looks for missing values and visualise the findings using naniar and ggplot.

naniar::any_miss(HP)           #check for NA
naniar::miss_var_summary(HP)   #NA frequency 
naniar::gg_miss_var(HP)        #NA visualization
naniar::gg_miss_upset(HP,order.by="freq")  #variable NA values relationship

ggplot2::ggplot(HP,aes(x=year_built,y=year_remod_add))+ #categorical variable
        geom_miss_point()+
        facet_wrap(~calculated_host_listings_count)+
        theme_dark()

3. Data transformation and Multiple imputation

In Step 3, the data frame is changed to a data table using data.table and Multiple imputation is used fo missing values with Amelia.

drop_dt <- data.table::as.data.table(HP) #set data.frame to data.table
drop_col <- c('name',                    #drop column 
              'host_name',        
              'neighbourhood',
              'neighbourhood_group',
              'room_type')
col <- drop_dt[,!drop_col,with=FALSE]   #create new table
res.amelia <- Amelia::amelia(col,m=5)  #5 imputed data sets 
Amelia::compare.density(res.amelia,var="house_price")  #density plot to analysis
HP <- naniar::impute_mean_if(HP,.predicate = is.numeric)
naniar::any_miss(HP)         #check if there are any NA

4. Filtering, group using by and Binary search using keys.

Finally, Step 4 sets data frame to data table, iltering rows based on conditions and data.table for the key concept for binary search to sort the data table using the key.

View(HP_dt <- data.table::as.data.table(HP))   #set data frame to data table
class(HP_dt)
head(HP_dt[room_type=="Private room" & house_price>181500],4) #filtering rows based on conditions
head(HP_dt[,.(host_id,host_name,name)],4)   #select given columns                                                       
head(HP_dt[neighbourhood=="Harlem",.(neighbourhood,     #select given cols by row selection
                                     number_of_reviews,
                                     availability_365
                                     )],4)
head(average <- HP_dt[,.(mean_price=mean(house_price)), #grouping using by 
                      by=neighbourhood],4)
head(chain_gang <- HP_dt[,.(.N,maximum=max(house_price),  #chaining statement
              minimum=min(house_price)  
         %>% round(2)),by=neighbourhood],4)   

#key concept for binary search. Sorts the data table by the key
data.table::setkey(HP_dt,neighbourhood)  #setting key for the data table
data.table::key(HP_dt)   #check data table key
head(HP_dt[.("Harlem")],3)     #select rows using key
room_by_neighbourhood <- HP_dt[.("Bedford-Stuyvesant"),
                               .(neighbourhood_group,neighbourhood,
                                room_type)]
head(room_by_neighbourhood)

#group using keyby
head(ans <- HP_dt["Hell's Kitchen",.(neighbourhood,
                                    max_review=number_of_reviews),
                                    keyby=availability_365],4)

Useful URLs Shared during the presentation

CODATA webinar, importance of data cleaning

CRAN: Simple Tools for Examining and Cleaning Dirty Data Janitor

Data Structures, Summaries, and Visualisations for Missing Data Naniar

A Program for Missing Data Amelia

The validity of multiple-imputation-based analyses Multiple imputation

Package to easily manipulate, clean, transform, and prepare your data for analysis. DataWizard

Provides a high-performance version of base R’s data.frame with syntax and feature enhancements for ease of use, convenience and programming speed. Data.table

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
NYC_2022.csv		NYC_2022.csv
R Packages for Data Cleaning_2022 R Ladies Brisbane R Ladies Gaborone.pdf		R Packages for Data Cleaning_2022 R Ladies Brisbane R Ladies Gaborone.pdf
README.md		README.md
Step_1.R		Step_1.R
Step_2.R		Step_2.R
Step_3.R		Step_3.R
Step_4.R		Step_4.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

1. Data wrangling and Exploration

2. Missing values and visualisation

3. Data transformation and Multiple imputation

4. Filtering, group using by and Binary search using keys.

About

Releases

Packages

Languages

sndaba/RPackagesForDataCleaning

Folders and files

Latest commit

History

Repository files navigation

1. Data wrangling and Exploration

2. Missing values and visualisation

3. Data transformation and Multiple imputation

4. Filtering, group using by and Binary search using keys.

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages