Skip to content

sndaba/DatawizardPackageTutorial

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 

Repository files navigation

The dependency free package of dplyr that is lighteight and has a different way of naming its functions with data_ before the fucntion name. The full article can be found on medium.

logo

I discovered Datawizard when researching for my talk on R packages for data cleaning. The latest version of Datawizard 0.9.1 was released on the 9th of September 2023. Datawizard is used for data transformation and statistic operations and is also part of the easystats collection.

This is a short tutorial on functions from the Data wizard package for data wrangling by using a dataset that can show us how the functions work.

1. Installing the Datawizard package

Installing and loading the Datawizard package.

install.packages("datawizard")
library(datawizard)

2. Read the dataset using data_read()

The data_read() function imports data from various file types. It is a small wrapper around haven::read_stata(), readxl::read_excel() and data.table::fread() .

#read the dataset using the data_read() function
house_price <- data_read("https://raw.githubusercontent.com/sndaba/RPackagesForDataCleaning/main/NYC_2022.csv")
View(house_price)

#output dataset sample seen below

image

sample of the dataset

3.Peek at the values and type of variables using data_peek()

The function creates a table data frame, showing all column names, variable types and the first values (as many as fit into the screen).

#data_peek shows a summary of the each variables' details
data_peek(house_price)

image

data frame summary showing the type of each variable and examples of values in a variable

4. Statistical summary using data_codebook()

data_codebook() generates codebooks from data frames, i.e. overviews of all variables and some more information about each variable (like labels, values or value range, frequencies, amount of missing values).

#generate an overview of statistics of missing, number of values, frequency of a value
(code <- data_codebook(house_price))

image

Output from codebook()

5. Replacing missing values with convert_na_to()

Replace missing values in a variable or a data frame using convert_na_to().

#missing data for numeric and characters
house_price_missing <- house_price <- convert_na_to(house_price, replace_num = 0, replace_char = "missing")

6. Searching for columns

find_columns() returns column names from a data set that match a certain search pattern, while get_columns() returns the found data.

#finding columns
find_columns(house_price_missing, starts_with("neighbourhood"))

#output shows columns at the bottom
[1] "neighbourhood_group" "neighbourhood"
#get_columns()
get_columns(house_price_missing, starts_with("neighbourhood"))

image

get_columns() output shows values of the columns

7. Look for columns based on pattern name with data_seek()

The data_seek() looks for variables in a data frame, based on patterns that either match the variable name (column name), variable labels, value labels or factor levels. Matching variable and value labels only works for “labelled” data, i.e. when the variables either have a label attribute or labels attribute.

#looks for columns even with a typo. "hot" is similar to "host" or "hood"
data_seek(house_price, "hot", fuzzy = TRUE)

image

list of columns that a close to the label “hot”

8. Remove columns with data_remove()

The data_remove() removes columns from a data frame. All functions support select-helpers that allow flexible specification of a search pattern to find matching columns, which should be reordered or removed.

#remove data.frame,column
house_price <- datawizard::data_remove(house_price, "latitude", "longitude") 
#remove data.frame,column
house_price <- datawizard::data_remove(house_price,"id")    

9. Column reordering with data_reorder()

The data_reorder() will move selected columns to the beginning of a data frame. The other column ordering function, data_relocate() (not covered in this article), will reorder columns to specific positions, indicated by before or after.

#add the names of the cols in the new order
house_price <- house_price_missing <- datawizard::data_reorder(house_price,c("host_id","name")) 
#add the names of the cols in the new order
house_price <- datawizard::data_reorder(house_price,c("host_name","name"))
#add the names of the cols in the new order
house_price <- datawizard::data_reorder(house_price,c("host_id","host_name"))

image

columns reordered

10. Rename some columns using data_rename()

#the column "price" will change to "house_price"
house_price <- datawizard::data_rename(house_price,"price","house_price")

11. Filtering and Matching with data_filter() and data_match()

Both functions return a filtered (or sliced) data frame or row indices of a data frame that match a specific condition. data_filter() works like data_match(), but works with logical expressions or row indices of a data frame to specify matching conditions.

#match rows following variable conditions with data_match()
View(data_match(house_price, data.frame(neighbourhood_group = "Brooklyn")))

image

data frame subset with rows relating to neighbourhood_group column set to “Brooklyn”.

#filtering using logical expressions
View(data_filter(house_price, room_type == "Private room" & house_price > 120000))

image

data frame subset with room_type set to “Private room” and house_price > 120000.

In Summary

The Datawizard package is an all purpose Data Science package where you can get operations for data formation, statistical summaries and data cleaning.

Further reading on Datawizard and coding at the Datawizard repository.