Chapter 4 Data Wrangling
4.1 Libraries
Up until now, we haven’t used packages - only functions available in base R. However, the power of R truly exists in the libraries, which are sets of functions that any user can create and share.
As of June 2019, there were over 14,000 packages available on the Comprehensive R Archive Network, or CRAN, the public clearing house for R packages. CRAN lists all libraries here: https://cran.r-project.org/web/packages/available_packages_by_name.html
To install a library, you can either:
- Packages
-->
Install-->
Name - Use the function install.packages()
Once a package is installed, you need to load it using the library or require function.
These are the libraries we’ll be using today. Install if you don’t have:
install.packages(c("data.table","dplyr","stringr"))
Now, import the libraries:
library(data.table)
library(dplyr)
library(stringr)
4.2 Importing & Exporting Data
Before importing data, we generally want to set the working directory. This is a folder where we will be accessing or storing data and outputs:
setwd("C:/Users/avillan/Documents/Class 2022/Class_Code")
Another useful function is getwd()
. This function returns an absolute filepath representing the current working directory of the R process, or NULL
if the working directory is not available:
getwd()
To import data, we can use the read.
function
<- read.csv("Auto_MPG.csv", stringsAsFactors = FALSE) df_auto
We can check the data set:
class(df_auto)
View(df_auto)
dim(df_auto)
nrow(df_auto)
summary(df_auto)
Note that Horsepower is a character class
typeof(df_auto$Horsepower)
$Horsepower
df_autosummary(df_auto$Horsepower)
Upon inspection, it read NAs as question marks (a character), so everything became a character. We can fix this by reassigning as numeric:
$Horsepower <- as.numeric(df_auto$Horsepower)
df_autosummary(df_auto)
$Horsepower
df_autosummary(df_auto$Horsepower)
Recall if you want to fix a column name you can use the colnames() function:
colnames(df_auto)
colnames(df_auto)[8] <- "Origin"
colnames(df_auto)
If we want to export the data, we can use write.csv. The row.names = FALSE argument will stop the writer from outputting a row number:
write.csv(df_auto, file="Auto_MPG_Clean.csv", row.names = FALSE)
In addition to importing the data using code, one can also import data into RStudio with point-and-click. Go to Import Dataset
in the Environment panel and select the relevant format. From there, you can set the datatype of each column, as well as specify whether to include or to skip a column or row. RStudio automatically generates the corresponding code that you can copy and paste for later reproducibility.
4.3 dplyr and tidyverse
One of the most difficult (yet, interesting!) parts of open-source software is the many ways to achieve the same task. This can make recalling functions and syntax difficult, increasing the time to write or read code. Thus, people have started creating packages that structure coding tasks into a simple but comprehensive set of functions. You’ll see this referred to as a “grammar”. This framework attempts to simplify the majority of required functions into a simple set of “verbs” to remember.
If you’re familiar with SQL - think about how easy it is to read and write:
SELECT * FROM table WHERE column = 'some_value'
In SQL, SELECT
, FROM
, WHERE
are “verbs” that provide the framework for numerous tasks.
The packages plyr, dplyr, ggplot2, stringr (and others) work within this framework. “tidyverse” is a collection of packages that encompases all of these https://www.tidyverse.org/
4.4 Data manipulation - dplyr basics
Select: select columns by name or helper function (choose which variables you want to look at ):
select(df_auto, MPG, Cylinders)
select(df_auto, -Origin)
select(df_auto, contains("Model"))
The base R equivalents are:
$MPG
df_autoc('MPG','Cylinders')]
df_auto[,-c('Origin')]
df_auto[,grepl("Model", names(df_auto))] df_auto[,
Slice: Select rows by position (choose which rows of data you want):
$row_number <- 1:nrow(df_auto) #adding a new variable called row_number to help us see the index
df_auto
slice(df_auto,1:5)
slice(df_auto, 150:n()) #all the rows between 150 and the end
slice(df_auto, -1:-5) #removes rows between values, equivalent to slice(df_auto, 6:n())
<- select(df_auto, -row_number) #remove the row_number column df_auto
The base R equivalents are:
1:5,]
df_auto[150:nrow(df_auto),]
df_auto[-c(1:5),] df_auto[
Rename: Rename the columns of a data frame
colnames(df_auto)
<- rename(df_auto, Miles_Per_Gallon=MPG) #rename MPG to Miles_Per_Gallon
df_auto colnames(df_auto)
<- rename(df_auto, MPG=Miles_Per_Gallon) #rename back Miles_Per_Gallon to MPG
df_auto colnames(df_auto)
The base R equivalents are:
colnames(df_auto)[1] <- 'Miles_Per_Gallon'
colnames(df_auto)
colnames(df_auto)[1] <- 'MPG'
colnames(df_auto)
Filter: Extract rows that meet logical criteria (filter the data frame by one, or multiple, conditions)
filter(df_auto, MPG>20)
filter(df_auto, (MPG>20 & Acceleration>20))
filter(df_auto, MPG==max(MPG))
filter(df_auto, MPG==min(MPG))
The base R equivalents are:
$MPG>20,]
df_auto[df_auto$MPG>20 & df_auto$Acceleration>20),]
df_auto[(df_autowhich.max(df_auto$MPG),]
df_auto[which.min(df_auto$MPG),] df_auto[
Sample: Randomly select a sample of n size or a proportion
sample_n(df_auto, 5)
sample_frac(df_auto, 0.05)
The base R equivalents are:
sample(1:nrow(df_auto),5),]
df_auto[sample(1:nrow(df_auto),ceiling(0.05*nrow(df_auto))),] df_auto[
Mutate: Compute and append one or more new columns (create a new variable - returns the data frame)
mutate(df_auto, HP_Per_Cylinder=Horsepower/Cylinders)
mutate(df_auto, Avg_HP_Per_Cylinder = mean(Horsepower/Cylinders, na.rm=TRUE))
The base R equivalents are:
$HP_Per_Cylinder <- df_auto$Horsepower / df_auto$Cylinders
df_auto$Avg_HP_Per_Cylinder <- mean(df_auto$Horsepower / df_auto$Cylinders, na.rm = TRUE) df_auto
About the na.rm=TRUE argument. Some arithmatic functions sum, avg, count, etc. will NA the entire column if they encounter a single NA within that column/vector. Setting na.rm=TRUE will tell the interpreter to “remove” the NA and carry forth with the calculation.
Summarise: Summarise data into single row of values (aggregates the data into a single result. think avg, mean, min, max, etc.)
summarise(df_auto,
Avg_HP_Per_Cylinder=mean(Horsepower/Cylinders, na.rm = TRUE),
Min_HP_Per_Cylinder=min(Horsepower/Cylinders, na.rm=TRUE)
)
The base R equivalents are:
data.frame(
Avg_HP_Per_Cylinder = mean(df_auto$Horsepower/df_auto$Cylinders, na.rm = TRUE),
Min_HP_Per_Cylinder = min(df_auto$Horsepower/df_auto$Cylinders, na.rm=TRUE)
)
4.5 dplyr - pipe operator
Consider the previous operations. If we wanted to manipulate our data set in order, we traditionally have to do a reassignment each time:
<- rename(df_auto, Miles_Per_Gallon=MPG) #rename
df_auto_hold
<- mutate(df_auto_hold, HP_Per_Cylinder=Horsepower/Cylinders) #then create
df_auto_hold
<- filter(df_auto_hold, HP_Per_Cylinder>20) #then filter
df_auto_hold
<- select(df_auto_hold, Car_Name, HP_Per_Cylinder) #then select
df_auto_hold
View(df_auto_hold)
This is truely cumbersome the more complicated the data manipulation becomes. To avoid this, we can use the pipe operator %>%
. The pipe “passes along” the results of one function to the next:
<- df_auto %>% #note the pipe operator
df_auto_hold rename(Miles_Per_Gallon=MPG) %>% #note the first argument is not required, because the pipe is passing it
mutate(HP_Per_Cylinder=Horsepower/Cylinders) %>%
filter(HP_Per_Cylinder>20) %>%
select(Car_Name, HP_Per_Cylinder)
View(df_auto_hold)
4.6 stringr package
The package stringr is a similar grammar language for manipulating strings. All the functions start with str_
Consider that I want to extract the Car Make and Model from the Car Name:
View(df_auto_hold)
We note that the first word in Car_Name is the make, and the second word is the model. The str_split function will split the string into a vector of individual tokens:
<- df_auto %>%
df_auto_hold mutate(Car_Make=str_split(Car_Name," "))
View(df_auto_hold)
The str_split creates a list, which we can use to try to get the value:
<- df_auto %>%
df_auto_hold mutate(Car_Make=str_split(Car_Name," ")[[1]][1])
View(df_auto_hold)
Note, this is treating the entire column, instead of by row. One way to address this is to rowise() by Car_Name, then create the variable:
<- df_auto %>%
df_auto_hold rowwise() %>%
mutate(Car_Make=str_split(Car_Name," ")[[1]][1])
View(df_auto_hold)
An alternative to this is the group_by() function:
<- df_auto %>%
df_auto_hold group_by(Car_Name) %>%
mutate(Car_Make=str_split(Car_Name," ")[[1]][1])
View(df_auto_hold)
With the structure built, I can just pipe on more functions to get additional variables:
<- df_auto %>%
df_auto rowwise() %>%
mutate(Car_Make=str_split(Car_Name," ")[[1]][1]) %>%
mutate(Car_Model=paste(str_split(Car_Name," ")[[1]][-1], collapse=" ")) #everything but the first
View(df_auto)
Now that we have Car Make and Model, we might want to summarize some information. We saw summarize the entire column earlier, but we can additionally summarize by a group. We use the group_by() to do so:
# Average MPG by Make and Model Year
%>%
df_auto group_by(Car_Make, Model_Year) %>%
summarise(Average_MPG=mean(MPG)) %>%
arrange(Model_Year) %>%
filter(Model_Year==70) %>%
ungroup(Car_Make, Model_Year)
4.7 Combining Data - dplyr
Typically when working with data, especially relational databases, we have information stored in multiple tables. For instance, we have the Auto data set:
View(df_auto)
# But we also have data on when a brand started:
<- c("amc", "audi", "bmw", "buick", "chevrolet", "datsun", "dodge", "ford")
Car_Make <- c(1954, 1910, 1916, 1903, 1911, 1931, 1900, 1903)
First_Year
<- data.frame(Car_Make=Car_Make,
df_auto_start First_Year=First_Year,
stringsAsFactors = FALSE)
View(df_auto_start)
We are interested in the relationship between founding year and average MPG for the max year. For this, we need to get my First Year data combined with the Auto MPG Data. This is accomplished via joins.
Dplyr has different join types. The easiest way to think about this is as a Venn-Diagram.
- inner_join - only returns the data where both frames contain the join key(s), removes all non-matching rows
- left_join - maintains all of the data on base table (left/top), and joins matching data from the joining table. NAs created for joining table
- right_join - opposite of left. Maintains all of the data on the joining table and matches from the base table. NAs created for base table
- full_join - returns all data from both tables. NAs created for both tables.
Inner_join:
<- df_auto %>%
df_auto_hold inner_join(df_auto_start, by="Car_Make")
View(df_auto_hold)
# Note, we've effectively filtered a lot of data because the inner_join only keeps matching records.
dim(df_auto)
dim(df_auto_hold)
# Joins can pipe together with other functions just as we expect
<- df_auto %>%
df_auto_hold inner_join(df_auto_start, by="Car_Make") %>%
group_by(Car_Make, First_Year) %>%
summarise(Average_MPG=mean(MPG))
View(df_auto_hold)
plot(df_auto_hold$First_Year, df_auto_hold$Average_MPG)
Left_join:
<- df_auto %>%
df_auto_hold left_join(df_auto_start, by="Car_Make")
# note, we've retained all of the records but it's placed "NA" in the column where it couldn't find a matching record
dim(df_auto)
dim(df_auto_hold)
is.na(df_auto_hold$First_Year)
View(df_auto_hold)
<- df_auto %>%
df_auto_hold left_join(df_auto_start, by="Car_Make") %>%
group_by(Car_Make, First_Year) %>%
summarise(Average_MPG=mean(MPG))
View(df_auto_hold)
4.8 Cheat Sheet
We’re not going to cover every function here, but what is great about these packages is the the cheatsheets. The community has made to provide reference. I highly recommend downloading them and keeping them close:
4.9 Extra - sqldf
If you like SQL, there is a package allows you to manipulate data in a in-memory, or permanent database:
install.packages("sqldf")
library(sqldf)
<- sqldf("select * from df_auto where Car_Make='amc'")
df_auto_hold View(df_auto_hold)
We will learn a lot more about SQL in the Fall semester.