4 Introduction to Data Wrangling via the dplyr package
4.1 Introduction
Data wrangling (also known as data manipulation or munging) is central to data analysis and is often one of the most time consuming portion of an analysis. The dplyr
package contains a suite of functions to make data wrangling easier. The core functions of the dplyr
package can be thought of as verbs for data wrangling.
Verb(s) | Meaning |
---|---|
filter and slice |
pick specific observations (i.e. specific rows) |
arrange |
reorder the rows |
select |
pick variables by their names (i.e. specific columns) |
mutate |
add new calculated columns to a data frame |
summarize |
aggregate many rows into a single row |
In this example we will explore how to use each of these functions, as well as how to combine them with the group_by
function for groupwise manipulations.
To begin, let’s make sure that our data set and the dplyr
package are loaded
library(readr)
## Warning: package 'readr' was built under R version 4.2.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.2.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
<- read_csv("data/colleges2015.csv") colleges
## Rows: 1776 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): college, type, city, state, region
## dbl (10): unitid, admissionRate, ACTmath, ACTenglish, undergrads, cost, grad...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Data: The file college2015.csv
contains information on predominantly bachelor’s-degree granting institutions from 2015 that might be of interest to a college applicant.
To get a feel for what data are available, look at the first couple rows. (The head
and tail
functions are a pair of tools to help you take a quick look at the data without printing all of it.)
head(colleges)
and maybe the last six rows
tail(colleges)
and the structure of the data frame.
str(colleges)
(Note that the str
function print to the console the same info you get by expanding the data in the Environment tab.)
Variable | Description |
---|---|
unitid |
A unique ID number for each school |
college |
School name |
type |
School type: public or private |
city |
City |
state |
State abbreviation |
region |
Region of the U.S. |
admissionRate |
Proportion of undergraduate applicants admitted |
ACTmath |
Median ACT math score |
ACTenglish |
Median ACT english score |
undergrads |
Undergraduate enrollment |
cost |
Total cost of attendance |
gradRate |
Proportion of students graduating within six years |
FYretention |
Proportion of first year students returning for a second year |
fedloan |
Proportion of students with federal student loans |
debt |
Median principal of student loans entering repayment |
4.1.1 0. Piping
By now, you have likely seen R code that uses the pipe %>%
to pass results from one line of code to the next. This is most useful for when we apply multiple verbs to a chunk of code like outlined below.
# Do not run this code
<- initial_data_object %>%
end_result # the . in the next line denotes where the previous lines result is placed
first_function(., function_input1, function_input2, ...) %>%
second_function(., function_input1, function_input2, ...) %>%
third_function(., function_input1, function_input2, ...) %>%
nth_function(., function_input1, function_input2, ...)
Much of this tutorial involves only using a single verb at a time, so we will only use it occasionally. You are welcome to use it if you like for any exercises within, even if they only require one verb.
4.2 1. Filtering rows
To extract the rows only for colleges and universities in a specific state we use the filter
function. For example, we can extract the colleges in New York from the colleges data set using the following code:
<- filter(colleges, state == "NY")
ny head(ny)
Redo using piping
<- colleges %>% filter(state == "NY")
ny
# or can use . to indicate where the left-hand side
# of the pipe will be placed
<- colleges %>% filter(., state == "NY") ny
Remarks
- The first argument given to
filter
is always the data frame (this is true for all the core functions indplyr
), followed by logical tests that the returned cases must pass. In our example, the test was whether the school was in New York, which is written asstate == "NY"
. - Remember: we have to use
==
to indicate equality/equivalence. (As=
is used for assignment, like->
, and, more importantly, for passing named inputs into functions.) - When testing character variables, be sure to use quotes to specify the value of the variable that you are testing.
- To specify multiple tests, use a comma to separate the tests (think of the comma as the word “and”). You can also use the
&
if you prefer. For example,
<- filter(colleges, state == "NY" & undergrads < 2000) smallNY
returns only those rows corresponding to schools in Wisconsin with fewer than 2,000 undergraduate students.
- To specify that at least one test must be passed, use the
|
character (aka “or”) instead of the comma. For example, the below test checks whether a college is in New York or Minnesota or Iowa, so it returns all of the colleges in New York, Minnesota, and Iowa.
<- filter(colleges, (state == "NY") | (state == "MN") | (state == "IA") ) NyMnIa
Alternatively, when dealing with a longer list of choices within the same variable, you can use the %in%
command. While slightly more “R specific”, this is a far better approach to take when you can multiple “or” conditions on the same variable to keep track of.
<- filter(colleges, state %in% c("NY", "MN", "IA")) NyMnIa
- You can use both
|
and,
to specify multiple tests. For example, we can return all colleges with fewer than 2,000 undergraduate students in New York, Minnesota, and Iowa.
<- filter(colleges, state == "NY" | state == "MN" | state == "IA", undergrads < 2000) smallNyMnIa
Common mistake - can’t distribute state == over
the or (|)
<- filter(colleges,
smallNyMnIa == "NY" | "MN" | "IA"),
(state < 2000) undergrads
Common comparison operators for the tests include:
>
,>=
,<
,<=
,!=
(not equal),==
(equal),%in%
, andbetween
.Another common task is to remove rows with missing values, there are several ways to do this. For example, using the R command
na.omit
,
<- na.omit(colleges) colleges
will reduce the data set to only rows with no missing values. You can the filter
function to drop rows with missing values in only a single variable. For example,
<- filter(colleges, !is.na(cost)) colleges
will eliminate only rows with NA
in the cost column.
4.2.1 Exercises
- Construct a data set with only Vermont colleges? (The abbreviation for Vermont is VT.) Feel free to try out piping if you’d like.
<- colleges %>%
vt filter(state == "VT")
- Construct a data set containing private Vermont colleges with at least 1000 undergraduates.
<-
vt_private_large %>% # full data set
colleges filter(
== "VT", # keep only Vermont schools
state == "private", # keep only private schools
type >= 1000 # keep only schools with enrollment > 1000
undergrads )
Another way
%>%
colleges filter(state == "VT") %>%
filter(type == "private") %>%
filter(undergrads >= 1000)
- Construct a data set containing all private Vermont colleges or public colleges with fewer than 2000 undergraduates.
%>% # full data
colleges filter(
# private Vermont schools or
== "VT") & (type == "private") ) |
( (state
# public schools with fewer than 2000
== "public") & (undergrads < 2000) )
( (type )
4.3 2. Slicing rows
To extract rows 10 through 16 from the colleges data frame we use the slice
function.
slice(colleges, 10:16)
Remarks
- To select consecutive rows, create a vector of the row indices by separating the first and last row numbers with a
:
. - To select non-consecutive rows, create a vector manually by concatenating the row numbers using
c()
. For example, to select the 2nd, 18th, and 168th rows useslice(colleges, c(2, 18, 168))
.
4.4 3. Arranging rows
To sort the rows by total cost, from the least expensive to the most expensive, we use the arrange
function.
<- arrange(colleges, cost)
costDF head(costDF)
Remarks
- By default,
arrange
assumes that we want the data arranged in ascending order by the specified variable(s). - To arrange the rows in descending order, wrap the variable name in the
desc
function. For example, to arrange the data frame from most to least expensive we would use the following command:
<- arrange(colleges, desc(cost))
costDF head(costDF)
- To arrange a data frame by the values of multiple variables, list the variables in a comma separated list. The order of the variables specifies the order in which the data frame will be arranged. For example,
<- arrange(colleges,
actDF desc(ACTmath), desc(ACTenglish))
reorders colleges first by the median ACT math score (in descending order) and then by the ACT english score (in descending order)
4.4.1 Exercises
- What school is most expensive?
# works if no ties
%>%
colleges arrange(desc(cost)) %>%
slice(1)
# better, in case there are ties
%>%
colleges slice_max(order_by = cost,
n = 1)
- What school has the least expensive tuition in Wyoming (WY)?
# using arrange
%>%
colleges arrange(
desc(state), # sort state abbreviations (WY is last)
# cheapest first
cost )
# not using arrange
%>%
colleges filter(state == "WY") %>% # get only WY schools
slice_min(cost, n = 1)
Suppose we wanted to find the ten most expensive schools and store the result as a new object.
Which school has the most expensive tuition in New York?
4.5 4. Selecting columns
Suppose that you are only interested in a subset of the columns in the data set—say, college
, city
, state
, undergrads
, and cost
—and want to create a data frame with only these columns. To do this, we select
the desired columns:
<- colleges %>% select(college, city, state, undergrads, cost)
lessCols head(lessCols)
Remarks
- After specifying the data frame, list the variable names to select from the data frame separated by commas.
- In some cases you may want to drop a small number of variables from a data frame. In this case, putting a negative sign before a variable name tells
select
to select all but the negated variables. For example, if we only wished to drop theunitid
variable we run the following command:
<- select(colleges, -unitid)
drop_unitid head(drop_unitid)
4.6 5. Mutating data (adding new columns)
Data sets often do not contain the exact variables we need, but contain all of the information necessary to calculate the needed variables. In this case, we can use the mutate
function to add a new column to a data frame that is calculated from other variables. For example, we may wish to report percentages rather than proportions for the admissions rate.
<- mutate(colleges, admissionPct = 100 * admissionRate)
colleges
# can move new variable to the front
# (might be useful if printing/knitting a tibble)
<- colleges %>%
colleges mutate(admissionPct = 100 * admissionRate) %>%
relocate(admissionPct)
Remarks
- After specifying the data frame, give the name of the new variable and it’s definition. Notice that we need to use
=
to assign the value of the new variable within mutate, not<-
. (This is because the new variable is not a new object, just an extra piece of information we are storing within a different object.) - To add multiple variables once, separate the list of new variables by commas. For example, we can also add percentage versions of
FYretention
andgradRate
.
<- mutate(colleges,
colleges FYretentionPct = 100 * FYretention,
gradPct = 100 * gradRate,
dropRate = FYretentionPct - gradPct
)
4.7 6. Summarizing rows
To create summary statistics for columns within the data set we must aggregate all of the rows using the summarize
command. (Note that you can also use the British spelling: summarise
.) For example, to calculate the median cost of all 1069 colleges in our data set we run the following command:
summarize(colleges,
medianCost = median(cost, na.rm = TRUE))
Remarks
- As with all of the functions we have seen, the first argument should be the name of the data frame.
- We add
na.rm = TRUE
here to remove any missing values in thecost
column before the calculation. Many functions, including this summarize function, will return an error if there are missing values (blanks,NA
s orNaN
s) in your data. summarize
returns a data frame, with one row and one column.- We can ask for multiple aggregations in one line of code by simply using a comma separated list. For example, we can calculate the five number summary of
cost
for all 1069 colleges in our data set
summarize(colleges,
min = min(cost, na.rm = TRUE),
Q1 = quantile(cost, .25, na.rm = TRUE),
median = median(cost, na.rm = TRUE),
Q3 = quantile(cost, .75, na.rm = TRUE),
max = max(cost, na.rm = TRUE))
- Notice that even when multiple statistics are calculated, the result is a data frame with one row and the number of columns correspond to the number of summary statistics.
It is worth noting that some summary statistics (e.g., the five number summary) can be calculated in dplyr
in other ways. For example,
summarize(colleges,
quantile_statistic =
quantile(x = cost, probs = seq(0, 1, 0.25), na.rm = TRUE), q = seq(0, 1, 0.25) )
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
## always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
However, it is my opinion that this is more complex for those just learning dplyr
.
4.8 7. Groupwise manipulation
Often it is of interest to manipulate data within groups. For example, we might be more interested in creating separate summaries for each state, or for private and public colleges. To do this we must first tell R what groups are of interest using the group_by
function, and then we can use any of the above functions. Most often group_by
is paired with summarise
or mutate
.
Let’s first consider comparing the cost of private and public colleges. First, we must specify that the variable type
defines the groups of interest.
<- group_by(colleges, type)
colleges_by_type colleges_by_type
Remarks
After specifying the data frame, list the categorical variable(s) defining the groups.
When we print the data frame it tells us the variables that define the groups and how many groups are in the data frame. This provides sanity checks, so be sure to pay attention to if this matches your expectation! For example, if there were any typos in the column or if just one value is capitalized (such as Public) we would be told there are more than two groups.
group_by
statements are almost always used as part of a piped chain of commands. It is rare that you would want to just group a dataset without then performing some further wrangling on it.The
ungroup
command can be used to remove a data frame of any groups constructed bygroup_by
.Multiple variables can be used to specify the groups. For example, to specify groups by state and type, we would run the following command:
<- group_by(colleges, state, type) colleges_state_type
- If using multiple
group_by
statements on the same data frame, only the last one will be applied. For example,
<- colleges %>% group_by(state) %>% group_by(type) colleges_state_type
Will only have groups based on the type of college, not the state.
4.8.1 Exercises
- Verify the first two statements by (a) printing both
colleges
andcolleges_by_type
to the console and (b) investigating each in the Environment panel.
colleges colleges_by_type
- Verify the last two statements by running the code provided in them.
4.8.2 Combining group_by
with other commands
Once we have a grouped data frame, we can obtain summaries by group via summarize
. For example, the five number summary of cost by institution type is obtained below
%>%
colleges group_by(state) %>%
summarize(min = min(cost, na.rm = TRUE),
Q1 = quantile(cost, .25, na.rm = TRUE),
median = median(cost, na.rm = TRUE),
Q3 = quantile(cost, .75, na.rm = TRUE),
max = max(cost, na.rm = TRUE)
)
We can also calculate new variables within groups, such as the standardized cost of attendance within each state:
<- group_by(colleges, state) %>%
colleges_by_state mutate(
mean.cost = mean(cost, na.rm = TRUE),
sd.cost = sd(cost, na.rm = TRUE),
z.cost = (cost - mean.cost) / sd.cost
) colleges_by_state
Remarks
mutate
allows you to use variables defined earlier to calculate a new variable. This is howz.cost
was calculated.
4.9 8. Additional Exercises
Filter the rows for colleges in Great Lakes or New England regions.
Which school in Great Lakes or New England region has the highest first-year retention rate in this reduced data set.
Which school in Great Lakes or New England region has the lowest admissions rate in this reduced data set.
Using the full data set, create a column giving the cumulative average cost of attendance, assuming that students finish in four years and that costs increase 3% per year. Name this new column
total_cost_4yrs
.Using the full data set, summarize the total cost of attendance by region by calculating the 10th, 50th, and 90th percentiles. Then arrange them by median costs (from highest to lowest).
Place your result from the previous question in a nicely formatted, HTML friendly, table.