11 Combining Data Sets with dplyr
- Basics
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.3
## Warning: package 'ggplot2' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'tidyr' was built under R version 4.2.3
## Warning: package 'readr' was built under R version 4.2.3
## Warning: package 'purrr' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## Warning: package 'stringr' was built under R version 4.2.2
## Warning: package 'forcats' was built under R version 4.2.3
## Warning: package 'lubridate' was built under R version 4.2.3
library(pander)
## Warning: package 'pander' was built under R version 4.2.2
11.1 Stacking Rows
Suppose we have the following two data sets. The first, dat1
has the variables x1
and y1
. The second, dat2
has the same variables. They represent a pair of datasets measuring the same features at two different locations. We would like to combine them into a single data frame.
<- tibble(x1 = runif(4),
dat1 y1 = rnorm(4))
<- tibble(y1 = rpois(4,1),
dat2 x1 = rgamma(4,shape = 2,scale = 3))
Notice that the order of the two variables is switched in the two tibbles. This means we can’t simply stack them without matching column headings
11.2 Merging Data using Joins
Suppose you have the following two data sets. The first, df1
has the variables id_numb
and xvar
. The second, df2
has the variables id
and yvar
. id_numb
and id
serve as identification variables, possibly with duplicates, where observations from the first data set with id_numb = 1
correspond to observations in the second data set with id = 1
.
<- tibble(id_numb = c(1, 1, 2, 4),
df1 xvar = c(16, -1, 11, 13))
<- tibble(id = c(1, 2, 2, 3),
df2 yvar = c(-1, -4, 0, -9))
11.2.0.0.1 Mutating Joins
left_join()
# join df2 to df1 (via left_join)
left_join(x = df1,
y = df2,
by = c("id_numb" = "id")
)
## Warning in left_join(x = df1, y = df2, by = c(id_numb = "id")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 3 of `x` matches multiple rows in `y`.
## ℹ Row 1 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
## # A tibble: 5 × 3
## id_numb xvar yvar
## <dbl> <dbl> <dbl>
## 1 1 16 -1
## 2 1 -1 -1
## 3 2 11 -4
## 4 2 11 0
## 5 4 13 NA
# join df1 to df2
# via left_join
left_join(
x = df2,
y = df1,
by = c("id" = "id_numb")
)
## Warning in left_join(x = df2, y = df1, by = c(id = "id_numb")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 1 of `x` matches multiple rows in `y`.
## ℹ Row 3 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
## # A tibble: 5 × 3
## id yvar xvar
## <dbl> <dbl> <dbl>
## 1 1 -1 16
## 2 1 -1 -1
## 3 2 -4 11
## 4 2 0 11
## 5 3 -9 NA
# via right_join
right_join(x = df1,
y = df2,
by = c("id_numb" = "id")
)
## Warning in right_join(x = df1, y = df2, by = c(id_numb = "id")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 3 of `x` matches multiple rows in `y`.
## ℹ Row 1 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
## # A tibble: 5 × 3
## id_numb xvar yvar
## <dbl> <dbl> <dbl>
## 1 1 16 -1
## 2 1 -1 -1
## 3 2 11 -4
## 4 2 11 0
## 5 3 NA -9
right_join()
See above chunk
inner_join()
%>% # piped data goes into x
df1 inner_join(y = df2,
by = c("id_numb" = "id")
)
## Warning in inner_join(., y = df2, by = c(id_numb = "id")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 3 of `x` matches multiple rows in `y`.
## ℹ Row 1 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
## # A tibble: 4 × 3
## id_numb xvar yvar
## <dbl> <dbl> <dbl>
## 1 1 16 -1
## 2 1 -1 -1
## 3 2 11 -4
## 4 2 11 0
full_join()
%>%
df2 full_join(x = df1,
# use . to force the piped
# output into the y spot
y = .,
by = c("id_numb" = "id")
)
## Warning in full_join(x = df1, y = ., by = c(id_numb = "id")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 3 of `x` matches multiple rows in `y`.
## ℹ Row 1 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
## # A tibble: 6 × 3
## id_numb xvar yvar
## <dbl> <dbl> <dbl>
## 1 1 16 -1
## 2 1 -1 -1
## 3 2 11 -4
## 4 2 11 0
## 5 4 13 NA
## 6 3 NA -9
11.2.0.0.2 Filtering Joins
semi_join()
semi_join(x = df1,
y = df2,
by = c("id_numb" = "id")
)
## # A tibble: 3 × 2
## id_numb xvar
## <dbl> <dbl>
## 1 1 16
## 2 1 -1
## 3 2 11
anti_join()
anti_join(
x = df1,
y = df2,
by = c("id_numb" = "id")
)
## # A tibble: 1 × 2
## id_numb xvar
## <dbl> <dbl>
## 1 4 13
11.2.1 More Examples
- Run the following and explain why
R
does not simply stack the rows. Then, fix the issue with therename()
function.
<- tibble(xvar = c(1, 2), yvar = c(5, 1))
df_test1a <- tibble(x = c(1, 2), y = c(5, 1))
df_test1b bind_rows(df_test1a, df_test1b)
## # A tibble: 4 × 4
## xvar yvar x y
## <dbl> <dbl> <dbl> <dbl>
## 1 1 5 NA NA
## 2 2 1 NA NA
## 3 NA NA 1 5
## 4 NA NA 2 1
%>%
df_test1a rename(x = xvar, y = yvar) %>%
bind_rows(df_test1b)
## # A tibble: 4 × 2
## x y
## <dbl> <dbl>
## 1 1 5
## 2 2 1
## 3 1 5
## 4 2 1
Now try again (prior to renaming) using unite
.
bind_rows(df_test1a, df_test1b) %>%
unite( xvar, x, col = 'x', na.rm = TRUE) %>%
unite( yvar, y, col = 'y', na.rm = TRUE) %>%
mutate(
x = parse_number(x),
y = parse_number(y)
)
## # A tibble: 4 × 2
## x y
## <dbl> <dbl>
## 1 1 5
## 2 2 1
## 3 1 5
## 4 2 1
Consider the babynames
R
package, which has the following data sets:
lifetables
: cohort life tables for differentsex
and differentyear
variables, starting at the year 1900.births
: the number of births in the United States in each year, since 1909babynames
: popularity of different baby names per year and sex since the year 1880.
library(babynames)
## Warning: package 'babynames' was built under R version 4.2.3
<- babynames::lifetables
life_df <- babynames::births
birth_df <- babynames::babynames
babynames_df head(babynames)
head(births)
head(lifetables)
You can read about each data set with ?babynames
, ?births
and ?lifetables
.
- Merge/Join the births data into the babynames data.
left_join(x = babynames_df,
y = birth_df,
by = c("year" = "year")
)
## # A tibble: 1,924,665 × 6
## year sex name n prop births
## <dbl> <chr> <chr> <int> <dbl> <int>
## 1 1880 F Mary 7065 0.0724 NA
## 2 1880 F Anna 2604 0.0267 NA
## 3 1880 F Emma 2003 0.0205 NA
## 4 1880 F Elizabeth 1939 0.0199 NA
## 5 1880 F Minnie 1746 0.0179 NA
## 6 1880 F Margaret 1578 0.0162 NA
## 7 1880 F Ida 1472 0.0151 NA
## 8 1880 F Alice 1414 0.0145 NA
## 9 1880 F Bertha 1320 0.0135 NA
## 10 1880 F Sarah 1288 0.0132 NA
## # ℹ 1,924,655 more rows
- Merge/Join the babynames data into the births data.
left_join(x = birth_df,
y = babynames_df,
by = c("year" = "year")
)
## # A tibble: 1,839,952 × 6
## year births sex name n prop
## <dbl> <int> <chr> <chr> <int> <dbl>
## 1 1909 2718000 F Mary 19259 0.0523
## 2 1909 2718000 F Helen 9250 0.0251
## 3 1909 2718000 F Margaret 7359 0.0200
## 4 1909 2718000 F Ruth 6509 0.0177
## 5 1909 2718000 F Dorothy 6253 0.0170
## 6 1909 2718000 F Anna 5804 0.0158
## 7 1909 2718000 F Elizabeth 5176 0.0141
## 8 1909 2718000 F Mildred 5054 0.0137
## 9 1909 2718000 F Marie 4301 0.0117
## 10 1909 2718000 F Alice 4170 0.0113
## # ℹ 1,839,942 more rows
Examine the two joins, and explain why one resulting data set has fewer observations (rows) than the other.
Evaluate whether the following statement is true or false: an
inner_join()
will always result in a data set with the same or fewer rows than afull_join()
.Evaluate whether the following statement is true or false: an
inner_join()
will always result in a data set with the same or fewer rows than aleft_join()
.
= tibble(x1 = 1, y1 = 10)
x = tibble(x1 = c(1,1,2,2,3,3),
y y2 = c(7,8,4,6,7,3)
)inner_join(x,y, by = "x1")
## # A tibble: 2 × 3
## x1 y1 y2
## <dbl> <dbl> <dbl>
## 1 1 10 7
## 2 1 10 8
inner_join(y,x, by = "x1")
## # A tibble: 2 × 3
## x1 y2 y1
## <dbl> <dbl> <dbl>
## 1 1 7 10
## 2 1 8 10
left_join(x, y, by = "x1")
## # A tibble: 2 × 3
## x1 y1 y2
## <dbl> <dbl> <dbl>
## 1 1 10 7
## 2 1 10 8
left_join(y, x, by = "x1")
## # A tibble: 6 × 3
## x1 y2 y1
## <dbl> <dbl> <dbl>
## 1 1 7 10
## 2 1 8 10
## 3 2 4 NA
## 4 2 6 NA
## 5 3 7 NA
## 6 3 3 NA
11.3 Exercises
- Read in the gun violence data set, and suppose that you want to add a row to this data set that has the statistics on gun ownership and mortality rate in the District of Columbia (Washington D.C., which is in the NE region, has 16.7 deaths per 100,000 people, and a gun ownership rate of 8.7%). To do so, create a
tibble()
that has a single row representing D.C. and then combine your newtibble
with the overall gun violence data set. Name this new data setall_df
.
library(tidyverse)
<- read_csv("data/gun_violence_us.csv") mortality_df
- Explain why each attempt at combining the D.C. data with the overall data doesn’t work or is incorrect.
<- tibble(state = "Washington D.C.", mortality_rate = 16.7,
test1 ownership_rate = 8.7, region = "NE")
bind_rows(mortality_df, test1)
<- tibble(state = "Washington D.C.", mortality_rate = 16.7,
test2 ownership_rate = 0.087, region = NE)
## Error in eval_tidy(xs[[j]], mask): object 'NE' not found
bind_rows(mortality_df, test2)
## Error in list2(...): object 'test2' not found
<- tibble(state = "Washington D.C.", mortality_rate = "16.7",
test3 ownership_rate = "0.087", region = "NE")
bind_rows(mortality_df, test3)
## Error in `bind_rows()`:
## ! Can't combine `..1$mortality_rate` <double> and `..2$mortality_rate` <character>.
- Examine the following data sets that are in
R
’s base library on demographic statistics about the U.S. states and state abbreviations:
<- as_tibble(state.x77)
df1 <- as_tibble(state.abb)
df2 df1
## # A tibble: 50 × 8
## Population Income Illiteracy `Life Exp` Murder `HS Grad` Frost Area
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 3615 3624 2.1 69.0 15.1 41.3 20 50708
## 2 365 6315 1.5 69.3 11.3 66.7 152 566432
## 3 2212 4530 1.8 70.6 7.8 58.1 15 113417
## 4 2110 3378 1.9 70.7 10.1 39.9 65 51945
## 5 21198 5114 1.1 71.7 10.3 62.6 20 156361
## 6 2541 4884 0.7 72.1 6.8 63.9 166 103766
## 7 3100 5348 1.1 72.5 3.1 56 139 4862
## 8 579 4809 0.9 70.1 6.2 54.6 103 1982
## 9 8277 4815 1.3 70.7 10.7 52.6 11 54090
## 10 4931 4091 2 68.5 13.9 40.6 60 58073
## # ℹ 40 more rows
df2
## # A tibble: 50 × 1
## value
## <chr>
## 1 AL
## 2 AK
## 3 AZ
## 4 AR
## 5 CA
## 6 CO
## 7 CT
## 8 DE
## 9 FL
## 10 GA
## # ℹ 40 more rows
Combine the two data sets with bind_cols()
. What are you assuming about the data sets in order to use this function?
<- bind_cols(df1, df2)
states_df states_df
## # A tibble: 50 × 9
## Population Income Illiteracy `Life Exp` Murder `HS Grad` Frost Area value
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 3615 3624 2.1 69.0 15.1 41.3 20 50708 AL
## 2 365 6315 1.5 69.3 11.3 66.7 152 566432 AK
## 3 2212 4530 1.8 70.6 7.8 58.1 15 113417 AZ
## 4 2110 3378 1.9 70.7 10.1 39.9 65 51945 AR
## 5 21198 5114 1.1 71.7 10.3 62.6 20 156361 CA
## 6 2541 4884 0.7 72.1 6.8 63.9 166 103766 CO
## 7 3100 5348 1.1 72.5 3.1 56 139 4862 CT
## 8 579 4809 0.9 70.1 6.2 54.6 103 1982 DE
## 9 8277 4815 1.3 70.7 10.7 52.6 11 54090 FL
## 10 4931 4091 2 68.5 13.9 40.6 60 58073 GA
## # ℹ 40 more rows
- Use a join function to combine the mortality data set (
all_df
) with D.C. with the states data set fromstates_df
. For this exercise, keep the row with Washington D.C., having it take onNA
values for any variable not observed in the states data.
left_join(all_df, states_df,
by = c("state" = "value"))
## # A tibble: 51 × 12
## state mortality_rate ownership_rate region Population Income Illiteracy
## <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 AL 16.7 0.489 South 3615 3624 2.1
## 2 AK 18.8 0.617 West 365 6315 1.5
## 3 AZ 13.4 0.323 West 2212 4530 1.8
## 4 AR 16.4 0.579 South 2110 3378 1.9
## 5 CA 7.4 0.201 West 21198 5114 1.1
## 6 CO 12.1 0.343 West 2541 4884 0.7
## 7 CT 4.9 0.166 NE 3100 5348 1.1
## 8 DE 11.1 0.052 NE 579 4809 0.9
## 9 FL 11.5 0.325 South 8277 4815 1.3
## 10 GA 13.7 0.316 South 4931 4091 2
## # ℹ 41 more rows
## # ℹ 5 more variables: `Life Exp` <dbl>, Murder <dbl>, `HS Grad` <dbl>,
## # Frost <dbl>, Area <dbl>
# could also use full_join here
- Repeat Exercise 4, but now drop Washington D.C. in your merging process. Practice doing this with a join function (as opposed to
slice
-ing orfilter
-ing it out explicitly).
inner_join(all_df, states_df,
by = c("state" = "value"))
## # A tibble: 50 × 12
## state mortality_rate ownership_rate region Population Income Illiteracy
## <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 AL 16.7 0.489 South 3615 3624 2.1
## 2 AK 18.8 0.617 West 365 6315 1.5
## 3 AZ 13.4 0.323 West 2212 4530 1.8
## 4 AR 16.4 0.579 South 2110 3378 1.9
## 5 CA 7.4 0.201 West 21198 5114 1.1
## 6 CO 12.1 0.343 West 2541 4884 0.7
## 7 CT 4.9 0.166 NE 3100 5348 1.1
## 8 DE 11.1 0.052 NE 579 4809 0.9
## 9 FL 11.5 0.325 South 8277 4815 1.3
## 10 GA 13.7 0.316 South 4931 4091 2
## # ℹ 40 more rows
## # ℹ 5 more variables: `Life Exp` <dbl>, Murder <dbl>, `HS Grad` <dbl>,
## # Frost <dbl>, Area <dbl>
# or
right_join(all_df, states_df,
by = c("state" = "value"))
## # A tibble: 50 × 12
## state mortality_rate ownership_rate region Population Income Illiteracy
## <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 AL 16.7 0.489 South 3615 3624 2.1
## 2 AK 18.8 0.617 West 365 6315 1.5
## 3 AZ 13.4 0.323 West 2212 4530 1.8
## 4 AR 16.4 0.579 South 2110 3378 1.9
## 5 CA 7.4 0.201 West 21198 5114 1.1
## 6 CO 12.1 0.343 West 2541 4884 0.7
## 7 CT 4.9 0.166 NE 3100 5348 1.1
## 8 DE 11.1 0.052 NE 579 4809 0.9
## 9 FL 11.5 0.325 South 8277 4815 1.3
## 10 GA 13.7 0.316 South 4931 4091 2
## # ℹ 40 more rows
## # ℹ 5 more variables: `Life Exp` <dbl>, Murder <dbl>, `HS Grad` <dbl>,
## # Frost <dbl>, Area <dbl>
- Use
semi_join()
to create a subset ofstates_df
that are in theNE
region. Hint: You will need to filterall_df
first to contain only states in theNE
region.
<- all_df %>% filter(region == "NE")
ne_df
semi_join(states_df, ne_df,
by = c("value" = "state"))
## # A tibble: 10 × 9
## Population Income Illiteracy `Life Exp` Murder `HS Grad` Frost Area value
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 3100 5348 1.1 72.5 3.1 56 139 4862 CT
## 2 579 4809 0.9 70.1 6.2 54.6 103 1982 DE
## 3 1058 3694 0.7 70.4 2.7 54.7 161 30920 ME
## 4 4122 5299 0.9 70.2 8.5 52.3 101 9891 MD
## 5 5814 4755 1.1 71.8 3.3 58.5 103 7826 MA
## 6 812 4281 0.7 71.2 3.3 57.6 174 9027 NH
## 7 7333 5237 1.1 70.9 5.2 52.5 115 7521 NJ
## 8 18076 4903 1.4 70.6 10.9 52.7 82 47831 NY
## 9 931 4558 1.3 71.9 2.4 46.4 127 1049 RI
## 10 472 3907 0.6 71.6 5.5 57.1 168 9267 VT
# using piping
%>%
all_df filter(region == "NE") %>%
semi_join(x = states_df,
y = . , # piped output goes here
by = c("value" = "state")
)
## # A tibble: 10 × 9
## Population Income Illiteracy `Life Exp` Murder `HS Grad` Frost Area value
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 3100 5348 1.1 72.5 3.1 56 139 4862 CT
## 2 579 4809 0.9 70.1 6.2 54.6 103 1982 DE
## 3 1058 3694 0.7 70.4 2.7 54.7 161 30920 ME
## 4 4122 5299 0.9 70.2 8.5 52.3 101 9891 MD
## 5 5814 4755 1.1 71.8 3.3 58.5 103 7826 MA
## 6 812 4281 0.7 71.2 3.3 57.6 174 9027 NH
## 7 7333 5237 1.1 70.9 5.2 52.5 115 7521 NJ
## 8 18076 4903 1.4 70.6 10.9 52.7 82 47831 NY
## 9 931 4558 1.3 71.9 2.4 46.4 127 1049 RI
## 10 472 3907 0.6 71.6 5.5 57.1 168 9267 VT
# using "base version of piping"
|>
all_df filter(region == "NE") |>
semi_join(x = states_df,
y = _ , # piped output goes here
by = c("value" = "state")
)
## # A tibble: 10 × 9
## Population Income Illiteracy `Life Exp` Murder `HS Grad` Frost Area value
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 3100 5348 1.1 72.5 3.1 56 139 4862 CT
## 2 579 4809 0.9 70.1 6.2 54.6 103 1982 DE
## 3 1058 3694 0.7 70.4 2.7 54.7 161 30920 ME
## 4 4122 5299 0.9 70.2 8.5 52.3 101 9891 MD
## 5 5814 4755 1.1 71.8 3.3 58.5 103 7826 MA
## 6 812 4281 0.7 71.2 3.3 57.6 174 9027 NH
## 7 7333 5237 1.1 70.9 5.2 52.5 115 7521 NJ
## 8 18076 4903 1.4 70.6 10.9 52.7 82 47831 NY
## 9 931 4558 1.3 71.9 2.4 46.4 127 1049 RI
## 10 472 3907 0.6 71.6 5.5 57.1 168 9267 VT
Do the same thing as Exercise 6, but this time, use
anti_join()
. Hint: You’ll need to filterall_df
in a different way to achieve this.Examine the following data sets (the first is
df1
and the second isdf2
) and then, without running any code, answer the following questions. (You can check your answer using code, but try it “by hand” first!)
<- tibble(id = c("A", "B", "C", "E", "F"), xvar = c(1, 2, 3, 1, 2))
df1 <- tibble(id = c("A", "C", "D", "E", "G", "H"), yvar = c(2, 1, 2, 1, 1, 4)) df2
How many rows would be in the data set from
left_join(df1, df2, by = c("id" = "id"))
?How many rows would be in the data set from
left_join(df2, df1, by = c("id" = "id"))
?How many rows would be in the data set from
full_join(df1, df2, by = c("id" = "id"))
?How many rows would be in the data set from
inner_join(df1, df2, by = c("id" = "id"))
?How many rows would be in the data set from
semi_join(df1, df2, by = c("id" = "id"))
?How many rows would be in the data set from
anti_join(df1, df2, by = c("id" = "id"))
?