11 Combining Data Sets with dplyr - Basics

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.3
## Warning: package 'ggplot2' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'tidyr' was built under R version 4.2.3
## Warning: package 'readr' was built under R version 4.2.3
## Warning: package 'purrr' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## Warning: package 'stringr' was built under R version 4.2.2
## Warning: package 'forcats' was built under R version 4.2.3
## Warning: package 'lubridate' was built under R version 4.2.3
library(pander)
## Warning: package 'pander' was built under R version 4.2.2

11.1 Stacking Rows

Suppose we have the following two data sets. The first, dat1 has the variables x1 and y1. The second, dat2 has the same variables. They represent a pair of datasets measuring the same features at two different locations. We would like to combine them into a single data frame.

dat1 <- tibble(x1 = runif(4),
  y1 = rnorm(4))
dat2 <- tibble(y1 = rpois(4,1),
              x1 = rgamma(4,shape = 2,scale = 3))

Notice that the order of the two variables is switched in the two tibbles. This means we can’t simply stack them without matching column headings

11.1.0.1 bind_rows()

dat_all <-
  bind_rows(
    "LocA" = dat1, 
    "LocB" = dat2,
    .id = "Location"
    )

11.2 Merging Data using Joins

Suppose you have the following two data sets. The first, df1 has the variables id_numb and xvar. The second, df2 has the variables id and yvar. id_numb and id serve as identification variables, possibly with duplicates, where observations from the first data set with id_numb = 1 correspond to observations in the second data set with id = 1.

df1 <- tibble(id_numb = c(1, 1, 2, 4),
  xvar = c(16, -1, 11, 13))
df2 <- tibble(id = c(1, 2, 2, 3),
              yvar = c(-1, -4, 0, -9))
11.2.0.0.1 Mutating Joins
  • left_join()
# join df2 to df1 (via left_join)
left_join(x = df1,
          y = df2,
          by = c("id_numb" = "id")
          )
## Warning in left_join(x = df1, y = df2, by = c(id_numb = "id")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 3 of `x` matches multiple rows in `y`.
## ℹ Row 1 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
## # A tibble: 5 × 3
##   id_numb  xvar  yvar
##     <dbl> <dbl> <dbl>
## 1       1    16    -1
## 2       1    -1    -1
## 3       2    11    -4
## 4       2    11     0
## 5       4    13    NA
# join df1 to df2
# via left_join
left_join(
  x = df2,
  y = df1,
  by = c("id" = "id_numb")
)
## Warning in left_join(x = df2, y = df1, by = c(id = "id_numb")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 1 of `x` matches multiple rows in `y`.
## ℹ Row 3 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
## # A tibble: 5 × 3
##      id  yvar  xvar
##   <dbl> <dbl> <dbl>
## 1     1    -1    16
## 2     1    -1    -1
## 3     2    -4    11
## 4     2     0    11
## 5     3    -9    NA
# via right_join
right_join(x = df1,
          y = df2,
          by = c("id_numb" = "id")
          )
## Warning in right_join(x = df1, y = df2, by = c(id_numb = "id")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 3 of `x` matches multiple rows in `y`.
## ℹ Row 1 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
## # A tibble: 5 × 3
##   id_numb  xvar  yvar
##     <dbl> <dbl> <dbl>
## 1       1    16    -1
## 2       1    -1    -1
## 3       2    11    -4
## 4       2    11     0
## 5       3    NA    -9
  • right_join()

See above chunk

  • inner_join()
df1 %>% # piped data goes into x
  inner_join(y = df2,
             by = c("id_numb" = "id")
             )
## Warning in inner_join(., y = df2, by = c(id_numb = "id")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 3 of `x` matches multiple rows in `y`.
## ℹ Row 1 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
## # A tibble: 4 × 3
##   id_numb  xvar  yvar
##     <dbl> <dbl> <dbl>
## 1       1    16    -1
## 2       1    -1    -1
## 3       2    11    -4
## 4       2    11     0
  • full_join()
df2 %>%
  full_join(x = df1,
            # use . to force the piped 
            # output into the y spot
            y = ., 
            by = c("id_numb" = "id")
            )
## Warning in full_join(x = df1, y = ., by = c(id_numb = "id")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 3 of `x` matches multiple rows in `y`.
## ℹ Row 1 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
## # A tibble: 6 × 3
##   id_numb  xvar  yvar
##     <dbl> <dbl> <dbl>
## 1       1    16    -1
## 2       1    -1    -1
## 3       2    11    -4
## 4       2    11     0
## 5       4    13    NA
## 6       3    NA    -9
11.2.0.0.2 Filtering Joins
  • semi_join()
semi_join(x = df1,
          y = df2,
          by = c("id_numb" = "id")
          )
## # A tibble: 3 × 2
##   id_numb  xvar
##     <dbl> <dbl>
## 1       1    16
## 2       1    -1
## 3       2    11
  • anti_join()
anti_join(
  x = df1,
  y = df2,
  by = c("id_numb" = "id")
)
## # A tibble: 1 × 2
##   id_numb  xvar
##     <dbl> <dbl>
## 1       4    13

11.2.1 More Examples

  1. Run the following and explain why R does not simply stack the rows. Then, fix the issue with the rename() function.
df_test1a <- tibble(xvar = c(1, 2), yvar = c(5, 1))
df_test1b <- tibble(x = c(1, 2), y = c(5, 1))
bind_rows(df_test1a, df_test1b)
## # A tibble: 4 × 4
##    xvar  yvar     x     y
##   <dbl> <dbl> <dbl> <dbl>
## 1     1     5    NA    NA
## 2     2     1    NA    NA
## 3    NA    NA     1     5
## 4    NA    NA     2     1
df_test1a %>% 
  rename(x = xvar, y = yvar) %>%
  bind_rows(df_test1b)
## # A tibble: 4 × 2
##       x     y
##   <dbl> <dbl>
## 1     1     5
## 2     2     1
## 3     1     5
## 4     2     1

Now try again (prior to renaming) using unite.

bind_rows(df_test1a, df_test1b) %>%
  unite(  xvar, x, col = 'x', na.rm = TRUE) %>%
  unite(  yvar, y, col = 'y', na.rm = TRUE) %>%
  mutate(
    x = parse_number(x),
    y = parse_number(y)
  )
## # A tibble: 4 × 2
##       x     y
##   <dbl> <dbl>
## 1     1     5
## 2     2     1
## 3     1     5
## 4     2     1

Consider the babynames R package, which has the following data sets:

  • lifetables: cohort life tables for different sex and different year variables, starting at the year 1900.
  • births: the number of births in the United States in each year, since 1909
  • babynames: popularity of different baby names per year and sex since the year 1880.
library(babynames)
## Warning: package 'babynames' was built under R version 4.2.3
life_df <- babynames::lifetables
birth_df <- babynames::births
babynames_df <- babynames::babynames
head(babynames)
head(births)
head(lifetables)

You can read about each data set with ?babynames, ?births and ?lifetables.

  1. Merge/Join the births data into the babynames data.
left_join(x = babynames_df,
          y = birth_df,
          by = c("year" = "year")
          )
## # A tibble: 1,924,665 × 6
##     year sex   name          n   prop births
##    <dbl> <chr> <chr>     <int>  <dbl>  <int>
##  1  1880 F     Mary       7065 0.0724     NA
##  2  1880 F     Anna       2604 0.0267     NA
##  3  1880 F     Emma       2003 0.0205     NA
##  4  1880 F     Elizabeth  1939 0.0199     NA
##  5  1880 F     Minnie     1746 0.0179     NA
##  6  1880 F     Margaret   1578 0.0162     NA
##  7  1880 F     Ida        1472 0.0151     NA
##  8  1880 F     Alice      1414 0.0145     NA
##  9  1880 F     Bertha     1320 0.0135     NA
## 10  1880 F     Sarah      1288 0.0132     NA
## # ℹ 1,924,655 more rows
  1. Merge/Join the babynames data into the births data.
left_join(x = birth_df,
          y = babynames_df,
          by = c("year" = "year")
          )
## # A tibble: 1,839,952 × 6
##     year  births sex   name          n   prop
##    <dbl>   <int> <chr> <chr>     <int>  <dbl>
##  1  1909 2718000 F     Mary      19259 0.0523
##  2  1909 2718000 F     Helen      9250 0.0251
##  3  1909 2718000 F     Margaret   7359 0.0200
##  4  1909 2718000 F     Ruth       6509 0.0177
##  5  1909 2718000 F     Dorothy    6253 0.0170
##  6  1909 2718000 F     Anna       5804 0.0158
##  7  1909 2718000 F     Elizabeth  5176 0.0141
##  8  1909 2718000 F     Mildred    5054 0.0137
##  9  1909 2718000 F     Marie      4301 0.0117
## 10  1909 2718000 F     Alice      4170 0.0113
## # ℹ 1,839,942 more rows
  1. Examine the two joins, and explain why one resulting data set has fewer observations (rows) than the other.

  2. Evaluate whether the following statement is true or false: an inner_join() will always result in a data set with the same or fewer rows than a full_join().

  3. Evaluate whether the following statement is true or false: an inner_join() will always result in a data set with the same or fewer rows than a left_join().

x = tibble(x1 = 1, y1 = 10)
y = tibble(x1 = c(1,1,2,2,3,3),
           y2 = c(7,8,4,6,7,3)
           )
inner_join(x,y, by = "x1")
## # A tibble: 2 × 3
##      x1    y1    y2
##   <dbl> <dbl> <dbl>
## 1     1    10     7
## 2     1    10     8
inner_join(y,x, by = "x1")
## # A tibble: 2 × 3
##      x1    y2    y1
##   <dbl> <dbl> <dbl>
## 1     1     7    10
## 2     1     8    10
left_join(x, y, by = "x1")
## # A tibble: 2 × 3
##      x1    y1    y2
##   <dbl> <dbl> <dbl>
## 1     1    10     7
## 2     1    10     8
left_join(y, x, by = "x1")
## # A tibble: 6 × 3
##      x1    y2    y1
##   <dbl> <dbl> <dbl>
## 1     1     7    10
## 2     1     8    10
## 3     2     4    NA
## 4     2     6    NA
## 5     3     7    NA
## 6     3     3    NA

11.3 Exercises

  1. Read in the gun violence data set, and suppose that you want to add a row to this data set that has the statistics on gun ownership and mortality rate in the District of Columbia (Washington D.C., which is in the NE region, has 16.7 deaths per 100,000 people, and a gun ownership rate of 8.7%). To do so, create a tibble() that has a single row representing D.C. and then combine your new tibble with the overall gun violence data set. Name this new data set all_df.
library(tidyverse)
mortality_df <- read_csv("data/gun_violence_us.csv")
  1. Explain why each attempt at combining the D.C. data with the overall data doesn’t work or is incorrect.
test1 <- tibble(state = "Washington D.C.", mortality_rate = 16.7,
                ownership_rate = 8.7, region = "NE")
bind_rows(mortality_df, test1)

test2 <- tibble(state = "Washington D.C.", mortality_rate = 16.7,
       ownership_rate = 0.087, region = NE)
## Error in eval_tidy(xs[[j]], mask): object 'NE' not found
bind_rows(mortality_df, test2)
## Error in list2(...): object 'test2' not found
test3 <- tibble(state = "Washington D.C.", mortality_rate = "16.7",
       ownership_rate = "0.087", region = "NE")
bind_rows(mortality_df, test3)
## Error in `bind_rows()`:
## ! Can't combine `..1$mortality_rate` <double> and `..2$mortality_rate` <character>.
  1. Examine the following data sets that are in R’s base library on demographic statistics about the U.S. states and state abbreviations:
df1 <- as_tibble(state.x77)
df2 <- as_tibble(state.abb)
df1
## # A tibble: 50 × 8
##    Population Income Illiteracy `Life Exp` Murder `HS Grad` Frost   Area
##         <dbl>  <dbl>      <dbl>      <dbl>  <dbl>     <dbl> <dbl>  <dbl>
##  1       3615   3624        2.1       69.0   15.1      41.3    20  50708
##  2        365   6315        1.5       69.3   11.3      66.7   152 566432
##  3       2212   4530        1.8       70.6    7.8      58.1    15 113417
##  4       2110   3378        1.9       70.7   10.1      39.9    65  51945
##  5      21198   5114        1.1       71.7   10.3      62.6    20 156361
##  6       2541   4884        0.7       72.1    6.8      63.9   166 103766
##  7       3100   5348        1.1       72.5    3.1      56     139   4862
##  8        579   4809        0.9       70.1    6.2      54.6   103   1982
##  9       8277   4815        1.3       70.7   10.7      52.6    11  54090
## 10       4931   4091        2         68.5   13.9      40.6    60  58073
## # ℹ 40 more rows
df2
## # A tibble: 50 × 1
##    value
##    <chr>
##  1 AL   
##  2 AK   
##  3 AZ   
##  4 AR   
##  5 CA   
##  6 CO   
##  7 CT   
##  8 DE   
##  9 FL   
## 10 GA   
## # ℹ 40 more rows

Combine the two data sets with bind_cols(). What are you assuming about the data sets in order to use this function?

states_df <- bind_cols(df1, df2)
states_df
## # A tibble: 50 × 9
##    Population Income Illiteracy `Life Exp` Murder `HS Grad` Frost   Area value
##         <dbl>  <dbl>      <dbl>      <dbl>  <dbl>     <dbl> <dbl>  <dbl> <chr>
##  1       3615   3624        2.1       69.0   15.1      41.3    20  50708 AL   
##  2        365   6315        1.5       69.3   11.3      66.7   152 566432 AK   
##  3       2212   4530        1.8       70.6    7.8      58.1    15 113417 AZ   
##  4       2110   3378        1.9       70.7   10.1      39.9    65  51945 AR   
##  5      21198   5114        1.1       71.7   10.3      62.6    20 156361 CA   
##  6       2541   4884        0.7       72.1    6.8      63.9   166 103766 CO   
##  7       3100   5348        1.1       72.5    3.1      56     139   4862 CT   
##  8        579   4809        0.9       70.1    6.2      54.6   103   1982 DE   
##  9       8277   4815        1.3       70.7   10.7      52.6    11  54090 FL   
## 10       4931   4091        2         68.5   13.9      40.6    60  58073 GA   
## # ℹ 40 more rows
  1. Use a join function to combine the mortality data set (all_df) with D.C. with the states data set from states_df. For this exercise, keep the row with Washington D.C., having it take on NA values for any variable not observed in the states data.
left_join(all_df, states_df, 
          by = c("state" = "value"))
## # A tibble: 51 × 12
##    state mortality_rate ownership_rate region Population Income Illiteracy
##    <chr>          <dbl>          <dbl> <chr>       <dbl>  <dbl>      <dbl>
##  1 AL              16.7          0.489 South        3615   3624        2.1
##  2 AK              18.8          0.617 West          365   6315        1.5
##  3 AZ              13.4          0.323 West         2212   4530        1.8
##  4 AR              16.4          0.579 South        2110   3378        1.9
##  5 CA               7.4          0.201 West        21198   5114        1.1
##  6 CO              12.1          0.343 West         2541   4884        0.7
##  7 CT               4.9          0.166 NE           3100   5348        1.1
##  8 DE              11.1          0.052 NE            579   4809        0.9
##  9 FL              11.5          0.325 South        8277   4815        1.3
## 10 GA              13.7          0.316 South        4931   4091        2  
## # ℹ 41 more rows
## # ℹ 5 more variables: `Life Exp` <dbl>, Murder <dbl>, `HS Grad` <dbl>,
## #   Frost <dbl>, Area <dbl>
# could also use full_join here
  1. Repeat Exercise 4, but now drop Washington D.C. in your merging process. Practice doing this with a join function (as opposed to slice-ing or filter-ing it out explicitly).
inner_join(all_df, states_df, 
           by = c("state" = "value"))
## # A tibble: 50 × 12
##    state mortality_rate ownership_rate region Population Income Illiteracy
##    <chr>          <dbl>          <dbl> <chr>       <dbl>  <dbl>      <dbl>
##  1 AL              16.7          0.489 South        3615   3624        2.1
##  2 AK              18.8          0.617 West          365   6315        1.5
##  3 AZ              13.4          0.323 West         2212   4530        1.8
##  4 AR              16.4          0.579 South        2110   3378        1.9
##  5 CA               7.4          0.201 West        21198   5114        1.1
##  6 CO              12.1          0.343 West         2541   4884        0.7
##  7 CT               4.9          0.166 NE           3100   5348        1.1
##  8 DE              11.1          0.052 NE            579   4809        0.9
##  9 FL              11.5          0.325 South        8277   4815        1.3
## 10 GA              13.7          0.316 South        4931   4091        2  
## # ℹ 40 more rows
## # ℹ 5 more variables: `Life Exp` <dbl>, Murder <dbl>, `HS Grad` <dbl>,
## #   Frost <dbl>, Area <dbl>
# or

right_join(all_df, states_df, 
           by = c("state" = "value"))
## # A tibble: 50 × 12
##    state mortality_rate ownership_rate region Population Income Illiteracy
##    <chr>          <dbl>          <dbl> <chr>       <dbl>  <dbl>      <dbl>
##  1 AL              16.7          0.489 South        3615   3624        2.1
##  2 AK              18.8          0.617 West          365   6315        1.5
##  3 AZ              13.4          0.323 West         2212   4530        1.8
##  4 AR              16.4          0.579 South        2110   3378        1.9
##  5 CA               7.4          0.201 West        21198   5114        1.1
##  6 CO              12.1          0.343 West         2541   4884        0.7
##  7 CT               4.9          0.166 NE           3100   5348        1.1
##  8 DE              11.1          0.052 NE            579   4809        0.9
##  9 FL              11.5          0.325 South        8277   4815        1.3
## 10 GA              13.7          0.316 South        4931   4091        2  
## # ℹ 40 more rows
## # ℹ 5 more variables: `Life Exp` <dbl>, Murder <dbl>, `HS Grad` <dbl>,
## #   Frost <dbl>, Area <dbl>
  1. Use semi_join() to create a subset of states_df that are in the NE region. Hint: You will need to filter all_df first to contain only states in the NE region.
ne_df <- all_df %>% filter(region == "NE")

semi_join(states_df, ne_df, 
          by = c("value" = "state"))
## # A tibble: 10 × 9
##    Population Income Illiteracy `Life Exp` Murder `HS Grad` Frost  Area value
##         <dbl>  <dbl>      <dbl>      <dbl>  <dbl>     <dbl> <dbl> <dbl> <chr>
##  1       3100   5348        1.1       72.5    3.1      56     139  4862 CT   
##  2        579   4809        0.9       70.1    6.2      54.6   103  1982 DE   
##  3       1058   3694        0.7       70.4    2.7      54.7   161 30920 ME   
##  4       4122   5299        0.9       70.2    8.5      52.3   101  9891 MD   
##  5       5814   4755        1.1       71.8    3.3      58.5   103  7826 MA   
##  6        812   4281        0.7       71.2    3.3      57.6   174  9027 NH   
##  7       7333   5237        1.1       70.9    5.2      52.5   115  7521 NJ   
##  8      18076   4903        1.4       70.6   10.9      52.7    82 47831 NY   
##  9        931   4558        1.3       71.9    2.4      46.4   127  1049 RI   
## 10        472   3907        0.6       71.6    5.5      57.1   168  9267 VT
# using piping

all_df %>%
  filter(region == "NE") %>%
  semi_join(x = states_df, 
            y = . , # piped output goes here
            by = c("value" = "state")
            )
## # A tibble: 10 × 9
##    Population Income Illiteracy `Life Exp` Murder `HS Grad` Frost  Area value
##         <dbl>  <dbl>      <dbl>      <dbl>  <dbl>     <dbl> <dbl> <dbl> <chr>
##  1       3100   5348        1.1       72.5    3.1      56     139  4862 CT   
##  2        579   4809        0.9       70.1    6.2      54.6   103  1982 DE   
##  3       1058   3694        0.7       70.4    2.7      54.7   161 30920 ME   
##  4       4122   5299        0.9       70.2    8.5      52.3   101  9891 MD   
##  5       5814   4755        1.1       71.8    3.3      58.5   103  7826 MA   
##  6        812   4281        0.7       71.2    3.3      57.6   174  9027 NH   
##  7       7333   5237        1.1       70.9    5.2      52.5   115  7521 NJ   
##  8      18076   4903        1.4       70.6   10.9      52.7    82 47831 NY   
##  9        931   4558        1.3       71.9    2.4      46.4   127  1049 RI   
## 10        472   3907        0.6       71.6    5.5      57.1   168  9267 VT
# using "base version of piping"
all_df |>
  filter(region == "NE") |>
  semi_join(x = states_df, 
            y = _ , # piped output goes here
            by = c("value" = "state")
            )
## # A tibble: 10 × 9
##    Population Income Illiteracy `Life Exp` Murder `HS Grad` Frost  Area value
##         <dbl>  <dbl>      <dbl>      <dbl>  <dbl>     <dbl> <dbl> <dbl> <chr>
##  1       3100   5348        1.1       72.5    3.1      56     139  4862 CT   
##  2        579   4809        0.9       70.1    6.2      54.6   103  1982 DE   
##  3       1058   3694        0.7       70.4    2.7      54.7   161 30920 ME   
##  4       4122   5299        0.9       70.2    8.5      52.3   101  9891 MD   
##  5       5814   4755        1.1       71.8    3.3      58.5   103  7826 MA   
##  6        812   4281        0.7       71.2    3.3      57.6   174  9027 NH   
##  7       7333   5237        1.1       70.9    5.2      52.5   115  7521 NJ   
##  8      18076   4903        1.4       70.6   10.9      52.7    82 47831 NY   
##  9        931   4558        1.3       71.9    2.4      46.4   127  1049 RI   
## 10        472   3907        0.6       71.6    5.5      57.1   168  9267 VT
  1. Do the same thing as Exercise 6, but this time, use anti_join(). Hint: You’ll need to filter all_df in a different way to achieve this.

  2. Examine the following data sets (the first is df1 and the second is df2) and then, without running any code, answer the following questions. (You can check your answer using code, but try it “by hand” first!)

df1 <- tibble(id = c("A", "B", "C", "E", "F"), xvar = c(1, 2, 3, 1, 2))
df2 <- tibble(id = c("A", "C", "D", "E", "G", "H"), yvar = c(2, 1, 2, 1, 1, 4))
  1. How many rows would be in the data set from left_join(df1, df2, by = c("id" = "id"))?

  2. How many rows would be in the data set from left_join(df2, df1, by = c("id" = "id"))?

  3. How many rows would be in the data set from full_join(df1, df2, by = c("id" = "id"))?

  4. How many rows would be in the data set from inner_join(df1, df2, by = c("id" = "id"))?

  5. How many rows would be in the data set from semi_join(df1, df2, by = c("id" = "id"))?

  6. How many rows would be in the data set from anti_join(df1, df2, by = c("id" = "id"))?