14 Introduction to Web Scraping

14.1 Data Scraping with rvest

Sometimes, you might want data from a public website that isn’t provided in a file format. To obtain this data, you’ll need to use web scraping, a term which just means “getting data from a website.” The easiest way to do this in R is with the rvest package. Note that we could spend an entire semester talking about web scraping, but we will focus only on websites where the scraping of data is “easy” and won’t give us any major errors.

Go to the following website and suppose that you wanted the table of gun violence statistics in R: https://en.wikipedia.org/wiki/Gun_violence_in_the_United_States_by_state. You could try copy-pasting the table into Excel and reading the data set in with read_excel(). Depending on the format of the table, that strategy may work but it may not. Another way is to scrape it directly with rvest. Additionally, if the website continually updates (standings for a sports league, enrollment data for a school, best-selling products for a company, etc.), then scraping is much more convenient, as you don’t need to continually copy-paste for updated data.

In the following code chunk, read_html() reads in the entire html file from the url provided while html_nodes() extracts only the tables on the website.

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.3
## Warning: package 'ggplot2' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'tidyr' was built under R version 4.2.3
## Warning: package 'readr' was built under R version 4.2.3
## Warning: package 'purrr' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## Warning: package 'stringr' was built under R version 4.2.2
## Warning: package 'forcats' was built under R version 4.2.3
## Warning: package 'lubridate' was built under R version 4.2.3
library(rvest)
## Warning: package 'rvest' was built under R version 4.2.3
## provide the URL and name it something (in this case, url).
url <- "https://en.wikipedia.org/wiki/Gun_violence_in_the_United_States_by_state"

## convert the html code into something R can read
h <- read_html(url)

## grabs the tables
tab <- h %>% html_nodes("table")

You’ll see that, for this example, there are 3 tables provided. The tables are stored in a list and we can reference the first table using [[1]], the second table using [[2]], etc. For the purposes of this class, we will figure out which of the 3 tables is the one we actually want using trial and error.

The html_table() function converts the table into a data.frame object.

test <- tab %>% html_table()

head( test[[1]] )
head( test[[2]] )
head( test[[3]] )

Which of the 3 tables is the one that we would want to use for an analysis on gun violence in the United States? After determining which one to use, extract it from the list and store it as a new object. (Then double-check your Environment to see if it is a form that you easily recognize.)

gun_violence <- test[[3]]

As another example, consider scraping data from SLU’s athletics page. In particular, suppose we want to do an analysis on SLU’s baseball team.

Go to the following website to look at the table of data that we want to scrape: https://saintsathletics.com/sports/baseball/stats/2021.

After looking at the website, scrape the data set.

url <- "https://saintsathletics.com/sports/baseball/stats/2021"
h <- read_html(url)
tab <- h %>% html_nodes("table")
objs <- tab %>% html_table()


head(objs[[1]])
tail(objs[[1]])

# can continue (or look at object in viewer)
batting_2021 <- objs[[1]]

There’s now 72 different tables! See if you can figure out where the first few tables are coming from on the website. After doing so, extract the appropriate table from the list and store it as a tibble.

14.1.1 Exercises

  1. Go to https://en.wikipedia.org/wiki/Beer_measurement Scrape the tables and, join the IBU table into the SRM table. (Note that even with some cleaning, we won’t get a lot of rows in the IBU table that have a match with the SRM table.)
url <- "https://en.wikipedia.org/wiki/Beer_measurement"
objs <- read_html(url) %>%
  html_nodes("table") %>%
  html_table()

srm <- objs[[1]]
ibu <- objs[[2]]

srm <- 
  srm %>%
    mutate(Example = str_to_lower(Example))

ibu <-
  tibble(objs[[2]], .name_repair = "unique") %>%
  mutate(Example = 
           str_to_lower(`IBUs of some common styles[15]...1`)
         ) %>%
  select(2,4)
## New names:
## • `IBUs of some common styles[15]` -> `IBUs of some common styles[15]...1`
## • `IBUs of some common styles[15]` -> `IBUs of some common styles[15]...2`
## • `IBUs of some common styles[15]` -> `IBUs of some common styles[15]...3`
beers <- 
  srm %>%
    separate_rows(Example, sep = ",") %>%
    mutate(
      Example = str_trim(Example, side = "both")
    ) %>%
  left_join(y = ibu,
            by = c("Example" = "Example")
            )

# can continue cleaning: e.g., rename variables, 
# parse_number on SRM, drop blank columns, etc

# can also save the tables
# write_csv(x = srm, file = "srm_wikipedia.csv")

# then, read it in to a new R Markdown file
# and begin the cleaning process