13 Intro to Text Manipulation in R via the stringr package

Many data sets have character strings in them. For example, in a file of tweets from Twitter (which are basically just strings of characters), perhaps you want to search for occurrences of a certain word or twitter handle. Or a character variable in a data set might be location with a city and state abbreviation, and you want to extract those observations with location containing “NY.”

In this tutorial, you will learn how to manipulate text data using the package stringr and how to match patterns using regular expressions. Some of the commands include:

Command Description
str_sub Extract substring from a given start to end position
str_detect Detect presence/absence of first occurrence of substring
str_locate Give position (start, end) of first occurrence of substring
str_locate_all Give positions of all occurrences of a substring
str_replace Replace one substring with another

13.1 1. Extracting and locating substrings

We introduce some basic commands from stringr.

The str_sub command extracts substrings from a string (that is, a sequence of characters) given the starting and ending position. For instance, to extract the characters in the second through fourth position or each string in fruits:

library(stringr)
fruits <- c("apple", "pineapple", "Pear", "orange", "peach", "banana")

Question 1 What are the characters in the first through third position of each string in fruits?

str_sub(string = fruits, 
        start = 1,
        end = 3)
## [1] "app" "pin" "Pea" "ora" "pea" "ban"

The str_detect command checks to see if any instance of a pattern occurs in a string.

fruits
## [1] "apple"     "pineapple" "Pear"      "orange"    "peach"     "banana"
#any occurrence of 'p'?
str_detect(string = fruits,
           pattern = "p")
## [1]  TRUE  TRUE FALSE FALSE  TRUE FALSE

Note that pattern matching is case-sensitive.

fruits %>%
  str_to_lower() %>%
  str_detect(pattern = "p")
## [1]  TRUE  TRUE  TRUE FALSE  TRUE FALSE

To locate the position of a pattern within a string, use str_locate:

str_locate(string = fruits, pattern = "an")
##      start end
## [1,]    NA  NA
## [2,]    NA  NA
## [3,]    NA  NA
## [4,]     3   4
## [5,]    NA  NA
## [6,]     2   3

Only the fourth and sixth fruits contain “an.” In the case of “banana,” note that only the first occurrence of “an” is returned.

To find all instances of “an” within each string:

str_locate_all(string = fruits, pattern = "an")
## [[1]]
##      start end
## 
## [[2]]
##      start end
## 
## [[3]]
##      start end
## 
## [[4]]
##      start end
## [1,]     3   4
## 
## [[5]]
##      start end
## 
## [[6]]
##      start end
## [1,]     2   3
## [2,]     4   5

Remark

The command str_locate_all returns a list.

out <- str_locate_all(fruits, "an")
data.class(out)
## [1] "list"
out[[6]] # this is the more useful way to work with a list
##      start end
## [1,]     2   3
## [2,]     4   5
unlist(out)
## [1] 3 4 2 4 3 5

13.2 2. Regular expressions

Now suppose we want to detect or locate words that begin with “p” or end in “e,” or match a more complex criteria. A regular expression is a sequence of characters that define a pattern.

Let’s detect strings that begin with either “p” or “P”. The metacharacter “^” is used to indicate the beginning of the string, and “[Pp]” is used to indicate “P” or “p”.

# find fruits that start with p (or P)
fruits
## [1] "apple"     "pineapple" "Pear"      "orange"    "peach"     "banana"
str_detect(string = fruits, 
           pattern = "^[pP]")
## [1] FALSE  TRUE  TRUE FALSE  TRUE FALSE

Similarly, the metacharacter “$” is used to signify the end of a string.

fruits
## [1] "apple"     "pineapple" "Pear"      "orange"    "peach"     "banana"
# end in 'e'
str_detect(fruits, pattern = "e$")
## [1]  TRUE  TRUE FALSE  TRUE FALSE FALSE
# end in a vowel (excluding y)
str_detect(fruits, pattern = "[aeiou]$")
## [1]  TRUE  TRUE FALSE  TRUE FALSE  TRUE

The following are other metacharacters that have special meanings and so are reserved:

* \ + $ { } [ ] ^ ? .

For instance, a period matches any single character:

gr.y matches gray, grey, gr9y, grEy, etc.

and * indicates 0 or more instances of the preceding character:

xy*z matches xz, xyz, xyyz, xyyyz, xyyyyz, etc.

To detect the letter “a” followed by 0 or more occurrences of “p”, type:

fruits
## [1] "apple"     "pineapple" "Pear"      "orange"    "peach"     "banana"
str_detect(string = fruits,
           pattern = "ap*" # a then 0 or more p's
           )
## [1] TRUE TRUE TRUE TRUE TRUE TRUE

Compare this to

fruits
## [1] "apple"     "pineapple" "Pear"      "orange"    "peach"     "banana"
str_detect(string = fruits,
           pattern = "ap+")
## [1]  TRUE  TRUE FALSE FALSE FALSE FALSE
fruits
## [1] "apple"     "pineapple" "Pear"      "orange"    "peach"     "banana"
# starts with anything but an a
# has nothing or something before it...
# ends in e
str_detect(string = fruits,
           pattern = "^[^a](.*)e$")
## [1] FALSE  TRUE FALSE  TRUE FALSE FALSE

The “+” in front of the “p” indicates that we want one or more occurrences of “p.”

Here is a more complex pattern:

fruits
## [1] "apple"     "pineapple" "Pear"      "orange"    "peach"     "banana"
str_detect(string = fruits,
           pattern = "^a(.*)e$")
## [1]  TRUE FALSE FALSE FALSE FALSE FALSE

The anchors ^ and $ are used to indicate we want strings that begin with the letter a and end with e. The (.*) indicates that we want to match 0 or more occurrences of any character. In particular, parentheses can be used to group parts of the pattern for readability.

13.3 3 Example

Suppose we want to extract information on 10 digit United States phone numbers from a text file.

a1 <- "Home: 507-645-5489"
a2 <- "Cell: 219.917.9871"
a3 <- "My work phone is 507-202-2332"
a4 <- "I don't have a phone"
info <- c(a1, a2, a3, a4)
info
## [1] "Home: 507-645-5489"            "Cell: 219.917.9871"           
## [3] "My work phone is 507-202-2332" "I don't have a phone"

We will now extract just the phone numbers from this string.

The area code must start with a 2 or higher so we use brackets again to indicate a range: [2-9]. The next two digits can be between 0 and 9, so we write [0-9]{2}. For the separator, we use [-.] to indicate either a dash or a period. The complete regular expression is given below:

phone <- "([2-9][0-9]{2})[-.]([0-9]{3})[-.]([0-9]{4})"
out <- str_detect(info, phone)
out
## [1]  TRUE  TRUE  TRUE FALSE

Again, str_detect just indicates the presence or absence of the pattern in question.

str_extract(info, phone)
## [1] "507-645-5489" "219.917.9871" "507-202-2332" NA

Let’s anonymize the phone-numbers!

str_replace(info, phone, "XXX-XXX-XXXX")
## [1] "Home: XXX-XXX-XXXX"            "Cell: XXX-XXX-XXXX"           
## [3] "My work phone is XXX-XXX-XXXX" "I don't have a phone"

Remarks

  1. As we noted above, certain characters are reserved. If we want to actually reference them in a regular expression, either put them within a bracket, or use a double forward slash.
str_locate(info, "[.]")  #find first instance of period
##      start end
## [1,]    NA  NA
## [2,]    10  10
## [3,]    NA  NA
## [4,]    NA  NA
str_locate(info, "\\.")  #same
##      start end
## [1,]    NA  NA
## [2,]    10  10
## [3,]    NA  NA
## [4,]    NA  NA
str_locate(info, ".")    #first instance of any character
##      start end
## [1,]     1   1
## [2,]     1   1
## [3,]     1   1
## [4,]     1   1
  1. Metacharacters have different meanings within brackets.
str_detect(fruits, "^[Pp]")  #starts with 'P' or 'p'
## [1] FALSE  TRUE  TRUE FALSE  TRUE FALSE
str_detect(fruits, "[^Pp]")  #any character except 'P' or 'p'
## [1] TRUE TRUE TRUE TRUE TRUE TRUE
str_detect(fruits, "^[^Pp]") #start with any character except 'P' or 'p'
## [1]  TRUE FALSE FALSE  TRUE FALSE  TRUE
  1. See the handout regexp.pdf for a summary of regular expressions.

13.4 4. Matching brackets or html tags

In many cases, you may want to match brackets such as [8] or html tags such as <table>.

out <- c("abc[8]", "abc[9][20]", "abc[9]def[10][7]", "abc[]")
out
## [1] "abc[8]"           "abc[9][20]"       "abc[9]def[10][7]" "abc[]"

In order to better understand what regular expressions are matching here, we will replace pieces of the above strings with the character “X”.

To replace the left bracket, we write \\[. Next we want to match 0 or more occurrences of any character except the right bracket so we need [^]]*. Finally, to match the right bracket \\].

str_replace_all(out, "\\[([^]]*)\\]", "X")
## [1] "abcX"      "abcXX"     "abcXdefXX" "abcX"

Compare this to

str_replace_all(out, "\\[(.*)\\]", "X")
## [1] "abcX" "abcX" "abcX" "abcX"

In this case, we match the first left bracket (indicated by the \\[), followed by 0 or more instances of any character (the (.*) portion), which could be a right bracket until the final right bracket \\].

13.5 Exercises

  1. Create a vector veggies containing “carrot”, “bean”, “peas”, “cabbage”, “scallion”, “asparagus.
library(dplyr)
veggies <- c("carrot", "bean", "peas", "cabbage", "scallion", "asparagus")
  1. Find those strings that contain the pattern “ea”.
veggies %>%
  str_detect(pattern = "ea") %>%
  bind_cols(veggies)
## # A tibble: 6 × 2
##   ...1  ...2     
##   <lgl> <chr>    
## 1 FALSE carrot   
## 2 TRUE  bean     
## 3 TRUE  peas     
## 4 FALSE cabbage  
## 5 FALSE scallion 
## 6 FALSE asparagus
  1. Find those strings that end in “s”.
str_detect(veggies, pattern = "s$")
## [1] FALSE FALSE  TRUE FALSE FALSE  TRUE
  1. Find those strings that contain at least two “a”’s.
str_detect(veggies,
           pattern = "a(.*)a"
           )
## [1] FALSE FALSE FALSE  TRUE FALSE  TRUE
  1. Find those strings that begin with any letter except “c”.
str_detect(veggies, pattern = "^[^c]")
## [1] FALSE  TRUE  TRUE FALSE  TRUE  TRUE
  1. Find the starting and ending position of the pattern “ca” in each string.
str_locate(veggies, pattern = "ca")
##      start end
## [1,]     1   2
## [2,]    NA  NA
## [3,]    NA  NA
## [4,]     1   2
## [5,]     2   3
## [6,]    NA  NA
str_locate_all(veggies, pattern = "ca")
## [[1]]
##      start end
## [1,]     1   2
## 
## [[2]]
##      start end
## 
## [[3]]
##      start end
## 
## [[4]]
##      start end
## [1,]     1   2
## 
## [[5]]
##      start end
## [1,]     2   3
## 
## [[6]]
##      start end
  1. The regular expression "^[Ss](.*)(t+)(.+)(t+)" matches “scuttlebutt”, “Stetson”, and “Scattter”, but not “Scatter.” Why?

13.5.1 Additional Exercises

The file oscars.tsv is a tab-delimited file containing information Oscar nominated films from 2006 to 2014. We will use this file to practice text/string manipulation in R via the stringr package.

library(tidyverse)
  1. Read in the data using the readr package.
#oscars <- read_tsv("data/oscars.tsv")
oscars <- read_delim("data/oscars.tsv", 
                     delim = "\t")

# notice in Viewer that numerous rows are missing 
# most columns
# We should probably remove rows that aren't linked
# to a movie (i.e., have a blank FilmName)

oscars <-
  oscars %>%
  filter( !is.na(FilmName) )

# could also use drop_na
# oscars <- 
#  oscars %>%
#    drop_na(FilmName)
  1. What proportion of movies were dramas?
oscars %>%
  # create a new variable to
  # determine if a movie is a drama
  mutate(
    Drama = str_detect(GenreName, "[Dd]rama")
  ) %>% 
  summarise(
    prop_drama = mean(Drama, na.rm = TRUE)
  )
## # A tibble: 1 × 1
##   prop_drama
##        <dbl>
## 1      0.866
  1. How many movies have the word “the” at least once in their name?
oscars %>%
  mutate(
    the_in_title = FilmName %>% 
                    str_to_lower() %>%
                    str_detect("\\bthe\\b")
  ) %>%
  summarise(sum(the_in_title))
## # A tibble: 1 × 1
##   `sum(the_in_title)`
##                 <int>
## 1                  17
  1. How many characters is the longest movie title?
oscars %>%
  mutate(
    n_characters = str_length(FilmName)
  ) %>%
  slice_max(n_characters, n = 1) %>%
  select(FilmName, n_characters)
## # A tibble: 2 × 2
##   FilmName                            n_characters
##   <chr>                                      <int>
## 1 The Curious Case of Benjamin Button           35
## 2 Extremely Loud and Incredibly Close           35
  1. Replace USA with United States in the appropriate movies CountryName.
oscars %>%
  mutate(CountryName2 = str_replace_all(CountryName,
                                 pattern = "USA",
                                 replacement = "United States")
         ) %>%
  select(CountryName, CountryName2)
## # A tibble: 67 × 2
##    CountryName            CountryName2                    
##    <chr>                  <chr>                           
##  1 USA, Germany           United States, Germany          
##  2 USA, Canada            United States, Canada           
##  3 USA, Canada            United States, Canada           
##  4 USA, UK, France, Japan United States, UK, France, Japan
##  5 USA, Canada, France    United States, Canada, France   
##  6 USA, Hong Kong         United States, Hong Kong        
##  7 USA, France, Mexico    United States, France, Mexico   
##  8 USA                    United States                   
##  9 USA                    United States                   
## 10 USA, UK, France, Italy United States, UK, France, Italy
## # ℹ 57 more rows
  1. Create a new variable indicating whether or not a movie was a Romance. (Notice that there are similar pre-existing variables for Drama and Biography.)
oscars %>%
  # create a new variable to
  # determine if a movie is a drama
  mutate(
    Genre_Romance = 
      if_else(
        str_detect(GenreName, "[Rr]omance"),
        1, 0
      )
  )
## # A tibble: 67 × 52
##    FilmName    OscarYear Duration Rating DirectorName DirectorGender OscarWinner
##    <chr>           <dbl>    <dbl>  <dbl> <chr>                 <dbl>       <dbl>
##  1 Crash            2006      113      4 Haggis                    0           1
##  2 Brokeback …      2006      134      4 Lee                       0           0
##  3 Capote           2006      114      4 Miller                    0           0
##  4 Good Night…      2006       93      2 Clooney                   0           0
##  5 Munich           2006      164      4 Spielberg                 0           0
##  6 The Depart…      2007      151      4 Scorsese                  0           1
##  7 Babel            2007      143      4 Inarritu                  0           0
##  8 Letters fr…      2007      141      4 Eastwood                  0           0
##  9 Little Mis…      2007      110      4 Dayton AND …              1           0
## 10 The Queen        2007      103      3 Frears                    0           0
## # ℹ 57 more rows
## # ℹ 45 more variables: GenreName <chr>, Genre_Drama <dbl>, Genre_Bio <dbl>,
## #   CountryName <chr>, ForeignandUSA <dbl>, ProductionName <chr>,
## #   ProductionCompany <dbl>, BudgetRevised <chr>, Budget <chr>,
## #   DomesticBoxOffice <dbl>, WorldwideRevised <dbl>, WorldwideBoxOffice <dbl>,
## #   DomesticPercent <dbl>, LimitedOpeningWnd <dbl>, LimitedTheaters <dbl>,
## #   LimitedAveragePThtr <dbl>, WideOpeningWkd <dbl>, WideTheaters <dbl>, …
  1. Make a table counting how often each Genre appears.
oscars %>%
  separate_rows(GenreName, sep = ", ") %>%
  group_by(GenreName) %>%
  summarize(n = n())
## # A tibble: 18 × 2
##    GenreName           n
##    <chr>           <int>
##  1 Action              3
##  2 Adventure          10
##  3 Animation           2
##  4 Biography          16
##  5 Comedy             11
##  6 Crime               6
##  7 Drama              58
##  8 Family              1
##  9 Fantasy             6
## 10 History             9
## 11 Musical             1
## 12 Mystery             4
## 13 Romance            13
## 14 Science Fiction     4
## 15 Sport               3
## 16 Thriller            9
## 17 War                 3
## 18 Western             2