3 Reading Tabular Data in R

3.1 Introduction

Data sets are stored in tabular format and there are many possible ways to organize tabular data. Some organizational schemes are designed to be easily read on the page (or screen), while others are designed to be easily used in analysis. In this tutorial, we focus on how a dataset stored in a tabular format can be read in to R.

3.2 Common file extensions

Here is a small list of common file extensions you may come across.

Format Typical Extension Ramler Recommended Package
Comma-separated data .csv readr (read_csv)
Tab-separated data .tsv readr
Semi-colon separated data .csv readr
Pipe (“|”) separated data .psv readr
Excel .xls or .xlsx readxl …or just resave it as a csv
R syntax .R base R
Saved R objects .RData, .rda base R
Google Sheets URL googlesheets4 or resave it as a comma-separated data
JSON (a possible topic of the course) .json jsonlite
SQL Database (a possible topic of the course) varies dbplyr or one that is specific to the variety of database used (e.g., Postgres or MySQL)

Note that the “pipe separated data” is different than the R pipe (%>%) we will learn about this semester.

3.3 Moving on to the tutorial

Make sure that the following packages are loaded

library(readr) # for reading in data 

3.3.1 Manually creating data

At times it may be necessary to create small datasets manually within R. We will learn two different ways to create small datasets.

3.3.1.1 Vectors

The first type of data we will consider is when you need to make the equivalent of a single variable of values. This is most easily done by create a vector in R. Here are a few examples.

x1 <- c(5,1,8,2)
x2 <- 1:12
x3 <- 5:-6
x4 <- c("we","don't", "talk", "about","Bruno")
x5 <- c(x1, x3)
x6 <- c(x1, x4) # converts x1 into a chr and combined with x4

If we want to extract individual components out of a vector we do so using commands like the following. Before running each, try to guess what the result will be. Then check your guess by running the code.

x1[1]
x1[4]
x4[1:4]
x4[c(1,3,4,5)]
x1[-1]
x1[5]
x1[0]

3.3.1.2 Tibbles/Data Frames

We can also create a data set directly within R with the tibble() function in the tibble package. This is most useful when we want to make a small reproducible example so that someone else may help with our code. (Note that the tibble is automatically loaded when we load the readr package/ We will rarely need to load it by itself except in situations where it is the only thing that we are using.)

A reproducible example, or reprex, is a chunk of code that we can give to someone else that runs without any outside data. These are used often on StackExchange. The following code chunk is not a reprex because people would not necessarily have the data set parsedf.csv.

## Hello! How do I get rid of the units from the values in
## my variable `x`? Thanks!
library(readr)
test_df <- read_csv("data/parsedf.csv")
head(test_df)

We want to post on StackExchange for someone to help us convert a variable from a character vector with units to a numeric vector without units. We want to be able to give any possible helpers a small example data set to work with. For this, we can create our own tiny data set with tibble():

## Hello! How do I get rid of the units from the values in
## my variable `xvar`? Thanks!
library(tibble)
test_df2 <- tibble(xvar = c("20,000 dollars", "40 dollars"),
                   yvar = c(1, 2))
test_df2

3.3.2 Extracting values from a tibble/dataframe

# extract a variable by name
test_df2$xvar

# extract a variable by location
test_df2[ ,1] # blank in rows spot means take all rows

# extract an observation by location
test_df2[2,]

3.3.3 Importing Files

As we saw in previous lessons, tabular data stored in a CSV files can be read in to R using the read_csv function from the readr package. If the data is in a clean format already, this is a fairly straightforward process. We will start to investigate how to read in and work with tabular data that is not a simple CSV file.

First, load in the example data sets

UBSprices <- read_csv("data/UBSprices.csv")
airlines <- read_csv("data/airline-safety.csv")

Let’s take a closer look at the airlines data. Look in your environment panel and expand the airlines object. Notice that it gives you a small amount of information about the variables? Also notice that some of the variables’ names have spaces in them. This is generally a bad practice, but is commonplace enough that the R allows a special way to access these variables. Note, you can also check the names of the variables using the names function in R.

To access variables with spaces in their names we need to use a backtick mark, `, around them. This key is found in the top left corner of the keyboard. Tick marks work a lot like quotes insofar as they open and close and wrap all elements into a single object. The difference is tick marks assume the marked object references a symbol, not a string of characters.

Lets try this out. First use the ticks to extract just the fatalities 1985_1999 variable.

airlines$`fatalities 1985_1999`

Note that in some cases, R allows you to be sloppy and you can use quotes instead of ticks. However, it is good practice to try to always use the tick marks when dealing with variable names that contain spaces.

Side note: This can cause some annoyances when sharing code with people that aren’t familiar with the tick structure. One way around this is use the name.repair option with make.names to “fix” unfriendly names in data.frames. (Using name_repair = "universal" is also a useful way to fix these.)

airlines <- read_csv("data/airline-safety.csv", 
                     name_repair = make.names)

(Note that the tick mark structure is still useful in other places. We will likely see examples where we will make use of it.)

3.3.4 Importing tabular files from the web

Suppose you know of a tabular (e.g., csv) file located on the web. Instead of downloading it and then reading it in, you can often just provide the URL as the file. For example, try to import the slc_towns_active_cases.csv from https://github.com/iramler/stlawu_covid

url = "https://raw.githubusercontent.com/iramler/stlawu_covid/main/slc_towns_active_cases.csv"

active_cases <- read_csv(url)

3.3.5 Exporting Files

Occasionally you may want to save a data set. (e.g., after you have cleaned a messy and/or subsetted a larger dataset) One simple way is to use the write_csv function.

Try this now by saving a copy of test_df2 to your data folder.

# write_csv(x = test_df2,
#          file = "data/test_df2.csv"
#          )

Next, lets try a series of exercises to learn about some of the different options available to us in the readr (and associated) packages.

3.3.6 Exercises

  1. What are some advantages and disadvantages of using the Import Dataset tool in R Studio?

Pros - generates code and previews some common options

Cons - unless you copy/paste the code into your R Markdown file, you do not have a fully reproducible document (but be sure to delete the View piece)

beyonce_lyrics <- read_csv("data/beyonce_lyrics.csv")

head(beyonce_lyrics) # can use this instead of View
  1. Read in the “polls” data using read_csv and inspect it. Fix the issue by using the appropriate option in read_csv
url <- "https://raw.githubusercontent.com/iramler/stat234/main/notes/data/rcp-polls-messy.csv"
polls <- read_csv(url, 
                  na = c("","NA","--"),
                  col_types = 
                    list(Sample = col_number()
                         ),
                  name_repair = make.names
                  )
  1. Each of the files, data1.csv, data2.csv, and data3.xlsx, have some problems with them when using read_csv. For each file,
  1. import them with read_csv,
  2. inspect them to see if you can tell what is wrong, and
  3. without directly editing the file, try to read them in properly.
# doesn't work b/c we use the 
# wrong delimiter
data1 <- read_csv("data/data1.csv")

# Fix 1
data1 <- read_csv2("data/data1.csv")

# Fix 2 (more general style)
data1 <- read_delim("data/data1.csv",
                    delim = ";"
                    )

# Try this!
# Not a good way of renaming variables
# data1 <- read_delim("data/data1.csv",
#                    delim = ";",
#                    col_names = c("a",'b','c')
#                    )

# one way to rename variables
names(data1) <- c('a','b','c')
# data2.csv
# First attempt - wrong variable names (header)
data2 <- read_csv("data/data2.csv")

# Fix 1
data2 <- read_csv("data/data2.csv", 
                  col_names = FALSE)

# Fix 2
data2 <- read_csv("data/data2.csv",
                  col_names = c("V1","V2","V3")
                  )
# data3.xlsx
data3 <- read_csv("data/data3.xlsx")

data3 <- readxl::read_excel("data/data3.xlsx",
                            skip = 2,
                            col_names = FALSE
                            )

# last (and hopefully correct) attempt
data3 <- readxl::read_excel("data/data3.xlsx",
                            skip = 1,
                            n_max = 16 #b/c 16 rows of actual data
                            )