3 Reading Tabular Data in R
3.1 Introduction
Data sets are stored in tabular format and there are many possible ways to organize tabular data. Some organizational schemes are designed to be easily read on the page (or screen), while others are designed to be easily used in analysis. In this tutorial, we focus on how a dataset stored in a tabular format can be read in to R.
3.2 Common file extensions
Here is a small list of common file extensions you may come across.
Format | Typical Extension | Ramler Recommended Package |
---|---|---|
Comma-separated data | .csv | readr (read_csv) |
Tab-separated data | .tsv | readr |
Semi-colon separated data | .csv | readr |
Pipe (“|”) separated data | .psv | readr |
Excel | .xls or .xlsx | readxl …or just resave it as a csv |
R syntax | .R | base R |
Saved R objects | .RData, .rda | base R |
Google Sheets | URL | googlesheets4 or resave it as a comma-separated data |
JSON (a possible topic of the course) | .json | jsonlite |
SQL Database (a possible topic of the course) | varies | dbplyr or one that is specific to the variety of database used (e.g., Postgres or MySQL) |
Note that the “pipe separated data” is different than the R pipe (%>%) we will learn about this semester.
3.3 Moving on to the tutorial
Make sure that the following packages are loaded
library(readr) # for reading in data
3.3.1 Manually creating data
At times it may be necessary to create small datasets manually within R. We will learn two different ways to create small datasets.
3.3.1.1 Vectors
The first type of data we will consider is when you need to make the equivalent of a single variable of values. This is most easily done by create a vector in R. Here are a few examples.
<- c(5,1,8,2)
x1 <- 1:12
x2 <- 5:-6
x3 <- c("we","don't", "talk", "about","Bruno")
x4 <- c(x1, x3)
x5 <- c(x1, x4) # converts x1 into a chr and combined with x4 x6
If we want to extract individual components out of a vector we do so using commands like the following. Before running each, try to guess what the result will be. Then check your guess by running the code.
1]
x1[4]
x1[1:4]
x4[c(1,3,4,5)]
x4[-1]
x1[5]
x1[0] x1[
3.3.1.2 Tibbles/Data Frames
We can also create a data set directly within R
with the tibble()
function in the tibble
package. This is most useful when we want to make a small reproducible example so that someone else may help with our code. (Note that the tibble
is automatically loaded when we load the readr
package/ We will rarely need to load it by itself except in situations where it is the only thing that we are using.)
A reproducible example, or reprex, is a chunk of code that we can give to someone else that runs without any outside data. These are used often on StackExchange. The following code chunk is not a reprex because people would not necessarily have the data set
parsedf.csv
.
## Hello! How do I get rid of the units from the values in
## my variable `x`? Thanks!
library(readr)
<- read_csv("data/parsedf.csv")
test_df head(test_df)
We want to post on StackExchange for someone to help us convert a variable from a character vector with units to a numeric vector without units. We want to be able to give any possible helpers a small example data set to work with. For this, we can create our own tiny data set with tibble()
:
## Hello! How do I get rid of the units from the values in
## my variable `xvar`? Thanks!
library(tibble)
<- tibble(xvar = c("20,000 dollars", "40 dollars"),
test_df2 yvar = c(1, 2))
test_df2
3.3.2 Extracting values from a tibble/dataframe
# extract a variable by name
$xvar
test_df2
# extract a variable by location
1] # blank in rows spot means take all rows
test_df2[ ,
# extract an observation by location
2,] test_df2[
3.3.3 Importing Files
As we saw in previous lessons, tabular data stored in a CSV files can be read in to R using the read_csv
function from the readr
package. If the data is in a clean format already, this is a fairly straightforward process. We will start to investigate how to read in and work with tabular data that is not a simple CSV file.
First, load in the example data sets
<- read_csv("data/UBSprices.csv")
UBSprices <- read_csv("data/airline-safety.csv") airlines
Let’s take a closer look at the airlines data. Look in your environment panel and expand the airlines object. Notice that it gives you a small amount of information about the variables? Also notice that some of the variables’ names have spaces in them. This is generally a bad practice, but is commonplace enough that the R allows a special way to access these variables. Note, you can also check the names of the variables using the names
function in R.
To access variables with spaces in their names we need to use a backtick mark, `, around them. This key is found in the top left corner of the keyboard. Tick marks work a lot like quotes insofar as they open and close and wrap all elements into a single object. The difference is tick marks assume the marked object references a symbol, not a string of characters.
Lets try this out. First use the ticks to extract just the fatalities 1985_1999
variable.
$`fatalities 1985_1999` airlines
Note that in some cases, R allows you to be sloppy and you can use quotes instead of ticks. However, it is good practice to try to always use the tick marks when dealing with variable names that contain spaces.
Side note: This can cause some annoyances when sharing code with people that aren’t familiar with the tick structure. One way around this is use the name.repair
option with make.names
to “fix” unfriendly names in data.frames. (Using name_repair = "universal"
is also a useful way to fix these.)
<- read_csv("data/airline-safety.csv",
airlines name_repair = make.names)
(Note that the tick mark structure is still useful in other places. We will likely see examples where we will make use of it.)
3.3.4 Importing tabular files from the web
Suppose you know of a tabular (e.g., csv) file located on the web. Instead of downloading it and then reading it in, you can often just provide the URL as the file. For example, try to import the slc_towns_active_cases.csv
from https://github.com/iramler/stlawu_covid
= "https://raw.githubusercontent.com/iramler/stlawu_covid/main/slc_towns_active_cases.csv"
url
<- read_csv(url) active_cases
3.3.5 Exporting Files
Occasionally you may want to save a data set. (e.g., after you have cleaned a messy and/or subsetted a larger dataset) One simple way is to use the write_csv
function.
Try this now by saving a copy of test_df2
to your data
folder.
# write_csv(x = test_df2,
# file = "data/test_df2.csv"
# )
Next, lets try a series of exercises to learn about some of the different options available to us in the readr
(and associated) packages.
3.3.6 Exercises
- What are some advantages and disadvantages of using the
Import Dataset
tool in R Studio?
Pros - generates code and previews some common options
Cons - unless you copy/paste the code into your R Markdown file, you do not have a fully reproducible document (but be sure to delete the View piece)
<- read_csv("data/beyonce_lyrics.csv")
beyonce_lyrics
head(beyonce_lyrics) # can use this instead of View
- Read in the “polls” data using read_csv and inspect it. Fix the issue by using the appropriate option in
read_csv
<- "https://raw.githubusercontent.com/iramler/stat234/main/notes/data/rcp-polls-messy.csv"
url <- read_csv(url,
polls na = c("","NA","--"),
col_types =
list(Sample = col_number()
),name_repair = make.names
)
- Each of the files,
data1.csv
,data2.csv
, anddata3.xlsx
, have some problems with them when using read_csv. For each file,
- import them with
read_csv
, - inspect them to see if you can tell what is wrong, and
- without directly editing the file, try to read them in properly.
# doesn't work b/c we use the
# wrong delimiter
<- read_csv("data/data1.csv")
data1
# Fix 1
<- read_csv2("data/data1.csv")
data1
# Fix 2 (more general style)
<- read_delim("data/data1.csv",
data1 delim = ";"
)
# Try this!
# Not a good way of renaming variables
# data1 <- read_delim("data/data1.csv",
# delim = ";",
# col_names = c("a",'b','c')
# )
# one way to rename variables
names(data1) <- c('a','b','c')
# data2.csv
# First attempt - wrong variable names (header)
<- read_csv("data/data2.csv")
data2
# Fix 1
<- read_csv("data/data2.csv",
data2 col_names = FALSE)
# Fix 2
<- read_csv("data/data2.csv",
data2 col_names = c("V1","V2","V3")
)
# data3.xlsx
<- read_csv("data/data3.xlsx")
data3
<- readxl::read_excel("data/data3.xlsx",
data3 skip = 2,
col_names = FALSE
)
# last (and hopefully correct) attempt
<- readxl::read_excel("data/data3.xlsx",
data3 skip = 1,
n_max = 16 #b/c 16 rows of actual data
)