16 Introduction to Working with Dates in R
16.1 Introduction
In many data files, the date or time of day will be an important variable. In this introductory tutorial, we will learn some basics on to handles dates.
A Reminder: Why do <date>
objects even matter? Compare the following two plots: one made where the date is in <chr>
form and the other where date is in its appropriate <date>
form.
library(tidyverse)
library(lubridate)
<- read_csv("data/animal_crossing_holidays.csv")
animal_crossing
%>%
animal_crossing ggplot(data = ., aes(x = Date1, y = Holiday)) +
geom_point()
%>%
animal_crossing mutate(Date_test_plot = dmy(Date1)) %>%
ggplot(data = ., aes(x = Date_test_plot, y = Holiday)) +
geom_point()
In which plot does the ordering on the x-axis make more sense?
16.2 Dates with lubridate
Goals:
- use
lubridate
functions to convert a character variable to a<date>
variable. - use
lubridate
functions to extract useful information from a<date>
variable, including the year, month, day of the week, and day of the year.
16.2.1 Converting Variables to <date>
The lubridate
package is built to easily work with Date
objects and DateTime
objects.
To begin, here are a few basic functions today()
, which prints today’s date, and now()
prints today’s date and time.
today()
now()
There are a number of built-in functions to convert character strings to Dates and Times.
16.2.1.1 Parsing Dates and Times
ymd()
: Parses dates in the format of “year-month-day” and returns a datetime object.dmy()
: Parses dates in the format of “day-month-year” and returns a datetime object.mdy()
: Parses dates in the format of “month-day-year” and returns a datetime object.hm()
: Parses times in the format of “hour-minute” and returns a time object.hms()
: Parses times in the format of “hour-minute-second” and returns a datetime object.
Here is a quick example showing what they do.
# dates in different formats
<- "2023-04-19"
d1 <- "19-04-2023"
d2 <- "04-19-2023"
d3
ymd(d1)
dmy(d2)
mdy(d3)
mdy(d2) # fails to parse b/c no month 19
# Parse time in different formats
<- hm("10:15")
time_hm <- hms("10:15:30") time_hms
As seen before, these also work on variables within data frames (tibbles).
%>%
animal_crossing mutate(Date1_v2 = dmy(Date1)) %>%
relocate(Date1_v2)
16.2.1.2 year()
, month()
, and mday()
The functions year()
, month()
, and mday()
can grab the year, month, and day of the month, respectively, from a <date>
variable. Like the forcats
functions, these will almost always be paired with a mutate()
statement because they will create a new variable.
Notice in the animal crossings data there are a number of variables related to these aspects. Here is how they were created.
# starting fresh
<- animal_crossing %>% select(Holiday, Date1)
animal_crossing2
# recreate initial
%>%
animal_crossing2 mutate(
Date = dmy(Date1),
Month = month(Date),
Year = year(Date),
Day = mday(Date),
Month2 = month(Date, label = TRUE, abbr = FALSE),
# a few extras
Day_in_year = yday(Date),
Day_of_week = wday(Date, label = TRUE, abbr = TRUE),
week_of_year = week(Date)
)
16.3 Using parse_date from the readr package
Another common way to work with dates is to use the parse_date
(and parse_date_time
function from lubridate
). This usually requires us to identify a format
for the date (or date-time) structure.
For example, if you have a date in the format “2023-04-18 09:19:59”, you can use the following format string to parse it using the parse_date() function:
<- "2023-04-18 09:19:59"
date_string <- parse_date(date_string,
date format = "%Y-%m-%d %H:%M:%S"
) date
Below is a table of common formats.
Format | Description |
---|---|
%d | Day of the month as a number (01-31). |
%m | Month as a number (01-12). |
%Y | Year with century (as a four digit number). |
%y | Year without century (00-99). |
%H | Hour (24-hour clock) as a decimal number (00-23). |
%I | Hour (12-hour clock) as a decimal number (01-12). |
%M | Minute as a decimal number (00-59). |
%S | Second as a decimal number (00-59). |
%z | Time zone offset from UTC (e.g., “-0800”). |
%Z | Time zone name. |
16.4 Another Example
We have data on flights originating from New York airports in November 2022.
<- "https://raw.githubusercontent.com/iramler/stat234/main/notes/data/ny_airports_nov2022.csv"
url <- read_csv(url) ny_airports
%>% mutate(FL_DATE = str_remove(FL_DATE, " 12:00:00 AM")) -> ny_airports ny_airports
First, lets reduce this data into just the four airports in Albany, Buffalo, Rochester, and Syracuse.
= paste( c('Albany', 'Buffalo', 'Rochester', 'Syracuse'), "NY", sep = ", ")
airports_to_use
airports_to_use
<-
upstate_airports %>%
ny_airports filter(ORIGIN_CITY_NAME %in% airports_to_use)
Now convert the FL_DATE variable from a <chr>
to a <date>
.
<-
upstate_airports %>%
upstate_airports mutate(Flight_Date = mdy(FL_DATE))
Calculate the average delay time for each airport.
Calculate the proportion of flights delayed for each airport.
%>%
upstate_airports group_by(ORIGIN_CITY_NAME) %>%
summarise(
avgDelay = mean(DEP_DELAY, na.rm=TRUE),
propDelay = mean( (DEP_DELAY > 0) , na.rm= TRUE )
)
- Which day of the week has the most flights?
%>%
upstate_airports mutate(day.of.week = wday(Flight_Date,
label = TRUE,
abbr = TRUE
%>%
) ) group_by(day.of.week) %>%
summarise(
n_flights = n()
%>%
) slice_max(n_flights, n = 1)
Try out this plot.
library(ggTimeSeries)
%>%
upstate_airports group_by(Flight_Date, ORIGIN_CITY_NAME) %>%
summarise(avg_daily_delay = mean(DEP_DELAY, na.rm=TRUE)) %>%
ungroup() %>%
ggplot_calendar_heatmap("Flight_Date",
"avg_daily_delay",
+
) facet_wrap(~ORIGIN_CITY_NAME) +
theme(legend.position = "top") +
scale_fill_continuous(low = 'green', high = 'red') +
labs(y = "Day of Week", y = "Month", fill = "Average Delay (min)") +
coord_flip()
16.5 Another Fun Example
https://trends.google.com/trends/explore?geo=US&hl=en
library(gtrendsR)
<- c("pumpkin spice","cold brew")
search_terms
<- gtrends(
mysearch keyword = search_terms, onlyInterest = TRUE,
#time = "all", # this will let us get the data from 2004 - present
time = "2013-04-18 2023-04-18"
)
<- mysearch[[1]]
coffee_df
head(coffee_df)
tail(coffee_df)
- Make a plot of your Popularity variables through time.