16 Confidence Intervals via Resampling (aka the Bootstrap)
16.1 Review of Confidence Intervals
Recall that a confidence interval is a statement about the plausible range of values of an unknown population parameter (e.g, a population mean) or function of parameters (e.g., a difference in two means) that are supported by a random sample of data.
In Stat 113, regardless of which instructor taught it, we focus on parameters for which Normal distributions can be used to model the sample-to-sample variability of the corresponding sample statistic. This distribution is commonly called the sampling distribution of the sample statistic.
For example, for sufficiently large random sample of size \(n\), the distribution of sample proportions \(\hat{p}\) can be modeled using
\(\hat{p} \sim Normal(p, \sqrt{\frac{p*(1-p)}{n} )}\)
Then, for a given sample of size \(n\), we can construct a \(C\%\) confidence interval using
\(\hat{p} \pm z^*\times \sqrt{\frac{\hat{p}*(1-\hat{p})}{n} )}\)
16.2 Complications
What if we are interested in a more complex parameter, or function of parameter(s), whose corresponding sample statistics do not follow a Normal distribution?
Well, that depends on a few things:
If you have a small sample size, you need to determine the “exact” distribution of the statistic. (I recommend taking STAT 326 - Mathematical Statistics to learn some strategies for doing so.)
If the exact distribution can’t be determined, you are pretty much out of luck.
If you have a very large sample size, you can still use a Normal distribution to approximate the exact distribution, using with a technique called the Delta Method to approximate the sampling distribution. This techniques requires using a bit of calculus.
For moderately sized samples, you can use a method called the Bootstrap to construct approximate confidence intervals.
16.3 Overview of the Bootstrap
For confidence intervals, the bootstrap is relatively simple, but powerful, technique used to approximate the sampling distribution. FYI, the name is derived from the saying “pull oneself by one’s bootstraps” which essentially means to succeed only by one’s own efforts or abilities. In this case, it refers to the idea that the sample data itself can be used to derive the sampling distribution of the statistic.
16.3.1 How it works
It samples (with replacement) from itself multiple times and, for each resample, computes the statistic of interest. These bootstrap resamples of the statistic approximate the sampling distribution and can then be used to determine the confidence interval.
16.3.1.1 The Simple Percentile Bootstrap Interval
One of the most easy to construct bootstrap interval is know as the percentile method. Given a set of bootstrapped statistics, we can construct a \(C\%\) confidence interval by finding the bounding values that correspond to the middle \(C\%\) of this distribution.
16.3.1.2 Steps to Constructing a Percentile Bootstrap Interval
Here we will investigate the procedure for two styles of problems.
16.3.1.2.1 C% Bootrap Interval for a Single Population
Given a sample of size n from a population
Draw a resample of size n with replacement from the sample. Computer a statistic that estimates the parameter of interest
Repeat this resampling process B times (where B is as large as reasonably possible given computing resources)
Construct the bootstrap distribution of the statistic.
Find the two values that contain the middle C% of the bootstrapped statistics.
16.3.1.2.2 C% Bootrap Interval for Comparing Two Population
Given independent samples of sizes m and n from two populations,
Draw a resample of size m with replacement for the first sample and a separate resample of size n from the second sample. Compute the statistic that represents the comparison of interest for the two groups.
Repeat this process B times.
Construct the bootstrap distribution of the statistic.
Find the two values that contain the middle C% of the bootstrapped statistics.
16.3.2 Implementing a Bootstrap in R
While there are a number of packages that help automate the construction of bootstrap distributions, we will focus on manually constructing them using a programming concept known as iteration. To do so, we’ll first need to learn how to do basic iteration programming in R.
16.3.3 Side Quest: Interation - the loop
In computer programming, a loop is a sequence of commands that are continually repeated until a certain condition is reached. Loops are a core component to most (if not all) computer languages. Further, many of the functions we work with in this course are simply “wrappers” around an underlying batch of code that loops through our data. We won’t go into details about loops in this class, but will instead focus on how then can be used to build bootstrap intervals. Additional note, we won’t be making any real attempts to make our code “efficient.” If, in practice, you have a large enough data set that computation time is a potential issue, then I suggest learning more advanced computational techniques. e.g., parallel computing. Disclaimer: Please don’t use the R Studio server for parallel tasks. We have a high performance computer cluster available for larger projects.
16.3.3.0.1 The for loop
The for loop is used when you have a fixed number of iterations:
for (index_counter in iteration_vector){
Statements
}# print the first ten "squares"
for(i in 1:10){
print(c( i, i^2) )
}For example, if we wish to loop through a data set to find the maximum value of a specific variable we could do the following. (Of course, we already know the max function so this is a silly example. We’ll just use it to illustrate how a for-loop is constructed in R…and introduce a small amount of other programming concepts.)
library(tidyverse)
stat113 <- read_csv("data/stat113.csv")
stat113_clean <- drop_na(stat113, Haircut)
maxHaircut = -Inf # need to "initialize" the value we plan to track
for(r in 1:nrow(stat113_clean) ){
# extract the current row's value
currentHaircut <- stat113_clean$Haircut[r]
# check if it is value than the current "max value"
# and replace it if it is
if(currentHaircut > maxHaircut){
maxHaircut <- currentHaircut
}
# this will continue until all rows of the data are checked, one at a time
}
maxHaircut16.3.4 Implementing a Bootstrap in R - part 2
Example 1: Suppose we want to construct a 95% confidence interval for the average price of Stat 113 students “back-to-school” haircut.
# prep data by cleaning it up
stat113_clean <- drop_na(stat113, Haircut)
# Step 0. Set up objects
# number of bootstrap samples
B <- 1000
# a blank vector to store the bootstrap statistics
all_mean_boots <- rep(NA, B)
# set up loop
for(i in 1:B){
# 1. Resample Data
resampled_data <- slice_sample(stat113_clean, prop = 1, replace = TRUE)
#2. Calculate statistic
one_mean_boot <- mean(resampled_data$Haircut)
#3. Store results
all_mean_boots[i] <- one_mean_boot
}
# Back to 1. Resample Data, repeat B times#4. Calculate CI
# investigate sampling distribution
summary(all_mean_boots)
qplot(all_mean_boots)
# determine confidence levels and quantile cutoff
conf.level <- 0.95
alpha <- 1 - conf.level
# use quantile function to get percentile CI, rounding it
# to a sufficient number of places
ci <- quantile(all_mean_boots, probs = c(alpha/2, 1-alpha/2 ))
ci %>% round(2)Example 2: Suppose we want to calculate a 95% confidence interval for the difference in average haircut prices for male and female students.
# prep data by cleaning it up
stat113_clean <- drop_na(stat113, Haircut, Sex)
# Step 0. Set up objects
# number of bootstrap samples
B = 1000
# a blank vector to store the bootstrap statistics
all_mean_diff_boots <- rep(NA, B)
# set up loop
for(i in 1:B){
# 1. Resample Data
# Do so by male vs female
resampled_data <- stat113_clean %>%
group_by(Sex) %>%
slice_sample(prop = 1, replace = TRUE)
#2. Calculate statistic
group_boot <- resampled_data %>%
group_by(Sex) %>%
summarise(avgHaircut = mean(Haircut))
# do F - M
one_mean_diff_boot <- group_boot$avgHaircut[1] - group_boot$avgHaircut[2]
#3. Store results
all_mean_diff_boots[i] <- one_mean_diff_boot
}
# Back to 1. Resample Data, repeat B times#4. Calculate CI
# investigate sampling distribution
summary(all_mean_diff_boots)
qplot(all_mean_diff_boots)
# determine confidence levels and quantile cutoff
conf.level <- 0.95
alpha <- 1 - conf.level
# use quantile function to get percentile CI, rounding it
# to a sufficient number of places
ci <- quantile(all_mean_diff_boots, probs = c(alpha/2, 1-alpha/2 ))
ci %>% round(2)16.3.4.1 Exercises
- Construct a 98% confidence interval for the standard deviation of haircut prices.
# prep data by cleaning it up
stat113_clean <- drop_na(stat113, Haircut)
# Step 0. Set up objects
# number of bootstrap samples
B <- 1000
# a blank vector to store the bootstrap statistics
all_sd_boots <- rep(NA, B)
# set up loop
for(i in 1:B){
# 1. Resample Data
resampled_data <- slice_sample(stat113_clean, prop = 1, replace = TRUE)
#2. Calculate statistic
one_sd_boot <- sd(resampled_data$Haircut)
#3. Store results
all_sd_boots[i] <- one_sd_boot
}
# Back to 1. Resample Data, repeat B times#4. Calculate CI
# investigate sampling distribution
summary(all_sd_boots)
qplot(all_sd_boots)
# determine confidence levels and quantile cutoff
conf.level <- 0.98
alpha <- 1 - conf.level
# use quantile function to get percentile CI, rounding it
# to a sufficient number of places
ci <- quantile(all_sd_boots, probs = c(alpha/2, 1-alpha/2 ))
ci %>% round(2)- Construct a 90% confidence interval for the proportion of students with a haircut of at least $50.
# prep data by cleaning it up
stat113_clean <- drop_na(stat113, Haircut)
# Step 0. Set up objects
# number of bootstrap samples
B <- 1000
# a blank vector to store the bootstrap statistics
all_prop50_boots <- rep(NA, B)
# set up loop
for(i in 1:B){
# 1. Resample Data
# prep resampled data
# to get above/below $50
resampled_data <- slice_sample(stat113_clean,
prop = 1,
replace = TRUE) %>%
mutate(
above50 = if_else(Haircut >= 50, TRUE, FALSE)
)
#2. Calculate statistic
one_prop50_boot <- mean(resampled_data$above50)
#3. Store results
all_prop50_boots[i] <- one_prop50_boot
}
# Back to 1. Resample Data, repeat B times#4. Calculate CI
# investigate sampling distribution
summary(all_prop50_boots)
qplot(all_prop50_boots)
# determine confidence levels and quantile cutoff
conf.level <- 0.90
alpha <- 1 - conf.level
# use quantile function to get percentile CI, rounding it
# to a sufficient number of places
ci <- quantile(all_prop50_boots, probs = c(alpha/2, 1-alpha/2 ))
ci %>% round(3)- Construct a 95% confidence interval for the median difference in haircut prices for athletes vs non-athletes.
# prep data by cleaning it up
stat113_clean <- drop_na(stat113, Haircut, Sport) %>%
select(Haircut, Sport) # simply to reduce size of data
# double check levels of Sport variable
count(stat113_clean, Sport) # Hooray, Yes and No values only!
# Step 0. Set up objects
# number of bootstrap samples
B = 1000
# a blank vector to store the bootstrap statistics
all_median_diff_boots <- rep(NA, B)
# set up loop
for(i in 1:B){
# 1. Resample Data
# Do so by non-athlete (No) vs athlete (Yes)
resampled_data <- stat113_clean %>%
group_by(Sport) %>%
slice_sample(prop = 1,
replace = TRUE)
#2. Calculate statistic
group_boot <- resampled_data %>%
group_by(Sport) %>%
summarise(medHaircut = median(Haircut))
# do No - Yes
one_median_diff_boot <- group_boot$medHaircut[1] - group_boot$medHaircut[2]
#3. Store results
all_median_diff_boots[i] <- one_median_diff_boot
}#4. Calculate CI
# investigate sampling distribution
summary(all_median_diff_boots)
qplot(all_median_diff_boots)
# determine confidence levels and quantile cutoff
conf.level <- 0.95
alpha <- 1 - conf.level
# use quantile function to get percentile CI, rounding it
# to a sufficient number of places
ci <- quantile(all_median_diff_boots, probs = c(alpha/2, 1-alpha/2 ))
ci %>% round(2)- EXTRA: For those curious about writing a function to do this, here is a basic one that works for number 3 (the difference in medians), You DO NOT not need to know this for the exam.
# function for one iteration of the bootstrap
one_bootstrap_sample <- function(cleaned_data, resp_var, grp_var){
# resample data
resampled_data <- cleaned_data %>%
slice_sample(prop = 1, replace = TRUE)
# calculate difference in medians
one_boot_stat <-
resampled_data %>%
group_by(!!sym(grp_var)) %>%
summarise(medDiff = median(!!sym(resp_var))) %>%
pull(medDiff) %>%
diff()
return(one_boot_stat)
}
# For those extra curious, the !!sym() is a way to pass
# a quoted string into a dplyr function (or other functions)
# as a variable in a dataset.
# The sym creates a "symbol" for the string,
# the !! unquotes it# function to do B iterations and calculate the CI
bootstrap_ci <- function(org_data, resp_var, grp_var,
B = 1000, conf.level = 0.95){
# clean up the data
cleaned_data <- org_data %>%
select(!!sym(resp_var), !!sym(grp_var)) %>% # see note below for this line
drop_na(!!sym(resp_var), !!sym(grp_var))
# Step 0. Set up objects
# a blank vector to store the bootstrap statistics
all_boots <- rep(NA, B)
# start loop
for(i in 1:B){
# call one iteration function and assign the output to the
# i-th spot in all_boots
all_boots[i] <- one_bootstrap_sample(cleaned_data,resp_var,grp_var)
}
# calculate CI
# determine confidence levels and quantile cutoff
alpha <- 1 - conf.level
# use quantile function to get percentile CI, rounding it
ci <- quantile(all_boots, probs = c(alpha/2, 1-alpha/2 ))
return(list(boot_dist = all_boots, ci = ci))
}# call function and get 95% CI
ci_result <- bootstrap_ci(stat113, "Haircut","Sport", B = 1000, conf.level = 0.95)
qplot(ci_result$boot_dist)
ci_result$ci