16 Confidence Intervals via Resampling (aka the Bootstrap)

16.1 Review of Confidence Intervals

Recall that a confidence interval is a statement about the plausible range of values of an unknown population parameter (e.g, a population mean) or function of parameters (e.g., a difference in two means) that are supported by a random sample of data.

In Stat 113, regardless of which instructor taught it, we focus on parameters for which Normal distributions can be used to model the sample-to-sample variability of the corresponding sample statistic. This distribution is commonly called the sampling distribution of the sample statistic.

For example, for sufficiently large random sample of size $n$, the distribution of sample proportions $\hat{p}$ can be modeled using

$\hat{p} \sim Normal(p, \sqrt{\frac{p*(1-p)}{n} )}$

Then, for a given sample of size $n$, we can construct a $C\%$ confidence interval using

$\hat{p} \pm z^*\times \sqrt{\frac{\hat{p}*(1-\hat{p})}{n} )}$

16.2 Complications

What if we are interested in a more complex parameter, or function of parameter(s), whose corresponding sample statistics do not follow a Normal distribution?

Well, that depends on a few things:

If you have a small sample size, you need to determine the “exact” distribution of the statistic. (I recommend taking STAT 326 - Mathematical Statistics to learn some strategies for doing so.)
If the exact distribution can’t be determined, you are pretty much out of luck.
If you have a very large sample size, you can still use a Normal distribution to approximate the exact distribution, using with a technique called the Delta Method to approximate the sampling distribution. This techniques requires using a bit of calculus.
For moderately sized samples, you can use a method called the Bootstrap to construct approximate confidence intervals.

16.3 Overview of the Bootstrap

For confidence intervals, the bootstrap is relatively simple, but powerful, technique used to approximate the sampling distribution. FYI, the name is derived from the saying “pull oneself by one’s bootstraps” which essentially means to succeed only by one’s own efforts or abilities. In this case, it refers to the idea that the sample data itself can be used to derive the sampling distribution of the statistic.

16.3.1 How it works

It samples (with replacement) from itself multiple times and, for each resample, computes the statistic of interest. These bootstrap resamples of the statistic approximate the sampling distribution and can then be used to determine the confidence interval.

16.3.1.1 The Simple Percentile Bootstrap Interval

One of the most easy to construct bootstrap interval is know as the percentile method. Given a set of bootstrapped statistics, we can construct a $C\%$ confidence interval by finding the bounding values that correspond to the middle $C\%$ of this distribution.

16.3.1.2 Steps to Constructing a Percentile Bootstrap Interval

Here we will investigate the procedure for two styles of problems.

16.3.1.2.1 C% Bootrap Interval for a Single Population

Given a sample of size n from a population

Draw a resample of size n with replacement from the sample. Computer a statistic that estimates the parameter of interest
Repeat this resampling process B times (where B is as large as reasonably possible given computing resources)
Construct the bootstrap distribution of the statistic.
Find the two values that contain the middle C% of the bootstrapped statistics.

16.3.1.2.2 C% Bootrap Interval for Comparing Two Population

Given independent samples of sizes m and n from two populations,

Draw a resample of size m with replacement for the first sample and a separate resample of size n from the second sample. Compute the statistic that represents the comparison of interest for the two groups.
Repeat this process B times.
Construct the bootstrap distribution of the statistic.
Find the two values that contain the middle C% of the bootstrapped statistics.

16.3.2 Implementing a Bootstrap in R

While there are a number of packages that help automate the construction of bootstrap distributions, we will focus on manually constructing them using a programming concept known as iteration. To do so, we’ll first need to learn how to do basic iteration programming in R.

16.3.3 Side Quest: Interation - the loop

In computer programming, a loop is a sequence of commands that are continually repeated until a certain condition is reached. Loops are a core component to most (if not all) computer languages. Further, many of the functions we work with in this course are simply “wrappers” around an underlying batch of code that loops through our data. We won’t go into details about loops in this class, but will instead focus on how then can be used to build bootstrap intervals. Additional note, we won’t be making any real attempts to make our code “efficient.” If, in practice, you have a large enough data set that computation time is a potential issue, then I suggest learning more advanced computational techniques. e.g., parallel computing. Disclaimer: Please don’t use the R Studio server for parallel tasks. We have a high performance computer cluster available for larger projects.

16.3.3.0.1 The for loop

The for loop is used when you have a fixed number of iterations:

for (index_counter in iteration_vector){
  Statements
}

# print the first ten "squares"
for(i in 1:10){
  print(c( i, i^2) )
}

For example, if we wish to loop through a data set to find the maximum value of a specific variable we could do the following. (Of course, we already know the max function so this is a silly example. We’ll just use it to illustrate how a for-loop is constructed in R…and introduce a small amount of other programming concepts.)

library(tidyverse)
stat113 <- read_csv("data/stat113.csv")

stat113_clean <- drop_na(stat113, Haircut)
maxHaircut = -Inf # need to "initialize" the value we plan to track

for(r in 1:nrow(stat113_clean) ){
  # extract the current row's value
  currentHaircut <- stat113_clean$Haircut[r]
  
  # check if it is value than the current "max value"
  # and replace it if it is
  if(currentHaircut > maxHaircut){
    maxHaircut <- currentHaircut
  }
  
  # this will continue until all rows of the data are checked, one at a time
}

maxHaircut

16.3.4 Implementing a Bootstrap in R - part 2

Example 1: Suppose we want to construct a 95% confidence interval for the average price of Stat 113 students “back-to-school” haircut.

# prep data by cleaning it up
stat113_clean <- drop_na(stat113, Haircut)

# Step 0. Set up objects 
 # number of bootstrap samples
B <- 1000

 # a blank vector to store the bootstrap statistics
all_mean_boots <- rep(NA, B)

# set up loop
for(i in 1:B){
  # 1. Resample Data
  resampled_data <- slice_sample(stat113_clean, prop = 1, replace = TRUE)
          
  #2. Calculate statistic
  one_mean_boot <- mean(resampled_data$Haircut)

  #3. Store results
  all_mean_boots[i] <- one_mean_boot
  
}

 # Back to 1. Resample Data, repeat B times

#4. Calculate CI
 # investigate sampling distribution
summary(all_mean_boots)

qplot(all_mean_boots)

# determine confidence levels and quantile cutoff
conf.level <- 0.95
alpha <- 1 - conf.level


# use quantile function to get percentile CI, rounding it 
# to a sufficient number of places
ci <- quantile(all_mean_boots, probs = c(alpha/2, 1-alpha/2 ))
ci %>% round(2)

Example 2: Suppose we want to calculate a 95% confidence interval for the difference in average haircut prices for male and female students.

# prep data by cleaning it up
stat113_clean <- drop_na(stat113, Haircut, Sex)

# Step 0. Set up objects 
# number of bootstrap samples
B = 1000

# a blank vector to store the bootstrap statistics
all_mean_diff_boots <- rep(NA, B)

# set up loop
for(i in 1:B){
  # 1. Resample Data
  # Do so by male vs female
  resampled_data <- stat113_clean %>%
                      group_by(Sex) %>%
                      slice_sample(prop = 1, replace = TRUE)

          
  #2. Calculate statistic
  group_boot <- resampled_data %>%
                  group_by(Sex) %>%
                  summarise(avgHaircut = mean(Haircut))

  # do F - M
  one_mean_diff_boot <- group_boot$avgHaircut[1] - group_boot$avgHaircut[2]
  
  #3. Store results
  all_mean_diff_boots[i] <- one_mean_diff_boot

  
}
 # Back to 1. Resample Data, repeat B times

#4. Calculate CI
 # investigate sampling distribution
summary(all_mean_diff_boots)

qplot(all_mean_diff_boots)

# determine confidence levels and quantile cutoff
conf.level <- 0.95
alpha <- 1 - conf.level


# use quantile function to get percentile CI, rounding it 
# to a sufficient number of places
ci <- quantile(all_mean_diff_boots, probs = c(alpha/2, 1-alpha/2 ))
ci %>% round(2)

16.3.4.1 Exercises

Construct a 98% confidence interval for the standard deviation of haircut prices.

# prep data by cleaning it up
stat113_clean <- drop_na(stat113, Haircut)

# Step 0. Set up objects 
 # number of bootstrap samples
B <- 1000

 # a blank vector to store the bootstrap statistics
all_sd_boots <- rep(NA, B)

# set up loop
for(i in 1:B){
  # 1. Resample Data
  resampled_data <- slice_sample(stat113_clean, prop = 1, replace = TRUE)
          
  #2. Calculate statistic
  one_sd_boot <- sd(resampled_data$Haircut)

  #3. Store results
  all_sd_boots[i] <- one_sd_boot
  
}

 # Back to 1. Resample Data, repeat B times

#4. Calculate CI
 # investigate sampling distribution
summary(all_sd_boots)

qplot(all_sd_boots)

# determine confidence levels and quantile cutoff
conf.level <- 0.98
alpha <- 1 - conf.level


# use quantile function to get percentile CI, rounding it 
# to a sufficient number of places
ci <- quantile(all_sd_boots, probs = c(alpha/2, 1-alpha/2 ))
ci %>% round(2)

Construct a 90% confidence interval for the proportion of students with a haircut of at least $50.

# prep data by cleaning it up
stat113_clean <- drop_na(stat113, Haircut)

# Step 0. Set up objects 
 # number of bootstrap samples
B <- 1000

 # a blank vector to store the bootstrap statistics
all_prop50_boots <- rep(NA, B)

# set up loop
for(i in 1:B){
  # 1. Resample Data
  # prep resampled data
  # to get above/below $50
  
  resampled_data <- slice_sample(stat113_clean, 
                                 prop = 1, 
                                 replace = TRUE) %>%
                    mutate(
                      above50 = if_else(Haircut >= 50, TRUE, FALSE)
                      )
          
  #2. Calculate statistic
  one_prop50_boot <- mean(resampled_data$above50)

  #3. Store results
  all_prop50_boots[i] <- one_prop50_boot
  
}

 # Back to 1. Resample Data, repeat B times

#4. Calculate CI
 # investigate sampling distribution
summary(all_prop50_boots)

qplot(all_prop50_boots)

# determine confidence levels and quantile cutoff
conf.level <- 0.90
alpha <- 1 - conf.level


# use quantile function to get percentile CI, rounding it 
# to a sufficient number of places
ci <- quantile(all_prop50_boots, probs = c(alpha/2, 1-alpha/2 ))
ci %>% round(3)

Construct a 95% confidence interval for the median difference in haircut prices for athletes vs non-athletes.

# prep data by cleaning it up
stat113_clean <- drop_na(stat113, Haircut, Sport) %>% 
                  select(Haircut, Sport) # simply to reduce size of data

# double check levels of Sport variable
count(stat113_clean, Sport) # Hooray, Yes and No values only!

# Step 0. Set up objects 
# number of bootstrap samples
B = 1000

# a blank vector to store the bootstrap statistics
all_median_diff_boots <- rep(NA, B)

# set up loop
for(i in 1:B){
  # 1. Resample Data
  # Do so by non-athlete (No) vs athlete (Yes)
  resampled_data <- stat113_clean %>%
                      group_by(Sport) %>%
                      slice_sample(prop = 1,
                                   replace = TRUE)

          
  #2. Calculate statistic
  group_boot <- resampled_data %>%
                  group_by(Sport) %>%
                  summarise(medHaircut = median(Haircut))

  # do No - Yes
  one_median_diff_boot <- group_boot$medHaircut[1] - group_boot$medHaircut[2]
  
  #3. Store results
  all_median_diff_boots[i] <- one_median_diff_boot

  
}

#4. Calculate CI
 # investigate sampling distribution
summary(all_median_diff_boots)
qplot(all_median_diff_boots)

# determine confidence levels and quantile cutoff
conf.level <- 0.95
alpha <- 1 - conf.level


# use quantile function to get percentile CI, rounding it 
# to a sufficient number of places
ci <- quantile(all_median_diff_boots, probs = c(alpha/2, 1-alpha/2 ))
ci %>% round(2)

EXTRA: For those curious about writing a function to do this, here is a basic one that works for number 3 (the difference in medians), You DO NOT not need to know this for the exam.

# function for one iteration of the bootstrap
one_bootstrap_sample <- function(cleaned_data, resp_var, grp_var){
  
  
  # resample data
  resampled_data <- cleaned_data %>%
                      slice_sample(prop = 1, replace = TRUE)

  # calculate difference in medians
  one_boot_stat <-
    resampled_data %>%
      group_by(!!sym(grp_var)) %>%
      summarise(medDiff = median(!!sym(resp_var))) %>%
      pull(medDiff) %>%
      diff()
    
  return(one_boot_stat)  
}

# For those extra curious, the !!sym() is a way to pass 
# a quoted string into a dplyr function (or other functions)
# as a variable in a dataset.
# The sym creates a "symbol" for the string,
# the !! unquotes it

# function to do B iterations and calculate the CI

bootstrap_ci <- function(org_data, resp_var, grp_var, 
                         B = 1000, conf.level = 0.95){
  
  
  # clean up the data
  cleaned_data <- org_data %>%
    select(!!sym(resp_var), !!sym(grp_var)) %>%  # see note below for this line
    drop_na(!!sym(resp_var), !!sym(grp_var))

  # Step 0. Set up objects 
  # a blank vector to store the bootstrap statistics
  all_boots <- rep(NA, B)
  
  # start loop
  for(i in 1:B){
    # call one iteration function and assign the output to the 
    # i-th spot in all_boots
    all_boots[i] <- one_bootstrap_sample(cleaned_data,resp_var,grp_var)
  }

  # calculate CI
  # determine confidence levels and quantile cutoff
  alpha <- 1 - conf.level

  # use quantile function to get percentile CI, rounding it 
  ci <- quantile(all_boots, probs = c(alpha/2, 1-alpha/2 ))
  
  return(list(boot_dist = all_boots, ci = ci))
}

# call function and get 95% CI

ci_result <- bootstrap_ci(stat113, "Haircut","Sport", B = 1000, conf.level = 0.95)

qplot(ci_result$boot_dist)
ci_result$ci