5 Unusual Observations

Getting Started As always, insert a chunk for the R packages we will need.

library(tidyverse)
library(ggiraphExtra)

The dataset mammals.csv contains information on the brain weight (in grams) and gestation period (in days) for a sample of 50 mammals. We are interested in using brain weight to predict gestation period.

mammals = read_csv("data/mammals.CSV")
  1. Propose a simple linear model (for the entire population of mammals) that uses brain weight to explain gestation period.

  2. Fit and report the least squares regression model for predicting gestation period from brain weight. What do you notice immediately from the scatterplot of the data and residual plots?

mod1 <- lm(Gestation_days ~ BrainWgt_g, data = mammals)


ggplot(data = mammals, aes(y = Gestation_days, 
                           x = BrainWgt_g)) +
  geom_point() #+
#  geom_smooth(se = FALSE) +
#  geom_smooth(method = "lm", se = FALSE, color = "red")
mod1 <- lm(Gestation_days ~ BrainWgt_g, data = mammals)
summary(mod1)
plot(mod1)
  1. One observation has a considerably larger brain weight than the rest of mammals in the dataset. Which observation is it? Report the brain weight and gestation period for this mammal.
mammals[25,  ]
  1. An observation that is influential has a “large effect” on the regression equation. Remove this mammal from the dataset and refit the regression model. Does this mammal seem to be influential?
# Remove data point
bigbrain <- which.max(mammals$BrainWgt_g)
nohuman <- mammals[-bigbrain,]

mammal_mod = lm(Gestation_days ~ BrainWgt_g, data = mammals)
nohuman_mod <- lm(Gestation_days ~ BrainWgt_g, data = nohuman)

ggplot(data = mammals, aes(y = Gestation_days, 
                           x = BrainWgt_g)) +
  geom_point() +
  geom_abline( # regression equation for full dataset
    intercept = mammal_mod$coefficients[1],
    slope = mammal_mod$coefficients[2],
    color = "blue"
    )  +
  geom_abline( # regression equation for no humans data
    intercept = nohuman_mod$coefficients[1],
    slope = nohuman_mod$coefficients[2],
    color = "red",
    linetype = 2 #dotted line
    )
summary(nohuman_mod)
  1. Calculate the leverage (by hand) for this mammal. Does this point have high leverage? Hint: You will need the summary statistics, specifically the mean and standard deviation) for brain weight.
nrow(mammals) # number of rows (sample size)
mean(mammals$BrainWgt_g)
sd(mammals$BrainWgt_g)
49 * 216.3585**2

(1320 - 107.2524)**2

(1/50) + (1470757/2293739)
  1. Use R to find the leverage for all data points. Identify any unusual observations.
mammals$hat <- hatvalues(mammal_mod)
filter(mammals, hat > 3 * (1+1)/50 )
  1. Calculate the residual (by hand) for humans.
mammal_mod

mammal_mod$residuals[25]
  1. Compute the standardized residual (by hand) for humans. Would this observation be considered unusual?
summary(mammal_mod)
  1. Use R to find the standardized residuals for all data points. Identify any unusual observations.
mammals$rstandard <- round( rstandard(mammal_mod) , 3)

filter(mammals, abs(rstandard) > 2)
  1. Compute the studentized residual (by hand) for humans. Would this observation be considered unusual?
summary(nohuman_mod)
  1. Use R to find the studentized residuals for all data points. Identify any unusual observations.
mammals$rstudent <- round( rstudent(mammal_mod), 3)

filter(mammals, abs(rstudent) > 2)
  1. Compute Cook’s D (by hand) for humans. Would this observation be considered influential?

  2. Use R to find the Cook’s D for all data points. Identify any influential observations.

mammals$cooks.d <- round( cooks.distance(mammal_mod), 3)

filter(mammals, cooks.d >= 0.5)

# dfbetas(mammal_mod) # another measure of influence - not used in Stat 213
plot(mammal_mod)
  1. Transformations can sometimes diminish the effects of outliers. Investigate transformations of brain weight, gestation, or both to find a better model for these data. Report the model and identify any observations with unusually large standardized residuals. If there are observations with large standardized residuals, are those points influential?