12 Indicators for Multi-level Categorical Variables

library(palmerpenguins) # for the data
library(ggplot2) # for the plot

To show case how to work with categorical variables with 3 or more levels, we will investigate the relationship between Flipper Length and Body Mass for three species of penguins: Adelie, Chinstrap, and Gentoo.

  1. Start by loading the data
data(penguins)
  1. Make a scatterplot of Flipper Length (y) vs Body Mass (x) colored by species.
ggplot(data = penguins, aes(x = body_mass_g, y = flipper_length_mm, color = species)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)
  1. Create indicator variables to indicate each species.
penguins$Ind_Adelie = ifelse(penguins$species == "Adelie", 1, 0)
penguins$Ind_Chinstrap = ifelse(penguins$species == "Chinstrap", 1, 0)
penguins$Ind_Gentoo = ifelse(penguins$species == "Gentoo", 1, 0)
  1. Propose a model that allows for different intercepts for each of the three species, but a common slope. Explain why we only need 2 of the 3 indicator variables.

  2. Propose a model that allows for completely different lines (i.e., different intercepts and different slopes).

  3. Fit the model from #4 and #5 in R.

mod4 <- lm(flipper_length_mm ~ body_mass_g +  Ind_Adelie + Ind_Chinstrap,
           data = penguins)
summary(mod4)
mod5 <- lm(flipper_length_mm ~ body_mass_g +  Ind_Adelie + Ind_Chinstrap +
           body_mass_g*Ind_Adelie +
           body_mass_g*Ind_Chinstrap
           ,data = penguins)
summary(mod5)
  1. How far apart are the estimated intercepts between the two species for which you used the indicator variables? Use Model #4
summary(mod4)
  1. Suppose we wish to determine if the model from #5 is better than the model from #4. Propose two metrics that could be use.
summary(mod4)
summary(mod5)
  1. Suppose we wish to determine if there is statistical significant evidence that model #5 is better than model #4. Write out the null and alternative hypotheses that would allow to do this in a single test. (Note: The rest of this test, called a Nested F-Test, will be covered after break.)

Preview for Nested F - Test Compare SSModel5 to SSModel4

source("~/rstudioshared/IRamler/scripts/slunova.R")

slunova(mod5)
print('------------------')
slunova(mod4)
model_shortcut <- lm(flipper_length_mm ~ body_mass_g*species,
                     data = penguins)
summary(model_shortcut)