12 Indicators for Multi-level Categorical Variables
library(palmerpenguins) # for the data
library(ggplot2) # for the plot
To show case how to work with categorical variables with 3 or more levels, we will investigate the relationship between Flipper Length and Body Mass for three species of penguins: Adelie, Chinstrap, and Gentoo.
- Start by loading the data
data(penguins)
- Make a scatterplot of Flipper Length (y) vs Body Mass (x) colored by species.
ggplot(data = penguins, aes(x = body_mass_g, y = flipper_length_mm, color = species)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
- Create indicator variables to indicate each species.
$Ind_Adelie = ifelse(penguins$species == "Adelie", 1, 0)
penguins$Ind_Chinstrap = ifelse(penguins$species == "Chinstrap", 1, 0)
penguins$Ind_Gentoo = ifelse(penguins$species == "Gentoo", 1, 0) penguins
Propose a model that allows for different intercepts for each of the three species, but a common slope. Explain why we only need 2 of the 3 indicator variables.
Propose a model that allows for completely different lines (i.e., different intercepts and different slopes).
Fit the model from #4 and #5 in R.
<- lm(flipper_length_mm ~ body_mass_g + Ind_Adelie + Ind_Chinstrap,
mod4 data = penguins)
summary(mod4)
<- lm(flipper_length_mm ~ body_mass_g + Ind_Adelie + Ind_Chinstrap +
mod5 *Ind_Adelie +
body_mass_g*Ind_Chinstrap
body_mass_gdata = penguins)
,summary(mod5)
- How far apart are the estimated intercepts between the two species for which you used the indicator variables? Use Model #4
summary(mod4)
- Suppose we wish to determine if the model from #5 is better than the model from #4. Propose two metrics that could be use.
summary(mod4)
summary(mod5)
- Suppose we wish to determine if there is statistical significant evidence that model #5 is better than model #4. Write out the null and alternative hypotheses that would allow to do this in a single test. (Note: The rest of this test, called a Nested F-Test, will be covered after break.)
Preview for Nested F - Test Compare SSModel5 to SSModel4
source("~/rstudioshared/IRamler/scripts/slunova.R")
slunova(mod5)
print('------------------')
slunova(mod4)
<- lm(flipper_length_mm ~ body_mass_g*species,
model_shortcut data = penguins)
summary(model_shortcut)