12 Indicators for Multi-level Categorical Variables
library(palmerpenguins) # for the data
library(ggplot2) # for the plotTo show case how to work with categorical variables with 3 or more levels, we will investigate the relationship between Flipper Length and Body Mass for three species of penguins: Adelie, Chinstrap, and Gentoo.
- Start by loading the data
data(penguins)- Make a scatterplot of Flipper Length (y) vs Body Mass (x) colored by species.
ggplot(data = penguins, aes(x = body_mass_g, y = flipper_length_mm, color = species)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)- Create indicator variables to indicate each species.
penguins$Ind_Adelie = ifelse(penguins$species == "Adelie", 1, 0)
penguins$Ind_Chinstrap = ifelse(penguins$species == "Chinstrap", 1, 0)
penguins$Ind_Gentoo = ifelse(penguins$species == "Gentoo", 1, 0)Propose a model that allows for different intercepts for each of the three species, but a common slope. Explain why we only need 2 of the 3 indicator variables.
Propose a model that allows for completely different lines (i.e., different intercepts and different slopes).
Fit the model from #4 and #5 in R.
mod4 <- lm(flipper_length_mm ~ body_mass_g + Ind_Adelie + Ind_Chinstrap,
data = penguins)
summary(mod4)mod5 <- lm(flipper_length_mm ~ body_mass_g + Ind_Adelie + Ind_Chinstrap +
body_mass_g*Ind_Adelie +
body_mass_g*Ind_Chinstrap
,data = penguins)
summary(mod5)- How far apart are the estimated intercepts between the two species for which you used the indicator variables? Use Model #4
summary(mod4)- Suppose we wish to determine if the model from #5 is better than the model from #4. Propose two metrics that could be use.
summary(mod4)
summary(mod5)- Suppose we wish to determine if there is statistical significant evidence that model #5 is better than model #4. Write out the null and alternative hypotheses that would allow to do this in a single test. (Note: The rest of this test, called a Nested F-Test, will be covered after break.)
Preview for Nested F - Test Compare SSModel5 to SSModel4
source("~/rstudioshared/IRamler/scripts/slunova.R")
slunova(mod5)
print('------------------')
slunova(mod4)model_shortcut <- lm(flipper_length_mm ~ body_mass_g*species,
data = penguins)
summary(model_shortcut)