13 Polynomial Regression 1

library(tidyverse)
source("~/rstudioshared/IRamler/scripts/slunova.R") # for ANOVA test

13.1 Example 1: Mercury in Bluegills

We are interested in modeling the relationship between the age of bluegills (in years) and mercury concentration (in parts per million). The data are available in bluegill.CSV.

bluegill <- read_csv("data/bluegill.CSV")
ggplot(bluegill, aes(x=Age, y=Mercury)) + 
  geom_point() + 
  geom_smooth(method="lm", se=FALSE, col="red") + 
  labs(x="Age (in years)", y = "Mercury Concentration (in ppm)")
  1. Fit a linear model that uses age to predict the amount of mercury in the fish. Be sure to examine plots of the residuals to check the model assumptions. Do you notice any problems?
mod1 = lm(  Mercury ~ Age, data=bluegill)
plot(mod1)
  1. Plot the data again, but this time allow for a second degree polynomial (quadratic) fit.
ggplot(bluegill, aes(x=Age, y=Mercury)) + 
  geom_point() + 
  geom_smooth(method="lm", se=FALSE, col="red", 
              formula = y~poly(x, 2)
              )
  1. One solution would be to use a quadratic polynomial to model the relationship. The form of this model is:

  2. To fit this model, you must “create” a new variable (age squared) and refit the model using age and age squared as predictors.

What is the fitted prediction equation?

mod2 = lm(Mercury ~ Age + I(Age^2) , data=bluegill)
summary(mod2)

4.b. An alternative way of fitting a quadratic regression model.

mod3 = lm(Mercury ~ poly(Age, 2) , data=bluegill)
summary(mod3)
  1. Are the model assumptions satisfied?
plot(mod2)
  1. Is there evidence that this model is useful?
summary(mod2)
slunova(mod2)
  1. Is there evidence that the quadratic term is necessary?
summary(mod2)
  1. Would a cubic model be better?
mod8 = lm(Mercury ~ Age + I(Age^2) + I(Age^3) , data=bluegill)
summary(mod8)
  1. Would a log-transformation be better?

13.2 Example 2: State SAT Scores

Motivation StateSAT0405.csv includes information on average SAT scores (Math, Verbal, and Combined) for the 50 states from the 2004-2005 school year, as well as the percent of high school seniors taking the exam that year.

SATdata = read_csv("data/StateSAT0405.CSV")

A scatterplot of the relationship between Average Combined SAT score (for a state) and percent of students (in a state) taking the exam is provided. Discuss with your neighbors any trends that you can identify.

g = ggplot(SATdata, aes(x=PercentTakers, y=AverageCombinedSAT)) + 
  geom_point() + 
  geom_smooth(method="loess") + 
  labs(x="Percent of Students (in a state) Taking SAT Exam", y="Average Combined SAT Score (for takers in a state)") 

g

Model 1 Propose a quadratic regression model to explain Average Combined SAT score from the percent of students taking the SAT exam. Fit that model, write the estimated equation below. Also, examine residual plots to determine if you have any concerns about model fit, as well as report the adjusted \(R^2\) for the model and the size of a typical prediction error (residual).

Model 2 In some states, it is more common for students to take the SAT while in others it is more common for students to take the ACT (https://www.collegeraptor.com/getting-in/articles/act-sat/preference-act-sat-state-infographic/). Given that, it might be useful to consider the possibility that the relationship between Average Combined SAT score and percent of students taking the SAT exam differs for the types of SAT (those with ACT preference and those with SAT preference).

g +
  geom_point(aes(color=TestPref)) + 
  geom_smooth(method="lm",se=FALSE,aes(color=TestPref))

Your first step will be to create an indicator variable for the inclusion of type of state in the model.

Propose a model that allows there to be different lines (with different intercepts and slopes) relating Average Combined SAT score to percent of takers for the two types of states (SAT versus ACT). Fit that model, write the estimated equation below. Also, examine residual plots to determine if you have any concerns about model fit, as well as report the adjusted \(R^2\) for the model and the size of a typical prediction error (residual).

Discussion Which model do you like better and why?