4 Transformations

Getting Started As always, insert a chunk for the R packages we will need.

library(tidyverse)
library(ggiraphExtra)

4.1 Guided Example: Length and Width of Butter Clams

Upload the file, ButterClams.CSV, from the T drive to the data folder (in your notes project). Run the following chunk to import the data in to R.

clams <- read_csv("data/ButterClams.CSV")

Our dataset contains information on the length and width (both in centimeters) of 88 Puget Sound butter clams. We want to use the width of the clams to model (predict) the length of the clams. Fill in the missing pieces of the code to produce a scatterplot of length versus width, with a smoother. Does a linear trend seem reasonable?

ggplot(clams, aes(x=Width, y=Length))    +
  geom_point() +
  geom_smooth(se = FALSE) +
  
  labs(x="Width (cm)", y = "Length (cm)")

Propose a simple linear model (for the entire population of butter clams) that uses width to explain length.
Fit the least squares regression equation for modeling length from width. Write down the equation below.

clams_mod1 <- lm(Length ~ Width, data = clams)
summary(clams_mod1)

Fill in the missing pieces of the code to add the least squares regression line to the scatterplot of length versus width for our sample of butter clams. Do not put the standard error on the display.

ggplot(clams, aes(x=Width, y=Length)) + 
  geom_point() + 
  geom_smooth(se = FALSE, method = 'lm') +
  labs(x="Width (cm)", y = "Length (cm)")

Produce plots to check the assumptions for the linear model we have used in this application. Do they seem reasonably met?

plot(clams_mod1)

Run the following code to create a new, transformed variable in your dataset. Write down a description of what this code does.

clams$logLength = log( clams$Length )

Produce a scatterplot (with smoother) using the transformed variable as the response variable, propose a new simple linear model using the transformed response variable, refit the model using this transformation, and examine the residuals. What problem do you see?

ggplot(clams, aes(x=Width, y=logLength))    +
  geom_point() +
  geom_smooth(se = FALSE) +
  
  labs(x="Width (cm)", y = "ln(Length)")

clams_mod2 <- lm(logLength ~ Width, data = clams)
summary(clams_mod2)

plot(clams_mod2)

What transformation should we try to “fix” this? Why? Insert a chunk to create another transformed variable in your dataset.

# make logWidth
clams$logWidth = log(clams$Width)

# y ~ log(x) model
clams_mod3 <- lm(Length ~ logWidth, data = clams)
summary(clams_mod3)
plot(clams_mod3)

# log(y) ~ log(x) model
clams_mod4 <- lm(logLength ~ logWidth, data = clams)
summary(clams_mod4)
plot(clams_mod4)

Produce a scatterplot (with smoother) using both of the transformed variables, propose a new simple linear model using both transformed variables, refit the new model, and examine the residuals. How do things look now?
Using the model to make a prediction: Write down the final fitted regression model. Then use the fitted line to predict the length of a clam that is 2.5 cm wide.

0.3051 + 0.9615*log(2.5)
exp(1.186114)

4.2 Your Turn: Penguins

Motivation: Emperor penguins are the most accomplished divers among birds, making routine dives of 5 – 12 minutes, with the longest dive ever recorded being over 27 minutes. Since air-breathing animals like penguins must hold their breath while submerged, the duration of any given dive depends on such things as how much oxygen is in the bird’s body at the beginning of the dive and how quickly that oxygen gets used. The rate of oxygen depletion is primarily determined by the penguin’s heart rate. Studying the heart rate of these birds can help us to understand how these animals regulate their oxygen consumption in order to make such impressive dives.

Upload the Penguins data from the T drive and place it in the data folder. Run the following chunk to import the data in to R.

penguins = read_csv("data/Penguins.CSV")

Produce a scatterplot of dive heart rate versus the duration of the dive (in minutes), with a smoother. Does a linear trend seem reasonable?

ggplot(data = penguins, aes(x = Duration_mins, y = DiveHeartRate)) +
  geom_point() +
  geom_smooth(se = FALSE)

Regardless of your answer to #2, propose a simple linear model (for the entire population of penguins) that uses the duration of the dive (in minutes) to explain dive heart rate.

$Dive Heart Rate = _0 + _1 Duration(mins) + $

Fit the least squares regression equation for modeling dive heart rate from the duration of the dive (in minutes). Examine the appropriate residual plots. Is the linear model appropriate?

penguin_mod1 <- lm(DiveHeartRate ~ Duration_mins, 
                   data = penguins)
penguin_mod1
plot(penguin_mod1)

Investigate various transformations until you find a model that seems appropriate (you might want to keep some notes for yourself so you remember what you did later!). Write down the final estimated model.

penguins$logDiveHeartRate = log(penguins$DiveHeartRate)
penguins$logDuration_mins = log(penguins$Duration_mins)

penguin_mod2 <- lm(data = penguins,
   logDiveHeartRate ~ Duration_mins)
plot(penguin_mod2)

penguin_mod3 <- lm(data = penguins,
   DiveHeartRate ~ logDuration_mins)
plot(penguin_mod3)

penguin_mod4 <- lm(data = penguins,
   logDiveHeartRate ~ logDuration_mins)
plot(penguin_mod4)

Use the final estimated model to predict the dive heart rate for a dive lasting 10 minutes.

Final form: Y ~ log(X) b/c it fixes the linearity issue (even though it still has some non-constant variance)

summary(penguin_mod3)