5 Data Wrangling Examples
5.2 Overview of Data
We will be using data from the (im)famous first day survey administered each semester in Stat 113. To familiarize ourselves with the data, check out the form here.
We will be using data from the Fall 2021 class. It can be found in the file Stat113Fall2021.csv
which is located in the data subdirectory of the notes folder.
Load the appropriate packages.
Read in the Stat 113 data.
Find the number of students, mean GPA, the proportion of students that did not submit GPAs, and the number of first year students in each section. Use piping. Store the result. Print the result to the console.
Construct a dataset that contains only the GPAs of the students, their class year, and their section. Keep only those that have a valid GPA.
Construct a dataset that contains only the social media related variables. Tip: Check out the “helper” section of the
dplyr
cheat sheet for a useful shortcut when selecting variables with similar names.Using the previous dataset, for those that have a Twitter account, construct a table containing number of people in each “Favorite Social Media” category and the average number of Facebook friends. Sort the resulting table by the amount of Facebook friends (from largest to smallest).
Instead of considering only those with Twitter accounts, construct a table similar to the previous part, but including lines for those with and without Twitter.
Construct a dataset that contains the students with the top 10 most number of piercings. Keep only the number of piercings and the gender of the student.
Count how many students are missing both their height and weight values.
Count how many students are missing at least one of their height and weight values.
Calculate BMI for students. Create a new object called statBMI to store this new variable along with the other the results. Be sure to retain missing values for the students that didn’t provide the necessary information. (Tip: When looking up the formula for BMI, recall that the units for height and weight in this data is inches and pounds, respectively.)
Keep only columns of BMI and sport question. Call this new object sportBMI.
Keep only rows where people answered the sport question and replace sportBMI with this cleaned data.
Now, we will redo the previous three parts as one long string of piped commands. Do so by starting with the initial data set.
Compare BMI for athletes vs non-athletes. Do so using numerical summary statistics that you learned about in Stat 113.
Assuming that the Stat 113 students in this data represent a random (or at least representative) sample of SLU students, is there a statistically significant difference in average BMI values between athletes and non-athletes?