DATA/STAT 234: Intro to Data Science
Last modified: 2023-05-05
Syllabus
0.1 Course Information
0.1.1 Course Overview
Much of this course will focus on the “Data Analysis Life Cycle” with an emphasis on becoming familiar with different types of data:
More specifically, we will mainly be using the R Programming Language and the suite of packages commonly referred to as the Tidyverse.
0.1.3 Instructor
- Dr. Ivan Ramler
- 124 Bewkes Hall
- e: iramler@stlawu.edu
- Office Hours:
- MW 1:00 - 2:30 or by appointment
- Zoom Option
0.1.4 Course Goals
Some (but not necessarily all) of the goals of this course are:
- Become familiar with the R environment
- Strengthen introductory statistics topics by revisiting them in R. (e.g., descriptive statistics, plotting, inference)
- Learn basic data science skills (data cleaning/wrangling and visualization)
- Set you on the path to be able to teach yourself! College isn’t just about giving us the exact skill set we’ll need for the rest of our lives. It’s about training us to be adaptable and familiar with learning new skills while under pressure and a time constraint.
0.1.5 Course Materials
R Studio: As mentioned, we will mainly be using R for this course and the IDE R Studio. While free to download, SLU has an R Studio server available at http://rstudio.stlawu.local:8787. This server can be accessed from off-campus through VPN.
Canvas: Canvas will used mainly for displaying aggregate grades, submitting exercises and code related to other assignments, and as a repository for useful links.
Data and Code outlines will be stored on GitHub at https://github.com/iramler/stat234.
- Highly recommended to install GitHub desktop on your own machine.
Text
- Course Notes - after topics are completed, this site will contain code that I write in class.
- Tidyverse Online Documentation: Online help menu for about half of the course’s material
0.1.5.1 Addtional Resources
- R for Data Science by Grolemund and Wickham, found here in a free online version.
- Modern Data Science with R by Baumer, Kaplan, and Horton, found here in a free online version.
- ChatGPT…ha, ha, just kidding. ChatGPT has been shown to be quite bad at generating correct solutions for programming (even though they look good). In fact, the leading source of programming help, Stack Overflow, has already banned the use of ChatGPT generated text from their site. (In other words, don’t trust it to provide much help. Instead, learn to sift through online examples on your own.)
0.1.6 Prerequisite
No prior programming experience is expected. However, knowledge of basic statistics (such as those covered in STAT 113) is required. Students that have had either (or both) STAT 213 or CS 140 may have an advantage on some topics, but still have the opportunity to learn plenty of new material.
0.1.7 Attendance
Unless ill or otherwise instructed by health official, students are expected to attend class. The material for each class builds on the previous day’s material and it becomes increasingly more difficult to catch up as you fall further behind. In the event you are ill (for whatever reason) and cannot attend class, I do appreciate those who send a brief email letting me know beforehand. Regardless of your reason, if you do miss a class, it is your responsibility to get the information you missed before the next class.
0.2 Assignments
0.2.1 Exams
We will have two exams (an evening midterm and a final).Further details on topics and study materials will be forthcoming, but you should expect them to be “handwritten” instead of on the computer. Note that the nature of the course implies that the final exam will be somewhat cumulative in nature.
Midterm Exam will be held during the week of March 6 - 10. The exact date will be voted on by the class. The exam will be held in the evening from 7 - 9 pm.
Final Exam is scheduled (by the Registrar) for Thursday, May 11 from 1:30 - 4:30 pm.
0.2.2 Quizzes
Most Mondays there will be in-class quizzes. They will require a small amount of programming by hand. You will be allowed access to the “R Studio Cheatsheets” for these quizzes. At the end of the semester, you may drop your lowest quiz.
During approximately the first half of the semester, longer take-home quizzes will be assigned on either Thursdays or Fridays and due by the beginning of class the following Tuesday. You are allowed use any course materials for these and will submit your code. Details and the submission process of additional collaboration rules will be forthcoming. Please be aware that these are typically more challenging and time consuming than the in-class quizzes. Take-home quizzes will typically be worth 15 - 20 QP each.
0.2.3 Exercises
Homework exercises will be assigned each week and are intended to help you practice the material. Exercises are graded for completion only. Details on deadlines for exercises will be forthcoming.
0.2.4 Projects
There will be several projects throughout the semester. More details will be given as they are assigned, but early project will typically have some pre-defined tasks and questions that you will investigate. As you develop more as a data scientist, projects will also have portions where you answer your own questions relating to a data set.
Projects are also where you will have a chance to practice your communication skills. For at least one project, you will be required to record a short presentation and watch several of your classmate’s presentations. More details will be given when this portion is assigned, but these oral presentations are intended to be fairly low stakes and allow you a chance to gain some practice and confidence presenting to an outside audience.
0.3 Grading
Percentage grade the course will be determined according to the performance on the each of the assignment categories according to the following weighted average.
- Midterm Exam: 25 or 20*%
- Final Exam: 25 or 30*%
- Quizzes: 20%
- Exercises: 10%
- Projects: 20%
* If your final exam score is higher than your midterm exam, the final will be worth 30% and the midterm 20%. If not, both exams are given the same weight (25%).
0.3.1 Tentative Grade Scale
Score | Grade |
---|---|
0.95 | 4.00 |
0.92 | 3.75 |
0.89 | 3.50 |
0.86 | 3.25 |
0.83 | 3.00 |
0.80 | 2.75 |
0.77 | 2.50 |
0.74 | 2.25 |
0.71 | 2.00 |
0.68 | 1.75 |
0.65 | 1.50 |
0.62 | 1.25 |
0.60 | 1.00 |
0.3.2 Skill Mastery by Grade (Approximate Guidelines)
The following table gives a rough guideline as to what skills I expect students to have to achieve a certain level of grade.
Grade | Skills |
---|---|
2.0 | Given full access to notes, internet resources, and instantaneous feedback from the computer (e.g., R console), student should be able to reliably complete most major tasks. |
3.0 | In addition to the above, student should be able to reliably write code for major tasks without feedback from the R console and only minimal notes. |
3.75+ | In addition to the above, student should only make trivial errors when writing code either by hand or on a computer. Additionally, student needs only minimal feedback from the computer to add details to analyses and can do so in written form with minimal notes. |
Again, these should be considered rough guidelines and help you have some understanding on how different aspects of the course lead to different overall grades.
0.3.3 Pass/Fail
Pass/Fail is available to eligible students in this course. A passing grade is equivalent to a 1.0 or higher. To be considered eligible, you must not be a declared major or minor in either Data Science, Statistics, Mathematics, or any Math-Combined majors that allow Statistics electives.
0.4 Tentative List of Topics
- Getting started with
R
andR Studio
- Basic Statistical Inference in
R
- Basic data types and importing tabular data
- Data Wrangling and Transformations with
dplyr
- Graphics with
ggplots2
- Factors with
forcats
- Data Tidying with
tidyr
- Improved communication with
R Markdown
andggplots2
- Merging data tables
- Strings with
stringr
- Importing other types of data files
- Additional topics (time permitting)
- R Scripts and Basic Coding in
R
- Dates and Times
- Simple Machine Learning (e.g., decision trees, hierarchical clustering)
- R Scripts and Basic Coding in