Overview: The aim of this course is to introduce the most common data analysis techniques used for analyzing real-world data that do not conform to the assumptions of the Linear Model. We will be analyzing data that displays non-linear patterns, frequency data, count data, and longitudinal data. Students will get practice with exploratory data analysis (data visualization, model selection, formulating a hypothesis) and with statistical inference for regression models. Data analysis will be done in R and reproducible assignment reports will be authored using R Markdown.
Prerequisites: I will assume that students are familiar with linear regression, have used a statistical package such as R for linear regression, and have a a reasonable degree of facility with mathematical reasoning about statistical models (at the level of STA302).
Instructor: Michael Guerzhoy. Office: BA5244, Email: guerzhoy at cs.toronto.edu (please include STA303/STA1002 in the subject, and please ask questions on Piazza if they are relevant to everyone.)
TAs: Tiffany Fitzpatrick, Luhui (Luke) Gan
Michael's office hours: Thursday 6-7PM, Friday 3-4PM. Or email for an appointment (Thursday and Friday afternoon/evening strongly preferred). Or drop by to see if I'm in. Feel free to chat with me after lecture.
Course forum on Piazza
There is no perfect textbook that fits the syllabus of STA303/STA1002. The following are good starting points:
- Michael Kutner, Christopher Nachtsheim, John Neter, Applied Linear Regression Models
- Howard J. Seltman, Experimental Design and Analysis — a more elementary book than what we need (just discusses the techniques while sometimes omitting the intuition/rationale/theory), but covers t-tests and ANOVA.
- Alan Agresti, Introduction to Categorical Data Analysis — covers most of what we need, but unfortunately not t-tests, ANOVA, and multiple comparisons (available on the web via the UofT library)
- Fred Ramsey and Daniel Schafer, The Statistical Sleuth: A Course in Methods of Data Analysis (see also The Statistical Sleuth (3rd Edition) In R) — an excellent book that can sometimes be sparse on details.
- Cosma Rohilla Shalizi, Advanced Data Analysis from an Elementary Point of View — a wonderful book about modern data analysis techniques. Some chapters are very relevant (although not directly covered), and others are too advanced.
- Andrew Gelman and Jennifer Hill, Data Analysis Using Regression and Multilevel/Hierarchical Models — an excellent book on multilevel/hierarchical models, and data analysis in general.
- Kieran Healy, Soc 880: Data Visualization — an excellent short course on data visualization, with excellent ggplot tutorials during Week 4 and Week 6.
We will be using RStudio to author reproducible data analysis reports using R and R Markdown.
Project 1 (10%): ANOVA, multiple comparisons, and simulation. Due: Thursday Jul. 14 11PM. Some R tips for P1: Part 1, Part 2. Solutions: writeup (source, grants.csv). Bonus solutions (source.)
Project 2 (15%): Classification, prediction, and multilevel models. Due: Friday Aug. 5 11PM (bonus due Aug. 8 5pm). Some R tips for P2: Ordinal variables (R code, source.) Interactions tutorial: here. Prediction cost tutorial: Part 1, Part 2, Part 3, Part 4; prediction cost tutorial. Project 2 handout source. Solutions: German Credit, Shaquille O'Neal's free throws and multilevel models
Lateness policy: 10% per 24 hours, rounded up. Late projects are only accepted for 48 hours after the deadline.
Projects are to be submitted on MarkUs. You can log in using your UTORid.
Monday Jul. 18. in EX300 6:50PM-9PM. Worth: 25%. Midterm paper. Solutions (source).
Summer 2016 exam paper
Aug. 2016 exam timetable. Worth: 50%
Reference sheet to be handed out with the exam
Conceptual problems: Study Guide. You can add your solutions, and read other people's solutions, here.
One-Way ANOVA and t-tests: Problems. Supplementary data and analysis: drug trial analysis from Kleibaum (source), Spock dataset (source). Solutions.
Two-Way ANOVA: Problems. Solutions.
Logistic Regression: Problems. Supplementary data and analysis: Donner Party (source), counterfeit banknotes (source), new cars (source). Solutions.
Logistic Regression, Part 2: Problems. Supplementary data and analysis: Krunnit (source), bottle deposits (source). Solutions.
Logistic Regression, Part 3: Problems. Supplementary data and analysis: Classification (source). Solutions.
Log-Linear Models: Problems (Q7_R.txt). Solutions (Q7_R_Full.txt).
Old tests and exams: here.
Unadapted practice problems are available here.
At students' request, I am posting relevant reading. You are only responsible for what's in the lectures, but of course it's always good to read a textbook as well. I do not expect that everyone consults all the readings I post, only that people make sure that they thoroughly understand the lectures.
Reading: Seltman Ch. 6 ("t-test"). Ramsey Ch. 2, 3 ("Inference Using t-Distributions", "A Closer Look at Assumptions")
Just for fun: the American Statistical Association's statement on p-values; more advanced (and slightly sarcastic) post from Andrew Gelman: "I've never in my professional life made a Type I error or a Type II error"
Reading: Seltman Ch. 7 ("One-way ANOVA"). Ramsey Ch. 3 ("A Close Look at Assumptions"), Ramsey Ch. 5 ("Comparisons Among Several Means").
Reading: the appropriate chapters from Kutner (different depending on the edition); the Two-Way ANOVA chapter in Seltman
Reading: Agresti Chapters 4-5 (not all sections).
Also: more fish! Neural correlates of interspecies perspective taking in the post-mortem Atlantic Salmon. More Brains! Cluster failure: Why fMRI inferences for spatial extent have inflated false-positive rates.
Just for fun: the Titanic was actually a typical. Typically, more men than women survive: M. Elinder and O. Erixson, Gender, social norms, and survival in maritime disasters, PNAS vol. 109 no. 33, 2012.
Just for fun: FiveThirtyEight's p-value clip.
Just for fun: the Dunning-Kruger effect study.
Simulation reading: Shalizi Chapter 5
Lecture 6: The midterm, cross validation (R code, source), Issues in logistic regression (R code -- perfect separation, source, R code -- extrabinomial, source).
Reading: Shalizi Ch. 3 on cross-validation.
Lecture 7: the midterm; Binomial and Poisson Distributions: review (R code, source); logistic regression with count data (R code, source.) Intro to Poisson Regression (R code, source).
Reading: On GLMs/Logistic/Poisson Regression, read the GLMs/Logistic/Poisson chapters in Kutner. Ch. 12 of Shalizi presents a nice summary of GLMs. Ramsey Ch. 20-22 is also good.
Lecture 9: a quick review of overdispersion and binomial family GLM in R (source); Intro to Hierarchical/Multilever Models (R code, source, data)
Reading: Agresti Ch. 10 or Gelman Ch. 12.
Lecture 10: The exam! Using glmnet for ridge logistic regression and visualizing coefficients (source, image files). A connection between Ridge Logistic Regression and Partial Pooling. Predicting elections with Partial Pooling (R code, source, polls.dta). Project 2 discussion.
Just for fun: the polling data is for the 1988 US presidential election.
Just for fun (relevant to the part of Project 2 dealing with Shaquille O'Neal's free throws): Hack-a-Shaq.
Reading: (for the polling example) the beginning of Gelman Ch. 12.
Lecture 11: Project 2 — German Credit, Project 2 — Shaq, Project 2 — bonus. predictive modelling (source, german.data).
See you around!