Nancy Reid, Toronto

STA414S/2104S: Statistical Methods for Data Mining and Machine Learning

January - April, 2010

Meets Tuesday 12-2, Thursday 12-1.
SS 2105

Course Information
This course will consider topics in statistics that have played a role in the development of techniques for data mining and machine learning. We will cover linear methods for regression and classification, nonparametric regression and classification methods, generalized additive models, aspects of model inference and model selection, model averaging and tree based methods.

Prerequisite: Either STA 302H (regression) or CSC 411H (machine learning). CSC108H was recently added: this is not urgent but you must be willing to use a statistical computing environment such as R or Matlab.

Office Hours: Tuesdays, 3-4; Thursdays, 2-3; or by appointment.

Textbook: Hastie, Tibshirani and Friedman. The Elements of Statistical Learning. Springer-Verlag.

Book web page, including a link to the online pdf of the book.

Course evaluation:

Homework 1 due February 11: 20%,
Homework 2 due March 4: 20%,
Midterm exam, March 16: 20%,
Final project due April 16: 40% (See January 5 handout for Project information).

Final Project: Due April 16, before 2 pm. You may email your project to me, but make sure you get an email reply before the deadline. If you do not near back from me, I didn't get your project, and you'll need to bring a hard copy.

I will not hold office hours April 5 to 9, but will be available by email.

Midterm: to be returned on April 6, 12 pm; room SS 6004

Homework 2 : Some comments

Homework 1 : Finally!, Sketch of solutions

Material from lectures

Mar 30

Slides, March 30. Overview of the course and the Netflix Prize solutions
All the papers I used are available through the Netflix website, see 2nd slide

Mar 23, 25

Slides, March 25
There will be class on March 25. Midterm due that day as well.
Slides, March 23
Brendan Frey's Science paper on a type of k-means algorithm
link to graphical display of the Netflix data

Mar 16

No Class on March 18, but office hours at the usual time
Slides
Handout with R code
Random Forests page, maintained by Adele Cutler
Electronic Textbook on Classification and Regression Trees
I forgot to tell you about all the cool stuff happening soon at the Fields Institute
- April 15 -- Robert C. Merton, Harvard, Nobel-Prize winning economist
- April 21-23 -- Darrell Duffie, Stanford
- April 29 -- Statistics Grad Students' Research Day
- April 30, May 1 -- DASF III: Workshop: Data Analysis and Statistical Foundations
- May 3, 4 -- Jianqing Fan, Princeton: Distinguished Lecture Series in Statistics

Mar 9, 11

Mar 11: Slides
Zip file of music data.
Mar 9: Slides
Neural Net Example from Venables and Ripley's MASS book and scripts
Paper on SVM by Mu Zhu, published in American Statistician
Slides from Jean-Francois Plante, prepared in 2009
Code for Jean-Francois's slides

Mar 2, 4

February 23, 25

Mistake on HW 2, Question 1: the "Excellent" classification should be 7-10. This has been changed on the posted version above.
Feb 25: Code for bias-variance trade-off, adapted from Course Notes by Cosma Shalizi of Carnegie-Mellon University
Feb 23: Slides
R source file for kernel smooth
R source file for loess

February 9

February 11: No lecture, but please bring HW to class by 1 pm.
Slides
Phoneme handout
Wavelet handout

February 2

Slides
February 4: Li Li has office hours in SS 2005 1-2 pm, SS 6027a 2-3 pm My office hour is cancelled on Thursday.

January 26, 28

Wine data for HW 1
Slides for both Tuesday and Thursday lectures Amended on Thursday Jan 28
HW 1: mistake in Q1 Expression for W in part (b) should have $\hat\beta_{(0)}$ instead of 0 in second log-likelihood term. Posted version below has been corrected.

January 19, 21

R code for parts of Table 3.3
Homework 1 amended Jan 26, 4.04pm
Slides
R Code for prostate data

January 12

Slides
Science news item on authors
Original article
Comments on the code provided for all subsets regression in the ElemStatLearn documentation

January 5, 7

Handout of course information.
Slides
R Code for polynomial fit
Summary of R session to do polynomial regression, including improved plots.
Documentation for the R library ElemStatLearn

Computing

I will refer to, and provide explanations for, the R computing environment. You are welcome to use some other package if you prefer. There are many online resources for R, including:

A Brief Introduction to R by H.C. Pumphrey has a very nice section on plotting,
Jeff Rosenthal's introduction for the course STA410/2102,
the documentation provided on the R project web site -- especially the
Introduction to R
John Verzani' s online book simpleR,
Radford Neal's notes for STA 410

Download R to your laptop using

Current version of R for Windows. Click on the link for R-2.10.0-win32.exe to download the setup program.
R for Mac OS X.
R for Linux .

A menu driven interface is available called R Commander.

I recently found a very nice short introduction to R basics from Charlotte Wickham, UC Berkeley.

Questions and clarifications re Midterm

Mar 16: There is a typo in the expression for f-hat.
- It has been fixed in the currently posted version.
Mar 17: In Question 1, Is each of the x's a scalar? ie should we treat x_1, ..., x_N as each being a single number or a vector? The issue I see is that if it is a single number, then the class of estimators would seem to have no intercept (since all of the terms are multiplied by y_i and we have no column of 1's involved).
- Yes, each of the x_i is a scalar; the exercise in the text has more specific details. With suitable choice of \ell_i, you can write $\hatf(x_0)$ as shown, for the case of linear regression (the only one I've checked); the bit that includes constant term \hat\beta_0 becomes part of the weight.

Some questions/answers re HW 1

Do you have any hints for Q2?
- As pointed out by Hala, two matrices span the same subspace if one is a (full rank) linear transformation of the other. This should help...
How do I create a training data set using the sample function in R?
- try ?sample; this will show you how to get a random sample of integers. you use these as row labels.
  myrows = sample(... ) ## I'll let you figure this part out
  mytrain = winedata[myrows,] ## choose these rows of the data.frame
  mytest = winedata[-myrows,] ## choose all the other rows of the data.frame
I have a specific question here: In the course slides there are lines like "> library(ElemStatLearn)". I'm just wondering where I can obtain this library package
- Within R there is an option to install packages from cran. ON my Mac it's a menu item and you highlight "Package Installer". A new window opens, with "Get List". Once you have the list (you need to be online), you search for "ElemStatLearn", and then click install selected. Once all that is done, from within R you can load the package by typing "library(ElemStatLearn)". Sometimes the download and install doesn't work the first (or second...) time, so keep trying.
  But you can also get any of the book data sets into R 'by hand', by going to the book website and downloading the data file to your computer and using read.table or read.csv; you don't have to use the package ElemStatLearn. Note that the 'manual' for ElemStatLearn (on my web page) can be used to duplicate many of the analyses in the book.
Specifically, the way I understand it is using lm.ridge, 1. the "smallest GCV value" (output from select(ridge_regression)) is the optimal value of lambda. And that 2. for different sequences of lambdas passed to lm.ridge(model, data, lambda=sequence), it will produce different values for the optimal values of lambda (or smallest GCV) depending on (obviously) the start and end of the sequence and (not so obvious to me....) how dense you make the sequence...
- yes, both of these are correct, although one would hope that the values don't depend too heavily on how dense the sequence is. I don't know if the difficulty is numerical, in the way things are implemented, or inherent in the GCV criterion. I haven't seen any discussion of this...
So, in particular to the wine quality problem, how does one choose the optimal lambda value to calculate the regression coefficients with?
- hmmm