
STA414S/2104S: Statistical Methods for Data Mining and Machine Learning January  April, 2010
Meets Tuesday 122, Thursday 121. SS 2105
Course Information
This course will consider topics in statistics that have played a role in
the development of techniques for data mining and machine learning. We will
cover linear methods for regression and classification, nonparametric
regression and classification methods, generalized additive models, aspects
of model inference and model selection, model averaging and tree based
methods.
Prerequisite: Either STA 302H (regression) or CSC 411H (machine learning). CSC108H was recently added: this is not urgent but you must be willing to use a statistical computing environment such as R or Matlab.
Office Hours: Tuesdays, 34; Thursdays, 23; or by appointment.
Textbook: Hastie, Tibshirani and Friedman. The Elements of Statistical Learning.
SpringerVerlag.
Book web
page, including a link to the online pdf of the book.
Course evaluation:
 Homework 1 due February 11: 20%,
 Homework 2 due March 4: 20%,
 Midterm exam, March 16: 20%,
 Final project due April 16: 40%
(See January 5 handout for Project information).
Final Project: Due April 16, before 2 pm. You may email your project to me, but make sure you get an email reply before the deadline. If you do not near back from me, I didn't get your project, and you'll need to bring a hard copy.
I will not hold office hours April 5 to 9, but will be available by email.
Midterm: to be returned on April 6, 12 pm; room SS 6004
Material from lectures
Mar 30
 Slides, March 30. Overview of the course and the Netflix Prize solutions
 All the papers I used are available through the Netflix website, see 2nd slide
Mar 23, 25
Mar 16
 No Class on March 18, but office hours at the usual time
 Slides
 Handout with R code
 Random Forests page, maintained by Adele Cutler
 Electronic Textbook on Classification and Regression Trees
 I forgot to tell you about all the cool stuff happening soon at the
Fields Institute
 April 15  Robert C. Merton, Harvard, NobelPrize winning economist
 April 2123  Darrell Duffie, Stanford
 April 29  Statistics Grad Students' Research Day
 April 30, May 1  DASF III: Workshop: Data Analysis and Statistical Foundations
 May 3, 4  Jianqing Fan, Princeton: Distinguished Lecture Series in Statistics
Mar 9, 11
Mar 2, 4
February 23, 25
February 9
February 2
 Slides
 February 4: Li Li has office hours in SS 2005 12 pm, SS 6027a 23 pm My office hour is cancelled on Thursday.
January 26, 28
 Wine data for HW 1
 Slides for both Tuesday and Thursday lectures Amended on Thursday Jan 28
 HW 1: mistake in Q1 Expression for W in part (b) should have
$\hat\beta_{(0)}$ instead of 0 in second loglikelihood term. Posted version
below has been corrected.
January 19, 21
January 12
January 5, 7
Computing I will refer to, and provide explanations for, the R computing environment. You are welcome to use some other package if you prefer. There are many online resources for R, including:
Download R to your laptop using
A menu driven interface is available called
R Commander.
I recently found a very nice
short introduction to R basics from Charlotte Wickham, UC Berkeley.
Questions and clarifications re Midterm
 Mar 16: There is a typo in the expression for fhat.
 It has been fixed in the currently posted version.
 Mar 17: In Question 1,
Is each of the x's a scalar? ie should we treat x_1, ..., x_N as each being a single number or a vector? The issue I see is that if it is a single number, then the class of estimators would seem to have no intercept (since all of the terms are multiplied by y_i and we have no column of 1's involved).
 Yes, each of the x_i is a scalar; the exercise in the text has more specific details. With suitable choice of \ell_i, you can write $\hatf(x_0)$ as shown, for the case of linear regression (the only one I've checked); the bit that includes constant term \hat\beta_0 becomes part of the weight.
Some questions/answers re HW 1
 Do you have any hints for Q2?

As pointed out by Hala, two matrices span the same subspace if one is a (full rank) linear transformation of the other. This should help...
 How do I create a training data set using the sample function in R?

try ?sample; this will show you how to get a random sample of integers. you use these as row labels.
myrows = sample(... ) ## I'll let you figure this part out
mytrain = winedata[myrows,] ## choose these rows of the data.frame
mytest = winedata[myrows,] ## choose all the other rows of the data.frame
I have a specific question here:
In the course slides there are lines like "> library(ElemStatLearn)". I'm just wondering where I can obtain this library package

Within R there is an option to install packages from cran. ON my Mac it's a menu item
and you highlight "Package Installer". A new window opens, with "Get List".
Once you have the list (you need to be online), you search for "ElemStatLearn", and then click install selected.
Once all that is done, from within R you can load the package by typing "library(ElemStatLearn)".
Sometimes the download and install doesn't work the first (or second...) time, so keep trying.
But you can also get any of the book data sets into R 'by hand', by going to the book website and downloading the data file to your computer and
using read.table or read.csv; you don't have to use the package ElemStatLearn.
Note that the 'manual' for ElemStatLearn (on my web page) can be used to duplicate many of the analyses in the book.

Specifically, the way I understand it is using lm.ridge,
1. the "smallest GCV value" (output from select(ridge_regression)) is the optimal value of lambda. And that
2. for different sequences of lambdas passed to lm.ridge(model, data, lambda=sequence), it will produce different values for the optimal values of lambda (or smallest GCV) depending on (obviously) the start and end of the sequence and (not so obvious to me....) how dense you make the sequence...

yes, both of these are correct, although one would hope that the values don't depend too heavily on how dense the sequence is.
I don't know if the difficulty is numerical, in the way things are implemented, or inherent in the GCV criterion. I haven't seen any discussion of this...

So, in particular to the wine quality problem, how does one choose the optimal lambda value to calculate the regression coefficients with?
