STA450S/4000S: Topics in statistics
Statistical Aspects of Data Mining


Spring, 2004
Textbook: Hastie, Tibshirani and Friedman. The Elements of Statistical Learning. Springer-Verlag.
Book web page

April 7, 2004

  • Project due before Friday April 16 5.00 pm
  • Slides

March 31, 2004

  • No class on Friday April 2
  • K-means clustering on the wine data works fine if the data are standardized first; see the slides
  • Slides
  • See help files for various R programs for details on cluster methods.

March 24, 2004

  • Homework 3 not due until Friday Mar 26
  • No tutorial on Friday, but I will be available to answer questions
  • Slides
  • Pictures to illustrate clustering are Figures 14.6, 14.12, 14.13, 14.14

March 17, 2004

  • Homework 3 not due until Friday, Mar 26
  • Slides (thanks to Ana-Maria Staicu)
  • Nice picture of regression tree (also thanks to Ana-Maria)

March 10, 2004


March 3, 2004

  • Lecture notes (hand) .
  • Pictures.
  • Handout on fitting GAMs in R.
  • Plot from GAM in R on the heart data, showing the smooth curves for each covariate, smoothing chosen automatically.
  • Plot from GAM in R on the heart data, showing the smooth curves for each covariate, smoothing set to be equivalent to 4 df for each covariate. (This is closer to the plot in the book; see Figures 5.4 and 6.12).
  • Homework 3 due March 25, 2004.

February 25, 2004

  • Slides from today.
  • On p.9, there are 4 plots, and the code on the preceding page is missing two lines: First 4 plots to a page were obtained by
     par(mfrow=c(2,2))
    and then the top left plot is just the underlying function. The top right shows the results from loess. The bottom left (I described this incorrectly in class) compares loess with the kernel smooth on p.6, and you can see that the more 'linear' behaviour of loess at the endpoints. Finally the bottom right shows the loess fit using local linear regression (red) compared to the loess fit using local linear regression with span 0.4 (green) and local quadratic reression with span 0.4 (purple).

February 11, 2004

  • No tutorial class on February 13.
  • Homework #1 is due on February 13 at 4 pm, in SS 6002a or SS 6018.
  • Homework 2 is due on March 3.
  • Slides from today.
  • Have a nice midterm break!

February 6, 2004


February 5, 2004

    Typo on question two of HW 1 (in the expression for the prior density for beta, sigma should be replaced by tau). The online version has been corrected. (See below for link.)

February 4, 2004


January 30, 2004

  • Tutorial cancelled today. You can now run R directly on any of the Cquest workstations by typing "R" in a command window. It may be on the menu soon. If you are logging in remotely, still easier to use "/u/radford/R".

January 28, 2004: Classification

  • News: No office hour Friday at 2 (January 30). Guest lecturer (Rafal Kustra) February 4. (Chapter 5.1ff)
  • Re Homework 1: All computer code used to reach conclusions should be submitted, but as *appendices* only. The answers to the questions should not include computer code, although relevant graphs may be included.
  • I meant to mention: Logistic regression is discussed in Chapter 14 of the 302 textbook. Also Section 4.2 of the book discusses linear regression of y on x, when y is a 0-1 variable. It is essentially the same as lda, but doesn't generalize well to more than two classes.
  • Slides

January 23, 2004: Ridge regression in R

  • Here is a copy of an R session I did on cquest. It shows ridge regression. (It's pretty raw!) You can also do all possible subsets by first typing library("leaps",lib.loc="/u/reid"); see help(leaps) for more info.

January 21, 2004: Model Selection; Intro to Classification

Sections 3.4.0-3.4.3, 2.2, 4.4.1
  • copy of slides
  • Homework 1 due February 11. The data ("abalone.data") is in "/u/reid" on Cquest. A description of the data ("abalone.desc") is in the same place, and provided here for those of you who are not using Cquest. You can download the data from the UCI Machine Learning Site as well.
  • The teaching assistant for the course is Ana-Maria Staicu. She has an office hour on Thursday from 12 to 1 in the Stat Aid centre.

January 16, 2004: Using R for linear regression

  • My code for fitting the linear regression model in the text.
  • Help text for the "lm" routine.

January 14, 2004: Linear Regression

We covered 3.1-3.3. and a little of 3.4. Next week we will finish Chapter 3 and start Chapter 4.

January 7, 2004

If you missed the class and are interested in taking it please email me asap.
Undergraduates have the choice of downloading R to their own computer, or using R or Splus on cquest. You are also welcome to use Matlab. To get an account on Cquest go to the Cquest home page and request an account.
There is a wealth of information about the NMMAPS study on Francesca Dominici's web site .

January 9, 2004: Trying R

Once you have a Cquest account, or have downloaded R onto your computer, try running R (/u/radford/R on Cquest) and having a look at the prostrate cancer data set. This is available on the book web page , but I have also put it into /u/reid on Cquest, so the following *should* work:

pr<- read.table("/u/reid/prostate.data",header=T)

Then you can try things like dim(pr) and names(pr) and so on. Here is a set of annotated commands that you can try in R or Splus.

January 8, 2004: announcement

This announcement was also sent by email who's email address I have. If you know someone who wants to get class emails not on this list, please tell them to email me asap.

The hour from 1-2 each Friday will be used as a tutorial, particularly with regard to computing. It is fine with me if you attend STA 410 at that time instead, this will provide you with a lot of computing expertise and you probably won't need my help. Our office staff is checking to see if ROSI will let students enrol for both courses, any feedback you have for me on this would be helpful. Note also that I have an office hour on Friday from 2-3, so if you are attending 410 and have some particular computing/course issues to discuss with me, you could ask me during that hour instead. I am trying to find a room in SS for that hour but for the moment my office will be the location.

This Friday, come to 2111 at 1.10 (if you are not attending 410). We are still sorting out the details of the graduate portion of the class, so the grad students will be there to discuss various possible meeting times with Professor Kustra. I will accompany anyone from 450 who is interested to the Cquest Lab in Sid Smith, to provide any needed help in getting an account, etc.

The web pages for STA 410/2102 and STA 2201 have introductory material on R. The first has instructions for running on Cquest, and the latter assumes you are running it on fisher/utstat.


[ Home | Information | Research | Teaching | Miscellaneous ]