
STA 2201S: Applied Statistics II Spring 2015
Final Project due April 15 11.59pm
The project report should be between three and five pages, and be a nontechnical summary of your analysis. This will not include any code, but it may include tables and plots. You shoudl make sure to have an introduction, to provide a detailed reference for the source of data, to state the scientific problem(s) of interest and your conclusions.
In a statistical appendix describe the main statistical methods used, give a summary of the statistical results, including what models were considered, what models formed the basis for the report above, and why. In this appendix you can include code excerpts, additional plots, and tables, as needed
Finally an executable file, either an R script or an R Markdown file or a knitr file is required, that will enable me to reproduce the results used in your report. This file should include the data frame that you constructed from your dataset, so that I don't need to use read.table or read.csv.
Homework 3
Due April 1, 11.59 on Blackboard. On the Blackboard web page you can find the
assignment under "Course Materials".
Homework 2
Due March 6, 11.59 pm on Blackboard. On the Blackboard web page you can find the
assignment under "Course Materials".

Here is a paper Archer found that discusses choosing between quasiPoisson and negative binomial. If you use the ideas in this paper for your homework be sure to include a reference.
 Q2(d). Q: can we choose between quasiPoisson and negative binomial using AIC? A: I don't think you can use AIC for the quasiPoisson, because there is not a genuine loglikelihood. I would rely on plots and on a study of the meanvariance relationship.
Q: If (ii) indicates that there is an association in one city but not in another, why would we be interested in (iii)? A: I *think* it could be the case in principle that you could have enough noise in the data that (iii) and (ii) could be compatible.
Archer found this resource, which is very clear. In particular, you might find it easier to think about the answers to the 3 parts by fitting sequences of Poisson GLMs of the form:(D = disease; B = blood group; C = city)
D + B + C, DC + B, DB + C, D + BC, DB + BC, etc.
and figuring out how these submodels link with the 3 parts of the question.
 Q2(a): You will want to refer to the AOAS paper for answering this question. It is not a standard generalized linear model of the type I described in class, unless \(\nu\) is considered fixed. So you can assume this for putting it in the GLM form. It is however a twoparameter exponential family, so if you interpret \(\theta = (\log\lambda, \nu)\), then the question can be answered as stated. Either version is fine.
 Q2(d): Thanks to AlexAntoine, for pointing out that the CMP model cannot be estimated using the Galapagos Island data.
I've revised the question, suggesting to try the negative binomial model instead. (Which can be fit.)
It's possible that a rate model is better for this data, if we think that the number of species might be proportional to the area of the island. Bonus marks for exploring this.
 In Q1, the notation \(\underline y\) means the vector of all the observations \((y_{111}, \dots, y_{JKL})\)
 Homework Questions Feb 18: Q2(d) changed; Feb 13: Typos corrected
Latex source
 Jager & Leek, for Q3
 Sellers & Shmueli, for Q2. This paper on generalized linear models with the ConwayMaxwellPoisson distribution appeared in the Annals of Applied Statistics in 2010.
Homework 1
 Marking Scheme
 Corrections and clarifications:
 On Jan.27, Q2 (b) and (d) were updated.
 Q3: Several students have asked: " what is the meaning of the main analysis of this endpoint?"
A: "Main analysis", means "what statistical analysis did they use to study this response". Often there is more than one, but one in particular that leads to the result emphasized in the abstract and conclusions. If there is more than one, just say so.
 Q2(d): The HW sheet was changed on Jan.27. As of today (Feb 3) You ONLY NEED to show the first part (with p's all equal).
 Q2(b): Use the result \(\sum y_i = \sum n_i \hat p_i\), which is true as long as the design matrix has a column of 1's.
 Homework Questions Updated Jan 27
 Latex for Homework Questions
 Reference paper for Q1
April 1
March 25
March 18
March 11
 Slides
 R script
 Jenny Bryan, again, this time with a Shiny App illustrating a catalogue of graphics and the R code to draw them
March 4
February 25
February 11
 Slides (updated Feb 16, using photos of blackboard)
 Measles web pages
February 4
 Slides
 iPad version
 Data Scientist ``the sexiest job of the 21st century" Harvard Business Review
 Yihui Li's web page for knitr
 More or Less podcasts on the BBC. "WS Global Wealth 24 Jan 15" discusses the Oxfam report. "WS Bad Luck and Cancer 10 Jan 15" reviews the Science article.
January 28
January 21
January 14
January 7
Text
Extending the Linear Model with R by J. Faraway.
Recommended
Statistical Models by A.C.
Davison.
Principles of Applied Statistics by D.R. Cox and C.A. Donnelly
Computing
You are welcome to use the statistical computing package of your choice,
but
I will refer exclusively to the R computing package. Some online resources
that I've found helpful are:
