Introduction

In the first part of this project, you will download data, process it to be readable in R, and then use ANOVA and t-tests to analyze it. In the second part of the project, you will use simulation to evaluate how well ANOVA and t-tests work in practice, and how robust they are.

Your analysis for this project is required to be reproducible: you will submit a pdf file containing your report, as well as an R Markdown file that the TAs can use to generate your report using Knitr. You will also submit the CSV file that you created. That way, the TAs should be able to simply use Knitr to re-generate your report again.

Your report should be well-written and readable: you will lose marks if your report is hard to understand.

The Dataset

The City of Toronto makes a large number of datasets available via its Open Data portal. We will be using the 2015 Community Grants Allocation dataset (also available here.)

Problem 1: Processing the File and Reading it into R (5%)

Your first task is to process the file and read it into R. One option to do so is using the read.xls package, but it is better (and required for the project) to use Microsoft Excel (or a free alternative, such as LibreOffice Calc) in order to edit the file, convert it to CSV, and then read it into R. An example of how to do this is here:

Create a CSV file which you will be then working on in R for the rest of the first part of the project.

Analyzing the Data

Problem 2: Preliminary Analysis (10%)

The City of Toronto is subdivided into 44 Wards, and data is available for each of the wards. Suppose that we are interested to know whether there is evidence that some wards receive larger grants, on average. (Note that this is a different question from whether different wards receive larger amounts of money).

Display the grant amounts for the different wards using a boxplot. In your report, explain why it is not reasonable to use ANOVA on the entire dataset. Refer both to the boxplot, and to the Service Area column.

Problem 3: F-test (10%)

Using R (do not edit the CSV file), remove the lines in the dataset for which the Service Area is City-wide, and the lines where the amount is over $200,000. For the rest of the dataset (the dataset of small grants), display and briefly analyze the appropriate diagnostic plots, and then perform an F-test to test the hypothesis that the Wards don’t differ from each other in terms of the average size of the grants given to them. What can be concluded from the data?

Problem 4: Pairwise t-tests (20%)

For the dataset from Problem 3, perform pairwise t-tests for differences of mean grant sizes between the different wards. Perform the t-tests

Without using any adjustments;
Using Bonferroni adjustments;
Using Tukey’s HSD

Briefly describe the results, and the differences that you observe between the three different procedures.

(Answer this regardless of the results that you actually obtain) If you do not observe significant differences when using Bonferroni adjustment/Tukey’s HSD, is it appropriate to conclude that there are no significant differences between any of the means? Briefly explain your answer.

You will note that the R functions for performing pairwise t-tests will not work without modifying the dataset by removing all the lines which contain information about Wards for which only one datapoint is available. Explain why that is.

I am supplying you with a function that you can use in order to remove such lines from a dataframe:

get_only_multiple_occ <- function(dat){
  new_dat <- data.frame(matrix(ncol = dim(dat)[2], nrow=0))
  colnames(new_dat) <- colnames(dat)
  
  for (i in 1:dim(dat)[1]){
    if (sum(dat[i, "Ward"]==dat[, "Ward"]) > 1){
      new_dat <- rbind(new_dat, dat[i,])
    }
  }
  return(new_dat)
}

If you include it in your Rmd file, you will be able to go dat <- get_only_multiple_occ(dat) to obtain a dataframe dat that only contains data for wards for which there are multiple datapoints.

Simulation in the Context of Multiple Comparisons

In this part of the project, we will be analyzing the power and robustness of the procedures we used in the first part. Since it is easier to handle, we will be using the Platyfish dataset from Lecture 2 as a starting point.

Briefly, the dataset consists of data on the percentage of time that each of the 84 females spent with the yellow-swordtailed male. There were 6 pairs of males (one with a yellow swordtail and one with a transparent swordtail in each pair), and for each pair, there were 14 females for which the percentage was measured.

Problem 5 (10%)

Run a simulation to determine how much of the time at least one significant difference will be detected if all pairwise t-tests are performed every time at 95% confidence, in a situation where all the means are actually the same. To do that, repeatedly simulate a dataset that is similar to the Platyfish dataset, with all the means exactly the same, and with the ANOVA assumptions satisfied. Clearly state what results you obtain.

The following piece of code should help you get started – make sure you understand it first, and then use it to start solving the question.

dat <- data.frame(matrix(ncol=2, nrow=0))
colnames(dat) <- c("Pair", "Percentage")

new_dat <- data.frame(matrix(ncol=2, nrow=1))
colnames(new_dat) <- c("Pair", "Percentage")


pair <- "Pair1"
for (i in 1:14){
  percentage <- rnorm(1)
  new_dat["Pair"] <- pair
  new_dat["Percentage"] <- percentage
  dat <- rbind(dat, new_dat)
}

Problem 6 (10%)

Run a simulation to determine the how much of the time at least one significant difference will be detected if all pairwise t-tests are performed every time at 95% confidence, in a situation where all the means are actually the same, when the Bonferroni adjustment is used. Clearly state what results you obtain, and compare your results to what you got in Problem 5.

Problem 7 (5%)

Recall that, for the actual data, the F-test did not provide evidence that the pairs differed from each other. Describe a plausible scenario where the pair means are actually different from each other. (That is, tell a plausible story explaining why the true means could be different.)

Problem 8 (15%)

Find values for the means (and variances) of the different pairs for which the F-test would find that the at least one of the means is significantly different from another about half of the time. State the values, and include code for a simulation (as well as the output of the simulation) which confirms your results.

Problem 9 (15%)

Find a situation where one of the ANOVA assumptions is violated, and where this leads to the F-test not behaving as expected. Precisely state how you generate the data in such a situation, and how this situation actually might come about (i.e., describe a scenario in the Platyfish experiment which would cause the data to behave like it does in your simulation). Include a simulation (and the outputs of a simulation) that confirms your findings.

Problem 10 (5% bonus – submit separately)

The Platyfish data may not conform exactly to the assumptions of the ANOVA model. Determine whether this makes the F-test more or less conservative, confirm your claims with simulations, and clearly state what the implications are for the conclusion that the pairs differ from each other (i.e., how likely is that conclusion to be wrong?)

What to Submit

Submit a the file anova.Rmd file that includes your write-up, and all the code that you wrote. Submit the file anova.pdf file generated from the ‘Rmd’ file. Submit the CSV file that you made, named grants.csv. We should be able to generate your pdf file from the Rmd and CSV files.

STA303/STA1002 Project 1