Problem 1: processing the file and reading it into R

We are reading in the CSV file, and make sure that we treat the $\verb;Ward;$ variable as categorical, and the $\verb;Total.Allocations;$ variable as numerical.

#setwd("/media/guerzhoy/Windows/STA303/projects/p1/analysis/soln")
dat <- read.csv("grants.csv")
dat$Ward <- factor(dat$Ward)
dat$Alloc <- as.numeric(dat$Total.Allocations)

Problem 2: Preliminary Analysis

We now produce box-plots for the whole dataset.

require(lattice)
bwplot(Alloc~Ward, data=dat)

From the figure, it can be seen that the grant amounts are heavily right-skewed and have outliers. Therefore the normality assumption in ANOVA is violated.

Service Area is a confounder. It also makes no sense to look at grants that are meant for projects that benefit the entire city if we’re interested in money allocated to individual wards.

Problem 3: F-test

Let’s now remove the lines in the dataset for which Service Area is City-wide or the amount is greater over $200,000.

get_only_multiple_occ <- function(dat){
    new_dat <- data.frame(matrix(ncol = dim(dat)[2], nrow=0))
    colnames(new_dat) <- colnames(dat)
    
    for (i in 1:dim(dat)[1]){
        if (sum(dat[i, "Ward"]==dat[, "Ward"]) > 1){
            new_dat <- rbind(new_dat, dat[i,])
        }
    }
    return(new_dat)
}


dat <- dat[dat$Service.Area!="City-wide" & dat$Service.Area!="City wide",]
dat <- dat[!is.na(dat$Alloc),]
dat <- dat[dat$Alloc<200000,]
dat <- get_only_multiple_occ(dat)

Now, let’s display the boxplot for the new dataset.

bwplot(Alloc~Ward, data=dat)

It can be seen that the grant amounts are less right-skewed and contain much fewer outliers. We should expect that; we’ve made it so that the largest possible grant is $200,000. We’ve justified that by saying that we’re interested in the average small grant. It’s likely that there are such categories as small and large grants, but simply removing grants over $200,000 is a crude approximation, since it leaves in some medium-sized grants, and doesn’t leave in others. We can try to display a histogram of the grant sizes to justify this:

library(ggplot2)
dat_temp <- read.csv("grants.csv")
dat_temp$Alloc <- as.numeric(dat_temp$Total.Allocations)
qplot(dat_temp$Alloc)

library(tigerstats)
model1 <- lm(Alloc~Ward, data=dat)
densityplot(model1$residuals)

The figure above is the density plot for the residuals of the linear model (assuming that the mean grant amounts are different for different wards). On that plot, we can’t really see that the residual are not normal, but we can see that the variances differ substantially.

Let’s now plot the residual vs. the fitted values.

plot(model1, 1)

We see that the distribution of the residuals is right-skewed, and the variance appears to not be exactly constant.

The QQ plot confirms that the residuals are not exactly normally distibuted (we’d expect the plot to be a straight line with slope 1 for normally distibrubuted residuals)

plot(model1, 2)

We cannot really conclude much from the F-test because the ANOVA assumptions are violated (Note on grading: no marks will be taken off if you somewhat reasonably argue the contrary.)

anova_result <- anova(lm(Alloc~Ward, data=dat))
print(anova_result)

## Analysis of Variance Table
## 
## Response: Alloc
##            Df     Sum Sq    Mean Sq F value  Pr(>F)  
## Ward       37 1.0115e+11 2733806621  1.4472 0.05575 .
## Residuals 220 4.1558e+11 1888986810                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the ANOVA table, the p-value of the F-test is 0.0558. Even if it were reasonable to use the F-test here, we would not conclude that there is strong evidence for the Wards being different, although there is weak evidence for that. In plain language, we could say (if the ANOVA assumptions held) that there is only weak evidence that different wards receive different grant amounts.

Note, however, that not all is lost. It is still possible to run simulations in order to see whether we can still run statistical tests.

Problem 4: pariwise t-tests

Here are the results without any adjustment.

no_adj <- with(dat, pairwise.t.test(Alloc, Ward, p.adj = "none"))

Here are the results with the Bonferroni adjustments

bonf_adj <- with(dat, pairwise.t.test(Alloc, Ward, p.adj = "bonf"))

Tukey’s HSD (results are omitted).

tukey_adj <- TukeyHSD(aov(lm(Alloc ~ Ward, data= dat)), conf.level = 0.95)

Without any adjustment, the minimum p-value is 0.001, which is highly significant.

Using Bonferroni adjustment, the minimun p-value becomes 0.7011. Using Tukey’s HSD, the minimum p-value is 0.2572. In both cases, none of the pairwise t-test has a significant result.

The no-adjustment procedure has the most significant p-values. Both Bonferroni adjustment and Tukey’s HSD adjust the p-values to be less significant. The Bonferroni adjustment is more conservative than Tukey’s HSD, as it has higher p-values.

It is not appropriate to conclude that there is no significant difference between any of the means. First, that’s because, without further analysis, not rejecting the hypothesis that the means are equal doesn’t mean that we can conclude that they are equal.

The appropriate test to use is here the F-test. Both Bonferroni adjustment and Tukey’s HSD are too conservative (i.e., they would not reject even if there are fairly large differences). They are used to find which pairs are significant.

Why do we need to remove the Wards for which there is just one data point?

In order to perform a two-sample t-test the variance needs to be estimated for both groups. It’s impossible to estimate the variance if there is just one data point.

Problem 5: simulating no-adjustment procedure

Here is the most straightforward way to write the simulation code:

set.seed(0)
p <- c()

sig_t_test <- c()

n_sims <- 1000
mu <- 60
sigma <- 15

for (i in 1:n_sims){
  dat <- data.frame(matrix(ncol=2, nrow=0))
  colnames(dat) <- c("Pair", "Percentage")
  
  new_dat <- data.frame(matrix(ncol=2, nrow=1))
  colnames(new_dat) <- c("Pair", "Percentage")
  
  
  pair <- "Pair1"
  for (i in 1:14){
    percentage <- rnorm(1, mean=mu, sd=sigma)
    new_dat["Pair"] <- pair
    new_dat["Percentage"] <- percentage
    dat <- rbind(dat, new_dat)
  }
  
  pair <- "Pair2"
  for (i in 1:14){
    percentage <- rnorm(1, mean=mu, sd=sigma)
    new_dat["Pair"] <- pair
    new_dat["Percentage"] <- percentage
    dat <- rbind(dat, new_dat)
  }
  
  pair <- "Pair3"
  for (i in 1:14){
    percentage <- rnorm(1, mean=mu, sd=sigma)
    new_dat["Pair"] <- pair
    new_dat["Percentage"] <- percentage
    dat <- rbind(dat, new_dat)
  }
  
  pair <- "Pair4"
  for (i in 1:14){
    percentage <- rnorm(1, mean=mu, sd=sigma)
    new_dat["Pair"] <- pair
    new_dat["Percentage"] <- percentage
    dat <- rbind(dat, new_dat)
  }
  
  pair <- "Pair5"
  for (i in 1:14){
    percentage <- rnorm(1, mean=mu, sd=sigma)
    new_dat["Pair"] <- pair
    new_dat["Percentage"] <- percentage
    dat <- rbind(dat, new_dat)
  }
  
  pair <- "Pair6"
  for (i in 1:14){
    percentage <- rnorm(1, mean=mu, sd=sigma)
    new_dat["Pair"] <- pair
    new_dat["Percentage"] <- percentage
    dat <- rbind(dat, new_dat)
  }
  
  
  p<- c(p, anova(lm(Percentage~Pair, data=dat))$`Pr(>F)`[1])

  ttestpvals <-with(dat, pairwise.t.test(Percentage, Pair, p.adj = "none"))$p.value
  sig_t_test <- c(sig_t_test, sum(ttestpvals<0.05, na.rm=TRUE)>0)
}

result <- mean(sig_t_test)*100

From the simulation result, 34.8% of the time there is at least one significant difference in all pairwise t-tests. This is much higer than 5%

While in this course, we are not concerned with the efficiency of the code we write, note that in general, for-loops are discouraged in R. Here is a way to write the code using replicate instead.

sim <- function(n_pairs, n_obs, mu, sigma){
  dat <- data.frame(Pair = paste('Pair', rep(1:n_pairs, each = n_obs), sep = ''),
                    Percentage = rnorm(n_pairs*n_obs, mu, sigma))
  ttestpvals <-with(dat, pairwise.t.test(Percentage, Pair, p.adj = "none"))$p.value
  sig_t_tests <- any(ttestpvals < 0.05, na.rm = TRUE)
  return(sig_t_tests)
}

n_sims <- 1000
n_pairs <- 6
n_obs <- 14
mu <- 60
sigma <- 15
t_test_rej <- replicate(n_sims, sim(n_pairs, n_obs, mu, sigma))
mean(t_test_rej)*100

Problem 6: simulating the Bonferroni adjustment

set.seed(0)
n_sims <- 1000
n_pairs <- 6
n_obs <- 14
mu <- 60
sigma <- 15

p_values <- rep(NA, times = n_sims)
sig_t_test <- rep(NA, times = n_sims)


for (i in 1:n_sims) {
    dat <- data.frame(Pair = paste('Pair', rep(1:n_pairs, each = n_obs), sep = ''),
                      Percentage = rnorm(n_pairs*n_obs, mean=mu, sd=sigma))
    ttestpvals <-with(dat, pairwise.t.test(Percentage, Pair, p.adj = "bonf"))$p.value
    p_values[i] <- anova(lm(Percentage ~ Pair, data = dat))$`Pr(>F)`[1]
    sig_t_test[i] <- any(ttestpvals < 0.05, na.rm = TRUE)
}

result <- mean(sig_t_test)*100

From the simulation result, 3.3% of the time there is at least one significant difference in all pairwise t-tests. This ratio is less than 5%.

Problem 7

Think about what it means for the means for the pairs to be different – it means, for example, that for Pair 1, females spent 80% of the time with the male with the yellow swordtail, and that for Pair 2, females spent 20% of the time with the male with the yellow swordtail.

A straightforward explanation here is that in Pair 1, the male with the yellow swordtail is more attractive, and in Pair 2, the male with the transparent swordtail was more attractive. This could happen for reasons that have nothing to do with swordtails. Perhaps the colour of the fins is what matters for attraction in platyfish, and the yellow-swordtailed male in Pair 1 and the transparent-tailed male in Pair 2 had the right colour fins.

Problem 8

One possible choice of the means are: c(72, 60, 60, 60, 60, 60), with standard deviations being 15 for all pairs.

set.seed(0)
n_sims <- 1000
n_pairs <- 6
n_obs <- 14
p_values <- rep(NA, times = n_sims)
sig_t_test <- rep(NA, times = n_sims)
mean_vec <- rep(c(72, 60, 60, 60, 60, 60), each = n_obs)

for (i in 1:n_sims) {
    dat <- data.frame(Pair = paste('Pair', rep(1:n_pairs, each = n_obs), sep = ''),
                      Percentage = rnorm(n_pairs*n_obs, mean = mean_vec, sd=15))
    ttestpvals <-with(dat, pairwise.t.test(Percentage, Pair, p.adj = "none"))$p.value
    p_values[i] <- anova(lm(Percentage~Pair, data=dat))$`Pr(>F)`[1]
    sig_t_test[i] <- any(ttestpvals<0.05, na.rm=TRUE)
}

mean(p_values < 0.05)*100

## [1] 48.6

Problem 9

ANOVA requires the variances are the same for all pairs. If we set the standard deviation for the first pair to be 5 times bigger than the others, we will see that the ratio of significant results are much higher than 5%. That’s no big deal – it’s better to be more conservative, but it does mean that Type II errors will likely occur more frequently.

set.seed(0)
n_sims <- 1000
n_pairs <- 6
n_obs <- 14
p_values <- rep(NA, times = n_sims)
sig_t_test <- rep(NA, times = n_sims)
sd_est <- c(20, 4, 4, 4, 4, 4)
sd_vector <- rep(sd_est, each = n_obs)

for (i in 1:n_sims) {
    percentage <- rnorm(n_pairs*n_obs, mean=60, sd = sd_vector)
    dat <- data.frame(Pair = paste('Pair', rep(1:n_pairs, each = n_obs), sep = ''),
                      Percentage = percentage)
    ttestpvals <-with(dat, pairwise.t.test(Percentage, Pair, p.adj = "none"))$p.value
    p_values[i] <- anova(lm(Percentage~Pair, data=dat))$`Pr(>F)`[1]
    sig_t_test[i] <- any(ttestpvals<0.05, na.rm=TRUE)
}

mean(p_values < 0.05)

## [1] 0.116

How might such a situation come about? Imagine that one of the male fish in Pair 1 looks a little odd – perhaps he is an unusual colour. This makes him very unattractive to some females, and very attractive to other females. This will make the variance of the measurements for Pair 1 higher than for the other pairs.

STA302/STA1002 Project 1: Average Grant Allocations to Toronto Wards

Luke Gan and Michael Guerzhoy