Supplement to Chapter 2 of Ramsey and Schafer’s *Statistical Sleuth* 3ed for POL346: Applied Quantitative Analysis.

Is humerus length related to whether the bird would survive or perish? That’s the question being addressed by Case Study 2.1 in the *Statistical Sleuth*.

*Motivation*: We will segment the population of birds into those that survived and those that perished. We can then ask whether these two subsets of the population differ *significantly* on specific parameters. In this case, we ask whether the two groups differ by humerus length (a type of bone in the wing).

We begin by reading the data and summarizing the variables.

```
# Loading dataset
bumpus <- Sleuth2::case0201 # always use meaningful name for your data
bumpus[1:2, ] # look first two rows of at data
```

```
Humerus Status
1 659 Perished
2 689 Perished
```

```
# Using janitor package to clean the variable names (here, to lower case)
library(janitor)
bumpus <- bumpus %>%
clean_names()
names(bumpus)
```

```
[1] "humerus" "status"
```

Before we apply any statistical tests, it’s important to look at the data. Boxplots and density plots provide can visually present how different groups are from one-another.

```
# base R boxplot
boxplot(data = bumpus,
humerus ~ status,
main = "Boxplot of survived and perished birds")
```

`ggplot2`

allows users to create more aesthetically pleasing graphs. Coding with `ggplot2`

can be a bit more involved but is straightforward once you get the basic ideas of aesthetics and layers. The ‘gg’ in the name refer to Leland Wilkinson’s 2005 book *The Grammar of Graphics* that developed a way of understanding the underlying structure of almost all data visualizations. For more, see ggplot2.

```
library(ggplot2)
# ggplot2 boxplot
bumpus %>%
ggplot() +
aes(x = status,
y = humerus) +
geom_boxplot()
```

A density plot presents the relative frequency of a particular observation compared to all observations. Thus, its \(y\)-axis is a fraction between 0 and 1.

```
# Coding density plots in ggplot2
bumpus %>%
ggplot() +
aes(x = humerus,
color = status) +
geom_density()
```

**Tip**: The `summary()`

command is a good base `R`

option for getting an overview of a vector. The `skimr`

package also has a useful `skim`

function for an overview of a whole data frame. `Janitor`

also has a useful function `tabyl()`

for frequency tables.

```
# Base R approach
survived <- which(bumpus$status == "Survived")
perished <- which(bumpus$status == "Perished")
summary(bumpus$humerus[survived])
```

```
Min. 1st Qu. Median Mean 3rd Qu. Max.
687 728 736 738 752 780
```

```
summary(bumpus$humerus[perished])
```

```
Min. 1st Qu. Median Mean 3rd Qu. Max.
659 718 734 728 743 765
```

```
# Also see skimr::skim function
library(dplyr) # for group_by and pipes
library(skimr)
bumpus %>%
group_by(status) %>%
skim()
```

Name | Piped data |

Number of rows | 59 |

Number of columns | 2 |

_______________________ | |

Column type frequency: | |

numeric | 1 |

________________________ | |

Group variables | status |

**Variable type: numeric**

skim_variable | status | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|---|

humerus | Perished | 0 | 1 | 728 | 23.5 | 659 | 718 | 734 | 743 | 765 | ▁▁▅▇▅ |

humerus | Survived | 0 | 1 | 738 | 19.8 | 687 | 728 | 736 | 752 | 780 | ▁▂▇▃▂ |

Researchers are interested in asking “Is Group X different from Group Y?”. Here, we want to know if birds that survive or perish vary significantly along humerus lengths.

The density plot shows that the mean humerus length of “Survived” is greater than the mean humerus length of “Perished.” We can confirm this by calling the mean() function as well.

```
ys <- mean(bumpus$humerus[survived])
ys
```

```
[1] 738
```

```
yp <- mean(bumpus$humerus[perished])
yp
```

```
[1] 728
```

A way to check if samples are significantly different is to perform a \(t\)-test. Note the \(t\)-ratio formula:

\[ t-ratio = \dfrac{(Estimate) − (Parameter)}{SE(Estimate)} \]

Here, we can think of it in terms of hypotheses.

\[ H_0 : \overline{Y}_s - \overline{Y}_p = 0 \\ H_1 : \overline{Y}_s - \overline{Y}_p \neq 0 \]

So our “Estimate” would be \(\overline{Y}_d = \overline{Y}_s - \overline{Y}_p\)

```
yd <- ys - yp
yd
```

```
[1] 10.1
```

Our “Parameter” is the Null Hypothesis. Here: \(\overline{Y}_d = 0\).

For our *SE(Estimate)*, we need to calculate a pooled standard deviation of the samples.

**NOTE!** We can only perform a two-sample pooled \(t\)-test if the standard deviations of the two samples are similar (or assumed to be similar).

We note that the distributions of “Survived” and “Perished” are similar, and we can assume \[\sigma_s = \sigma_p = \sigma \]

The equation for pooled standard deviation (\(\sigma\)) is:

\[ \sigma = \sqrt{\dfrac{(n_s −1)\sigma_s^2 + (n_p −1)\sigma_p^2}{n_s + n_p − 2}} \]

```
# Count number data points in Survived and Perished groups
ns <- length(bumpus$humerus[bumpus$status == "Survived"])
ns
```

```
[1] 35
```

```
np <- length(bumpus$humerus[bumpus$status == "Perished"])
np
```

```
[1] 24
```

```
# Store standard deviations of each group
ss <- sd(bumpus$humerus[bumpus$status == "Survived"])
ss
```

```
[1] 19.8
```

```
sp <- sd(bumpus$humerus[bumpus$status == "Perished"])
sp
```

```
[1] 23.5
```

```
# Calculate pooled sd
s <- sqrt(((ns - 1) * ss^2 + (np - 1) * sp^2)/(ns + np - 2))
s
```

```
[1] 21.4
```

So the *SE(Estimate)* can be calculated using the following formula:

\[ SE(\overline{Y}_d) = \sigma\sqrt{\dfrac{1}{n_s} + \dfrac{1}{n_p}} \]

```
se <- s * sqrt((1/ns) + (1/np))
se
```

```
[1] 5.67
```

So now the \(t\)-ratio is:

```
t <- (yd - 0)/se
t
```

```
[1] 1.78
```

The \(p\)-value associated with this t-ratio is:

```
# calculate area as or more extreme than t under left tail
# in general, we can't be sure if t is positive or negative
# so, for left tail, we calculate abs() and multiply by -1
pnorm( -abs(t) )
```

```
[1] 0.0378
```

```
# calculate area as or more extreme than t under right tail
1 - pnorm( abs(t) )
```

```
[1] 0.0378
```

```
# two-tailed test
p_two <- pnorm( -abs(t) ) * 2 # left tail x 2
p_two
```

```
[1] 0.0756
```

A \(p\)-value of 0.076 means we would expect to see a difference in means as extreme or more extreme than \(\bar{Y}_d = 10.083\) approximately a 7.557 percent of the time, even if the null hypothesis were true. At a 5% level, this result is *not* statistically significant. We *fail* to reject the null hypothesis.

A simpler way to do this process would be to call upon the \(t\)-test function in R.

```
t.test(data = bumpus,
humerus ~ status,
var.equal = TRUE)
```

```
Two Sample t-test
data: humerus by status
t = -2, df = 57, p-value = 0.08
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-21.45 1.28
sample estimates:
mean in group Perished mean in group Survived
728 738
```

Using `infer`

to run a permutation test:

```
library(infer)
bumpus_permutations <- bumpus %>%
specify(humerus ~ status) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means", order = c("Survived", "Perished"))
head(bumpus_permutations)
```

```
# A tibble: 6 x 2
replicate stat
<int> <dbl>
1 1 1.87
2 2 5.17
3 3 3.2
4 4 1.37
5 5 -1.44
6 6 -1.37
```

```
bumpus_permutations %>%
ggplot() +
aes(x = stat) +
geom_histogram() +
geom_vline(xintercept = c(-yd, yd),
color = "red")
```

Researchers want to know if there are any physiological indicators of schizophrenia. Early postmortem analysis suggests that certain parts of the brain might be sized differently in individuals who suffer from schizophrenia versus others. Researchers decided to control for exogenous factors that may contribute to variation in hippocampus size by using twins.

```
# Loading and cleaning dataset
twins <- Sleuth2::case0202 %>%
clean_names() # convert column names to lowercase
head(twins)
```

```
unaffect affected
1 1.94 1.27
2 1.44 1.63
3 1.56 1.47
4 1.58 1.39
5 2.06 1.93
6 1.66 1.26
```

```
twins %>% skim()
```

Name | Piped data |

Number of rows | 15 |

Number of columns | 2 |

_______________________ | |

Column type frequency: | |

numeric | 2 |

________________________ | |

Group variables |

**Variable type: numeric**

skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|

unaffect | 0 | 1 | 1.76 | 0.24 | 1.25 | 1.60 | 1.77 | 1.94 | 2.08 | ▁▃▂▃▇ |

affected | 0 | 1 | 1.56 | 0.30 | 1.02 | 1.31 | 1.59 | 1.78 | 2.02 | ▂▇▃▅▆ |

There are 15 pairs of twins, within each pair, one is affected by schizophrenia, and the other is not. The difference in area of left hippocampus of these pairs of twins is:

```
diff <- twins$unaffect - twins$affected
head(diff)
```

```
[1] 0.67 -0.19 0.09 0.19 0.13 0.40
```

We can now plot the distribution of these differences:

```
hist(diff)
```

Now we will conduct a paired \(t\)-test, and construct a 95% confidence interval:

**Note**: Remember the formula for SE here is the following:

$$

\[\begin{split} SE = \frac{\sigma}{\sqrt{n}} \end{split}\]$$

```
diff_mean <- mean(diff)
diff_mean
```

```
[1] 0.199
```

```
diff_sd <- sd(diff)
diff_sd
```

```
[1] 0.238
```

```
diff_SE <- diff_sd/sqrt(15)
diff_SE
```

```
[1] 0.0615
```

```
t_val <- (diff_mean - 0)/diff_SE
t_val
```

```
[1] 3.23
```

```
# p-value of a two-sided t-test
two_p <- pnorm(-abs(t_val))*2
two_p
```

```
[1] 0.00124
```

```
# construct a 95% confidence interval
q <- qt(0.975, 15-1)
q
```

```
[1] 2.14
```

```
lower <- diff_mean - q*diff_SE
lower
```

```
[1] 0.0667
```

```
upper <- diff_mean + q*diff_SE
upper
```

```
[1] 0.331
```

And so, our \(p\)-value is: 0.001, which is less than a 5% level of significance, meaning the difference *is* significant.

And, our 95% confidence interval is (0.067, 0.331).

**Tip**: If the confidence interval straddles zero, the difference-in-means cannot be statistically distinguished from zero and is *not* statistically significant.

Note, that the `twins`

data is structured in a way that important information about each case is not actually stored as data. Data about the condition, that is affected or unaffected, is stored in the column names, not in the cells of the data frame. Similarly, data about pairs of twins is implied by each row but the data is not explicit. For certain kinds of statistical analysis or data visualization, this additional data would need to be stored in the cells.

We can use the `tidyr`

package to reshape the data to a tidy format in which each variable is a column, each observation is a row. For more on tiday data, see Hadley Wickham’s 2014 article, Tidy data.(Wickham 2014)

```
library(tidyr)
# use tidyr package to `gather` to create `key value` pair
tidy_twins <- twins %>%
mutate(twin_pair = row_number() %>% # create pair # and convert to categorical
as.factor()) %>%
tidyr::gather('affected', 'unaffect',
key = 'condition',
value = 'area') %>%
arrange(twin_pair)
head(tidy_twins)
```

```
twin_pair condition area
1 1 affected 1.27
2 1 unaffect 1.94
3 2 affected 1.63
4 2 unaffect 1.44
5 3 affected 1.47
6 3 unaffect 1.56
```

```
tail(tidy_twins)
```

```
twin_pair condition area
25 13 affected 2.02
26 13 unaffect 2.04
27 14 affected 1.59
28 14 unaffect 1.62
29 15 affected 1.97
30 15 unaffect 2.08
```

A good example of why tidy data has advantages over untidy data is with plotting. To plot the relationship between each twin pair, we need to be able to pass `ggplot`

values for an \(x\)-axis, a \(y\)-axis and a group. The plot below would be considerably harder to do with the untidy version of the data. With the tidy version, it is straightforward.

```
tidy_twins %>%
ggplot() +
aes(x = condition, y = area,
color = twin_pair,
group = twin_pair) +
geom_point() +
geom_line()
```

```
#theme(legend.position="none") # optional hide legend
```

For some functions, such as a \(t\)-test, the tidy version is not much different but the code can be more readable. Here we’re contrasting hippocampus `area`

by schizophrenia `condition`

.

```
tidy_twins %>%
# t.test command is base R and not updated for pipes (%>%)
# so we use data = . to pass data.frame to data argument in a pipe
t.test(data = .,
area ~ condition,
var.equal = TRUE)
```

```
Two Sample t-test
data: area by condition
t = -2, df = 28, p-value = 0.06
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.40318 0.00585
sample estimates:
mean in group affected mean in group unaffect
1.56 1.76
```

R is a functional programing language which means it has many tools for creating and manipulating functions. Often, it will be helpful to write your own functions to avoid repeating code and making errors. You can think of functions as *verbs* in that they act on our data objects (which can be thought of as *nouns*).

```
# Simple example function
cube <- function(x) {
x^3
}
cube(5)
```

```
[1] 125
```

```
# Example function using 'return()'
celsify <- function(fahren_temp) {
# important to put parentheses around whole equation
# otherwise two values are passed by pipe to round()
cel_temp <- ((fahren_temp - 32) * 5/9) %>%
round(1)
return(cel_temp)
}
celsify(32)
```

```
[1] 0
```

```
celsify(100)
```

```
[1] 37.8
```

```
# Example using two input values
raise <- function(x, y) {
x^y
}
raise(3, 2)
```

```
[1] 9
```

This supplement was put together by Omar Wasow, Sukrit S. Puri, and Blaykyi Kenyah. It builds on prior work by Project Mosaic, an ``NSF-funded initiative to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum.’’ We are grateful to Nicholas Horton, Linda Loi, Kate Aloisio, and Ruobing Zhang for their effort developing The Statistical Sleuth in R.(Horton et al. 2013)

Horton, Nicholas, Linda Loi, Kate Aloisio, and Ruobing Zhang. 2013. “The Statistical Sleuth in R.” http://www.math.smith.edu/~nhorton/sleuth3/.

Wickham, Hadley. 2014. “Tidy Data.” *The Journal of Statistical Software* 59 (10). http://vita.had.co.nz/papers/tidy-data.html.