# Supplement to Chapter 2

Supplement to Chapter 2 of Ramsey and Schafer’s Statistical Sleuth 3ed for POL346: Applied Quantitative Analysis.

Team pol346 pol346.com (Princeton University, Department of Politics)princeton.edu/politics
2020-05-04

# Bumpus’s Data on Natural Selection

Is humerus length related to whether the bird would survive or perish? That’s the question being addressed by Case Study 2.1 in the Statistical Sleuth.

Motivation: We will segment the population of birds into those that survived and those that perished. We can then ask whether these two subsets of the population differ significantly on specific parameters. In this case, we ask whether the two groups differ by humerus length (a type of bone in the wing).

## Statistical summary and graphical display

We begin by reading the data and summarizing the variables.


bumpus <- Sleuth2::case0201 # always use meaningful name for your data
bumpus[1:2, ] # look first two rows of at data

Humerus   Status
1     659 Perished
2     689 Perished

# Using janitor package to clean the variable names (here, to lower case)
library(janitor)
bumpus <- bumpus %>%
clean_names()
names(bumpus)

[1] "humerus" "status" 

Before we apply any statistical tests, it’s important to look at the data. Boxplots and density plots provide can visually present how different groups are from one-another.


# base R boxplot
boxplot(data = bumpus,
humerus ~ status,
main = "Boxplot of survived and perished birds")

ggplot2 allows users to create more aesthetically pleasing graphs. Coding with ggplot2 can be a bit more involved but is straightforward once you get the basic ideas of aesthetics and layers. The ‘gg’ in the name refer to Leland Wilkinson’s 2005 book The Grammar of Graphics that developed a way of understanding the underlying structure of almost all data visualizations. For more, see ggplot2.


library(ggplot2)

# ggplot2 boxplot
bumpus %>%
ggplot() +
aes(x = status,
y = humerus) +
geom_boxplot()

A density plot presents the relative frequency of a particular observation compared to all observations. Thus, its $$y$$-axis is a fraction between 0 and 1.


# Coding density plots in ggplot2
bumpus %>%
ggplot() +
aes(x = humerus,
color = status) +
geom_density() 

Tip: The summary() command is a good base R option for getting an overview of a vector. The skimr package also has a useful skim function for an overview of a whole data frame. Janitor also has a useful function tabyl() for frequency tables.


# Base R approach
survived <- which(bumpus$status == "Survived") perished <- which(bumpus$status == "Perished")

summary(bumpus$humerus[survived])  Min. 1st Qu. Median Mean 3rd Qu. Max. 687 728 736 738 752 780   summary(bumpus$humerus[perished])

Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
659     718     734     728     743     765 

# Also see skimr::skim function
library(dplyr) # for group_by and pipes
library(skimr)

bumpus %>%
group_by(status) %>%
skim()
 Name Piped data Number of rows 59 Number of columns 2 _______________________ Column type frequency: numeric 1 ________________________ Group variables status

Variable type: numeric

skim_variable status n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
humerus Perished 0 1 728 23.5 659 718 734 743 765 ▁▁▅▇▅
humerus Survived 0 1 738 19.8 687 728 736 752 780 ▁▂▇▃▂

## Inferential procedure (two sample $$t$$-test)

Researchers are interested in asking “Is Group X different from Group Y?”. Here, we want to know if birds that survive or perish vary significantly along humerus lengths.

The density plot shows that the mean humerus length of “Survived” is greater than the mean humerus length of “Perished.” We can confirm this by calling the mean() function as well.


ys <- mean(bumpus$humerus[survived]) ys  [1] 738  yp <- mean(bumpus$humerus[perished])
yp

[1] 728

A way to check if samples are significantly different is to perform a $$t$$-test. Note the $$t$$-ratio formula:

$t-ratio = \dfrac{(Estimate) − (Parameter)}{SE(Estimate)}$

Here, we can think of it in terms of hypotheses.

$H_0 : \overline{Y}_s - \overline{Y}_p = 0 \\ H_1 : \overline{Y}_s - \overline{Y}_p \neq 0$

So our “Estimate” would be $$\overline{Y}_d = \overline{Y}_s - \overline{Y}_p$$


yd <- ys - yp
yd

[1] 10.1

Our “Parameter” is the Null Hypothesis. Here: $$\overline{Y}_d = 0$$.

For our SE(Estimate), we need to calculate a pooled standard deviation of the samples.

NOTE! We can only perform a two-sample pooled $$t$$-test if the standard deviations of the two samples are similar (or assumed to be similar).

We note that the distributions of “Survived” and “Perished” are similar, and we can assume $\sigma_s = \sigma_p = \sigma$

The equation for pooled standard deviation ($$\sigma$$) is:

$\sigma = \sqrt{\dfrac{(n_s −1)\sigma_s^2 + (n_p −1)\sigma_p^2}{n_s + n_p − 2}}$


# Count number data points in Survived and Perished groups
ns <- length(bumpus$humerus[bumpus$status == "Survived"])
ns

[1] 35

np <- length(bumpus$humerus[bumpus$status == "Perished"])
np

[1] 24

# Store standard deviations of each group
ss <- sd(bumpus$humerus[bumpus$status == "Survived"])
ss

[1] 19.8

sp <- sd(bumpus$humerus[bumpus$status == "Perished"])
sp

[1] 23.5

# Calculate pooled sd
s <- sqrt(((ns - 1) * ss^2 + (np - 1) * sp^2)/(ns + np - 2))
s

[1] 21.4

So the SE(Estimate) can be calculated using the following formula:

$SE(\overline{Y}_d) = \sigma\sqrt{\dfrac{1}{n_s} + \dfrac{1}{n_p}}$


se <- s * sqrt((1/ns) + (1/np))
se

[1] 5.67

So now the $$t$$-ratio is:


t <- (yd - 0)/se
t

[1] 1.78

The $$p$$-value associated with this t-ratio is:


# calculate area as or more extreme than t under left tail
# in general, we can't be sure if t is positive or negative
# so, for left tail, we calculate abs() and multiply by -1
pnorm( -abs(t) ) 

[1] 0.0378

# calculate area as or more extreme than t under right tail
1 - pnorm( abs(t) )

[1] 0.0378

# two-tailed test
p_two <- pnorm( -abs(t) ) * 2  # left tail x 2
p_two

[1] 0.0756

A $$p$$-value of 0.076 means we would expect to see a difference in means as extreme or more extreme than $$\bar{Y}_d = 10.083$$ approximately a 7.557 percent of the time, even if the null hypothesis were true. At a 5% level, this result is not statistically significant. We fail to reject the null hypothesis.

A simpler way to do this process would be to call upon the $$t$$-test function in R.


t.test(data = bumpus,
humerus ~ status,
var.equal = TRUE)

Two Sample t-test

data:  humerus by status
t = -2, df = 57, p-value = 0.08
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-21.45   1.28
sample estimates:
mean in group Perished mean in group Survived
728                    738 

Using infer to run a permutation test:


library(infer)
bumpus_permutations <- bumpus %>%
specify(humerus ~ status) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means", order = c("Survived", "Perished"))

head(bumpus_permutations)

# A tibble: 6 x 2
replicate  stat
<int> <dbl>
1         1  1.87
2         2  5.17
3         3  3.2
4         4  1.37
5         5 -1.44
6         6 -1.37

bumpus_permutations %>%
ggplot() +
aes(x = stat) +
geom_histogram() +
geom_vline(xintercept = c(-yd, yd),
color = "red")

# Anatomical Abnormalities Associated with Schizophrenia

Researchers want to know if there are any physiological indicators of schizophrenia. Early postmortem analysis suggests that certain parts of the brain might be sized differently in individuals who suffer from schizophrenia versus others. Researchers decided to control for exogenous factors that may contribute to variation in hippocampus size by using twins.


twins <- Sleuth2::case0202  %>%
clean_names() # convert column names to lowercase

head(twins)

unaffect affected
1     1.94     1.27
2     1.44     1.63
3     1.56     1.47
4     1.58     1.39
5     2.06     1.93
6     1.66     1.26

twins %>% skim()
 Name Piped data Number of rows 15 Number of columns 2 _______________________ Column type frequency: numeric 2 ________________________ Group variables

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
unaffect 0 1 1.76 0.24 1.25 1.60 1.77 1.94 2.08 ▁▃▂▃▇
affected 0 1 1.56 0.30 1.02 1.31 1.59 1.78 2.02 ▂▇▃▅▆

There are 15 pairs of twins, within each pair, one is affected by schizophrenia, and the other is not. The difference in area of left hippocampus of these pairs of twins is:


diff <- twins$unaffect - twins$affected
head(diff)

[1]  0.67 -0.19  0.09  0.19  0.13  0.40

We can now plot the distribution of these differences:


hist(diff)

Now we will conduct a paired $$t$$-test, and construct a 95% confidence interval:

Note: Remember the formula for SE here is the following:

$$$\begin{split} SE = \frac{\sigma}{\sqrt{n}} \end{split}$$$


diff_mean <- mean(diff)
diff_mean

[1] 0.199

diff_sd <- sd(diff)
diff_sd

[1] 0.238

diff_SE <- diff_sd/sqrt(15)
diff_SE

[1] 0.0615

t_val <- (diff_mean - 0)/diff_SE
t_val

[1] 3.23

# p-value of a two-sided t-test
two_p <- pnorm(-abs(t_val))*2
two_p

[1] 0.00124

# construct a 95% confidence interval
q <- qt(0.975, 15-1)
q

[1] 2.14

lower <- diff_mean - q*diff_SE
lower

[1] 0.0667

upper <- diff_mean + q*diff_SE
upper

[1] 0.331

And so, our $$p$$-value is: 0.001, which is less than a 5% level of significance, meaning the difference is significant.

And, our 95% confidence interval is (0.067, 0.331).

Tip: If the confidence interval straddles zero, the difference-in-means cannot be statistically distinguished from zero and is not statistically significant.

# Tidy data

Note, that the twins data is structured in a way that important information about each case is not actually stored as data. Data about the condition, that is affected or unaffected, is stored in the column names, not in the cells of the data frame. Similarly, data about pairs of twins is implied by each row but the data is not explicit. For certain kinds of statistical analysis or data visualization, this additional data would need to be stored in the cells.

We can use the tidyr package to reshape the data to a tidy format in which each variable is a column, each observation is a row. For more on tiday data, see Hadley Wickham’s 2014 article, Tidy data.(Wickham 2014)


library(tidyr)

# use tidyr package to gather to create key value pair
tidy_twins <- twins %>%
mutate(twin_pair = row_number() %>% # create pair # and convert to categorical
as.factor()) %>%
tidyr::gather('affected', 'unaffect',
key = 'condition',
value = 'area') %>%
arrange(twin_pair)

head(tidy_twins)

twin_pair condition area
1         1  affected 1.27
2         1  unaffect 1.94
3         2  affected 1.63
4         2  unaffect 1.44
5         3  affected 1.47
6         3  unaffect 1.56

tail(tidy_twins)

twin_pair condition area
25        13  affected 2.02
26        13  unaffect 2.04
27        14  affected 1.59
28        14  unaffect 1.62
29        15  affected 1.97
30        15  unaffect 2.08

A good example of why tidy data has advantages over untidy data is with plotting. To plot the relationship between each twin pair, we need to be able to pass ggplot values for an $$x$$-axis, a $$y$$-axis and a group. The plot below would be considerably harder to do with the untidy version of the data. With the tidy version, it is straightforward.


tidy_twins %>%
ggplot() +
aes(x = condition, y = area,
color = twin_pair,
group = twin_pair) +
geom_point() +
geom_line()  


#theme(legend.position="none") # optional hide legend

For some functions, such as a $$t$$-test, the tidy version is not much different but the code can be more readable. Here we’re contrasting hippocampus area by schizophrenia condition.


tidy_twins %>%
# t.test command is base R and not updated for pipes (%>%)
# so we use data = . to pass data.frame to data argument in a pipe
t.test(data = .,
area ~ condition,
var.equal = TRUE)

Two Sample t-test

data:  area by condition
t = -2, df = 28, p-value = 0.06
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.40318  0.00585
sample estimates:
mean in group affected mean in group unaffect
1.56                   1.76 

# Writing a function

R is a functional programing language which means it has many tools for creating and manipulating functions. Often, it will be helpful to write your own functions to avoid repeating code and making errors. You can think of functions as verbs in that they act on our data objects (which can be thought of as nouns).


# Simple example function
cube <- function(x) {
x^3
}

cube(5)

[1] 125

# Example function using 'return()'
celsify <- function(fahren_temp) {
# important to put parentheses around whole equation
# otherwise two values are passed by pipe to round()
cel_temp <- ((fahren_temp - 32) * 5/9) %>%
round(1)
return(cel_temp)
}

celsify(32)

[1] 0

celsify(100)

[1] 37.8

# Example using two input values
raise <- function(x, y) {
x^y
}

raise(3, 2)

[1] 9

This supplement was put together by Omar Wasow, Sukrit S. Puri, and Blaykyi Kenyah. It builds on prior work by Project Mosaic, an NSF-funded initiative to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum.’’ We are grateful to Nicholas Horton, Linda Loi, Kate Aloisio, and Ruobing Zhang for their effort developing The Statistical Sleuth in R.(Horton et al. 2013)

Horton, Nicholas, Linda Loi, Kate Aloisio, and Ruobing Zhang. 2013. “The Statistical Sleuth in R.” http://www.math.smith.edu/~nhorton/sleuth3/.

Wickham, Hadley. 2014. “Tidy Data.” The Journal of Statistical Software 59 (10). http://vita.had.co.nz/papers/tidy-data.html.