Supplement to Chapter 2 of Ramsey and Schafer’s *Statistical Sleuth* 3ed for POL346: Applied Quantitative Analysis.

```
knitr::opts_chunk$set(echo = TRUE)
options(digits = 3)
```

Is humerus length related to whether the bird would survive or perish? That’s the question being addressed by Case Study 2.1 in the *Statistical Sleuth*.

*Motivation*: We will segment the population of birds into those that survived and those that perished. We can then ask whether these two subsets of the population differ *significantly* on specific parameters. In this case, we ask whether the two groups differ by humerus length (a type of bone in the wing).

We begin by reading the data and summarizing the variables.

```
# Loading dataset
bumpus <- Sleuth2::case0201 # always use meaningful name for your data
bumpus[1:2, ] # look first two rows of at data
```

```
Humerus Status
1 659 Perished
2 689 Perished
```

```
# Using janitor package to clean the variable names (here, to lower case)
library(janitor)
bumpus <- bumpus %>%
clean_names()
names(bumpus)
```

`[1] "humerus" "status" `

Before we apply any statistical tests, it’s important to look at the data. Boxplots and density plots provide can visually present how different groups are from one-another.

```
# base R boxplot
boxplot(data = bumpus,
humerus ~ status,
main = "Boxplot of survived and perished birds")
```

`ggplot2`

allows users to create more aesthetically pleasing graphs. Coding with `ggplot2`

can be a bit more involved but is straightforward once you get the basic ideas of aesthetics and layers. The ‘gg’ in the name refer to Leland Wilkinson’s 2005 book *The Grammar of Graphics* that developed a way of understanding the underlying structure of almost all data visualizations. For more, see ggplot2.

```
library(ggplot2)
# ggplot2 boxplot
bumpus %>%
ggplot() +
aes(x = status,
y = humerus) +
geom_boxplot()
```

A density plot presents the relative frequency of a particular observation compared to all observations. Thus, its \(y\)-axis is a fraction between 0 and 1.

```
# Coding density plots in ggplot2
bumpus %>%
ggplot() +
aes(x = humerus,
color = status) +
geom_density()
```

**Tip**: The `summary()`

command is a good base `R`

option for getting an overview of a vector. The `skimr`

package also has a useful `skim`

function for an overview of a whole data frame. `Janitor`

also has a useful function `tabyl()`

for frequency tables.

```
# Base R approach
survived <- which(bumpus$status == "Survived")
perished <- which(bumpus$status == "Perished")
summary(bumpus$humerus[survived])
```

```
Min. 1st Qu. Median Mean 3rd Qu. Max.
687 728 736 738 752 780
```

```
summary(bumpus$humerus[perished])
```

```
Min. 1st Qu. Median Mean 3rd Qu. Max.
659 718 734 728 743 765
```

```
# Also see skimr::skim function
library(dplyr) # for group_by and pipes
library(skimr)
bumpus %>%
group_by(status) %>%
skim_without_charts()
```

Name | Piped data |

Number of rows | 59 |

Number of columns | 2 |

_______________________ | |

Column type frequency: | |

numeric | 1 |

________________________ | |

Group variables | status |

**Variable type: numeric**

skim_variable | status | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
---|---|---|---|---|---|---|---|---|---|---|

humerus | Perished | 0 | 1 | 728 | 23.5 | 659 | 718 | 734 | 743 | 765 |

humerus | Survived | 0 | 1 | 738 | 19.8 | 687 | 728 | 736 | 752 | 780 |

Researchers are interested in asking “Is Group X different from Group Y?” Here, we want to know if birds that survive or perish vary significantly along humerus lengths.

The density plot shows that the mean humerus length of “Survived” is greater than the mean humerus length of “Perished.” We can confirm this by calling the mean() function as well.

A way to check if samples are significantly different is to perform a \(t\)-test. Note the \(t\)-ratio formula:

\[ t-ratio = \dfrac{(Estimate) − (Parameter)}{SE(Estimate)} \]

Here, we can think of it in terms of hypotheses.

\[ H_0 : \overline{Y}_s - \overline{Y}_p = 0 \\ H_1 : \overline{Y}_s - \overline{Y}_p \neq 0 \]

So our “Estimate” would be \(\overline{Y}_d = \overline{Y}_s - \overline{Y}_p\)

```
yd <- ys - yp
yd
```

`[1] 10.1`

Our “Parameter” is the Null Hypothesis. Here: \(\overline{Y}_d = 0\).

For our *SE(Estimate)*, we need to calculate a pooled standard deviation of the samples.

**NOTE!** We can only perform a two-sample pooled \(t\)-test if the standard deviations of the two samples are similar (or assumed to be similar).

We note that the distributions of “Survived” and “Perished” are similar, and we can assume \[\sigma_s = \sigma_p = \sigma \]

The equation for pooled standard deviation (\(\sigma\)) is:

\[ \sigma = \sqrt{\dfrac{(n_s −1)\sigma_s^2 + (n_p −1)\sigma_p^2}{n_s + n_p − 2}} \]

```
# Count number data points in Survived and Perished groups
ns <- length(bumpus$humerus[bumpus$status == "Survived"])
ns
```

`[1] 35`

```
np <- length(bumpus$humerus[bumpus$status == "Perished"])
np
```

`[1] 24`

```
# Store standard deviations of each group
ss <- sd(bumpus$humerus[bumpus$status == "Survived"])
ss
```

`[1] 19.8`

```
sp <- sd(bumpus$humerus[bumpus$status == "Perished"])
sp
```

`[1] 23.5`

```
# Calculate pooled sd
s <- sqrt(((ns - 1) * ss^2 + (np - 1) * sp^2)/(ns + np - 2))
s
```

`[1] 21.4`

So the *SE(Estimate)* can be calculated using the following formula:

\[ SE(\overline{Y}_d) = \sigma\sqrt{\dfrac{1}{n_s} + \dfrac{1}{n_p}} \]

```
se <- s * sqrt((1/ns) + (1/np))
se
```

`[1] 5.67`

So now the \(t\)-ratio is:

```
t <- (yd - 0)/se
t
```

`[1] 1.78`

The \(p\)-value associated with this t-ratio is:

```
# calculate area as or more extreme than t under left tail
# in general, we can't be sure if t is positive or negative
# so, for left tail, we calculate abs() and multiply by -1
pnorm( -abs(t) )
```

`[1] 0.0378`

`[1] 0.0378`

`[1] 0.0756`

A \(p\)-value of 0.076 means we would expect to see a difference in means as extreme or more extreme than \(\bar{Y}_d = 10.083\) approximately a 7.557 percent of the time, even if the null hypothesis were true. At a 5% level, this result is *not* statistically significant. We *fail* to reject the null hypothesis.

A simpler way to do this process would be to call upon the \(t\)-test function in R.

```
t.test(data = bumpus,
humerus ~ status,
var.equal = TRUE)
```

```
Two Sample t-test
data: humerus by status
t = -2, df = 57, p-value = 0.08
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-21.45 1.28
sample estimates:
mean in group Perished mean in group Survived
728 738
```

Using `infer`

to run a permutation test:

```
library(infer)
bumpus_permutations <- bumpus %>%
specify(humerus ~ status) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means", order = c("Survived", "Perished"))
head(bumpus_permutations)
```

```
# A tibble: 6 x 2
replicate stat
<int> <dbl>
1 1 3.69
2 2 5.24
3 3 3.55
4 4 -4.67
5 5 -14.4
6 6 -0.874
```

```
bumpus_permutations %>%
ggplot() +
aes(x = stat) +
geom_histogram() +
geom_vline(xintercept = c(-yd, yd),
color = "red")
```

Researchers want to know if there are any physiological indicators of schizophrenia. Early postmortem analysis suggests that certain parts of the brain might be sized differently in individuals who suffer from schizophrenia versus others. Researchers decided to control for exogenous factors that may contribute to variation in hippocampus size by using twins.

```
# Loading and cleaning dataset
twins <- Sleuth3::case0202 %>%
janitor::clean_names() # convert column names to lowercase
head(twins)
```

```
unaffected affected
1 1.94 1.27
2 1.44 1.63
3 1.56 1.47
4 1.58 1.39
5 2.06 1.93
6 1.66 1.26
```

```
twins %>%
skim_without_charts()
```

Name | Piped data |

Number of rows | 15 |

Number of columns | 2 |

_______________________ | |

Column type frequency: | |

numeric | 2 |

________________________ | |

Group variables | None |

**Variable type: numeric**

skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
---|---|---|---|---|---|---|---|---|---|

unaffected | 0 | 1 | 1.76 | 0.24 | 1.25 | 1.60 | 1.77 | 1.94 | 2.08 |

affected | 0 | 1 | 1.56 | 0.30 | 1.02 | 1.31 | 1.59 | 1.78 | 2.02 |

There are 15 pairs of twins, within each pair, one is affected by schizophrenia, and the other is not. The difference in area of left hippocampus of these pairs of twins is:

```
diff <- twins$unaffected - twins$affected
head(diff)
```

`[1] 0.67 -0.19 0.09 0.19 0.13 0.40`

We can now plot the distribution of these differences:

```
hist(diff)
```

Now we will conduct a paired \(t\)-test, and construct a 95% confidence interval:

**Note**: Remember the formula for SE here is the following:

$$

\[\begin{split} SE = \frac{\sigma}{\sqrt{n}} \end{split}\]$$

```
diff_mean <- mean(diff)
diff_mean
```

`[1] 0.199`

```
diff_sd <- sd(diff)
diff_sd
```

`[1] 0.238`

```
diff_SE <- diff_sd/sqrt(15)
diff_SE
```

`[1] 0.0615`

```
t_val <- (diff_mean - 0)/diff_SE
t_val
```

`[1] 3.23`

`[1] 0.00124`

```
# construct a 95% confidence interval
q <- qt(0.975, 15-1)
q
```

`[1] 2.14`

```
lower <- diff_mean - q*diff_SE
lower
```

`[1] 0.0667`

```
upper <- diff_mean + q*diff_SE
upper
```

`[1] 0.331`

And so, our \(p\)-value is: 0.001, which is less than a 5% level of significance, meaning the difference *is* significant.

And, our 95% confidence interval is (0.067, 0.331).

**Tip**: If the confidence interval straddles zero, the difference-in-means cannot be statistically distinguished from zero and is *not* statistically significant.

Note, that the `twins`

data is structured in a way that important information about each case is not actually stored as data. Data about the condition, that is affected or unaffected, is stored in the column names, not in the cells of the data frame. Similarly, data about pairs of twins is implied by each row but the data is not explicit. For certain kinds of statistical analysis or data visualization, this additional data would need to be stored in the cells.

We can use the `tidyr`

package to reshape the data to a tidy format in which each variable is a column, each observation is a row. For more on tiday data, see Hadley Wickham’s 2014 article, Tidy data.(Wickham 2014)

```
suppressMessages(library(dplyr))
library(tidyr)
# First, create column to indicate units (in this case twin pairs)
tidy_twins <- twins %>%
mutate(
# create pair number but treat as a category, not numeric
# (e.g., twin pair 1 isn't higher ranking than twin pair 2)
twin_pair = row_number() %>% as.factor()
)
head(tidy_twins, 3)
```

```
unaffected affected twin_pair
1 1.94 1.27 1
2 1.44 1.63 2
3 1.56 1.47 3
```

```
# Second, use tidyr package to `pivot_longer` to create `key value` pair
tidy_twins <- tidy_twins %>%
tidyr::pivot_longer(
cols = c('affected', 'unaffected'),
names_to = 'condition',
values_to = 'area'
) %>%
arrange(twin_pair) # sort data
head(tidy_twins)
```

```
# A tibble: 6 x 3
twin_pair condition area
<fct> <chr> <dbl>
1 1 affected 1.27
2 1 unaffected 1.94
3 2 affected 1.63
4 2 unaffected 1.44
5 3 affected 1.47
6 3 unaffected 1.56
```

```
tail(tidy_twins)
```

```
# A tibble: 6 x 3
twin_pair condition area
<fct> <chr> <dbl>
1 13 affected 2.02
2 13 unaffected 2.04
3 14 affected 1.59
4 14 unaffected 1.62
5 15 affected 1.97
6 15 unaffected 2.08
```

The diagram above shows how `pivot_longer()`

transformed the data.frame.

- In the original data, the twin pair information was not stored as data in the data.frame. So, our first step was creating a column that included each row number, a way to assign a value to each twin pair.
- In the original data, each twin pair is represented by a single row. In the tidy data, each twin is represnted by their own separate row.
- In the original data, the data about the condition of the twin is not stored as data in the data.frame but, rather as descriptive information in the column name. In the tidy data, the twin condition is stored as data in a column called
`condition`

. This column name information is what pivot_longer calls the`key`

. - In the original data, the hippocampus volume values for both twins is stored in a single row. In the tidy data, each individual twin’s hippocampus area value is in the
`area`

column. This data inside the data.frame what`pivot_longer`

calls the`values`

.

In short, we move from the unit being a twin pair to the unit being an individual twin. And, we move from important values stored in the row numbers and column names, to those data being stored inside the data.frame.

A good example of why tidy data has advantages over untidy data is with plotting. If we plot the two available columns in the ‘messy’ data, we can see a relationship in hippocampus valume between the affected and the unaffected twin but it’s hard to evaluate our question of interest which is whether there’s a systematic difference in the volumes within twin pairs.

With the tidy data, however, the differences within pairs are much clearer. To plot the relationship between each twin pair, we need to be able to pass `ggplot`

values for an \(x\)-axis, a \(y\)-axis and a group. The plot below would be considerably harder to do with the untidy version of the data. With the tidy data, it is straightforward.

```
tidy_twins %>%
ggplot() +
aes(x = condition, y = area,
color = twin_pair,
group = twin_pair) +
geom_point() +
geom_line()
```

```
#theme(legend.position="none") # optional hide legend
```

For some older, Base R functions such as a \(t\)-test, using the tidy data is not much different but the code can be more readable. Here we’re testing whether hippocampus `area`

differs by schizophrenia `condition`

.

```
# traditional syntax
t.test(area ~ condition, data = tidy_twins, var.equal = TRUE)
```

```
Two Sample t-test
data: area by condition
t = -2, df = 28, p-value = 0.06
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.40318 0.00585
sample estimates:
mean in group affected mean in group unaffected
1.56 1.76
```

```
# piped syntax
tidy_twins %>%
# t.test command is base R and not updated for pipes (%>%)
# so we use data = . to pass data.frame to `data =` argument in a pipe
t.test(data = .,
area ~ condition,
var.equal = TRUE)
```

```
Two Sample t-test
data: area by condition
t = -2, df = 28, p-value = 0.06
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.40318 0.00585
sample estimates:
mean in group affected mean in group unaffected
1.56 1.76
```

R is a functional programing language which means it has many tools for creating and manipulating functions. Often, it will be helpful to write your own functions to avoid repeating code and making errors. You can think of functions as *verbs* in that they act on our data objects (which can be thought of as *nouns*).

```
# Simple example function
cube <- function(x) {
x^3
}
cube(5)
```

`[1] 125`

```
# Example function using 'return()'
celsify <- function(fahren_temp) {
# important to put parentheses around whole equation
# otherwise two values are passed by pipe to round()
cel_temp <- ((fahren_temp - 32) * 5/9) %>%
round(1)
return(cel_temp)
}
celsify(32)
```

`[1] 0`

```
celsify(100)
```

`[1] 37.8`

```
# Example using two input values
raise <- function(x, y) {
x^y
}
raise(3, 2)
```

`[1] 9`

This supplement was put together by Omar Wasow, Sukrit S. Puri, and Blaykyi Kenyah. It builds on prior work by Project Mosaic, an ``NSF-funded initiative to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum.’’ We are grateful to Nicholas Horton, Linda Loi, Kate Aloisio, and Ruobing Zhang for their effort developing The Statistical Sleuth in R.(Horton et al. 2013)

Horton, Nicholas, Linda Loi, Kate Aloisio, and Ruobing Zhang. 2013. “The Statistical Sleuth in r.” http://www.math.smith.edu/~nhorton/sleuth3/.

Wickham, Hadley. 2014. “Tidy Data.” *The Journal of Statistical Software* 59. http://vita.had.co.nz/papers/tidy-data.html.