# Supplement to Chapter 1

Supplement to Chapter 1 of Ramsey and Schafer’s Statistical Sleuth 3ed for POL346: Applied Quantitative Analysis.

Team pol346 pol346.com (Princeton University, Department of Politics)princeton.edu/politics
2020-03-31

# Assign Random Numbers

There are many ways to draw random numbers. Here our hypothetical task is to randomly assign a number from 1 to 20 to each of 20 subjects.


# from a vector of 1 to 20, draw 20 times at random draw without replacement
# which means a number cannot be drawn twice
random_numbers <- sample(x = 1:20, size = 20, replace = FALSE)

# view random numbers
random_numbers

[1] 14  5  3  7 12  8 20 16  6 18  2  9  1 10 17 19 13 11  4 15

# Order a Data Set

We can use the order function to sort a data set in increasing order. To start, we’ll use the pre-loaded data set called cars with the speed and stopping distance of cars (in the 1920s). We are going to order the distance column.


# view first six rows of cars data
head(cars)

speed dist
1     4    2
2     4   10
3     7    4
4     7   22
5     8   16
6     9   10

# With base R, sort by distance
order_dist <- order(cars\$dist) # returns an index of dist from low to high

# returns an index of dist from low to high
head(order_dist)

[1]  1  3  2  6 12  5

# create data sorted by distance
sorted_data <- cars[order_dist, ] # sort rows of cars by order_dist

# view sorted data, (also, note how row numbers correspond to order_dist above)
head(sorted_data)

speed dist
1      4    2
3      7    4
2      4   10
6      9   10
12    12   14
5      8   16

# load dplyr with suppress warning messages
suppressMessages(library(dplyr)) # with dplyr, sort by distance

cars_by_dist <- cars %>%
arrange(dist)

head(cars_by_dist)

speed dist
1     4    2
2     7    4
3     4   10
4     9   10
5    12   14
6     8   16

# Create a Large Number of Simulations


# set up some fake data
treated <- c(1.31, 1.45, 1.12, 1.16, 1.3, 1.5, 1.2, 1.22, 1.42,
1.14, 1.23, 1.59, 1.11, 1.1, 1.53, 1.52, 1.17, 1.49, 1.62)

control <- c(1.13, 1.71, 1.39, 1.15, 1.33, 1, 1.03, 1.68, 1.76,
1.55, 1.34, 1.47, 1.74, 1.74, 1.19, 1.15, 1.2, 1.59, 1.47)

diff_observed <- mean(treated) - mean(control)

# Set up data in a tidy' format (one column for conditions and one of values)
# Create a vector with 19 'treated' and 19 'controls'.
condition <- rep(c("treated", "control"), each = 19) %>%
as.factor()

combined <- c(treated, control) # combine two vectors into 'combined'

# Combine the two conditions in a dataframe.
dataset <- data.frame(condition = condition,
values = combined)

# View data
head(dataset)

condition values
1   treated    1.3
2   treated    1.4
3   treated    1.1
4   treated    1.2
5   treated    1.3
6   treated    1.5

# Example of shuffling values of dataset with sample command
dataset %>%
mutate(values = sample(values)) %>% # shuffle values column
head()

condition values
1   treated    1.1
2   treated    1.2
3   treated    1.6
4   treated    1.5
5   treated    1.5
6   treated    1.6

# Permute data set multiple times
number_of_permutations <- 1000
diff_in_means_sim <- NULL

for (i in 1:number_of_permutations) {

# shuffle data
data_shuffled <- dataset %>%
mutate(values = sample(values))

# calculate means by condition
mean_treated <- data_shuffled %>%
filter(condition == "treated") %>%
pull(values) %>% # pull single column & convert from data.frame to vector
mean()

mean_control <- data_shuffled %>%
filter(condition == "control") %>%
pull(values) %>% # pull single column & convert from data.frame to vector
mean()

# under null hypothesis, estimate permuted difference
diff_in_means_sim[i] <- mean_treated - mean_control
}

# plot a histogram of the randomization distribution
hist(diff_in_means_sim)

# plot vertical lines in left and right tail of observed difference in means
abline(v = diff_observed, col = "red")
abline(v = -diff_observed, col = "red")

Our test statistic is the observed difference in means. If the null hypothesis is true, our test statistic should be ‘’close to zero.’’ If the alternative hypothesis is true, the test statistic should ‘’far from zero.’’ To estimate whether the test statistic is ‘’close to zero’’ or ‘’far from zero,’’ we calculate a $$p$$-value.

From the textbook (p. 13, 3e):

In a randomized experiment the $$p$$-value is the probability that randomization alone leads to a test statistic as extreme as or more extreme than the one observed. The smaller the $$p$$-value, the more unlikely it is that chance assignment is responsible for the discrepancy between groups, and the greater the evidence that the null hypothesis is incorrect.


# calculate number of simulations in which difference in means is as extreme or
# more extreme than the observed test statistic in both tails
right_tail <- sum(diff_in_means_sim >= abs(diff_observed))
left_tail <- sum(diff_in_means_sim <= -abs(diff_observed))

# calculate two-sided p-value
pvalue <- (right_tail + left_tail)/number_of_permutations
pvalue

[1] 0.33

# Using the infer package

The above simulation can also be run using a single R package called infer that aims ‘’to perform inference using an expressive statistical grammar.’’ For more about infer, see Modern Dive: An Introduction to Statistical and Data Sciences via R by Chester Ismay and Albert Y. Kim(Ismay and Kim 2018). In short, infer can be used to more intuitively and explicitly walk through the individual steps in many statistical inference methods.

For our purposes, we are going to walk through how to carry out the same analysis we just performed on our dataset to calculate its $$p$$-value under the null hypothsis that $$\mu_{t} - \mu_{c} = 0$$


library(infer)

# Data must be in a tidy' format with one column of treatment condition and one column of values
#   In previous code that we concatenated 19 treated and 19 control values.
#   Here we create a vector identifying the 19 'treated' and 19 'controls'.
condition <- rep(c("treated", "control"), each = 19) %>%
as.factor()

#   We then combine the two in a dataframe.
dataset <- data.frame(condition = condition,
values = combined)

# Pipe new data to infer functions
inferdata <- dataset %>%

# Specify the relationship between an outcome and an explanatory variable
infer::specify(values ~ condition) %>%

# Declare a null hypothesis of no relationship between treatment
# assignment and outcome
infer::hypothesize(null = "independence") %>%

# Permute data 1000 times. This randomly reassigns our outcomes to either
# treatment/control (like previous for loop)
infer::generate(reps = 1000, type = "permute") %>%

# Calculate the difference-in-means for our permuted dataset
infer::calculate(stat = "diff in means",
order = c("treated", "control"))

# View inferdata
head(inferdata)

# A tibble: 6 x 2
replicate    stat
<int>   <dbl>
1         1  0.0379
2         2 -0.0895
3         3  0.0968
4         4  0.0905
5         5  0.0979
6         6  0.0432

The result is a new data set of 1,000 simulated difference-in-means, one from each permutation. If the null hypothesis is true, our original result will not be extreme compared to these simulated results. That is our observed difference-in-means will be closer to the center of the new distrubtion than the tails. How do we find out whether it is extreme? As before, we can use both calculations and visualizations. Below is a plot to see whether the observed difference appears extreme. We can use infer::visualize for this.


library(ggplot2)

infer::visualize(inferdata) +
geom_vline(xintercept = c(-diff_observed, diff_observed),
col = "red")

The infer::visualize command uses the powerful visualization package ggplot2. We can also visualize the data directly with ggplot2.


# A density plot
plot_density <- ggplot(data = inferdata) +
aes(x = stat) +
geom_density() +
geom_vline(xintercept = c(-diff_observed, diff_observed),
col = "red")

# Or, alternatively, a histogram
plot_hist <- ggplot(data = inferdata) +
aes(x = stat) +
geom_histogram() +
geom_vline(xintercept = c(-diff_observed, diff_observed),
col = "red")

# load library to put two plots side-by-side
# set fig.height = 2.5 in chunk header to adjust height of figures

library(gridExtra)

grid.arrange(plot_density, plot_hist, ncol = 2)

The plots confirm our earlier findings that -0.08 is not extreme and, therefore, not statistically significant.

The infer package is quite powerful, and can be used for a lot of other statistical inferences.

# Sample Problem

#### Statistical Sleuth 3e, Chapter 1, Modified Problem 12: Fish Oil and Blood Pressure.

Researchers used 7 red and 7 black playing cards to randomly assign 14 volunteer males with high blood pressure to one of two diets for four weeks: a fish oil diet and a standard oil diet. The reductions in diastolic blood pressure are shown below.

Fish oil diet: 8 12 10 14 2 0 0
Regular oil diet: -6 0 1 2 -3 -4 2

##### Questions
1. Estimate the difference in mean diastolic blood pressure for the two groups.
2. Using a randomization test with 1000 draws, estimate how extreme the observed difference in means is to the randomization distribution under the assumption that there is no effect of the fish oil diet as compared to the regular diet. HINT: You can use the infer package to do this!
3. Plot the randomization distribution and the observed difference in means.

(Based on a study by H. R. Knapp and G. A. FitzGerald, “The Antihypertensive Effects of Fish Oil,” New England Journal of Medicine 320 (1989): 1037-43.)

This supplement was put together by Omar Wasow, Sukrit S. Puri, and Blaykyi Kenyah. It builds on prior work by Project Mosaic, an NSF-funded initiative to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum.’’ We are grateful to Nicholas Horton, Linda Loi, Kate Aloisio, and Ruobing Zhang for their effort developing The Statistical Sleuth in R.(Horton et al. 2013)

Horton, Nicholas, Linda Loi, Kate Aloisio, and Ruobing Zhang. 2013. “The Statistical Sleuth in R.” http://www.math.smith.edu/~nhorton/sleuth3/.

Ismay, Chester, and Albert Y. Kim. 2018. Modern Dive: An Introduction to Statistical and Data Sciences via R. https://moderndive.com/.