# Supplement to Chapter 20

Sample R code for Ramsey and Schafer’s Statistical Sleuth 3ed for POL346: Applied Quantitative Analysis.

Team 346 pol346.com (Department of Politics, Princeton University)http://princeton.edu/politics
2019-05-12

# Logistic Regression for Binary Response Variables

## Survival in the Donner Party–An Observational Study

From Statistical Sleuth, Chapter 20 (p 602, 3e):

In 1846 the Donner and Reed families left Springfield, Illinois, for California by covered wagon. In July, the Donner Party, as it became known, reached Fort Bridger, Wyoming. There its leaders decided to attempt a new and untested route to the Sacramento Valley. Having reached its full size of 87 people and 20 wagons, the party was delayed by a difficult crossing of the Wasatch Range and again in the crossing of the desert west of the Great Salt Lake. The group became stranded in the eastern Sierra Nevada mountains when the region was hit by heavy snows in late October. By the time the last survivor was rescued on April 21, 1847, 40 of the 87 members had died from famine and exposure to extreme cold.

These data were used by an anthropologist to study the theory that females are better able to withstand harsh conditions than are males. (Data from D. K. Grayson, “Donner Party Deaths: A Demographic Assessment,” Journal of Anthropological Research 46 (1990): 223-42.) For any given age, were the odds of survival greater for women than for men?

### Statistical Conclusion

While it generally appears that older adults were less likely than younger adults to survive, the data provide highly suggestive evidence that the age effect is more pronounced for females than for males (the two-sided p-value for interaction of gender and age is 0.05). Ignoring the interaction leads to an informal conclusion about the overall difference between men and women: The data provide highly suggestive evidence that the survival probability was higher for women (in the Donner Party) than for the men (two-sided p-value = 0.02 from a drop-in-deviance test). The odds of survival for women are estimated to be 5 times the odds of survival for similarly aged men (95% confidence interval: 1.2 times to 25.2 times, from a drop-in-deviance-based confidence interval).

### Scope of Inference

Since the data are observational, the result cannot be used as proof that women were more apt to survive than men; the possibility of confounding variables cannot be excluded. Furthermore, since the 45 individuals were not drawn at random from any population, inference to a broader population is not justified.


library(Sleuth3)
library(janitor)

library(xtable)
library(stargazer)

suppressMessages(library(dplyr))
library(forcats)
library(ggplot2)
library(cowplot)

library(margins)
library(sjPlot)
library(zeligverse)

theme_set(theme_bw())

donner <- Sleuth3::case2001 %>%
clean_names()

# convert Survived/Died to ones/zeros
donner <- donner %>%
mutate(survived = as.numeric(status == "Survived"),
sex = fct_relevel(sex, c("Male", "Female"))) # male = reference category

# view data
head(donner)

age    sex   status survived
1  23   Male     Died        0
2  40 Female Survived        1
3  40   Male Survived        1
4  30   Male     Died        0
5  28   Male     Died        0
6  40   Male     Died        0

# run logistic regression
glm_sex <- glm(survived ~ sex,       family = binomial, data = donner)
glm_age <- glm(survived ~ age,       family = binomial, data = donner)
glm_sa  <- glm(survived ~ sex + age, family = binomial, data = donner)

# create regression table
stargazer(glm_sex, glm_age, glm_sa,
type = 'html',
header = FALSE)
 Dependent variable: survived (1) (2) (3) sexFemale 1.386** 1.597** (0.671) (0.755) age -0.066** -0.078** (0.032) (0.037) Constant -0.693* 1.819* 1.633 (0.387) (0.999) (1.110) Observations 45 45 45 Log Likelihood -28.643 -28.145 -25.628 Akaike Inf. Crit. 61.286 60.291 57.256 Note: p<0.1; p<0.05; p<0.01

### Explaining results

From Statistical Sleuth, (p 608, 3e)

The fit of a logistic regression model to the Donner Party data, where $$\pi$$ represents survival probability, gives $logit(\hat{\pi}) = 1.63 - 0.078 \times age + 1.60 \times female$ For comparing women 50 years old (A) with women 20 years old (B), the odds ratio is estimated to be exp[-0.078(50 - 20)] = 0.096, or about 1/10. So the odds of a 20-year-old woman surviving were about 10 times the odds of a 50-year-old woman surviving. Comparing a woman ($$female$$ = 1 = A) with a man ($$female$$ = 0 = B) of the same age, the estimated odds ratio is exp(1.60[1 - 0) = 4.95—a woman’s odds were about 5 times the odds of survival of a man of the same age.

### Comparing models


# for ANOVA with logistic regression, need to specify test = "Chisq"
anova(glm_sex, glm_sa, test = "Chisq") %>%
xtable() %>%
print(type = "html",
comment = FALSE) # turn off message about package
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 43 57.29
2 42 51.26 1 6.03 0.0141

### Visualizing models

Look at survival vs age with base R.


# plot with base R
x_age <- seq(15, 65, by = 1)
y_status <- predict(glm_age, list(age = x_age), type="response")

plot(data = donner,
survived ~ age)
lines(x_age, y_status)
points(donner$age, fitted(glm_age), pch=20, col = "darkgrey")   # plot with base R, by sex x_age <- seq(15, 65, by = 1) x_male <- rep("Male", length(x_age)) x_female <- rep("Female", length(x_age)) y_pred_male <- predict(glm_sa, list(age = x_age, sex = x_male), type="response") y_pred_female <- predict(glm_sa, list(age = x_age, sex = x_female), type="response") plot(survived ~ age, data = donner) lines(x_age, y_pred_male, col = "red") lines(x_age, y_pred_female, col = "blue") ### Marginal effects With simple linear regression using two variables, plotting the relationship between some $$Y$$ against some $$X$$ is trivial. With multiple regression, however, it is typically difficult to plot multiple dimensions in 2D. Similarly, more complicated models, such as those with interaction terms or non-linear relationships as in logistic regression can be harder to interpret and plot. One solution is to estimate marginal or partial effects that tell us how an outcome variable changes when an explanatory variable changes while other explanatory variables are held constant. In R there are a growing number of ways to estimate and plot marginal effects. To start, we’ll use the margins package with the Donner data and a multivariable model.  # using library(margins) margins(glm_sa) %>% summary()  factor AME SE z p lower upper age -0.0153 0.0060 -2.5504 0.0108 -0.0270 -0.0035 sexFemale 0.3258 0.1356 2.4028 0.0163 0.0601 0.5916  margins(glm_sa) %>% plot() ### Marginal effects with sjPlot The sjPlot package offers a plot_model command which offers a number of options to easily plot predicted probabilities from logistic regression models.  # with sjPlot package # https://cran.r-project.org/web/packages/sjPlot/vignettes/plot_marginal_effects.html plot_model(glm_sa, type = "pred", terms = "sex") + ggtitle("Predicted probability of survival by sex") + ylab("Predicted probability of survival") + xlab("Sex")  # adding second variable plot_model(glm_sa, type = "pred", terms = c("sex", "age")) + ggtitle("Predicted probability of survival by sex and age") + ylab("Predicted probability of survival") + xlab("Sex")  # change x-axis plot_model(glm_sa, type = "pred", terms = c("age [15, 25, 35, 45, 55, 65]", "sex")) + ggtitle("Predicted probability of survival by age and sex") + ylab("Predicted probability of survival") + xlab("Age") ### Marginal effects with zelig  ##################################### # Logit on binary DV ##################################### logit_out <- donner %>% zelig(survived ~ age + sex, model = "logit", data = ., cite = FALSE ) sim_out <- logit_out %>% setx(age = seq(15, 65, by = 5)) %>% sim() %>% zelig_qi_to_df() slim_out <- qi_slimmer(sim_out) p <- slim_out %>% ggplot() + aes(x = age, y = qi_ci_median) + geom_line() + geom_ribbon(aes(ymin = qi_ci_min, ymax = qi_ci_max), alpha = 0.3, color = NA) + ylab("Survived") + xlab("Age") + theme_cowplot() + scale_y_continuous(labels = scales::percent) p  ##################################### # Logit on binary DV ##################################### logit_out_female <- donner %>% filter(sex == "Female") %>% zelig(survived ~ age, model = "logit", data = ., cite = FALSE ) logit_out_male <- donner %>% filter(sex == "Male") %>% zelig(survived ~ age, model = "logit", data = ., cite = FALSE ) sim_out_male <- logit_out_male %>% setx(age = seq(15, 65, by = 1)) %>% sim() %>% zelig_qi_to_df() sim_out_female <- logit_out_female %>% setx(age = seq(15, 65, by = 1)) %>% sim() %>% zelig_qi_to_df() slim_out_male <- qi_slimmer(sim_out_male) slim_out_female <- qi_slimmer(sim_out_female) names(slim_out_male)[2] <- "age" names(slim_out_female)[2] <- "age" survived_comb <- rbind( slim_out_male[ , c("age", "qi_ci_min", "qi_ci_median", "qi_ci_max") ], slim_out_female[ , c("age", "qi_ci_min", "qi_ci_median", "qi_ci_max") ] ) num_rows <- nrow(slim_out_male) survived_comb$sex <- rep( c("Male", "Female"), each = num_rows ) # "combined"

p <- survived_comb %>%
ggplot() +
aes(x = age, y = qi_ci_median, group = sex) +
geom_line() +
geom_ribbon(aes(ymin = qi_ci_min, ymax = qi_ci_max, fill = sex),
alpha = 0.3, color = NA) +
ylab("Survived") +
xlab("Age") +
scale_y_continuous(labels = scales::percent) +
theme_cowplot()

p

### Summary of findings

While it generally appears that older adults were less likely than younger adults to survive, the data provide highly suggestive evidence that the age effect is more pronounced for females than for males (the two-sided $$p$$-value for interaction of gender and age is 0.05). Ignoring the interaction leads to an informal conclusion about the overall difference between men and women: The data provide highly suggestive evidence that the survival probability was higher for women (in the Donner Party) than for the men (two-sided $$p$$-value = 0.02 from a drop-in-deviance test). The odds of survival for women are estimated to be 5 times the odds of survival for similarly aged men (95% confidence interval: 1.2 times to 25.2 times, from a drop-in-deviance-based confidence interval).

Since the data are observational, the result cannot be used as proof that women were more apt to survive than men; the possibility of confounding variables cannot be excluded. Furthermore, since the 45 individuals were not drawn at random from any population, inference to a broader population is not justified.

## Birdkeeping and Lung Cancer–A Retrospective Observational Study

From Statistical Sleuth, Chapter 20, (p 602-3, 3e)

A 1972-1981 health survey in The Hague, Netherlands, discovered an association between keeping pet birds and increased risk of lung cancer. To investigate bird- keeping as a risk factor, researchers conducted a case-control study of patients in 1985 at four hospitals in The Hague (population 450,000). They identified 49 cases of lung cancer among patients who were registered with a general practice, who were age 65 or younger, and who had resided in the city since 1965. They also selected 98 controls from a population of residents having the same general age structure. (Data based on P.A. Holst, D. Kromhout, and R. Brand, "For Debate: Pet Birds as an Independent Risk Factor for Lung Cancer,’’ British Medical Journal 297 (1988): 13-21.)

Age and smoking history are both known to be associated with lung cancer incidence. After age, socioeconomic status, and smoking have been controlled for, is an additional risk associated with birdkeeping?

### Statistical Conclusion

The odds of lung cancer among birdkeepers are estimated to be 3.8 times as large as the odds of lung cancer among nonbirdkeepers, after accounting for the effects of smoking, sex, age, and socioeconomic status. An approximate 95% confidence interval for this odds ratio is 1.7 to 8.5. The data provide convincing evidence that increased odds of lung cancer were associated with keeping pet birds, even after accounting for the effect of smoking (one-sided $$p$$-value = 0.0008).

### Scope of Inference

Inference extends to the population of lung cancer patients and unaffected individu- als in The Hague in 1985. Statistical analysis of these observational data cannot be used as evidence that birdkeeping causes the excess cancer incidence among bird- keepers, although the data are consistent with that theory. (As further evidence in support of the causal theory, the researchers investigated other potential confound- ing variables [beta-carotene intake, vitamin C intake, and alcohol consumption]. They also cited medical rationale supporting the statistical associations: People who keep birds inhale and expectorate excess allergens and dust particles, increasing the likelihood of dysfunction of lung macrophages, which in turn can lead to diminished immune system response.)

### Code book

• FM = Sex(1 = F, 0 = M)
• AG = Age, in years
• SS = Socioeconomic status (1 = High, 0 = Low), determined by occupation of the household’s principal wage earner
• YR = Years of smoking prior to diagnosis or examination
• CD = Average rate of smoking, in cigarettes per day
• BK = Indicator of birdkeeping (caged birds in the home for more than 6 consecutive months from 5 to 14 years before diagnosis (cases) or examination (controls)

birds <- Sleuth3::case2002 %>%
clean_names()

birds <- birds %>%
mutate(lung_cancer = as.numeric(lc == "LungCancer")) %>%
rename(birdkeeper = bk,
years_smoking = yr)

head(birds)

lc   fm   ss birdkeeper ag years_smoking cd lung_cancer
1 LungCancer Male  Low       Bird 37            19 12           1
2 LungCancer Male  Low       Bird 41            22 15           1
3 LungCancer Male High     NoBird 43            19 15           1
4 LungCancer Male  Low       Bird 46            24 15           1
5 LungCancer Male  Low       Bird 49            31 20           1
6 LungCancer Male High     NoBird 51            24 15           1

glm_bk <- glm(lung_cancer ~ birdkeeper, family = binomial, data = birds)
glm_bk2 <- glm(lung_cancer ~ birdkeeper + ag + years_smoking + cd + ss,
family = binomial, data = birds)

stargazer(glm_bk, glm_bk2,
type = 'html',
header = FALSE)
 Dependent variable: lung_cancer (1) (2) birdkeeperNoBird -1.356*** -1.408*** (0.371) (0.408) ag -0.041 (0.035) years_smoking 0.066*** (0.025) cd 0.024 (0.025) ssLow -0.003 (0.457) Constant -0.030 -0.071 (0.244) (1.728) Observations 147 147 Log Likelihood -86.466 -77.658 Akaike Inf. Crit. 176.931 167.317 Note: p<0.1; p<0.05; p<0.01

# for ANOVA with logistic regression, need to specify test = "Chisq"
anova(glm_bk, glm_bk2, test = "Chisq") %>%
xtable() %>%
print(type = "html",
comment = FALSE) # turn off message about package
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 145 172.93
2 141 155.32 4 17.61 0.0015

### Marginal effects with sjPlot


# with sjPlot package
# https://cran.r-project.org/web/packages/sjPlot/vignettes/plot_marginal_effects.html
plot_model(glm_bk2, type = "pred", terms = "birdkeeper")


plot_model(glm_bk2, type = "pred", terms = c("birdkeeper", "years_smoking"))

plot_model(glm_bk2, type = "pred", terms = c("years_smoking", "birdkeeper"))