Report 2 Guide (draft)

Some helpful code for Report 2.

Team 346 pol346.com (Princeton Univeristy Department of Politics)princeton.edu/politics
2021-04-19

Introduction

Imagine if we were interested in how attitudes about civil unrest in the 1970s influenced political views. To investigate, we could use the American National Election Study Timeseries to model and plot how something like the relationship between the 1972 Unrest Scale relates to feelings toward the Republican presidential candidate.

# tidyverse packages
library(tidyverse)
library(janitor)

# file path management (requires use of R Projects)
library(here)

# data translation
library(foreign)
library(haven)

# plotting
library(ggridges)
library(sjPlot)

# with sjPlot, for changing labels
library(sjlabelled)

# tables
library(stargazer)

Load data

# here() manages local path to data (requires use of R Projects)
anes <- haven::read_dta(here("anes_timeseries_cdf_dta/anes_timeseries_cdf_stata13.dta")) 

NOTE: The haven package has a useful feature when importing the formats .dta or .sav which is that it keeps each columns label information that is stored with those data. For example we might look at the variable V201600 and we can see the @ label that tells us the question description. This is called metadata (data about our data). We can also see that the type of data is dbl+lbl. This is short for double — typical of numeric data that’s not integers — and label.

sjlabelled::get_label(anes[37])  # sex
              VCF0104 
"Respondent - Gender" 
sjlabelled::get_label(anes[313]) # thermometer GOP candidate
                                          VCF0426 
"Thermometer - Republican Presidential Candidate" 
str(anes$VCF0104)
 dbl+lbl [1:59944] 1, 2, 2, 2, 1, 2, 1, 2, 1, 1, 1, 1, 2, 2, 2, 1...
 @ label       : chr "Respondent - Gender"
 @ format.stata: chr "%16.0g"
 @ labels      : Named num [1:4] 0 1 2 3
  ..- attr(*, "names")= chr [1:4] "0. NA; no Pre IW" "1. Male" "2. Female" "3. Other (2016)"
str(anes$VCF0426)
 dbl+lbl [1:59944] NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
 @ label       : chr "Thermometer - Republican Presidential Candidate"
 @ format.stata: chr "%51.0g"
 @ labels      : Named num [1:3] 97 98 99
  ..- attr(*, "names")= chr [1:3] "97. 97-100 Degrees" "98. DK (exc.1968); don't recognize (1980 and later)" "99. NA; no Post IW"
# also get_labels WITH S useful for seeing responses
sjlabelled::get_labels(anes$VCF0105a)
[1] "1. White non-Hispanic (1948-2012)"                           
[2] "2. Black non-Hispanic (1948-2012)"                           
[3] "3. Asian or Pacific Islander, non-Hispanic (1966-2012)"      
[4] "4. American Indian or Alaska Native non-Hispanic (1966-2012)"
[5] "5. Hispanic (1966-2012)"                                     
[6] "6. Other or multiple races, non-Hispanic (1968-2012)"        
[7] "7. Non-white and non-black (1948-1964)"                      
[8] "9. Missing"                                                  

Many functions like lm() work just fine with labeled data. sjPlot::plot_model() however will break on labeled data.

# in general DO NOT use raw variables you haven't recoded as they'll include values like 998 for missing data
lm1 <- lm(VCF0426 ~ VCF0104, data = anes)

# this generates an error due to labeled data
sjPlot::plot_model(lm1, type = "pred", terms = "VCF0104")

# Error: Can't combine `..1` <character> and `..2` <double>.
# Run `rlang::last_error()` to see where the error occurred.

Consequently, it’s important to strip the labels before those variables get passed to lm(). You can do that easily with commands like as.numeric() or as.character() or as.factor(). See sample code below:

anes <- anes %>% 
  mutate(
    feel_trump = VCF9057 %>% as.numeric(),
    male_fct   = case_when(
      VCF0104 == 1 ~ "Male",
      VCF0104 == 2 ~ "Female",
    ) %>% as.factor()
  )

# Now, sjPlot::plot_model() works
lm1 <- lm(feel_trump ~ male_fct, data = anes)
sjPlot::plot_model(lm1, type = "pred", terms = "male_fct")

Recode data

anes <- anes %>% 
  mutate(
    # this is a time series version of ANES with multiple years
    year       = VCF0004 %>% as.factor(),

    # create column for age
    age        = VCF0102 %>% as.numeric(), 
    
    # convert sex to binary 
    male_bin   = case_when(
                   VCF0104 == 1 ~ 1, 
                   TRUE         ~ 0) %>% as.numeric(),
    
    # convert sex to factor 
    male_fct   = case_when(
                   VCF0104 == 1 ~ "Male",
                   VCF0104 == 2 ~ "Female"
                  ) %>% as.factor(),
    
    # create race numeric
    race_num   = VCF0105a %>% as.numeric(),
      
    # create race factor
    race_fct   = case_when(
      VCF0105a == 1 ~ "White",
      VCF0105a == 2 ~ "Black",
      VCF0105a == 3 ~ "AAPI",
      VCF0105a == 4 ~ "Native American",
      VCF0105a == 5 ~ "Hispanic",
      VCF0105a == 6 ~ "Other",
      VCF0105a == 7 ~ "Other", # combine two categories
      VCF0105a == 9 ~ NA_character_  
    ) %>% as.factor(),

    # create smaller race factor
    race5_fct   = case_when(
      VCF0105a == 1 ~ "White",
      VCF0105a == 2 ~ "Black",
      VCF0105a == 3 ~ "AAPI",
      VCF0105a == 5 ~ "Hispanic",
      TRUE          ~ "Other"
    ) %>% as.factor(),
    
    # example: variable with just Whites and Blacks
    race_wb_fct    = case_when(
                       race_num == 1 ~ "White",
                       race_num == 2 ~ "Black",
                       TRUE      ~ NA_character_
                       ) %>% as.factor(),
    
    # example: create race dummy (0/1) variables
    race_white_bin  = (race_num == 1) %>% as.numeric(),
    race_black_bin  = (race_num == 2) %>% as.numeric(),
    race_aapi_bin   = (race_num == 3) %>% as.numeric(),
    race_native_bin = (race_num == 4) %>% as.numeric(),
    race_hisp_bin   = (race_num == 5) %>% as.numeric(),
    race_other_bin  = (race_num == 6 & race_num == 7) %>% as.numeric(),

    # main explanatory and outcome vars
    unrest_scale   = VCF0528 %>% as.numeric(),
    therm_rep_pres = VCF0426 %>% as.numeric()
  )

# subset this version of ANES for the year 1972
a72 <- anes %>% filter(year == 1972)

Model data

Typically we might control for something like 5-7 variables. In this example, we just use a handful to illustrate.

Run two models: bivariate and multiple terms

# Thermometers Republican Candidate vs Unrest Scale 
reduced_model <- lm(formula = therm_rep_pres ~ unrest_scale,
          data    = a72) 

summary(reduced_model)

Call:
lm(formula = therm_rep_pres ~ unrest_scale, data = a72)

Residuals:
   Min     1Q Median     3Q    Max 
-68.20 -14.75   5.25  20.25  34.56 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)     61.29       2.28   26.93   <2e-16 ***
unrest_scale     1.15       0.67    1.72    0.086 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 26.4 on 732 degrees of freedom
  (1971 observations deleted due to missingness)
Multiple R-squared:  0.00402,   Adjusted R-squared:  0.00266 
F-statistic: 2.95 on 1 and 732 DF,  p-value: 0.086
full_model    <- lm(formula = therm_rep_pres ~ unrest_scale + male_fct + race_wb_fct + age,
          data    = a72) 

summary(full_model)

Call:
lm(formula = therm_rep_pres ~ unrest_scale + male_fct + race_wb_fct + 
    age, data = a72)

Residuals:
   Min     1Q Median     3Q    Max 
-73.06 -13.98   4.39  18.92  50.25 

Coefficients:
                 Estimate Std. Error t value  Pr(>|t|)    
(Intercept)        45.008      4.045   11.13   < 2e-16 ***
unrest_scale        1.079      0.662    1.63      0.10    
male_fctMale       -0.899      1.944   -0.46      0.64    
race_wb_fctWhite   16.092      3.313    4.86 0.0000015 ***
age                 0.784      0.586    1.34      0.18    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 25.9 on 714 degrees of freedom
  (1986 observations deleted due to missingness)
Multiple R-squared:  0.0405,    Adjusted R-squared:  0.0352 
F-statistic: 7.54 on 4 and 714 DF,  p-value: 0.00000596

Publish data in table with stargazer

# stargazer can take multiple models
stargazer(reduced_model, full_model, header = FALSE, type = "html")
Dependent variable:
therm_rep_pres
(1) (2)
unrest_scale 1.150* 1.080
(0.670) (0.662)
male_fctMale -0.899
(1.940)
race_wb_fctWhite 16.100***
(3.310)
age 0.784
(0.586)
Constant 61.300*** 45.000***
(2.280) (4.040)
Observations 734 719
R2 0.004 0.041
Adjusted R2 0.003 0.035
Residual Std. Error 26.400 (df = 732) 25.900 (df = 714)
F Statistic 2.960* (df = 1; 732) 7.540*** (df = 4; 714)
Note: p<0.1; p<0.05; p<0.01

Plot models with plot_model

# plot_model from sjPlot package
plot_model(
  model  = full_model,    
  type   = "pred",           # plot predicted value
  terms  = c("unrest_scale") # plot unrest_scale as predictor (x-axis)
) +
  theme_minimal() +
  ggtitle("Thermometer GOP Candidate vs Unrest Scale") +
  labs(y = "Predicted Therm-GOP Pres Cand")

Plot model with moderator term

We can also look at how race moderates the main relationship. The slope of the relationship between the GOP Candidate Feeling Thermometer and the Urban Unrest Scale is the same for both groups (white and black) but whites, on average, feel more warmly towards the GOP candidate and blacks, on average, feel more cooly toward towards the GOP candidate (an intercept shift).

plot_model(
  model  = full_model,    
  type   = "pred",                          # plot predicted value
  terms  = c("unrest_scale", "race_wb_fct") # plot unrest_scale as predictor
                                            # plot race_wb_fct as moderator
) +
  theme_minimal() +
  ggtitle("Thermometer GOP Candidate vs Unrest Scale") +
  labs(y     = "Predicted Therm-GOP Pres Cand",
       color = 'Race') # change legend title


This supplement was put together by Omar Wasow.