Intro to Summary Tables

Supplement to POL346

Team POL346 pol346.com (Department of Politics, Princeton University)http://princeton.edu/politics
2020-05-17

Table of Contents


Introduction

The first table in almost every published study we read gives descriptive statistics. Depending on the field, type of project, and publication, there will be some variation in the design of these tables. Broadly, however, they serve the same purpose: to help readers get a sense of the distribution of the variables used in the study. Measures included will also vary; some authors will include median and IQR, while others will use mean and standard deviation, and others will include all 4. Generally, the number of observations for each variable will be included; this helps get a sense of where there is missingnessness. Summary tables also help us, the researchers to identify issues in how variables are coded and to interpret our own coefficients.

In this document, we will be looking at a selection of variables from the 2016 American National Election Survey. To make our tables, we will use two packages: arsenal and knitr. You will by now be familiar with the latter’s kable() function. When paired with arsenal::tableby(), it can be used to make informative, well-formatted summary tables with ease.

For more details on the tableby() function, consult this great resource.

Loading packages


# set global options
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)

library(tidyverse)
library(knitr)
library(foreign)
library(stringr)
library(MatchIt)
library(naniar)

# install.packages("arsenal")
library(arsenal)

# NOTE library(kableExtra) may cause output issues with library(arsenal)

Loading and cleaning the data

The hardest part of assembling a summary table is to have the variables you’ll use exactly as you want them. Here, I selected a series of variables from the 2016 ANES and cleaned them.


anes <- read.dta("anes_timeseries_2016_Stata12.dta")

anes <- anes %>% 
  mutate(
    race                  = V161310x,
    age                   = V161267,
    death_penalty         = V161233x,
    vaccines              = V162162x,
    scientists            = V162112,
    obedient_self_reliant = V162241,
    considerate_behaved   = V162242,
    stocks                = V161350,
    pid                   = V161158x,
    educ                  = V161270,
    married               = V161269
    )

anes1 <- anes %>% 
  dplyr::select(race, age, death_penalty, vaccines, scientists, 
         obedient_self_reliant,
         stocks, pid, educ, married) # keep selected variables

anes1 <- anes1 %>% 
  # mutate_if conducts a logical test (.predicate) like is.factor(), 
  # if TRUE, runs function (.funs), here convert to character
  mutate_if(.predicate = is.factor, .funs = as.character) 

# remove negative values -- those are just NAs
anes1 <- anes1 %>% 
  # mutate_all applies a function to all columns
  # here we run a case_when function on each column, one at a time 
  # str_sub function checks if first character == "-", assign NA,
  # otherwise return original data in as.character() format
  mutate_all(
    .funs = function(x)
      case_when(
        str_sub(x, start = 1, end = 1) == "-" ~ NA_character_, 
        TRUE ~ as.character(x)
      )
    )

# I want to remove the numbers at the beginning of strings
# make vaccines, death_penalty, age, scientists thermometer numeric
# and make educ and pid indicator variables

anes1 <- anes1 %>% 
  # mutate_at applies a function to specfic columns
  # str_sub extracts and replaces substrings from a character vector
  mutate_at(.vars = c("vaccines", "death_penalty"), 
            .funs = function(x) str_sub(x, start = 1, end = 2)) %>% 
  mutate_at(.vars = c("age", "scientists", "vaccines", "death_penalty"),
            .funs = as.numeric) %>% 
  mutate_at(.vars = c("stocks", "married", "race", "obedient_self_reliant"), 
            .funs = function(x) str_sub(x, start = 4))


anes1 <- anes1 %>% 
  mutate(
    educ_BA    = ifelse(str_sub(educ, start = 1, end = 2) %in% c("13", "14", "15", "16"), 
                          "BA or Higher", "Less than BA"),
    republican = ifelse(pid %in% c("7. Strong Republican", 
                                   "6. Not very strong Republican", 
                                   "5. Independent-Republican"), 
                                    "Yes", "No") %>% as.factor(),
    race       = ifelse(str_sub(race, start = 1, end = 5) == "Other", "Other", race),
    scientists = ifelse(scientists > 100, NA, scientists)
    )

anes1 <- anes1 %>% 
  dplyr::select(age, race, educ_BA, obedient_self_reliant,
         republican, stocks, scientists, vaccines, death_penalty)

After cleaning, the data looks like this. Note that some variables are numeric (such as age and scientists, the feeling thermometer towards scientists). All others are character types. As of now, none of these are factor variables.


head(anes1, 4)

  age                race      educ_BA obedient_self_reliant
1  29 White, non-Hispanic Less than BA             Obedience
2  26 White, non-Hispanic BA or Higher         Self-reliance
3  23 White, non-Hispanic Less than BA         Self-reliance
4  58 White, non-Hispanic Less than BA         Self-reliance
  republican stocks scientists vaccines death_penalty
1        Yes     No         70        4             1
2        Yes     No         70        1             4
3         No     No         70        5             1
4        Yes    Yes         50        4             1

Simple summary table

Now let’s look at a very simple summary table using tableby. You’ll note that tableby uses the same syntax as lm(). Anything to the left of the tilde (~) will be a column, and anything on the right will be a row. In this first example, I leave the left side blank. This means that summary statistics will be presented for the full data, not separating by group.


tableby( ~ republican + race + age +  educ_BA + 
           stocks + vaccines + scientists + obedient_self_reliant, 
         data   = anes1,
         digits = 2) %>% 
  summary() %>% 
  kable()
Overall (N=4270)
republican
   No 2541 (59.5%)
   Yes 1729 (40.5%)
race
   N-Miss 33
   Asian, native Hawaiian or other Pacif Islr,non-Hispanic 148 (3.5%)
   Black, non-Hispanic 397 (9.4%)
   Hispanic 450 (10.6%)
   Native American or Alaska Native, non-Hispanic 27 (0.6%)
   Other 177 (4.2%)
   White, non-Hispanic 3038 (71.7%)
age
   N-Miss 121
   Mean (SD) 49.58 (17.58)
   Range 18.00 - 90.00
educ_BA
   BA or Higher 1635 (38.3%)
   Less than BA 2635 (61.7%)
stocks
   N-Miss 102
   No 2218 (53.2%)
   Yes 1950 (46.8%)
vaccines
   N-Miss 671
   Mean (SD) 2.33 (1.67)
   Range 1.00 - 7.00
scientists
   N-Miss 655
   Mean (SD) 76.70 (19.44)
   Range 0.00 - 100.00
obedient_self_reliant
   N-Miss 646
   Both 70 (1.9%)
   Obedience 1684 (46.5%)
   Self-reliance 1870 (51.6%)

For numeric variables, like age and vaccines, we see the mean and sd. Character types, by contrast, are treated as categorical variables. For them, we receive a count and proportion.

Subgroups

If you want to break down your explanatory and control variables by subgroups, doing so is extremely easy. Just put the variable you want to group by before the tilde (~).


tableby(republican ~ race + age +  educ_BA + stocks + 
           vaccines + scientists + obedient_self_reliant, 
         data   = anes1, 
         test   = FALSE, # do not do t-tests or give p-values
         digits = 2) %>% 
  summary() %>% 
  kable()
No (N=2541) Yes (N=1729) Total (N=4270)
race
   N-Miss 22 11 33
   Asian, native Hawaiian or other Pacif Islr,non-Hispanic 93 (3.7%) 55 (3.2%) 148 (3.5%)
   Black, non-Hispanic 380 (15.1%) 17 (1.0%) 397 (9.4%)
   Hispanic 347 (13.8%) 103 (6.0%) 450 (10.6%)
   Native American or Alaska Native, non-Hispanic 19 (0.8%) 8 (0.5%) 27 (0.6%)
   Other 115 (4.6%) 62 (3.6%) 177 (4.2%)
   White, non-Hispanic 1565 (62.1%) 1473 (85.7%) 3038 (71.7%)
age
   N-Miss 70 51 121
   Mean (SD) 48.13 (17.46) 51.73 (17.54) 49.58 (17.58)
   Range 18.00 - 90.00 18.00 - 90.00 18.00 - 90.00
educ_BA
   BA or Higher 952 (37.5%) 683 (39.5%) 1635 (38.3%)
   Less than BA 1589 (62.5%) 1046 (60.5%) 2635 (61.7%)
stocks
   N-Miss 69 33 102
   No 1479 (59.8%) 739 (43.6%) 2218 (53.2%)
   Yes 993 (40.2%) 957 (56.4%) 1950 (46.8%)
vaccines
   N-Miss 405 266 671
   Mean (SD) 2.39 (1.70) 2.25 (1.61) 2.33 (1.67)
   Range 1.00 - 7.00 1.00 - 7.00 1.00 - 7.00
scientists
   N-Miss 391 264 655
   Mean (SD) 79.32 (19.27) 72.84 (19.06) 76.70 (19.44)
   Range 0.00 - 100.00 0.00 - 100.00 0.00 - 100.00
obedient_self_reliant
   N-Miss 389 257 646
   Both 38 (1.8%) 32 (2.2%) 70 (1.9%)
   Obedience 903 (42.0%) 781 (53.1%) 1684 (46.5%)
   Self-reliance 1211 (56.3%) 659 (44.8%) 1870 (51.6%)

Added features


# change variable names:
var_names <- list(
  republican    = "Republican",
  race          = "Race",
  stocks        = "Stock Market Participation",
  death_penalty = "Support/Oppose Death Penalty"
)

# and add title (always do this):
tableby(republican ~ race + stocks + death_penalty, 
         data   = anes1, 
         test   = FALSE, 
         digits = 2) %>% 
  summary(labelTranslations = var_names) %>%   # variable names
  kable(caption = "Selected Variables by Republican Party ID") # title
Table 1: Selected Variables by Republican Party ID
No (N=2541) Yes (N=1729) Total (N=4270)
Race
   N-Miss 22 11 33
   Asian, native Hawaiian or other Pacif Islr,non-Hispanic 93 (3.7%) 55 (3.2%) 148 (3.5%)
   Black, non-Hispanic 380 (15.1%) 17 (1.0%) 397 (9.4%)
   Hispanic 347 (13.8%) 103 (6.0%) 450 (10.6%)
   Native American or Alaska Native, non-Hispanic 19 (0.8%) 8 (0.5%) 27 (0.6%)
   Other 115 (4.6%) 62 (3.6%) 177 (4.2%)
   White, non-Hispanic 1565 (62.1%) 1473 (85.7%) 3038 (71.7%)
Stock Market Participation
   N-Miss 69 33 102
   No 1479 (59.8%) 739 (43.6%) 2218 (53.2%)
   Yes 993 (40.2%) 957 (56.4%) 1950 (46.8%)
Support/Oppose Death Penalty
   N-Miss 69 25 94
   Mean (SD) 2.23 (1.23) 1.54 (0.93) 1.95 (1.17)
   Range 1.00 - 4.00 1.00 - 4.00 1.00 - 4.00

# change which summary statistics are displayed
tableby(republican ~ race + vaccines + scientists,
        numeric.stats = c("N", "median", "q1q3"), # give us count, median, and IQR for numeric
        data   = anes1, 
        test   = FALSE, 
        digits = 2) %>% 
  summary() %>% 
  kable(caption = "Selected Variables by Republican Party ID") # title
Table 2: Selected Variables by Republican Party ID
No (N=2541) Yes (N=1729) Total (N=4270)
race
   N-Miss 22 11 33
   Asian, native Hawaiian or other Pacif Islr,non-Hispanic 93 (3.7%) 55 (3.2%) 148 (3.5%)
   Black, non-Hispanic 380 (15.1%) 17 (1.0%) 397 (9.4%)
   Hispanic 347 (13.8%) 103 (6.0%) 450 (10.6%)
   Native American or Alaska Native, non-Hispanic 19 (0.8%) 8 (0.5%) 27 (0.6%)
   Other 115 (4.6%) 62 (3.6%) 177 (4.2%)
   White, non-Hispanic 1565 (62.1%) 1473 (85.7%) 3038 (71.7%)
vaccines
   N 2136 1463 3599
   Median 2.00 2.00 2.00
   Q1, Q3 1.00, 4.00 1.00, 3.00 1.00, 4.00
scientists
   N 2150 1465 3615
   Median 85.00 70.00 85.00
   Q1, Q3 70.00, 99.00 60.00, 85.00 60.00, 93.00

Significance tests

There is also a way to conduct t.tests and other significance tests. To do so, set test = TRUE. For this option, I also recommend using results = "asis" in the chunk header instead of using kable(), and setting pfootnote in the summary function to TRUE. This gives details of what kind of test was used to establish significance. Finally, to set a title on this table without kable, use title = "something" in the summary() function.


# is there a significant difference between groups? Chi-square and t-test options
tableby(republican ~ age + death_penalty + vaccines + scientists,
        numeric.stats = c("N", "mean", "sd"),
        data          = anes1, 
        test          = TRUE, # set test to true (and remember to pair with mean/sd)
        numeric.test  = "kwt", cat.test = "chisq", # set test types (linear ANOVA default)
        digits        = 2) %>% 
  summary(title = "Selected Variables by Republican Party ID",
          pfootnote = TRUE) # display methods used
Selected Variables by Republican Party ID
No (N=2541) Yes (N=1729) Total (N=4270) p value
age < 0.0011
   N 2471 1678 4149
   Mean 48.13 51.73 49.58
   SD 17.46 17.54 17.58
death_penalty < 0.0011
   N 2472 1704 4176
   Mean 2.23 1.54 1.95
   SD 1.23 0.93 1.17
vaccines 0.1391
   N 2136 1463 3599
   Mean 2.39 2.25 2.33
   SD 1.70 1.61 1.67
scientists < 0.0011
   N 2150 1465 3615
   Mean 79.32 72.84 76.70
   SD 19.27 19.06 19.44
  1. Kruskal-Wallis rank sum test

Showing balance on covariates after matching:

Finally, after conducting matching, it is possible to use tableby() to show that your covariates are now balanced. In this example, I match the treatment, republican, on two variables, age and stocks. Stocks, here, I am using as a proxy for status/wealth. Notice that I must do some data cleaning before matching. I had to (1) change the character vectors to dummy variables, and (b) omit any NAs from this subsample of the data. I conducted 3 t-tests and noted a statisically significant difference for stocks and age on probability of being republican, but not on educ_BA.


anes1 <- anes1 %>% 
  mutate(
    educ_BA_numeric = ifelse(educ_BA == "BA or Higher", 1, 0),
    republican      = ifelse(republican == "Yes", 1, 0),
    stocks          = ifelse(stocks == "Yes", 1, 0),
    id              = row_number()
    ) %>% 
  na.omit()

with(anes1, t.test(age ~ republican))

    Welch Two Sample t-test

data:  age by republican
t = -5.845, df = 2919.7, p-value = 5.625e-09
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -4.792802 -2.384948
sample estimates:
mean in group 0 mean in group 1 
       47.63609        51.22496 

with(anes1, t.test(stocks ~ republican))

    Welch Two Sample t-test

data:  stocks by republican
t = -8.7854, df = 2946.3, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.1863508 -0.1183466
sample estimates:
mean in group 0 mean in group 1 
      0.4107863       0.5631350 

with(anes1, t.test(educ_BA_numeric ~ republican)) # not statistically significant

    Welch Two Sample t-test

data:  educ_BA_numeric by republican
t = -1.0687, df = 2947.3, p-value = 0.2853
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.05210572  0.01534282
sample estimates:
mean in group 0 mean in group 1 
      0.3916331       0.4100145 

# another way to show this:
# summary(glm(republican ~ educ_BA_numeric + stocks + age, data = anes1, family = binomial()))

matched <- matchit(republican ~ age + stocks + educ_BA_numeric,
                   data   = anes1,
                   method = "nearest")

anes_matched <- match.data(matched)

Now, I only need to feed the matched data into our previous tableby() output:


# using tableby
tableby(republican ~ age + stocks + educ_BA_numeric,
        numeric.stats    = "meansd", # only one summary stat
        numeric.simplify = TRUE,     # put on the same row as variable name
        numeric.test     = "kwt",    # manually set test types
        cat.test         = "chisq",  # manually set test types
        data             = anes_matched, 
        test             = TRUE,     # set test to true (remember to pair w/ mean/sd)
        digits           = 2) %>% 
  summary(title = "Covariate Balance on Matched Characteristics",
          pfootnote = TRUE) 
Covariate Balance on Matched Characteristics
0 (N=1378) 1 (N=1378) Total (N=2756) p value
age 50.79 (17.03) 51.22 (17.67) 51.01 (17.35) 0.6071
stocks 0.56 (0.50) 0.56 (0.50) 0.56 (0.50) 0.9691
educ_BA_numeric 0.44 (0.50) 0.41 (0.49) 0.42 (0.49) 0.1331
  1. Kruskal-Wallis rank sum test

This supplement was put together by Beatriz Barros. Please email any questions, corrections or concerns to .