Introduction to Time Trends (draft)

Developing intuition for time trends with solar radiation and skin cancer data

Team 346 pol346.com (Department of Politics, Princeton University)http://princeton.edu/politics
2020-05-04

Introduction

Data contains yearly skin cancer rates (per 100,000 people) in Connecticut from 1938 to 1972 with a code indicating those years that came two years after higher than average sunspot activity and those years that came two years after lower than average sunspot activity.

Setup


library(dplyr)
library(janitor)

library(ggplot2)
library(sjPlot)    # plot_model command
library(patchwork) # combine plots
library(cowplot)   # combine plots

library(stargazer)

theme_set(theme_minimal(base_size = 16))


solar <- Sleuth3::ex0323 %>% clean_names()

dim(solar)

[1] 35  3

head(solar)

year cancer_rate sunspot_activity
1 1938         0.8              Low
2 1939         1.3             High
3 1940         1.4             High
4 1941         1.2             High
5 1942         1.7              Low
6 1943         1.8              Low

Plot data


solar %>%
ggplot() +
aes(y = cancer_rate, x = year, color = sunspot_activity) +
geom_point() 

Run models


# single mean model (one intercept)
lm_mean    <- solar %>% lm(cancer_rate ~ 1, data = .)

# two mean model (two intercepts)
lm_sunspot <- solar %>% lm(cancer_rate ~ sunspot_activity, data = .)

# single conditional model (one slope, one intercept)
lm_year    <- solar %>% lm(cancer_rate ~ year, data = .)

# single conditional mean + mean shift (one slope, two intercepts)
lm_sunspot_and_year <- solar %>% lm(cancer_rate ~ sunspot_activity + year, data = .)

Look at table of results


stargazer(
lm_mean, lm_sunspot, lm_year, lm_sunspot_and_year,
type = "html", header = FALSE, digits = 2, omit.stat = c("f", "ser")
)
 Dependent variable: cancer_rate (1) (2) (3) (4) sunspot_activityLow -0.44 -0.44*** (0.40) (0.09) year 0.11*** 0.11*** (0.01) (0.004) Constant 2.92*** 3.17*** -212.00*** -212.00*** (0.20) (0.30) (11.30) (8.68) Observations 35 35 35 35 R2 0.00 0.04 0.92 0.95 Adjusted R2 0.00 0.01 0.91 0.95 Note: p<0.1; p<0.05; p<0.01

Cancer rate modeled from single mean


cancer_rate_mean <- mean(solar$cancer_rate) solar$predicted_with_mean <- predict(lm_mean)   # Save the predicted values
solar$residuals_with_year_no_sunspot <- residuals(lm_year) # Save the residual values plot_year_no_sunspot <- solar %>% ggplot() + aes(x = year, y = cancer_rate, color = sunspot_activity) + geom_point() + geom_abline(aes(intercept = lm_year$coefficients[1], slope = lm_year$coefficients[2]), alpha = .5) + geom_segment(aes(xend = year, yend = predicted_with_year_no_sunspot), alpha = .3) + # alpha to fade lines geom_point(aes(y = predicted_with_year_no_sunspot), shape = 1, size = 1.5, alpha = .75) plot_year_no_sunspot Cancer as a function of sunspot_activity with year  solar$predicted_with_year <- predict(lm_sunspot_and_year)   # Save the predicted values
solar\$residuals_with_year <- residuals(lm_sunspot_and_year) # Save the residual values

plot_sunspot_and_year <- solar %>%
ggplot() +
aes(x = year, y = cancer_rate, color = sunspot_activity) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, size = .5) +
geom_segment(aes(xend = year, yend = predicted_with_year), alpha = .3) +  # alpha to fade lines
geom_point(aes(y = predicted_with_year), shape = 1, size = 1.5, alpha = .75)

plot_sunspot_and_year

Plot models


p1 <-
plot_model(lm_sunspot, type = "pred", terms = "sunspot_activity") +
scale_y_continuous(limits = c(0, 4)) +
ggtitle("Without year trend")

p2 <-
plot_model(lm_sunspot_and_year, type = "pred", terms = c("sunspot_activity")) +
scale_y_continuous(limits = c(0, 4)) +
ggtitle("With year trend")

# combine plots
cowplot::plot_grid(p1, p2, labels = c("A", "B"))

In pane (A), we see the model that does not control for year trends. There is a lot of unexplained variation and the estimates have wide confidence intervals. In pane (B), the model does control for year and we are better able to isolate an effect of sunspot activity.

In this example, we remove the “year effect” by removing the conditional mean. In other words, we have a model for a predicted value of cancer rates by year cancer ~ year. Once we account for that, we can then ask is there a significant difference between the residuals for high sunspot years vs low sunspot years?
Put another way, if we know the year and have a model, we can make a good prediction about skin cancer rates in Connecticut. That predicted value is our conditional mean. Now imagine we just calculate the deviation from the one conditional mean to estimate how much rates vary in high vs low sunspot years. From that we’d have two groups of residuals and we can see if those two groups are significantly different. Once we’ve accounted for the year-over-year trend, we could simply run a t-test on the residuals in low years vs the residuals in high years. That essentially what cancer ~ sunspot_activity + year` is doing in pane (B).