Does the price for a pack of cigarettes influence how much people smoke? The Cigarette Consumption Panel Data Set has relevant data across 48 states from 1985 to 1995. In this assignment, we’ll explore the relationship between cigarette consumption and average price per pack.

THE CIGARETTE CONSUMPTION PANEL DATA SET

The data set consists of annual data for the 48 continental U.S. states from 1985 – 1995. Quantity consumed is measured by annual per capita cigarette sales in packs per fiscal year, as derived from state tax collection data. The price is the average retail cigarette price per pack during the fiscal year, including taxes. Income is per capita income. The general sales tax is the average tax, in cents per pack, due to the broad- based state sales tax applied to all consumption goods. The cigarette-specific tax is the tax applied to cigarettes only. All prices, income, and taxes used in the regressions in this chapter are deflated by the Consumer Price Index and thus are in constant (real) dollars. We are grateful to Professor Jonathan Gruber of MIT for providing us with these data.

  1. Either go to https://vincentarelbundock.github.io/Rdatasets/datasets.html and download ‘The Cigarette Consumption Panel Data Set’ with the name Cigarette.csv or use the command below to load it directly from a URL.
# from internet with shortened URL
cig <- read.csv(url("http://bit.ly/346cig")) %>% clean_names()
dim(cig)
## [1] 528  10
names(cig)
##  [1] "x"      "state"  "year"   "cpi"    "pop"    "packpc" "income" "tax"   
##  [9] "avgprs" "taxs"
  1. The documentation for cigarette.csv tells us it is a panel data set of 48 observations from 1985 to 1995 with 528 observations from the United States. The variables are:

    • state: state
    • year: year
    • cpi: consumer price index
    • pop: state population
    • packpc: number of packs per capita
    • income: state personal income (total, nominal)
    • tax: average state, federal, and average local excise taxes for fiscal year
    • avgprs: average price during fiscal year, including sales taxes
    • taxs: average excise taxes for fiscal year, including sales taxes

(Source: Professor Jonathan Gruber, MIT. References: Stock, James H. and Mark W. Watson (2003) Introduction to Econometrics, Addison-Wesley Educational Publishers, chapter 10.)

  1. Explore the data briefly (see sample code below).
    • Which states have the smallest and the largest populations?
    • As measured by income, which state is the poorest and which the richest?
    • As measured by packpc, which states appear to smoke the least and the most?
    • Look at packpc for all 48 states. Does there appear to be significant variation across states?
# sample code (change chunk header to eval = TRUE)
cig %>% 
  group_by(state) %>% 
  summarize(mean = mean(taxs, na.rm = TRUE)) %>%
  arrange(mean) %>%
  print(n = 5)
  1. Visualize data
    • Plot packpc vs avgprs. What pattern do you observe?
    • In your plot, are there any apparent outlying observations?
    • How could you identify those outliers?
    • Create a scatter plot of packpc vs avgprs with text labels set to state for each observation rather than a point (see sample code below). Which states appear to be outliers?
    • Create a scatter plot of avgprs vs year with text labels for each observation rather than a point. What patterns do you observe?
# sample code (change chunk header to eval = TRUE)
iris %>% 
  ggplot() +
    aes(x = Petal.Length, y = Petal.Width) +
    geom_text(aes(label = Species), size = 3)
  1. Recoding and transformation:
    • What is the distribution of income?
    • What is the distribution of population?
    • What transformations might be helpful for these data?
    • Which state appeared to consume unsually small quantities of cigarettes in your plot?
    • Use dplyr::case_when to create a categorical variable outliers for three outlying states (two “high”, one “low”) and a category “med” for non-outlying states.
    • Create a scatter plot of packpc vs avgprs with outlying states marked with a different color or symbol
cig <- cig %>%
    mutate(
        price_low_med_high = case_when(
            # syntax is LOGICAL on left ~ ASSIGNED VALUE on right
            avgprs <= 122 ~ "low",  # if price <= 122, assign "low"
            avgprs >= 173 ~ "high", # if price >= 173, assign "high"
            TRUE ~ "med"            # for all others, assign "med"
        ),
        price_low_med_high = fct_relevel(price_low_med_high, 
                                         c("low", "med", "high"))
    )

# check recoding worked
plot(packpc ~ price_low_med_high, cig)
  1. Create a “naive,” bivariate model in which you regress number of packs per capita packpc on average price during fiscal year avgprs. Look at the codebook and add some appropriate control variables to the right-hand side of your regression. Create a total of three regressions: the simple bivariate model, a model that controls for population and income and another model that controls for population, income and tax the average state, federal, and average local excise taxes (for now, do not use year or state as control variables).

  2. With stargazer create a single regression table with all three models from above.

How does the relationship between cigarette consumption and price change across the models? Consider some of things we care about in estimating relationships such as sign, magnitude, substance, and uncertainty. Is the relationship seem robust to inclusion of controls?

  1. Run a forth model include state as right hand side control variable. In R, look at a summary() of your regression results. Notice, because state is a factor with 48 levels, we get an unwieldy 47 rows of results in our summary for that one variable. Create a regression table with stargazer using the same model but use the omit command in stargazer to suppress the state fixed effects:
# sample code (change chunk header to eval = TRUE)
stargazer(
  lm1, lm2, lm3, lm4, 
  omit = c("state"),  # remove any row with text state in it
  type = 'text'       # for results in console
  )     
  1. Now, run a fifth model with year as a right hand side control (exclude state). Look at the summary() of your model. In your regression, is year being treated as a numeric or factor variable? Look at your regression output. You can also look at the results of the str command above or use the class() command. What should it be? Rerun your fifth regression with the correct specification. Look at the model using summary(). Has it changed?

  2. Create a sixth model with state and year fixed effects. Using stargazer, print three models: one with controls but no state or year fixed effects (model 3 above), one with controls and state fixed effects (model 4 above), and one with controls plus state and year fixed effects (model 6). Again, consider how the relationship between cigarette consumption and price changes across the models. How do the measures of magnitude, sign and uncertainty change?

# sample code (change chunk header to eval = TRUE)
stargazer(
  lm3, lm4, lm6, 
  omit      = c("state|year"), # remove state & year rows
  omit.stat = c("f", "ser"),   # drop two model stats, narrows columns
  type      = 'text'           # for results in console
  )
  1. If you haven’t already, install the packages sjPlot, jtools and interactions. Now let’s plot some of the models. With sjPlot, run plot_model on your bivariate regression object (model 1). For more about these packages, see:
# sample code (change chunk header to eval = TRUE)
library(sjPlot)
# plot regression line and smoothed running average
plot_model(lm1)

# plot regression line, data and running average of data
plot_model(lm1, show.data = TRUE)
  1. With sjPlot now run plot_model on some of your other regression output. How do these plots differ from the bivariate case? Why are they different?
# sample code (change chunk header to eval = TRUE)

# create coefficient plot with sjPlot
plot_model(lm2)

Just for comparison, use jtlools and run plot_coefs on some of your multivariate models.

# sample code (change chunk header to eval = TRUE)

library(jtools)
# create coefficient plot with jtools
plot_coefs(lm2)
  1. Another way to plot multivariate models is to just plot the marginal effects. That is, to plot the marginal effect (a one unit change) in one variable (\(x\)) on the predicted value of another (\(y\)) holding remaining variables constant. With sjPlots::plot_model plot your third model from above (e.g., model with controls but not state and year fixed effects). Still using model 3, add a second variable tax to the terms option.
# sample code (change chunk header to eval = TRUE)

plot_model(lm3, type = "pred", terms = "avgprs", show.data = TRUE)

plot_model(lm3, type = "pred", terms = c("avgprs", "tax"))

How does tax moderate the marginal effect of avgprs on packpc?

  1. Run a seventh model estimating packpc vs avgprs with controls for population, tax and income, year fixed effects and add a new control outliers (this variable should have been created above). Do not include state fixed effects. Look at the summary() of your model. Use the interactions package to plot packpc vs avgprs while accounting for how outliers moderate the effect.
# sample code (change chunk header to eval = TRUE)

library(interactions)
# set predictor to avgprs and show how outliers acts as moderater
interact_plot(lm7, pred = "avgprs", modx = "outliers", plot.points = TRUE)
  1. Using the package plm (for panel linear models) re-run models 3, 4, 5 and 6. Compare the results of each lm() command with a similar call in plm using stargazer.
# sample code (change chunk header to eval = TRUE)

library(plm)

# No state or year fixed effects, everything in one 'pool' 
plm1 <- plm(formula = packpc ~ avgprs + log_pop + log_inc + tax, 
            data = cig, 
            index = c("state", "year"), # tell plm unit and time vars
            model = "pooling")          # tell plm to pool all data

stargazer(lm3, plm1, type = 'text')
# sample code (change chunk header to eval = TRUE)

plm2 <- plm(formula = packpc ~ avgprs + log_pop + log_inc + tax, 
            data = cig, 
            index = c("state", "year"), # tell plm unit and time vars
            effect = "individual",      # tell plm only unit fixed effects 
            model = "within")

stargazer(lm4, plm2, type = 'text', omit = "state")
# sample code (change chunk header to eval = TRUE)

plm3 <- plm(formula = packpc ~ avgprs + log_pop + log_inc + tax, 
            data = cig, 
            index = c("state", "year"), # tell plm unit and time vars
            effect = "time",            # tell plm only time fixed effects 
            model = "within")

stargazer(lm5, plm3, type = 'text', omit = "year")
# sample code (change chunk header to eval = TRUE)

plm4 <- plm(formula = packpc ~ avgprs + log_pop + log_inc + tax, 
            data = cig, 
            index = c("state", "year"), # tell plm unit and time vars
            effect = "twoways",         # tell plm unit & time fixed effects 
            model = "within")

stargazer(lm6, plm4, type = 'text', omit = "state|year")