Does the price for a pack of cigarettes influence how much people smoke? The Cigarette Consumption Panel Data Set has relevant data across 48 states from 1985 to 1995. In this assignment, we’ll explore the relationship between cigarette consumption and average price per pack.

The data set consists of annual data for the 48 continental U.S. states from 1985 – 1995. Quantity consumed is measured by annual per capita cigarette sales in packs per fiscal year, as derived from state tax collection data. The price is the average retail cigarette price per pack during the fiscal year, including taxes. Income is per capita income. The general sales tax is the average tax, in cents per pack, due to the broad- based state sales tax applied to all consumption goods. The cigarette-specific tax is the tax applied to cigarettes only. All prices, income, and taxes used in the regressions in this chapter are deflated by the Consumer Price Index and thus are in constant (real) dollars. We are grateful to Professor Jonathan Gruber of MIT for providing us with these data.

- Either go to https://vincentarelbundock.github.io/Rdatasets/datasets.html and download ‘The Cigarette Consumption Panel Data Set’ with the name Cigarette.csv or use the command below to load it directly from a URL.

```
# from internet with shortened URL
cig <- read.csv(url("http://bit.ly/346cig")) %>% clean_names()
dim(cig)
```

`## [1] 528 10`

`names(cig)`

```
## [1] "x" "state" "year" "cpi" "pop" "packpc" "income" "tax"
## [9] "avgprs" "taxs"
```

The documentation for

`cigarette.csv`

tells us it is a panel data set of 48 observations from 1985 to 1995 with 528 observations from the United States. The variables are:- state: state
- year: year
- cpi: consumer price index
- pop: state population
- packpc: number of packs per capita
- income: state personal income (total, nominal)
- tax: average state, federal, and average local excise taxes for fiscal year
- avgprs: average price during fiscal year, including sales taxes
- taxs: average excise taxes for fiscal year, including sales taxes

(Source: Professor Jonathan Gruber, MIT. References: Stock, James H. and Mark W. Watson (2003) *Introduction to Econometrics*, Addison-Wesley Educational Publishers, chapter 10.)

- Explore the data briefly (see sample code below).
- Which states have the smallest and the largest populations?
- As measured by
`income`

, which state is the poorest and which the richest? - As measured by
`packpc`

, which states appear to smoke the least and the most? - Look at
`packpc`

for all 48 states. Does there appear to be significant variation across states?

```
# sample code (change chunk header to eval = TRUE)
cig %>%
group_by(state) %>%
summarize(mean = mean(taxs, na.rm = TRUE)) %>%
arrange(mean) %>%
print(n = 5)
```

- Visualize data
- Plot
`packpc`

vs`avgprs`

. What pattern do you observe? - In your plot, are there any apparent outlying observations?
- How could you identify those outliers?
- Create a scatter plot of
`packpc`

vs`avgprs`

with text labels set to`state`

for each observation rather than a point (see sample code below). Which states appear to be outliers? - Create a scatter plot of
`avgprs`

vs`year`

with text labels for each observation rather than a point. What patterns do you observe?

- Plot

```
# sample code (change chunk header to eval = TRUE)
iris %>%
ggplot() +
aes(x = Petal.Length, y = Petal.Width) +
geom_text(aes(label = Species), size = 3)
```

- Recoding and transformation:
- What is the distribution of income?
- What is the distribution of population?
- What transformations might be helpful for these data?
- Which state appeared to consume unsually small quantities of cigarettes in your plot?
- Use
`dplyr::case_when`

to create a categorical variable`outliers`

for three outlying states (two “high”, one “low”) and a category “med” for non-outlying states. - Create a scatter plot of
`packpc`

vs`avgprs`

with outlying states marked with a different color or symbol

```
cig <- cig %>%
mutate(
price_low_med_high = case_when(
# syntax is LOGICAL on left ~ ASSIGNED VALUE on right
avgprs <= 122 ~ "low", # if price <= 122, assign "low"
avgprs >= 173 ~ "high", # if price >= 173, assign "high"
TRUE ~ "med" # for all others, assign "med"
),
price_low_med_high = fct_relevel(price_low_med_high,
c("low", "med", "high"))
)
# check recoding worked
plot(packpc ~ price_low_med_high, cig)
```

Create a “naive,” bivariate model in which you regress number of packs per capita

`packpc`

on average price during fiscal year`avgprs`

. Look at the codebook and add some appropriate control variables to the right-hand side of your regression. Create a total of three regressions: the simple bivariate model, a model that controls for population and income and another model that controls for population, income and`tax`

the average state, federal, and average local excise taxes (for now, do not use`year`

or`state`

as control variables).With

`stargazer`

create a single regression table with all three models from above.

How does the relationship between cigarette consumption and price change across the models? Consider some of things we care about in estimating relationships such as sign, magnitude, substance, and uncertainty. Is the relationship seem robust to inclusion of controls?

- Run a forth model include
`state`

as right hand side control variable. In`R`

, look at a`summary()`

of your regression results. Notice, because`state`

is a factor with 48 levels, we get an unwieldy 47 rows of results in our summary for that one variable. Create a regression table with`stargazer`

using the same model but use the`omit`

command in`stargazer`

to suppress the state fixed effects:

```
# sample code (change chunk header to eval = TRUE)
stargazer(
lm1, lm2, lm3, lm4,
omit = c("state"), # remove any row with text state in it
type = 'text' # for results in console
)
```

Now, run a fifth model with

`year`

as a right hand side control (exclude`state`

). Look at the`summary()`

of your model. In your regression, is`year`

being treated as a numeric or factor variable? Look at your regression output. You can also look at the results of the`str`

command above or use the`class()`

command. What should it be? Rerun your fifth regression with the correct specification. Look at the model using`summary()`

. Has it changed?Create a sixth model with

`state`

and`year`

fixed effects. Using`stargazer`

, print three models: one with controls but no state or year fixed effects (model 3 above), one with controls and state fixed effects (model 4 above), and one with controls plus state and year fixed effects (model 6). Again, consider how the relationship between cigarette consumption and price changes across the models. How do the measures of magnitude, sign and uncertainty change?

```
# sample code (change chunk header to eval = TRUE)
stargazer(
lm3, lm4, lm6,
omit = c("state|year"), # remove state & year rows
omit.stat = c("f", "ser"), # drop two model stats, narrows columns
type = 'text' # for results in console
)
```

- If you haven’t already, install the packages
`sjPlot`

,`jtools`

and`interactions`

. Now let’s plot some of the models. With`sjPlot`

, run`plot_model`

on your bivariate regression object (model 1). For more about these packages, see:

- https://cran.r-project.org/web/packages/sjPlot/vignettes/plot_model_estimates.html
- https://jtools.jacob-long.com

```
# sample code (change chunk header to eval = TRUE)
library(sjPlot)
# plot regression line and smoothed running average
plot_model(lm1)
# plot regression line, data and running average of data
plot_model(lm1, show.data = TRUE)
```

- With
`sjPlot`

now run`plot_model`

on some of your other regression output. How do these plots differ from the bivariate case? Why are they different?

```
# sample code (change chunk header to eval = TRUE)
# create coefficient plot with sjPlot
plot_model(lm2)
```

Just for comparison, use `jtlools`

and run `plot_coefs`

on some of your multivariate models.

```
# sample code (change chunk header to eval = TRUE)
library(jtools)
# create coefficient plot with jtools
plot_coefs(lm2)
```

- Another way to plot multivariate models is to just plot the marginal effects. That is, to plot the marginal effect (a one unit change) in one variable (\(x\)) on the predicted value of another (\(y\)) holding remaining variables constant. With
`sjPlots::plot_model`

plot your third model from above (e.g., model with controls but*not*state and year fixed effects). Still using model 3, add a second variable`tax`

to the`terms`

option.

```
# sample code (change chunk header to eval = TRUE)
plot_model(lm3, type = "pred", terms = "avgprs", show.data = TRUE)
plot_model(lm3, type = "pred", terms = c("avgprs", "tax"))
```

How does `tax`

moderate the marginal effect of `avgprs`

on `packpc`

?

- Run a seventh model estimating
`packpc`

vs`avgprs`

with controls for population, tax and income, year fixed effects and add a new control`outliers`

(this variable should have been created above). Do*not*include state fixed effects. Look at the`summary()`

of your model. Use the`interactions`

package to plot`packpc`

vs`avgprs`

while accounting for how`outliers`

moderate the effect.

```
# sample code (change chunk header to eval = TRUE)
library(interactions)
# set predictor to avgprs and show how outliers acts as moderater
interact_plot(lm7, pred = "avgprs", modx = "outliers", plot.points = TRUE)
```

- Using the package
`plm`

(for panel linear models) re-run models 3, 4, 5 and 6. Compare the results of each`lm()`

command with a similar call in`plm`

using`stargazer`

.

```
# sample code (change chunk header to eval = TRUE)
library(plm)
# No state or year fixed effects, everything in one 'pool'
plm1 <- plm(formula = packpc ~ avgprs + log_pop + log_inc + tax,
data = cig,
index = c("state", "year"), # tell plm unit and time vars
model = "pooling") # tell plm to pool all data
stargazer(lm3, plm1, type = 'text')
```

```
# sample code (change chunk header to eval = TRUE)
plm2 <- plm(formula = packpc ~ avgprs + log_pop + log_inc + tax,
data = cig,
index = c("state", "year"), # tell plm unit and time vars
effect = "individual", # tell plm only unit fixed effects
model = "within")
stargazer(lm4, plm2, type = 'text', omit = "state")
```

```
# sample code (change chunk header to eval = TRUE)
plm3 <- plm(formula = packpc ~ avgprs + log_pop + log_inc + tax,
data = cig,
index = c("state", "year"), # tell plm unit and time vars
effect = "time", # tell plm only time fixed effects
model = "within")
stargazer(lm5, plm3, type = 'text', omit = "year")
```

```
# sample code (change chunk header to eval = TRUE)
plm4 <- plm(formula = packpc ~ avgprs + log_pop + log_inc + tax,
data = cig,
index = c("state", "year"), # tell plm unit and time vars
effect = "twoways", # tell plm unit & time fixed effects
model = "within")
stargazer(lm6, plm4, type = 'text', omit = "state|year")
```