11 N Parameters

This chapter is still a DRAFT. Check back in a few weeks.

Having created models with one parameter in Chapter 6, two parameters in Chapter 7, three parameters in Chapter 8, four parameters in Chapter 9 and five parameters in Chapter 10, you are now ready to make the jump to \(N\) parameters.

In this chapter, we will consider models with many parameters and the complexities that arise therefrom.

11.1 More on candidate longevity

Let’s continue our examination of the longevity of gubernatorial candidates which we began in Chapter 10. Recall the packages we need and the tibble we created:

library(tidyverse)
library(primer.data)
library(rstanarm)

ch11 <- governors %>% 
  select(last_name, year, state, sex, lived_after, election_age) %>% 
  drop_na()

11.1.1 state

Consider a model in which years lived after the election is a function of the state in which the candidate lives. This assumption generates a model with 51 parameters, an average for each state and the residual variance. Mathematics:

\[ lived\_after_i = \beta_1 x_{AL,i} + \beta_2 x_{AK,i} + ... + \beta_{49} x_{WI,i} + \beta_{50} x_{WY,i} + \epsilon_i\] As usual, \(\epsilon_i \sim N(0, \sigma^2)\). Conceptually, this model is identical to several of the three parameter models we explored in Chapter 8. For an individual candidate, the years to live after the election are function of the average for her state and of an error term. None of the other coefficients matter because the indicator variables are zero for all the other states. For example, the model for a candidate from Alaska simplifies to:

\[ lived\_after_i = \beta_1 x_{AL,i} + \beta_2 x_{AK,i} + ... + \beta_{49} x_{WI,i} + \beta_{50} x_{WY,i} + \epsilon_i \] \[ lived\_after_i = \beta_1 x 0 + \beta_2 1 + ... + \beta_{49} 0 + \beta_{50} 0 + \epsilon_i \] \[ lived\_after_i = \beta_2 + \epsilon_i \] The interpretation is the same as when we modeled age as a function of party or att_end as a function of treatment. In all these cases, an individual can only be in one category. She can be a Democrat or a Republican. She can received the treatment or not. She can live in Alabama or Alaska or . . . . The only wrinkle we have added is that there are 50 categories under consideration, not just two.

The code is exactly the same as in those Chapter 8 exmaples.

fit_1 <- stan_glm(formula = lived_after ~ state - 1,
                  data = ch11,
                  refresh = 0,
                  seed = 76)

We do not need to, explicitly, tell R that there are 50 categories. It determines that from state, just as it determined that there were two possibilities for treatment. The difficulty arises, however, when we try to examine the parameter values:

fit_1
## stan_glm
##  family:       gaussian [identity]
##  formula:      lived_after ~ state - 1
##  observations: 1092
##  predictors:   50
## ------
##                     Median MAD_SD
## stateAlabama        24.0    3.1  
## stateAlaska         27.4    3.5  
## stateArizona        31.2    2.3  
## stateArkansas       31.7    2.4  
## stateCalifornia     26.3    2.9  
## stateColorado       32.7    2.7  
## stateConnecticut    27.1    3.1  
## stateDelaware       34.5    3.0  
## stateFlorida        22.5    3.5  
## stateGeorgia        32.8    3.7  
## stateHawaii         25.3    4.2  
## stateIdaho          31.4    2.9  
## stateIllinois       24.1    3.3  
## stateIndiana        26.1    2.8  
## stateIowa           32.3    2.4  
## stateKansas         28.1    2.5  
## stateKentucky       29.6    3.0  
## stateLouisiana      26.8    3.7  
## stateMaine          32.3    3.3  
## stateMaryland       25.2    3.0  
## stateMassachusetts  25.5    2.5  
## stateMichigan       25.3    2.7  
## stateMinnesota      32.3    2.6  
## stateMississippi    26.6    3.1  
## stateMissouri       26.3    3.3  
## stateMontana        25.9    2.9  
## stateNebraska       26.1    2.6  
## stateNevada         27.7    3.3  
## stateNew Hampshire  30.1    2.5  
## stateNew Jersey     26.1    3.2  
## stateNew Mexico     29.1    2.3  
## stateNew York       23.3    2.6  
## stateNorth Carolina 22.7    3.8  
## stateNorth Dakota   29.4    2.6  
## stateOhio           29.1    2.5  
## stateOklahoma       26.0    3.2  
## stateOregon         25.2    3.0  
## statePennsylvania   26.8    3.0  
## stateRhode Island   32.1    2.0  
## stateSouth Carolina 33.0    3.3  
## stateSouth Dakota   23.5    2.5  
## stateTennessee      21.8    2.9  
## stateTexas          28.5    2.3  
## stateUtah           26.9    3.0  
## stateVermont        28.1    2.4  
## stateVirginia       30.3    3.7  
## stateWashington     31.1    3.0  
## stateWest Virginia  32.5    3.0  
## stateWisconsin      34.5    2.4  
## stateWyoming        20.7    2.9  
## 
## Auxiliary parameter(s):
##       Median MAD_SD
## sigma 13.2    0.3  
## 
## ------
## * For help interpreting the printed output see ?print.stanreg
## * For info on the priors used see ?prior_summary.stanreg

That is a lot to take in! Examining and comparing two parameters — the average attitude toward immigration of the treated and control, for example, is easy. Doing the same for 50 parameters is hard. Use tidy() from the broom.mixed package to create a tibble with all the parameters.

## Registered S3 method overwritten by 'broom.mixed':
##   method      from 
##   tidy.gamlss broom
fit_1_params <- tidy(fit_1)

fit_1_params
## # A tibble: 50 x 3
##    term             estimate std.error
##    <chr>               <dbl>     <dbl>
##  1 stateAlabama         24.0      3.07
##  2 stateAlaska          27.4      3.52
##  3 stateArizona         31.2      2.27
##  4 stateArkansas        31.7      2.43
##  5 stateCalifornia      26.3      2.85
##  6 stateColorado        32.7      2.71
##  7 stateConnecticut     27.1      3.06
##  8 stateDelaware        34.5      2.97
##  9 stateFlorida         22.5      3.53
## 10 stateGeorgia         32.8      3.69
## # … with 40 more rows

Putting all the parameters into a tibble makes everything much easier. For example, we can look at the states with the longest and shortest lived candidates:

fit_1_params %>% 
  arrange(desc(estimate)) %>% 
  slice(1:5, 46:50)
## # A tibble: 10 x 3
##    term                estimate std.error
##    <chr>                  <dbl>     <dbl>
##  1 stateDelaware           34.5      2.97
##  2 stateWisconsin          34.5      2.38
##  3 stateSouth Carolina     33.0      3.34
##  4 stateGeorgia            32.8      3.69
##  5 stateColorado           32.7      2.71
##  6 stateNew York           23.3      2.63
##  7 stateNorth Carolina     22.7      3.85
##  8 stateFlorida            22.5      3.53
##  9 stateTennessee          21.8      2.93
## 10 stateWyoming            20.7      2.94

A graphical display is the best way to examine numerous parameters at the same time.

fit_1_params %>% 
  mutate(state = str_replace(term, "state", "")) %>% 
  ggplot(aes(x = fct_reorder(state, estimate),
             y = estimate)) +
    geom_pointrange(aes(ymin = estimate - 2 * std.error,
                        ymax = estimate + 2 * std.error)) +
    coord_flip() +
    labs(title = "Expected Years Lived Post Election",
         y = "Years",
         x = NULL)

There are 51 parameters.

11.1.2 state, election_age, sex and election_age*sex

Let’s consider one more model, one which adds state to the model we used at the end of the previous chapter.

Math:

\[ lived\_after_i = \beta_0 + \beta_1 x_{AK,i} + \beta_2 x_{AR,i} + ... \beta_{49} x_{WY,i} + \\ \beta_{50} male_i + \beta_{51} election\_age_i+ \beta_{52} male_i * election\_age_i + \epsilon_i\]

  • \(lived\_after_i\) is the outcome variable. \(\beta_1\), \(\beta_2\), \(\beta_3\)… all the way to \(\beta_{49}\) correspond to different parameters. These values are specific to each state. In order to find the female intercept (i.e. lived_after for females) for a specific state, you must add that state’s beta value to the value of \(\beta_0\) to find intercept for the lived_after outcome for females candidates in that specific state. For example, if you look at the model above, \(\beta_1\) is the intercept value for Arkansas (note the AK subscript). You would take the appropriate value, which appears next to Arkansas in the printed model, and add it to the value for \(\beta_0\).

  • \(\beta_0\) will be the intercept for one state alone. When the model is printed, you will see that (Intercept) takes on the lived_after value for female candidates in Alabama.

  • \(\beta_{50}\) is the coefficient for the explanatory variable \(male_i\). When we are trying to find the intercepts for female candidates, the explanatory variable will take on the value of 0. However, \(male_i\) = 1 when we are trying to find intercept values for male candidates. Therefore, you add the \(\beta_{50}\) value to the female intercept value to find the value of lived_after for male candidates. \(\beta_{51}\) is the coefficient for the explanatory variable \(election\_age_i\). It it just the slope for women. \(\beta_{52}\) is difficult to interpret. However, it gains meaning when it is added to \(\beta_{51}\), which results in the slope for men.

fit_gov_1 <- stan_glm(data = ch11,
                      formula = lived_after ~ state + sex*election_age,
                      refresh = 0,
                      iter = 10000,
                      seed = 98)

Note it takes awhile to run. We are dealing with 55 parameters here.

print(fit_gov_1, detail = FALSE)
##                      Median MAD_SD
## (Intercept)          16.6   21.2  
## stateAlaska           3.2    3.9  
## stateArizona          8.2    3.2  
## stateArkansas         1.6    3.2  
## stateCalifornia       5.7    3.5  
## stateColorado         4.2    3.3  
## stateConnecticut      2.9    3.5  
## stateDelaware         8.7    3.6  
## stateFlorida         -4.1    3.7  
## stateGeorgia          5.1    4.0  
## stateHawaii           4.4    4.3  
## stateIdaho            4.8    3.4  
## stateIllinois        -0.3    3.7  
## stateIndiana          3.5    3.6  
## stateIowa             2.7    3.2  
## stateKansas          -0.3    3.2  
## stateKentucky         4.2    3.5  
## stateLouisiana        4.0    4.0  
## stateMaine            5.9    3.7  
## stateMaryland         4.8    3.5  
## stateMassachusetts    0.4    3.2  
## stateMichigan        -3.0    3.4  
## stateMinnesota        4.6    3.3  
## stateMississippi      2.3    3.7  
## stateMissouri         1.5    3.7  
## stateMontana          2.5    3.5  
## stateNebraska         2.4    3.3  
## stateNevada           4.3    3.7  
## stateNew Hampshire    3.7    3.2  
## stateNew Jersey       1.2    3.6  
## stateNew Mexico       2.9    3.1  
## stateNew York         2.6    3.4  
## stateNorth Carolina  -3.9    3.9  
## stateNorth Dakota     6.0    3.4  
## stateOhio             6.1    3.3  
## stateOklahoma        -1.4    3.7  
## stateOregon           0.7    3.5  
## statePennsylvania     3.6    3.5  
## stateRhode Island     5.4    3.0  
## stateSouth Carolina   6.8    3.7  
## stateSouth Dakota    -4.9    3.2  
## stateTennessee       -1.1    3.6  
## stateTexas            3.4    3.2  
## stateUtah             3.4    3.5  
## stateVermont          1.7    3.3  
## stateVirginia         7.1    3.9  
## stateWashington       7.7    3.6  
## stateWest Virginia    6.8    3.5  
## stateWisconsin        6.9    3.2  
## stateWyoming         -1.3    3.5  
## sexMale              54.6   21.5  
## election_age         -0.1    0.4  
## sexMale:election_age -0.8    0.4  
## 
## Auxiliary parameter(s):
##       Median MAD_SD
## sigma 10.8    0.2

The (Intercept) refers to the average number of years lived after the election for women from Alabama, who on the day of election, have been alive the average number of years of all female candidates! That’s a mouthful. It really is not different than what we have been doing all along, however. You can see here that (Intercept) value is around 17 years.

However, we have only talked about women. What about for the men? Well, if you look at the bottom of the list, you will see the sexMale value of around 55. When you wish to estimate lived_after for a specific state, you must add this value to the intercept, and then also add the intercept of the desired state.

\(\beta_{51}\), our coefficient value for \(election\_age_i\), is -0.06. When comparing female candidates who differ by one year in their current ages, there is not much of a difference in how many years we expect them to live after the election.

\(\beta_{51} + \beta_{52}\), our coefficients for \(election\_age_i\) and sexMale:election_age, results in the value of around -0.9. This value is our slope for men.

Posterior distribution for female candidates in Washington and South Dakota.

fit_gov_1 %>% 
  as_tibble() %>% 
  mutate(Washington_females = `(Intercept)` + stateWashington) %>% 
  mutate (SD_females = `(Intercept)` + `stateSouth Dakota`) %>% 
  select(Washington_females, SD_females) %>% 
  pivot_longer(cols = Washington_females:SD_females, 
               names_to = "parameters",
               values_to = "years") %>% 
  ggplot(aes(years, fill = parameters)) +
    geom_histogram(aes(y = after_stat(count/sum(count))),
                   alpha = 0.5, 
                   bins = 100, 
                   position = "identity") +
    labs(title = "Posterior Probability Distribution",
         subtitle = "for women that live in Washington and South Dakota",
         x = "Average Years Lived Post Election",
         y = "Probability") + 
    scale_x_continuous(labels = scales::number_format()) +
    scale_y_continuous(labels = scales::percent_format()) +
    theme_classic()

The posterior above shows that female candidates in Washington live longer after the election than female candidates in South Dakota. Interesting huh? Our knowledge about averages is always more precise than our knowledge about individuals.

11.2 Wisdom

Wisdom

FIGURE 11.1: Wisdom

11.2.1 EDA of shaming

Imagine you are running for Governor and want to do a better job of getting your voters to vote. You recently read about a large-scale experiment showing the effect of sending out a voting reminder that “shames” citizens who do not vote. You are considering sending out a “shaming” voting reminder yourself. What will happen if you do? Will more voters show up to the polls? Additionally, on the day of the election a female citizen is randomly selected. What is the probability she will vote?

Consider a new data set, shaming, corresponding to an experiment carried out by Gerber, Green, and Larimer (2008) titled “Social Pressure and Voter Turnout: Evidence from a Large-Scale Field Experiment.” This experiment used several hundred thousand registered voters and a series of mailings to determine the effect of social pressure on voter turnout.

Let’s now do another EDA, starting off by running glimpse().

library(primer.data)
library(tidyverse)
library(skimr)
glimpse(shaming)
## Rows: 344,084
## Columns: 15
## $ cluster       <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", …
## $ primary_06    <int> 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1…
## $ treatment     <fct> Civic Duty, Civic Duty, Hawthorne, Hawthorne, Hawthorne…
## $ sex           <chr> "Male", "Female", "Male", "Female", "Female", "Male", "…
## $ age           <int> 65, 59, 55, 56, 24, 25, 47, 50, 38, 39, 65, 61, 57, 37,…
## $ primary_00    <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "…
## $ general_00    <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "No", "Yes", "Yes", …
## $ primary_02    <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "No", "Yes", "Yes", …
## $ general_02    <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "No", "Yes", "Yes", …
## $ primary_04    <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "…
## $ general_04    <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes",…
## $ hh_size       <int> 2, 2, 3, 3, 3, 3, 3, 3, 2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 1…
## $ hh_primary_04 <dbl> 0.095, 0.095, 0.048, 0.048, 0.048, 0.048, 0.048, 0.048,…
## $ hh_general_04 <dbl> 0.86, 0.86, 0.86, 0.86, 0.86, 0.90, 0.90, 0.90, 0.90, 0…
## $ neighbors     <int> 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,…

Here we see that glimpse() gives us a look at the raw data contained within the shaming data set. At the very top of the output, we can see the number of rows and columns, or observations and variables respectively. We see that there are 344,084 observations, with each row corresponding to a unique respondent. The “Columns: 10” tells us that there are 10 variables within this data set. Below this, we see a cutoff version of the entire data set that has the variables on the left as rows and the observations as a list separated by commas, as compared to the tibble output that presents with the variables as columns and the observations as rows running horizontally.

From this summary, we get an idea of some of the variables we will be working with. Variables of particular interest to us are sex, hh_size, and primary_06. The variable hh_size tells us the size of the respondent’s household, sex tells us the sex of the respondent, and primary_06 tells us whether or not the respondent voted in the 2006 Primary election.

There are a few things to note while exploring this data set. You may – or may not – have noticed that the only response to the general_04 variable is “Yes.” In their published article, the authors note that “Only registered voters who voted in November 2004 were selected for our sample” (Gerber, Green, Larimer, 2008). After this, the authors found their history then sent out the mailings.

It is also important to identify the dependent variable and its meaning. In this shaming experiment, the dependent variable is primary_06, which is a variable coded either 0 or 1 for whether or not the respondent voted in the 2006 primary election. This is the dependent variable because the authors are trying to measure the effect that the treatments have on the proportion of people who vote in the 2006 general election.

The voting results from other years, such as 2002 and 2004, are of less interest to us and can be removed from the abbreviated data set. In addition to removing general_04, primary_02, general_02, or primary_04, we also will not be taking particular interest in age, or no_of_names within this chapter.

By narrowing down the set of variables we are looking at and investigating, we will find more meaningful relationships among them. However, we have not yet discussed the most important variable of them all: treatment. The treatment variable is a factor variable with 5 levels, including the control. Since we are curious as to how sending mailings affects voter turnout, the treatment variable will tell us about the impact each type of mailing can make. Let’s start off by taking a broad look at the different treatments.

shaming %>%
  count(treatment)
## # A tibble: 5 x 2
##   treatment       n
##   <fct>       <int>
## 1 Control    191243
## 2 Civic Duty  38218
## 3 Hawthorne   38204
## 4 Self        38218
## 5 Neighbors   38201

Four types of treatments were used in the experiment, with voters receiving one of the four types of mailing. All of the mailing treatments carried the message, “DO YOUR CIVIC DUTY - VOTE!”

The first treatment, Civic Duty, also read, “Remember your rights and responsibilities as a citizen. Remember to vote.” This message acted as a baseline for the other treatments, since it carried a message very similar to the one displayed on all the mailings.

In the second treatment, Hawthorne, households received a mailing which told the voters that they were being studied and their voting behavior would be examined through public records. This adds a small amount of social pressure to the households receiving this mailing.

In the third treatment, Self, the mailing includes the recent voting record of each member of the household, placing the word “Voted” next to their name if they did in fact vote in the 2004 election or a blank space next to the name if they did not. In this mailing, the households were also told, “we intend to mail an updated chart” with the voting record of the household members after the 2006 primary. By emphasizing the public nature of voting records, this type of mailing exerts more social pressure on voting than the Hawthorne treatment.

The fourth treatment, Neighbors, provides the household members’ voting records, as well as the voting records of those who live nearby. This mailing also told recipients, “we intend to mail an updated chart” of who voted in the 2006 election to the entire neighborhood.

For now, let’s focus on a subset of the data. We will sample just 10,000 rows because otherwise stan_glm() takes an annoyingly large amount of time to work. Nothing substantive changes.

set.seed(9)
ch9_sham <- shaming %>% 
  filter(treatment %in% c("Control", "Neighbors")) %>% 
  droplevels() %>% 
  mutate(solo = ifelse(hh_size == 1, TRUE, FALSE)) %>% 
  select(primary_06, treatment, solo, sex, age) %>% 
  slice_sample(n = 10000, replace = FALSE)

We create the variable solo, which is TRUE for voters who live alone and FALSE for those that do not. We are curious to see if the treatment effect, if any, is the same for voters who live alone as it is for those who do not. We have also focused in on only two “treatments”: Control and Neighbors. This is for the sake of simplification. We want to know if social pressure impacts voting behavior, so it makes sense to look at the treatment that provides the most social pressure.

ch9_sham %>% 
  skim()
TABLE 11.1: Data summary
Name Piped data
Number of rows 10000
Number of columns 5
_______________________
Column type frequency:
character 1
factor 1
logical 1
numeric 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
sex 0 1 4 6 0 2 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
treatment 0 1 FALSE 2 Con: 8365, Nei: 1635

Variable type: logical

skim_variable n_missing complete_rate mean count
solo 0 1 0.14 FAL: 8560, TRU: 1440

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
primary_06 0 1 0.31 0.46 0 0 0 1 1 ▇▁▁▁▃
age 0 1 49.58 14.49 20 41 50 59 97 ▃▇▇▂▁

Let’s focus on a few observations that may be relevant to our analysis. First, note that each treatment has approximately 38,000 respondents. The control group, denoted by Con, has approximately 190 thousand respondents. For the logical variable solo, we see that approximately 47 thousand of the total respondents live alone (TRUE), while approximately 296 thousand live in households greater than 1 (FALSE). It may also be important to note that the average age of the respondents is 49.8 years with a standard deviation of 14.4 years.

To get a better sense of some respondents’ information, let’s use slice_sample() to gather a random sample of n observations from the data set.

ch9_sham %>% 
  slice_sample(n = 10)
## # A tibble: 10 x 5
##    primary_06 treatment solo  sex      age
##         <int> <fct>     <lgl> <chr>  <int>
##  1          0 Control   FALSE Male      52
##  2          0 Control   FALSE Female    20
##  3          1 Control   FALSE Female    76
##  4          0 Control   FALSE Male      39
##  5          0 Neighbors FALSE Female    35
##  6          1 Control   TRUE  Male      84
##  7          0 Control   FALSE Male      61
##  8          1 Control   FALSE Male      73
##  9          1 Neighbors FALSE Male      46
## 10          0 Neighbors FALSE Male      38

Now we have a table with 5 random observations and the respondents’ information in a regular table output. By taking a few random samples, we may start to see some patterns within the data. Do you notice anything in particular about the variable treatment?

One other helpful summarizing technique we can use is skim(). To make the information it contains simpler, we will only be looking at three variables: primary_06, treatment, and sex.

shaming %>% 
  select(primary_06, treatment, sex) %>% 
  skim()
TABLE 11.2: Data summary
Name Piped data
Number of rows 344084
Number of columns 3
_______________________
Column type frequency:
character 1
factor 1
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
sex 0 1 4 6 0 2 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
treatment 0 1 FALSE 5 Con: 191243, Civ: 38218, Sel: 38218, Haw: 38204

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
primary_06 0 1 0.32 0.46 0 0 0 1 1 ▇▁▁▁▃

Running the skim() command gives us a summary of the data set as a whole, as well as the types of variables and individual variable summaries. At the top, we see the number of columns and rows within the selected data set. Below this we are given a list with the different types of variables, or columns, and how often they appear within the data we are skimming. The variables are then separated by their column type, and we are given individual summaries based on the type.

Having created models with one parameter in Chapter 6 and two parameters in Chapter 7, you are now ready to make the jump to \(N\) parameters. The more parameters we include in our models, the more flexible they can become. But we must be careful of overfitting, of making models which are inaccurate because they don’t have enough data to accurately estimate those parameters. The tension between overfitting and underfitting is central to the practice of data science.

Let’s consider one of the most important virtues of data science: wisdom. The map from our data to our question. Recall that our mission here is to increase our voter turnout while we are running for Governor.

To investigate this, we are given a dataset in which respondents were encouraged to vote under four treatments. This was accomplished by sending a letter to citizens that voted in the previous primary election with varying degrees of social pressure. The remainder of the respondents fall under a control group, which received no such mailings. The dataset offers a number of details about each respondent, including their age, sex, treatment type, and voting outcome.

What we truly want to know is how to make citizens vote. One immediate problem with our dataset is that, due to our study population, we are only studying people that voted in the previous primary election. In other words, if someone did not vote in the previous primary election, they were not included. This would be a large problem, since that means we can only figure out how to make citizens that have already voted vote. Though we can’t be sure, it is reasonable to assume that it is easier to encourage citizens to vote in the next primary election if they have a history of recently voting in primary elections.

Does this mean our data is unhelpful? Of course not! With four treatments (and therefore four different methods of encouraging voting), we can gain quite a bit of knowledge. Mostly, we will know the most effective way to incentivize people with a history of voting to vote again. We will also know if no method of persuasion (the control) is the best option. We will further be able to tell if certain methods of persuasion work better for certain groups of people, according to factors such as age, sex, or household size. This can help tremendously in our election.

That being said, the map from our question to our data is almost never perfect. In data science, we often have to look at our data, understand its limitations, and try our best to make inferences that help our cause.

11.2.2 primary_06 and (treatment + age)

Justice

FIGURE 11.2: Justice

Courage

FIGURE 11.3: Courage

Because we will be going through a series of models in this chapter, it is useful to combine the virtues of Justice and Courage. To begin, let’s model primary_06, which represents whether a citizen voted or not, against age and treatment to see if there is a connection.

Let’s look at the relationship between primary voting and treatment + age.

model_3 <- stan_glm(data = ch9_sham, 
                 formula = primary_06 ~ treatment + age, 
                 refresh = 0)
print(model_3, digits = 3)
## stan_glm
##  family:       gaussian [identity]
##  formula:      primary_06 ~ treatment + age
##  observations: 10000
##  predictors:   3
## ------
##                    Median MAD_SD
## (Intercept)        0.084  0.016 
## treatmentNeighbors 0.079  0.012 
## age                0.004  0.000 
## 
## Auxiliary parameter(s):
##       Median MAD_SD
## sigma 0.457  0.003 
## 
## ------
## * For help interpreting the printed output see ?print.stanreg
## * For info on the priors used see ?prior_summary.stanreg

The (Intercept) here has two key details. First, since Control comes alphabetically before Neighbors, the Control group is our baseline for comparison. This holds similarly for age. This is the slope for age of only those participants in the Control group.

Second, remember that for this data, the (Intercept) does have a mathematical interpretation, but it does not have a practical interpretation. Why is this? Because the slope for age starts at zero. This is nonsensical for our purposes, as no voter can be of zero age.

Therefore, this model shows that, within the control group, the percent voting is 0.084 = 8.4%. How do we then calculate our percent voting in the Neighbors group? Recall that the treatmentNeighbors median is not giving a standalone figure for this group, but rather represents the offset between the Control and Neighbors groups. To find the Neighbors value, we must add the offset to the original value: 0.084 + 0.079 = .163 = 16.3%. This is nearly double the rate in the Control group!

Let’s turn to our age median. Begin by grouping our observations by age and counting by primary_06, which gives us counts for 1 (yes) or 0 (no) for number voting in each age category.

age <- ch9_sham %>% 
  group_by(age) %>% 
  count(primary_06) 

age
## # A tibble: 149 x 3
## # Groups:   age [77]
##      age primary_06     n
##    <int>      <int> <int>
##  1    20          0   119
##  2    20          1    15
##  3    21          0   120
##  4    21          1    15
##  5    22          0   150
##  6    22          1    24
##  7    23          0   119
##  8    23          1    19
##  9    24          0   102
## 10    24          1    21
## # … with 139 more rows

To explore this relationship visually, let’s create a graph. We are coercing primary_06 into a character variable as it more closely represents “yes” or “no” as opposed to a numeric value.

age %>% 
  mutate(primary_06 = as.character(primary_06)) %>% 
  ggplot(aes(x = age, y = n, color = primary_06)) +
  geom_point() +
  labs(
    title = "Relationship Between Age and Voting",
    subtitle = "In the 2006 Primary Elections",
    x = "Age",
    y = "Count"
  )

There are some interesting takeaways here. - First, in almost every age bracket (other than above 90), the majority participants did not vote. - The spike between ages 40 and 60 illustrates that most participants exist in this age bracket. - The differences between voters and non-voters narrows greatly after age 60.

Let’s now look at another graph that aims to show the same phenomena, but also includes a formula using lm. This more clearly shows the upward trend in voting as a participants age increases. We can also see that the highest concentrations in the “Voted” row exist from ages 45-50, whereas the highest concentrations for the “Did Not Vote” row exist in the 18-25 and 30-60 age groups. Again, we see that, for almost all ages, the partcipants are more likely to not vote than vote. This is illustrated by the darker concentration of dots in the “Did Not Vote” row. The slope of our regression line, however, shows a clear picture: the older you are, the more likely you are to vote.

shaming %>% 
  ggplot(aes(age, primary_06)) + 
  geom_jitter(alpha = 0.005, height = 0.1) + 
  geom_smooth(formula = y ~ x, method = "lm", se = FALSE) + 
  scale_y_continuous(breaks = c(0, 1), labels = c("Did Not Vote", "Voted")) + 
  labs(title = "Age and Voting in 2012 Michigan Primary Election", 
       subtitle = "Older people are more likely to vote", 
       x = "Age", 
       y = NULL, 
       caption = "Data from Gerber, Green, and Larimer (2008)") 

Note that the median for age is 0.004. Age is therefore positively correlated with voting in the primary election. What does that mean? It means that, for every year that a participant’s age increases, their odds of voting in the primary increases by 0.004. Now, this might not seem like a huge difference. However, think of it like this: for every decade older that a participant is, their odds of voting increase .04 = 4%! This makes sense considering that we just learned that older citizens are more likely to vote.

Now, let’s return to our voting difference between the Control and Neighbors groups. Let’s model the posterior probability distribution for the rates of voting.

# In progress. Modify x-axis labels *10. 

model_3 %>% 
  as_tibble() %>% 
  mutate(Neighbors = `(Intercept)` + `treatmentNeighbors`) %>% 
  mutate(Control = `(Intercept)`) %>% 
  select(Neighbors, Control) %>% 
  pivot_longer(cols = Neighbors:Control,
               names_to = "parameters",
               values_to = "percent_voting") %>% 
  ggplot(aes(percent_voting, fill = parameters)) +
  geom_histogram(aes(y = after_stat(count/sum(count))),
                   alpha = 0.5, 
                   bins = 100, 
                   position = "identity") +
    labs(title = "Posterior Probability Distribution",
         subtitle = "for Control versus Neighbors voting rates",
         x = "% of group voting",
         y = "Probability") + 
    scale_x_continuous(labels = scales::number_format()) +
    scale_y_continuous(labels = scales::percent_format()) +
    theme_classic()

11.2.3 primary_06 & (age + solo + treatment)

Now that we have analyzed the impact of our various treatments on voting behavior, let’s turn to three different variables together: sex, solo (living alone), and treatment.

model_4 <- stan_glm(data = ch9_sham, 
                 formula = primary_06 ~ treatment + age + solo, 
                 refresh = 0)
print(model_4, digits = 3)
## stan_glm
##  family:       gaussian [identity]
##  formula:      primary_06 ~ treatment + age + solo
##  observations: 10000
##  predictors:   4
## ------
##                    Median MAD_SD
## (Intercept)        0.084  0.016 
## treatmentNeighbors 0.079  0.013 
## age                0.004  0.000 
## soloTRUE           0.015  0.013 
## 
## Auxiliary parameter(s):
##       Median MAD_SD
## sigma 0.457  0.003 
## 
## ------
## * For help interpreting the printed output see ?print.stanreg
## * For info on the priors used see ?prior_summary.stanreg

First, let’s look at which variables are included on our left hand side. Besides our (Intercept), we are given treatmentNeighbors, age, and soloTRUE. What does that mean our baseline for comparison, or (Intercept), is? Look to what we are not given: treatmentControl and soloFALSE. Therefore, our baseline for comparison are the participants in the Control group who do not live in single person households. As we learned before, age calculates starting at age 0 and is therefore unhelpful practially. For this group, the percentage voting is .085 = 8.5%. From this figure, we will break down what our variables and their respective medians mean.

  • treatmentNeighbors, with a median of 0.078: the offset in the Neighbors treatment is 0.078 compared with the control group. The percent voting for this group is then 0.085 (Control) + 0.078 (Neighbors) = .163 = 16.3%. This illustrates that the Neighbors treatment is positively correlated with voting.
  • age, with a median of 0.004: as before, the age represents the positively correlated increase with voting as someone ages. For every decade older a participant is, their chance of voting increases by 4%.
  • soloTRUE, with a median of 0.015: as compared with participants in households exceeding 1 persons, participants who live alone are more likely to vote by a factor of 0.015 = 1.5%. Note that this is only true for our baseline group of Control participants and does not include an analysis for those in Neighbors.
# Modify x scale

model_4 %>% 
  as_tibble() %>% 
  mutate(SoloTrue = `(Intercept)` + `soloTRUE`) %>% 
  mutate(SoloFalse = `(Intercept)`) %>% 
  select(SoloTrue, SoloFalse) %>% 
  pivot_longer(cols = SoloTrue:SoloFalse,
               names_to = "parameters",
               values_to = "percent_voting") %>% 
  ggplot(aes(percent_voting, fill = parameters)) +
    geom_histogram(aes(y = after_stat(count/sum(count))),
                   alpha = 0.5, 
                   bins = 100, 
                   position = "identity") +
    labs(title = "Posterior Probability Distribution",
         subtitle = "for those who live alone versus live with other people",
         x = "% of group voting",
         y = "Probability") + 
    scale_x_continuous(labels = scales::number_format()) +
    scale_y_continuous(labels = scales::percent_format()) +
    theme_classic()

11.2.4 primary_06 & (age, solo, treatment, + interactions)

Since we’ve now studied a model with three different variables, it is time to look at interactions! Here, we will look at primary_06 as a function of age, solo, treatment, and solo*treatment. What does solo*treatment mean for us? It means we are looking at the solo and treatment variables as they correspond to one another.

model_5 <- stan_glm(data = ch9_sham, 
                 formula = primary_06 ~ age + solo + treatment + solo * treatment, 
                 refresh = 0)
print(model_5, digits = 3)
## stan_glm
##  family:       gaussian [identity]
##  formula:      primary_06 ~ age + solo + treatment + solo * treatment
##  observations: 10000
##  predictors:   5
## ------
##                             Median MAD_SD
## (Intercept)                 0.086  0.016 
## age                         0.004  0.000 
## soloTRUE                    0.009  0.014 
## treatmentNeighbors          0.073  0.013 
## soloTRUE:treatmentNeighbors 0.040  0.034 
## 
## Auxiliary parameter(s):
##       Median MAD_SD
## sigma 0.458  0.003 
## 
## ------
## * For help interpreting the printed output see ?print.stanreg
## * For info on the priors used see ?prior_summary.stanreg

11.2.5 Temperance

Temperance

FIGURE 11.4: Temperance

Finally, let’s remember the virtue of Temperance. The gist of temperance is: be humble with our inferences, as our inferences are always, certainly, and unfortunately not going to match the real world. How does this apply to our shaming scenario?

Prediction uncertainty is the main culprit. No matter how hard we try, we cannot predict the future. Though we now have conclusions about how shaming impacted voters in the 2006 primary elections, we do not have the confidence to say that what worked or didn’t work then would work now.

For instance, perhaps the impact of your neighbors knowing your voting history is greater in the midst of a pandemic, where you may be locked inside with few interactions outside of your immediate proximity. Perhaps the opposite is true. These unknown unknowns cannot be accounted for in our models. We cannot predict a pandemic, nor can we predict how this will change the way that people vote or respond to flyers.

There is also the issue of representitiveness. Do the voters of the 2006 primary election (who have already demonstrated a willingness to vote in the 2004 primary election) truly represent the people voting in our gubernatorial election?

These complications are why we must make inferences with a grain of salt. That is not to say that all data science is unhelpful! On the contrary, acknowledging our deficits will only make our inferences (and the actions we take because of them) stronger.

11.3 Summary