10 Five Parameters

This chapter is still a DRAFT. Check back in a few weeks.

The next step in our model building education is to learn about interactions. The effect of a treatment relative to a control is almost never uniform. The effect might be bigger in women than in men, smaller in the rich relative to the poor. The technical term for such effects is “heterogeneous,” which is just Ph.D.’ese for “different.” With enough data, all effects are heterogeneous. Causal effects, at least in the social science, always vary across units. To model this reality, we rely on “interactions,” on allowing the effect size to differ. To so so, we need models with at least 5 parameters.

10.1 EDA of governors

Packages:

The primer.data package includes the governors data set which features the demographic information about the candidates for governor in the United States. Barfort, Klemmensen, and Larsen (2020) gathered this data concluded that winning a gubernatorial election increases a candidate’s lifespan.

glimpse(governors)
## Rows: 1,092
## Columns: 14
## $ state        <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "…
## $ year         <int> 1946, 1946, 1950, 1954, 1954, 1958, 1962, 1966, 1966, 19…
## $ first_name   <chr> "James", "Lyman", "Gordon", "Tom", "James", "William", "…
## $ last_name    <chr> "Folsom", "Ward", "Persons", "Abernethy", "Folsom", "Lon…
## $ party        <chr> "Democrat", "Republican", "Democrat", "Republican", "Dem…
## $ sex          <chr> "Male", "Male", "Male", "Male", "Male", "Male", "Male", …
## $ died         <date> 1987-11-21, 1948-12-17, 1965-05-29, 1968-03-07, 1987-11…
## $ status       <chr> "Challenger", "Challenger", "Challenger", "Challenger", …
## $ win_margin   <dbl> 77.3, -77.3, 82.2, -46.7, 46.7, -77.5, 100.0, -34.3, 34.…
## $ region       <chr> "South", "South", "South", "South", "South", "South", "S…
## $ population   <dbl> 2906000, 2906000, 3058000, 3014000, 3014000, 3163000, 33…
## $ election_age <dbl> 38, 79, 49, 47, 46, 33, 43, 48, 40, 51, 68, 55, 45, 52, …
## $ death_age    <dbl> 79, 81, 63, 60, 79, 88, 79, 99, 42, 79, 75, 79, 76, 81, …
## $ lived_after  <dbl> 41.0, 2.1, 14.6, 13.3, 33.0, 54.5, 35.9, 51.0, 1.5, 27.9…

There are 14 variables and 1,092 observations. In this Chapter, we will only be looking at the variables last_name, year, state, sex, lived_after, and election_age.

ch10 <- governors %>% 
  select(last_name, year, state, sex, lived_after, election_age)

election_age and lived_after are how many years a candidate lived before and after the election, respectively. As a consequence, only politicians who are already deceased are included in this data set. This means that there are only a handful of observations from elections in the last 20 years. Most candidates from that time period are still alive and are, therefore, excluded.

slice_sample(ch10, n = 5)
## # A tibble: 5 x 6
##   last_name  year state        sex    lived_after election_age
##   <chr>     <int> <chr>        <chr>        <dbl>        <dbl>
## 1 Ristine    1964 Indiana      Male          44.6         44.8
## 2 Sundlun    1986 Rhode Island Male          24.7         66.8
## 3 Richards   1990 Texas        Female        15.9         57.2
## 4 Turner     1946 Oklahoma     Male          26.6         52.0
## 5 Williams   1948 Michigan     Male          39.2         37.7

sex is most often “Male,” as we might expect.

skim(ch10)
TABLE 10.1: Data summary
Name ch10
Number of rows 1092
Number of columns 6
_______________________
Column type frequency:
character 3
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
last_name 0 1 3 11 0 615 0
state 0 1 4 14 0 50 0
sex 0 1 4 6 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1 1965 13.4 1945.00 1954 1962 1974 2011 ▇▆▃▂▁
lived_after 0 1 28 13.4 0.13 18 30 39 60 ▃▆▇▆▂
election_age 0 1 52 8.7 31.35 45 51 57 84 ▂▇▆▂▁

skim() groups the variables together by type. We are given histograms of the numeric data. In looking at the histogram for year, we see that it is skewed right — meaning that most of the data is bunched to the left and that there is a smaller tail to the right — with half of the observations from election years between 1945 and 1962. This makes sense logically, because we are only looking at deceased candidates, and candidates from more recent elections are more likely to still be alive.

In using this data set, our left-side variable will be lived_after. We are trying to understand/predict how many years a candidate will live after the election.

ch10 %>%
  ggplot(aes(x = year, y = lived_after)) +
  geom_point() +
  labs(title = "US Gubernatorial Candidate Years Lived Post-Election",
       subtitle = "Candidates who died more recently can't have lived for long post-election",
       caption = "Data: Barfort, Klemmensen and Larsen (2019)",
       x = "Year",
       y = "Years Lived After Election") +
  scale_y_continuous(labels = scales::label_number()) +
  theme_classic() 

Starting with the relationship between lived_after and year, there is a rough line above which there are no observations. There are no data points in the top right portion of the graph because it is not possible to have run in 2011 and then live 50 years after the election took place. This edge of the data represents, approximately, the most years a candidate could have possibly lived, and still have died, given the year of the election. The reason the data is slanted downward is because the maximum value for this scenario is greater in earlier years. That is, those candidates who ran for governor in earlier years could live a long time after the election and still have died prior to the data set creation, giving them higher lived_after values than those who ran for office in more recent years. The edge of the scatter plot is not perfectly straight because, for many election years, no candidate had the decency to die just before data collection.

There are fewer observations in later years because fewer recent candidates have died recently.

ch10 %>%
  ggplot(aes(x = sex, y = lived_after)) +
  geom_boxplot() +
  labs(title = "US Gubernatorial Candidate Years Lived Post-Election",
       subtitle = "Male candidates live much longer after the election",
       caption = "Data: Barfort, Klemmensen and Larsen (2019)",
       x = "Gender",
       y = "Years Lived After Election") +
  scale_y_continuous(labels = scales::label_number()) +
  theme_classic() 

This plot shows that men live much longer, on average, than women after the election. Does that make sense to you?

10.2 Wisdom

Wisdom

FIGURE 10.1: Wisdom

The concept of the “population” is subtle and important. The population is not the set of candidates for which we have data. That is the data set. The population is the larger — potentially much larger — set of individuals about whom we want to make inferences. The parameters in our models refer to the population, not to the data set.

Consider a simple example. Define \(\mu\) as the average number of years lived by candidates for governor after Election Day. Can we calculate \(\mu\) from our data? No! There are many candidates for governor who are still alive, who are not included in our data even though they are part of the “population” we want to study. \(\mu\) can not be calculated. It can only be estimated.

Note, also, that there are many different populations, each with its own \(\mu\), in which we might be interested.

  • The population of all candidates for governor in the US from 1945 to 2012. This is the period covered in the paper.

  • The population of all candidates for governor in the US from 1900 to 2012. A priori, we would not expect a major difference between candidates who run in 1946 and those who run in 1942. How different could they be? It is therefore not unreasonable to extend the population of interest back in time, even if we have no data for earlier periods.

  • The population of all candidates for governor in the US from 1945 to 2030. We are often interested in the future. We want to make predictions about what will happen to candidates, even to candidates who have not yet run for office.

  • The population of candidates for governor around the world. Other countries have governors also! We want to understand their longevity as well.

  • The population of candidates for all political offices in the US. We might expect candidates for Senator to have similar lifespans to candidates for Governor.

  • And so on. There are as many possible populations as there are questions we might ask.

All of these populations are different, so each has a different \(\mu\). Which \(\mu\) we are interested in depends on the problem we are trying to solve. It is a judgment call, a matter of Wisdom, as to whether or not that data we have is “close enough” to the population we are interested in to justify making a model.

10.3 Justice and Courage

Justice

FIGURE 10.2: Justice

Courage

FIGURE 10.3: Courage

Because we will be going through a series of models in this chapter, it is useful to combine the virtues of Justice and Courage.

10.3.1 election_age

To begin, let’s model candidate lifespan after the election as a function of candidate lifespan prior to the election. The data:

ch10 %>% 
  ggplot(aes(x = election_age, y = lived_after)) +
    geom_point() +
    labs(title = "Longevity of Gubernatorial Candidates",
         subtitle = "Younger candidates live longer", 
         caption = "Data Source: Barfort, Klemmensen and Larsen (2019)",
         x = "Age in Years on Election Day",
         y = "Years Lived After Election") +
    scale_x_continuous(labels = scales::label_number()) +
    scale_y_continuous(labels = scales::label_number()) +
    theme_classic()

The math is fairly simple:

\[ lived\_after_i = \beta_0 + \beta_1 election\_age_i + \epsilon_i \]

with \(\epsilon_i \sim N(0, \sigma^2)\). \(lived\_after_i\) is the number of years lived after the election for candidate \(i\). \(election\_age_i\) is the number of years lived before the election for candidate \(i\). \(\epsilon_i\) is the “error term,” the difference between the actual years-lived for candidate \(i\) and the modeled years-lived. \(\epsilon_i\) is normally distributed with a mean of 0 and a standard deviation of \(\sigma\). The key distinction is between:

  • Variables, always subscripted with \(i\), whose values (potentially) vary across individuals.

  • Parameters, never subscripted with \(i\), whose values are constant across individuals.

Why do we use \(lived\_after_i\) in this formula instead of \(y_i\)? The more often we remind ourselves about the variables actual substance, the better. But there is another common convention: to always use \(y_i\) as the symbol for the dependent variable. It would not be usual to describe this model as:

\[ y_i = \beta_0 + \beta_1 election\_age_i + \epsilon_i \] Both mean the same thing.

Either way, \(\beta_0\) is the “intercept” of the regression, the average value for the population of \(lived\_after\), among those for whom \(election\_age = 0\). \(\beta_1\) is the “coefficient” of \(election\_age\). When comparing two individuals, the first with an \(election\_age\) 1 year older than the second, we expect the first to have a \(lived\_after\) value \(\beta_1\) different from the second. In other words, we expect the older to have fewer years remaining, because \(\beta_1\) is negative. Again, this is the value for the population from which our data is drawn.

There are three unknown parameters — \(\beta_0\), \(\beta_1\) and \(\sigma\) — just as with the models we used in Chapter 8.

You may recall from middle school algebra that the equation of a line is \(y = m x + b\). There are two parameters: \(m\) and \(b\). The intercept \(b\) is the value of \(y\) when \(x = 0\). The slope coefficient \(m\) for \(x\) is the increase in \(y\) for every increase of one in \(x\). When defining a regression line, we use slightly different notation but the fundamental relationship is the same.

We can implement this model with stan_glm().

fit_gov_1 <- stan_glm(data = ch10,
                      formula = lived_after ~ election_age,
                      refresh = 0,
                      seed = 9)

As we discussed in Chapter 8, the most common term for a model like this is a “regression.” We have “regressed” lived_after, our dependent variable, on election_age, our (only) independent variable.

The parameter values:

print(fit_gov_1, detail = FALSE)
##              Median MAD_SD
## (Intercept)  72.7    2.1  
## election_age -0.9    0.0  
## 
## Auxiliary parameter(s):
##       Median MAD_SD
## sigma 11.1    0.2

As is almost always the case, \(\sigma\) is a nuisance parameter, somethings whose value we are not interested in. This is why stan_glm() refers to it as an “Auxiliary” parameter.

The posterior distributions of \(\beta_0\) (the intercept) and \(\beta_1\) (the coefficient of \(election\_age_i\)), on the other hand, are important. Before looking at the posteriors themselves, let’s examine the fitted values:

ch10 %>% 
  ggplot(aes(x = election_age, y = lived_after)) +
    geom_point() +
    geom_line(aes(y = fitted(fit_gov_1)), color = "blue") +
    labs(title = "Longevity of Gubernatorial Candidates",
         subtitle = "Blue line shows fitted values", 
         caption = "Data Source: Barfort, Klemmensen and Larsen (2019)",
         x = "Age in Years",
         y = "Years Lived After Election") +
    scale_x_continuous(labels = scales::label_number()) +
    scale_y_continuous(labels = scales::label_number()) +
    theme_classic()

Consider someone who is about 40 years old on Election Day. We have a score or more data points for candidates around that age. Two died soon after the election. Some of them lived for 50 or more years after the election. Variation fills the world. But the fitted line tells us that, on average, we would expect a candidate that age to live for a little less than 40 years after the election.

This model, with a continuous independent (or “predictor”) variable, has an infinite number of fitted values, one for each possible value of election_age.

We can create a formula for the fitted values by placing the median values of the parameters into the mathematical formula:

\[ lived\_after_i = 72.7 - 0.9 election\_age_i\]

Consider the intercept. Since our independent variable is \(election\_age_i\), the intercept is the \(lived\_after_i\) value when \(election\_age_i\) is zero. Here, we would interpret this intercept as the average lifespan of a gubernatorial candidate after the election, if the candidate was alive for zero years prior to the election.

This is, of course, substantively nonsense. No one runs for office on the day they are born. In the next model, we will explore ways of making the intercept more interpretable. In the meantime, the math is the math.

Consider the coefficient for \(election\_age_i\), \(\beta_1\). The median of the posterior, -0.9, represents the slope of the model. For every unit increase in our independent variable, our dependent variable will change by this amount. For every additional day a candidate is alive before an election, their lifespan after the election will be 0.9 years lower, on average. If we are given the number of years a candidate lived before the election and want to estimate how long they will live for after, we will multiply the years they were alive prior by this beta of -0.9, and then combine that with the intercept.

This is a descriptive model, not a causal model. Remember our motto from Chapter 4: No causation without manipulation. There is no way, for person \(i\), to change the years that she has been alive on Election Day. On the day of this election, she is X years old. There is no way to change that. So, there are not two (or more) potential outcomes. Without more than one potential outcome, there can not be a causal effect.

Given that, it is important to monitor our language. We do not believe that that changes in election_age “cause” changes in lived_after. That is obvious. But there are some words and phrases — like “associated with” and “change by” — which are too close to causal. (And which we are guilty of using just a few paragraphs ago!) Be wary of their use. Always think in terms of comparisons when using a predictive model. We can’t change election_age from 40 to 50 for an individual candidate. We can only compare two candidates (or two groups of candidates), one with election_age equal to 40 and the other with election_age equal to 50. If our model is correct, such candidates will, on average, differ in lived_after by \(\beta_1\) times the difference between 40 and 50.

Let’s look at the posterior of \(\beta_1\), the coefficient of \(election\_age_i\):

fit_gov_1 %>% 
  as_tibble() %>% 
  ggplot(aes(election_age)) + 
    geom_histogram(aes(y = after_stat(count/sum(count))),
                   bins = 100) +
    labs(title = "Posterior Distribution of the Coefficient of `election_age`",
         y = "Probability",
         x = "Coefficient of `election_age`") + 
    scale_y_continuous(labels = scales::percent_format()) +
    theme_classic()

We center variables when the model’s intercept does not make substantive sense. To center a model, we pick a constant value, usually the mean of the independent variable, and subtract that constant from every value of the independent variable. Another option would be to pick a reasonable reference value, like 50 years-old. Either way, the meaning of election_age changes from the raw number of years a candidate has been alive to the number of years relative to that reference value.

In this example, we want to center the value for election_age, the independent variable. First, we must pick the value that we will center by. Here, we will use the mean of election_age, which is about 52 years old. Once we find this number, we will subtract it from every election_age value, creating a new variable: c_election_age, where the “c” stand for centered. c_election_age is the number of years a candidate has been alive, as of Election Day, relative to the average years alive of all candidates on their respective election days.

ch10 <- ch10 %>% 
  mutate(c_election_age = election_age - mean(election_age))
fit_gov_1_centered <- stan_glm(data = ch10,
                      formula = lived_after ~ c_election_age,
                      refresh = 0,
                      seed = 11)

print(fit_gov_1_centered, detail = FALSE)
##                Median MAD_SD
## (Intercept)    28.2    0.3  
## c_election_age -0.9    0.0  
## 
## Auxiliary parameter(s):
##       Median MAD_SD
## sigma 11.1    0.2

We can see that the intercept has decreased while the slope has stayed the same. When we interpret this model, we only have to change the definition of the intercept. Rather than the intercept representing the remaining years of a candidate who was alive for zero years before running for governor, it now represents the post-election lifespan of a gubernatorial candidate who was alive for the mean number of years before running. If a candidate was alive for about 52 years before running for governor, they are expected to live for about 28 years after the election, on average.

Think hard about your parameters. What do they mean? Which population do they represent? What ideal Preceptor Table would make their calculation easy? In this case, \(\beta_0\) is the average number of years which gubernatorial candidates live after election day, but just for the subset of candidates who are the average age on Election Day. If we had the ideal Preceptor Table, this would be trivial to calculate. Just take the average for that subset! No estimation required. But, our actual Preceptor Table has lots of missing values. In particular, many gubernatorial candidates have not . . . uh . . . died. (How inconsiderate!) So, we can’t know how many years they will live. All we can do is, first, define \(\beta_0\) and, second, estimate a posterior probability distribution for it.

10.3.2 sex

Let’s now regress lived_after on sex to see how candidates’ post-election lifespans differ by sex.

fit_gov_2 <- stan_glm(data = ch10,
                      formula = lived_after ~ sex - 1,
                      refresh = 0,
                      seed = 16)

Note this workflow. Try one model. Interpret it. Try another model. And then another. There is no one “true” model. There is an infinite space of possible models. Good data science involves an intelligent tour of that space.

print(fit_gov_2, detail = FALSE)
##           Median MAD_SD
## sexFemale 16.0    2.9  
## sexMale   28.5    0.4  
## 
## Auxiliary parameter(s):
##       Median MAD_SD
## sigma 13.3    0.3

In this regression, we use the -1 in the formula to make the output more straightforward, with no intercept to interpret. The math of this model is the same as those we saw in Chapter 8:

\[ lived\_after_i = \beta_1 female_i + \beta_2 male_i + \epsilon_i\]

where \[female_i, male_i \in \{0,1\}\] \[female_i + male_i = 1\] \[\epsilon_i \sim N(0, \sigma^2)\]

The meaning of \(lived\_after_i\) and of \(\epsilon_i\) is the same as in the first model. Indeed, they are the same throughout these exercises. \(female_i\) and \(male_i\) are 0/1 variables, just like last chapter. They are variables whose values vary across individuals.

The important parameters are \(\beta_1\) and \(\beta_2\). They are the average years-lived post-election for, respectively, women and men. Again, this is not the “average” for the data we have. That is easy to calculate! No estimation required. \(\beta_1\) and \(\beta_2\) are averages for the entire “population,” however we have chosen to define that term. Those averages can not be calculated directly. They can only be estimated, by creating a posterior probability distribution.

Looking back to the regression model we just created, we see that there is no intercept. Instead of having a \(\beta_0\) value, we have \(\beta_1\) and \(\beta_2\) for female and male. This makes things easier to interpret. Without having to add or subtract anything from an intercept, this regression tells us that, on average, women are expected to live about 16 years after running for governor, and men are expected to live about 28 years.

This is a strange result. Why would men live almost twice as long as women after the election? One explanation for this might be that women don’t run for governor until later in life, and therefore are not expected to live as long.

Now that we have interpreted the model using a -1 in the formula to estimate \(\beta_1\) and \(\beta_2\), let’s take away the -1 and regress lived_after on an intercept and sex to see how our equation changes.

fit_gov_2a <- stan_glm(data = ch10,
                       formula = lived_after ~ sex,
                       refresh = 0,
                       seed = 76)
print(fit_gov_2a, detail = FALSE)
##             Median MAD_SD
## (Intercept) 16.1    2.9  
## sexMale     12.3    2.9  
## 
## Auxiliary parameter(s):
##       Median MAD_SD
## sigma 13.3    0.3

We no longer have a value for female. However we do have an intercept. In this regression our mathematical formula is:

\[ lived\_after_i = \beta_0 + \beta_1 male_i + \epsilon_i\]

\(\beta_0\) is our intercept, around 16 years. This result is very similar to the female value from before. In this type of model, our intercept represents the the variable which is not represented in the model. \(\beta_1\) only affects the outcome when the candidate is male. (If the candidate is female, then \(male_i = 0\). Therefore, the intercept value represents those who are not male, i.e., females.)

When the candidate is a male, we add the coefficient for male to the intercept value, which gives us the average lifespan of a male gubernatorial candidate after an election. As we can see from adding \(\beta_0\) and \(\beta_1\), this value is the same as what we got for males in the previous model.

Be careful with notation! \(\beta_1\) in the no-intercept model is different from \(\beta_1\) in the model with an intercept! Notation varies. We must pay attention each time we make a model.

The posterior distribution for \(\beta_0 + \beta_1\) can be constructed via simple addition.

fit_gov_2a %>% 
  as_tibble() %>% 
  mutate(male_intercept = `(Intercept)` + sexMale) %>% 
  ggplot(aes(male_intercept)) + 
    geom_histogram(aes(y = after_stat(count/sum(count))),
                   bins = 100) +
    labs(title = "Posterior Distribution of Average Male Candidate Years Left`",
         y = "Probability",
         x = "Years To Live After the Election") + 
    scale_x_continuous(labels = scales::number_format()) +
    scale_y_continuous(labels = scales::percent_format()) +
    theme_classic()

But is that actually so “simple?” Not really! Manipulating parameters directly is bothersome. Dealing with variables named (Intercept) is error-prone. Nothing about this approach scales to realistic cases involving numerous predictor variables. Using posterior_epred() and associated functions is much easier.

posterior_epred(fit_gov_2a,
                tibble(sex = "Male")) %>% 
  as_tibble() %>% 
  ggplot(aes(`1`)) + 
    geom_histogram(aes(y = after_stat(count/sum(count))),
                   bins = 100) +
    labs(title = "Posterior Distribution of Average Male Candidate Yeas Le  ft`",
         y = "Probability",
         x = "Years To Live After the Election") + 
    scale_x_continuous(labels = scales::number_format()) +
    scale_y_continuous(labels = scales::percent_format()) +
    theme_classic()

The interpretation of this parameter is the same as we have seen before. There is a true average, across the entire population, of the number of years that male candidates live after the election. We can never know what that true average is. But, it seems very likely that the true average is somewhere between 27.5 and 29.5 years.

10.3.3 c_election_age and sex

Our outcome variable continues to be lived_after, but now we will have two different explanatory variables: c_election_age and sex. Note that sex is a categorical explanatory variable and c_election_age is a continuous explanatory variable. This is the same type of model — parallel slopes — as we saw in Chapter 9

Math:

\[ lived\_after_i = \beta_0 + \beta_1 male_i + \beta_2 election\_age\_c_i + \epsilon_i \]

But wait! The variable name is sex not male. Where does male come from?

The answer is that male is an indicator variable, meaning a 0/1 variable. male takes a value of one if the candidate is “Male” and zero otherwise. This is the same as the \(male_i\) variable used in the previous two examples.

  • The outcome variable is \(lived\_after_i\), the number of years a person is alive after the election. \(male_i\) is one of our explanatory variables. If we are predicting the number of years a male candidate lives after the election, this value will be 1. When we are making this prediction for female candidates, this value will be 0. \(election\_age\_c_i\) is our other explanatory variable. It is the number of years a candidate has lived before the election.

  • \(\beta_0\) is the average number of years lived after the election for women, who on the day of election, have been alive the average number of years of all candidates (i.e. both male and female). \(\beta_0\) is also the intercept of the equation. In other words, \(\beta_0\) is the expected value of \(lived\_after_i\), if \(male_i = 0\) and \(election\_age\_c_i = 0\).

  • \(\beta_1\) is almost meaningless by itself. The only time it has meaning is when its value is connected to our intercept (i.e. \(\beta_0 + \beta_1\)). When the two are added together, you get the average number of years lived after the election for men, who on the day of election, have been alive the average number of years for all candidates.

  • \(\beta_2\) is, for the entire population, the average difference in \(lived\_after_i\) between two individuals, one of whom has an \(election\_age\_c_i\) value of 1 greater than the other.

Let’s translate the model into code.

fit_gov_3 <- stan_glm(data = ch10,
                      formula = lived_after ~ sex + c_election_age,
                      refresh = 0,
                      seed = 12)
print(fit_gov_3, detail = FALSE)
##                Median MAD_SD
## (Intercept)    22.2    2.4  
## sexMale         6.1    2.4  
## c_election_age -0.8    0.0  
## 
## Auxiliary parameter(s):
##       Median MAD_SD
## sigma 11.1    0.2

Looking at our results, you can see that our intercept value is around 22. The average female candidate, who had been alive the average number of years of all candidates, would live another 22 years or so after the election.

Note that sexMale is around 6. This is our coefficient, \(\beta_1\). We need to connect this value to our intercept value to get something meaningful. Using the formula \(\beta_0 + \beta_1\), we find out that the number of years the average male candidate — who, on the day of election, is the average age of all candidates — would live is around 28 years.

Now take a look at the coefficient for \(election\_age\_c_i\), \(\beta_2\). The median of the posterior, -0.8, represents the slope of the model. When comparing two candates who differ by one year in c_election_age, we expect that they will differ by -0.8 years in lived_after. It makes sense that this value is negative. The more years a candidate has lived, the fewer years the candidate has left to live. So, for every extra year a candidate is alive before an election, their lifespan after the election will be 0.8 years lower, on average.

Let’s now look at some posteriors.

fit_gov_3 %>% 
  as_tibble() %>% 
  mutate(male_years = `(Intercept)` + sexMale) %>% 
  rename(female_years = `(Intercept)`) %>% 
  select(female_years, male_years) %>% 
  pivot_longer(cols = female_years:male_years, 
               names_to = "parameters",
               values_to = "years") %>% 
  ggplot(aes(years, fill = parameters)) +
    geom_histogram(aes(y = after_stat(count/sum(count))),
                   alpha = 0.5, 
                   bins = 100, 
                   position = "identity") +
     labs(title = "Posterior Probability Distribution",
         subtitle = "Men live longer",
         x = "Average Years Lived Post Election",
         y = "Probability") + 
    scale_x_continuous(labels = scales::number_format(accuracy = 1)) +
    scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
    theme_classic()

The graph above displays the posterior probability distributions for \(\beta_0\) and for $_0 + _1 $. The two distributions are fairly different; the distribution for females is more spread out than that for males. There were a greater number of men than women in the data. The more data we have, the more precise we can be.

Let’s now take a look at a posterior distribution for \(\beta_2\), the coefficient of \(election\_age\_c_i\).

fit_gov_3 %>% 
  as_tibble() %>% 
  ggplot(aes(c_election_age)) + 
    geom_histogram(aes(y = after_stat(count/sum(count))),
                   bins = 100) +
    labs(title = "Posterior Distribution of the Coefficient of `c_election_age`",
         y = "Probability",
         x = "Coefficient of `c_election_age`") + 
    scale_y_continuous(labels = scales::percent_format()) +
    theme_classic()

10.3.4 c_election_age, sex and c_election_age*sex

Let’s create another model. This time, however, the numeric outcome variable of lived_after is a function of the two explanatory variables we used above, c_election_age and sex, and of their interaction. To look at interactions, we need 5 parameters, which is why we needed to wait until this chapter to introduce the concept.

Math:

\[ lived\_after_i = \beta_0 + \beta_1 male_i + \beta_2 election\_age\_c_i + \\ \beta_3 male_i * election\_age\_c_i + \epsilon_i \]

  • Our outcome variable is still \(lived\_after_i\). We want to know how many years a candidate will live after an election. Our explanatory variables as the same as before. \(male_i\) is zero for male candidates and one for female candidates. \(election\_age\_c_i\) the number of years a candidate has lived before the election, relative to the average value for all candidates. In this model, we have a third predictor variable: the interaction between \(male_i\) and \(election\_age\_c_i\).

  • \(\beta_0\) is the average number of years lived after the election for women, who on the day of election, have been alive the average number of years of all candidates. In a sense, this is the same meaning as in the previous model, without an interaction term. But, always remember that the meaning of a parameter is conditional on the model in which it is embedded. Even if a parameter is called \(\beta_0\) in two different regressions does necessitate that it means the same thing in both regressions. Parameter names are arbitrary, or at least simply a matter of convention.

  • \(\beta_1\) does not have a simple interpretation as a stand-alone parameter. It is a measure of how different women are from men. However, \(\beta_0 + \beta_1\) has a straightforward meaning exactly analogous to the meaning of \(\beta_0\). The sum is the average number of years lived after the election for men, who on the day of election, have been alive the average number of years of all candidates.

  • \(\beta_2\) is the coefficient of \(election\_age\_c_i\). It it just the slope for women. It is the average difference in \(lived\_after_i\) between two women, one of whom has an \(election\_age\_c_i\) value of 1 greater than the other. In our last example, \(\beta_2\) was the slope for the whole population. Now we are different slopes for different genders.

  • \(\beta_3\) alone is difficult to interpret. However, when it is added to \(\beta_2\), the result in the slope for men.

Code:

fit_gov_4 <- stan_glm(data = ch10,
                      formula = lived_after ~ sex*c_election_age,
                      refresh = 0,
                      seed = 13)
print(fit_gov_4, detail = FALSE)
##                        Median MAD_SD
## (Intercept)            16.5    3.6  
## sexMale                11.9    3.6  
## c_election_age         -0.1    0.4  
## sexMale:c_election_age -0.8    0.4  
## 
## Auxiliary parameter(s):
##       Median MAD_SD
## sigma 11.1    0.2

The intercept has increased. \(\beta_0\) is around 17. This is the intercept for females. It still means the average number of years lived after the election for women is 17 or so. Our sexMale coefficient, \(\beta_1\), refers to the value that must be added to the intercept in order to get the average for males. When calculated, the result is 28. Keep in mind, however, that these values only apply if \(election\_age\_c_i = 0\), if, that is, candidate \(i\) is around 52 years old.

The coefficient for \(election\_age\_c_i\), \(\beta_2\), is -0.1. What does this mean? It is the slope for females. So, when comparing two female candidates who differ by one year in age, we expect that the older candidate will live 0.1 years less. Now direct your attention below at the coefficient of sexMale:c_election_age, \(\beta_3\), which is -0.8. This is the value that must be added to the coefficient of \(election\_age\_c_i\) (recall \(\beta_2 + \beta_3\)) in order to find the slope for males. When the two are added together, this value, or slope, is about -0.9. When comparing two male candidates who differ in age by one year, we expect the older candidate to live about 0.9 years less.

Key point: The interpretation of the intercepts only apply to candidates for whom \(election\_age\_c_i = 0\). Candidates who are not 52 years-old will have a different expected number of years to live. The interpretation of the slope applies to everyone. In other words, the relationship between \(lived\_after_i\) and \(election\_age\_c_i\) is the same, regardless of your gender or how old you are.

The posterior:

fit_gov_4 %>% 
  as_tibble() %>% 
  mutate(male_years = `(Intercept)` + sexMale) %>% 
  rename(female_years = `(Intercept)`) %>% 
  select(female_years, male_years) %>% 
  pivot_longer(cols = female_years:male_years, 
               names_to = "parameters",
               values_to = "years") %>% 
  ggplot(aes(years, fill = parameters)) +
    geom_histogram(aes(y = after_stat(count/sum(count))),
                   alpha = 0.5, 
                   bins = 100, 
                   position = "identity") +
    labs(title = "Posterior for Years Lived After the Election",
         subtitle = "Men live longer",
         x = "Average Years Lived Post Election",
         y = "Probability") + 
    scale_x_continuous(labels = scales::number_format()) +
    scale_y_continuous(labels = scales::percent_format()) +
    theme_classic()

Again, we do not recommend working directly with parameters. The above analysis would be much easier with posterior_epred(), as we will see in the Temperance Section.

Male candidates live longer on average than female candidates. Note, also, that the average years to live after the election for females is about 17 with this model. With the previous model, it was 22 years. Why the difference? The interpretation of “average” is different! In the previous model, it was the average for all women. In this model, it is the average for all 52 years-old women. Those are different things, so we should hardly be surprised by different posteriors.

Slope posteriors:

fit_gov_4 %>% 
  as_tibble() %>% 
  mutate(slope_men = c_election_age + `sexMale:c_election_age`) %>% 
  rename(slope_women = c_election_age) %>% 
  select(slope_women, slope_men) %>% 
  pivot_longer(cols = slope_women:slope_men, 
               names_to = "parameters",
               values_to = "slope") %>% 
  ggplot(aes(slope, fill = parameters)) +
    geom_histogram(aes(y = after_stat(count/sum(count))),
                   alpha = 0.5, 
                   bins = 100, 
                   position = "identity") +
    labs(title = "Posterior for Slope of Years-Lived on Years-to_Live",
         subtitle = "Men have a steeper slope",
         x = "slope",
         y = "Probability") + 
    scale_y_continuous(labels = scales::percent_format()) +
    theme_classic() 

This posterior distribution shows the average slope values for men and women. You can see that men have a steeper slope, while the slope for women is practically 0! If you are trying to forecast the number of years that a women will live after the election, you may ignore the number of years that she has already lived. This is definitely not true for men. Why the difference?

10.3.5 What is \(\beta_0\)?

Attentive readers will have noticed that the meaning of \(\beta_0\) is identical for the last two models and, at the same time, our estimate for its posterior are very different. How can that be?

Recall the math:

\[ lived\_after_i = \beta_0 + \beta_1 male_i + \beta_2 election\_age\_c_i + \epsilon_i \]

\[ lived\_after_i = \beta_0 + \beta_1 male_i + \beta_2 election\_age\_c_i + \\ \beta_3 male_i * election\_age\_c_i + \epsilon_i\] In both cases, \(\beta_0\) has the same meaning. It is the average number of years lived after the election for women, who on the day of election, have been alive the average number of years of all candidates. Plug in 0 for \(male_i\) and 0 for \(election\_age\_c_i\) and both equations simplify to:

\[ lived\_after_i = \beta_0 + \epsilon_i \] So far, so good. But, despite the fact that we are using the same data, the posterior probability distributions for \(\beta_0\) are meaningfully different between the two models:

m_3 <- fit_gov_3 %>% 
  as_tibble() %>%
  select(`(Intercept)`) %>% 
  rename(`No Interaction Term` = `(Intercept)`)

m_4 <- fit_gov_4 %>% 
  as_tibble() %>%
  select(`(Intercept)`) %>% 
  rename(`With Interaction Term` = `(Intercept)`)

bind_cols(m_3, m_4) %>% 
  select(`No Interaction Term`, `With Interaction Term`) %>% 
  pivot_longer(cols = `No Interaction Term`:`With Interaction Term`, 
               names_to = "Model",
               values_to = "years") %>% 
  ggplot(aes(years, fill = Model)) +
    geom_histogram(aes(y = after_stat(count/sum(count))),
                   alpha = 0.5, 
                   bins = 100, 
                   position = "identity") +
    labs(title = "Posterior Probability Distributions",
         subtitle = "For intercept in two different models",
         x = "Additional Years Lived Post Election",
         y = "Probability") + 
    scale_x_continuous(labels = scales::number_format()) +
    scale_y_continuous(labels = scales::percent_format()) +
    theme_classic()

There is enough overlap between the two distributions that they aren’t completely incompatible. If, in truth, \(\beta_0 = 20\), neither posterior distribution is “wrong.”

The meaning of any parameter, including \(\beta_0\), is conditional on two assumptions: the population for which we seek to make inferences and the structure of the model. In this case, we assume that the population of interest is the same. But the models are different, albeit only by one interaction term. But that is enough of a difference to make \(\beta_0\) in the first model a different “thing” that \(\beta_0\) in the second model, even if one “interpretation” of what they mean is the same.

The trick is that an individual interpretation of the meaning of a parameter just captures one characteristic of the meaning of that parameter. It is just a single view of the “thing.” That view is fine, as far as it goes, but it can’t show us all aspects of the parameter. A picture of the full moon tells us little about what appears on its dark side.

Consider a different question: What is the expected value for male candidates who are 1 year older than average:

  • Without an interaction term: \(\beta_0 + \beta_1 + \beta_2 1\).

  • With an interaction term: \(\beta_0 + \beta_1 + \beta_2 1 + \beta_3 1\).

We shouldn’t be surprised that different models generate different formulas for this question. Yet note how \(\beta_0\) appears in both. In other words, \(\beta_0\) is not only the average number of years lived after the election for women, who on the day of election, have been alive the average number of years of all candidates. It is also part of a multitude of other estimates.

The only way to understand all of what is wrapped up in \(\beta_0\) is to see it in the context of the full mathematical formula. Since we have two different models, defined by two different formulas, it is hardly surprising that the posterior probability distributions for \(beta_0\) will differ, even though one specific interpretation — one out of many — is the same.

Our advice: Spend less time thinking about what parameters mean and more time using posterior_epred() and posterior_predict() to examine the implications of your models. Those posteriors are the real substance of your model.

10.4 Temperance

Temperance

FIGURE 10.4: Temperance

Consider:

  • How long will two political candidates — one male and one female, both 10 years older than the average candidate — live after the election? How different will their lifespans be?

These questions are, purposely, less precise than the ones we tackled in Chapters 7 and 8, written more in a conversational style. This is how normal people talk.

However, as data scientists, our job is to bring precision to these questions. There are two commonsense interpretations. First, we could be curious about the expected values for these questions. If we averaged the data for a thousand candidates like these, what would the answer be? Second, we could be curious about two specific individuals. How long will they live? Averages involve questions about parameters. The fates of individuals require predictions. Those are general claims, violated too often to be firm rules. Yet, they highlight a key point: expected values are less variable than individual predictions.

To calculate expected values, use posterior_epred(). To forecast for individuals, use posterior_predict().

10.4.1 Expected values

Consider the “on average” interpretation first. The answer begins with the posterior distributions of the parameters in fit_gov_4.

newobs = tibble(sex = c("Male", "Female"), 
                 c_election_age = 10)

pe <- posterior_epred(object = fit_gov_4, 
                      newdata = newobs) %>% 
  as_tibble() %>% 
  rename("Male" = `1`,
         "Female" = `2`)


pe %>% 
 pivot_longer(cols = Male:Female, 
               names_to = "Gender",
               values_to = "years") %>% 
  ggplot(aes(years, fill = Gender)) +
    geom_histogram(aes(y = after_stat(count/sum(count))),
                   alpha = 0.5, 
                   bins = 100, 
                   position = "identity") +
    labs(title = "Posterior for Expected Years Lived Post-Election",
         subtitle = "Male candidates live longer",
         x = "Years",
         y = "Probability") + 
    scale_x_continuous(labels = 
                         scales::number_format(accuracy = 1)) +
    scale_y_continuous(labels = 
                         scales::percent_format(accuracy = 1)) +
    theme_classic()

Looking at our posterior probability distributions above, we can see that male candidates are expected to live longer. But how much longer? As in previous chapters, we can manipulate distributions in, more or less, the same way that we manipulate simple numbers. If we want to know the difference between two posterior distributions, we can simply subtract.

pe <- posterior_epred(object = fit_gov_4, 
                      newdata = newobs) %>% 
  as_tibble() %>% 
  mutate(diff = `1` - `2`)


pe %>% 
  ggplot(aes(diff)) +
    geom_histogram(aes(y = after_stat(count/sum(count))),
                   alpha = 0.5, 
                   bins = 100, 
                   position = "identity") +
    labs(title = "Posterior for Expected Additional Male Years Lived",
         subtitle = "Male candidates live about 4 years longer",
         x = "Expected Additional Years Lived Post Election",
         y = "Probability") + 
    scale_x_continuous(labels = scales::number_format(accuracy = 1)) +
    scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
    theme_classic()

The average value of the difference in years-to-live is probably positive, with the most likely value being around 4 years. But there still a 6% chance the true value is less than zero, i.e., that we should expect female candidates to live longer.

Instead of using posterior_epred(), we could have answered these questions by using the posterior probability distributions for the parameters in the model, along with some simple math. Don’t do this! First, you are much more likely to make a mistake. Second, this approach does not generalize well to complex models with scores of parameters and their interactions.

10.4.2 Individual predictions

If, instead, we interpret the question as asking for a prediction for a small number of individuals, then we need to use posterior_predict().

Use posterior_predict() to create draws from the posterior probability distribution for our prediction for these cases. posterior_predict() takes two arguments: the model for which the simulations should be run, and a tibble indicating the covariate values for the individual(s) we want to predict. In this case, we are using the fit_gov_4 model and the tibble is the one we just created above. In other words, the inputs for posterior_predict() and posterior_epred() are identical.

pp <- posterior_predict(object = fit_gov_4, 
                        newdata = newobs) %>%
  as_tibble() %>% 
  mutate_all(as.numeric) %>% 
  rename("Male" = `1`,
         "Female" = `2`)

pp  
## # A tibble: 4,000 x 2
##     Male Female
##    <dbl>  <dbl>
##  1  5.03  24.6 
##  2 24.1    6.62
##  3 15.6   30.6 
##  4  4.16  23.0 
##  5 13.7   31.3 
##  6 25.4   18.3 
##  7 23.8   10.6 
##  8 27.2   41.8 
##  9 -1.71   1.90
## 10 17.5   25.3 
## # … with 3,990 more rows

The resulting tibble has 2 columns, the first for a male candidate and the second for female candidate. Both columns are draws from the posterior predictive distributions. In both cases, the forecasts depend on the values of all the covariates. That is, we would provide a different forecast if the candidates were younger or older.

Why do we need the weird mutate_all(as.numeric) incantation? The reason is that posterior_epred() returns a simple matrix, which is easy to transform into a tibble. posterior_predict(), on the other hand, returns a special sort of matrix which is much harder to work with. So, we need a little hackery to make the next steps easier.

Let’s look at the posterior predictive distribution for each candidate.

pp %>% 
  pivot_longer(cols = Male:Female, 
               names_to = "Gender",
               values_to = "years") %>% 
  ggplot(aes(years, fill = Gender)) +
    geom_histogram(aes(y = after_stat(count/sum(count))),
                   alpha = 0.5, 
                   bins = 100, 
                   position = "identity") +
    labs(title = "Posterior for a Candidate's Years Lived Post-Election",
         subtitle = "Individual lifespans have a great deal of variation",
         x = "Years Lived Post Election",
         y = "Probability") + 
    scale_x_continuous(labels = scales::number_format()) +
    scale_y_continuous(labels = 
                         scales::percent_format(accuracy = 1)) +
    theme_classic()

There is a big overlap in the predictions for individuals while, at the same time, there is much less overlap in the averages. Random stuff happens to an individual all the time. Random stuff cancels out when you take the average for many individuals. Consider the difference in the posterior predictive distributions for the two individuals.

pp %>% 
  mutate(diff = Male - Female) %>% 
  ggplot(aes(diff)) +
    geom_histogram(aes(y = after_stat(count/sum(count))),
                   alpha = 0.5, 
                   bins = 100, 
                   position = "identity") +
    labs(title = "Posterior for a Male Candidate's Extra Years Lived", 
         subtitle = "Any random male candidate may die before a random female candidate",
         x = "Years",
         y = "Probability") + 
    scale_x_continuous(labels = scales::number_format()) +
    scale_y_continuous(labels = 
                         scales::percent_format(accuracy = 1)) +
    theme_classic()

In words, we would predict that the male candidate would live longer than the female candidate. By how much? Well, that number is an unknown parameter. By looking at our posterior above, our best estimate is about 3.8 years. However, it is quite possible that, for any given male/female candidates, the female will live longer.

pp %>% 
  mutate(diff = Male - Female) %>% 
  summarize(f_live_longer = sum(diff < 0),
            total = n(),
            f_live_longer / total)
## # A tibble: 1 x 3
##   f_live_longer total `f_live_longer/total`
##           <int> <int>                 <dbl>
## 1          1617  4000                 0.404

In fact, there is a 4 in 10 chance that the female candidate lives longer.

Note what is the same and what is different when we move from a question about averages to a question about individuals. In both cases, the most likely value is about the same. That is, the average behavior is the same as our expected value for any given individual. But the uncertainty is much greater for an individual prediction. The chance of the true average for male candidates being less than that for female candidates is low. Yet, for any individual pair of candidates, it would not even be slightly surprising for the female candidate to outlive the male candidate. Individuals vary. Averages never tell the whole story.

10.4.3 Expectation versus individual variation

Let’s compare the results from posterior_epred() and posterior_predict() for this scenario directly. Most of this code is the same as what we have shown you above, but we think it is useful to look at everything together.

newobs <- tibble(sex = c("Male", "Female"),
                  c_election_age = 10)

pe <- posterior_epred(fit_gov_4, 
                      newdata = newobs) %>%
  as_tibble() %>% 
  mutate(diff = `1` - `2`)

pp <- posterior_predict(fit_gov_4, 
                        newdata = newobs) %>% 
  as_tibble() %>% 
  mutate_all(as.numeric) %>% 
  mutate(diff = `1` - `2`)

tibble(Expectation = pe$diff,
       Prediction = pp$diff) %>% 
  pivot_longer(cols = Expectation:Prediction, 
               names_to = "Type",
               values_to = "years") %>% 
  ggplot(aes(years, fill = Type)) +
    geom_histogram(aes(y = after_stat(count/sum(count))),
                   alpha = 0.5, 
                   bins = 100, 
                   position = "identity") +
    labs(title = "Posterior for Expected and Individual Male Advantage",
         subtitle = "Expected male advantage is much more precisely estimated",
         x = "Additional Years Lived Post Election",
         y = "Probability") + 
    scale_x_continuous(labels = scales::number_format()) +
    scale_y_continuous(labels = 
                         scales::percent_format(accuracy = 1)) +
    theme_classic()

Expected values vary much less than predictions. The above chart makes that easy to see. We are somewhat sure that the true underlying average for the numbers of years that male candidates live post-election is more than female candidates. But, for any two individual candidates, there is a good chance that that the female candidate will live longer. We can not ignore \(\epsilon\) when predicting the outcome for individuals. When estimating expected values or long-run averages, the \(\epsilon\)’s cancel out.

10.4.4 Testing

In a normal statistics book, we would have already discussed the concept of “testing” extensively. The terminology varies by field. “Tests,” “testing,” “hypothesis tests,” “tests of significance,” and “null hypothesis significance testing” all refer to the same concept. We will refer to this collection of approaches as NHST, a common abbreviation derived from the initials of the last phrase. Wikpedia provides an overview.

Our view: Amateurs test. Professionals summarize.

Consider the question of the expected difference in lifespans for our South Dakato and Washington candidates. Is that difference “significant?” Can we “reject the null hypothesis?” The convential answer is Yes. Anytime that zero is outside of the 95% confidence interval, we can declare the result significant. But why should we? What does that gain us? We already have the full posterior probability distribution. That is knowledge. A Yes/No question throws away too much information to (almost) ever be useful. There is no reason to test when you can summarize by providing the full posterior probability distribution.

The same arguments apply in the case of “insignificant” results, with “\(p > 0.5\), when we can’t”reject" the null hypothesis. Instead of expected values, consider the case of two candidates, one from South Dakota and one from Washington. Is the difference in their predicted lifespans “significant?” Who cares!? We have the full posterior probability distribution for that prediction — also known as the posterior predictive distribution — as graphed above. The 95% confidence interval includes zero. Does that mean we should throw it away? No! That would be nonsense. Yes, there is a 20% chance that the South Dakota candidate lives longer, so we can hardly be surprised when that happens. But we still think that it is much more likely that the Washington candidate lives longer. Indeed, we would consider 4-to-1 odds as fair. The fact that the difference is not “significant” has no relevance to how we use the posterior to make decisions.

The same reasoning applies to every parameter we estimate, to every prediction we make. Never test — unless your boss demands a test. Use your judgment, make your models, summarize your knowledge of the world, and use that summary to make decisions.

10.5 Summary

Avoid answering questions by working with parameters directly. Use posterior_epred() instead.

Good data science involves an intelligent tour of the space of possible models.

Always think in terms of comparisons when using a predictive model.