10 Five Parameters

Last chapter, we studied four parameters: models in which we studied multiple right-hand side variables at once. The next step in our model building education is to learn about interactions. The effect of a treatment relative to a control is almost never uniform. The effect might be bigger in women than in men, smaller in the rich relative to the poor. The technical term for such effects is “heterogeneous,” which is just Ph.D.’ese for “different.” With enough data, all effects are heterogeneous. Causal effects, at least in the social science, always vary across units. To model this reality, we rely on interactions, on allowing the effect size to differ. The same applies for predictive models. The relationship between our outcome variable \(y\) and a predictor variable \(x\) is rarely constant. The relationship varies based on the values of other variables. To take account of interactions, we need models with at least 5 parameters.

Consider the following questions:

How many years would we expect two political candidates — one male and one female, both 10 years older than the average candidate — to live after the election? How different will their lifespans be? More broadly, how long do candidates, in general, live after the election? Does winning the election affect their longevity?

Note how far we have come in the Primer. These are difficult questions, involving issues of both prediction and causation. Yet, if we follow the Cardinal Virtues, we can provide sophisticated answers.

10.1 Wisdom

Always start by looking at the data.

Wisdom

FIGURE 10.1: Wisdom

10.1.1 EDA of governors

Packages:

The primer.data package includes the governors data set which features the demographic information about the candidates for governor in the United States. Barfort, Klemmensen, and Larsen (2020) gathered this data and concluded that winning a gubernatorial election increases a candidate’s lifespan.

glimpse(governors)
## Rows: 1,092
## Columns: 14
## $ state        <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "A…
## $ year         <int> 1946, 1946, 1950, 1954, 1954, 1958, 1962, 1966, 1966, 197…
## $ first_name   <chr> "James", "Lyman", "Gordon", "Tom", "James", "William", "G…
## $ last_name    <chr> "Folsom", "Ward", "Persons", "Abernethy", "Folsom", "Long…
## $ party        <chr> "Democrat", "Republican", "Democrat", "Republican", "Demo…
## $ sex          <chr> "Male", "Male", "Male", "Male", "Male", "Male", "Male", "…
## $ died         <date> 1987-11-21, 1948-12-17, 1965-05-29, 1968-03-07, 1987-11-…
## $ status       <chr> "Challenger", "Challenger", "Challenger", "Challenger", "…
## $ win_margin   <dbl> 77.3, -77.3, 82.2, -46.7, 46.7, -77.5, 100.0, -34.3, 34.3…
## $ region       <chr> "South", "South", "South", "South", "South", "South", "So…
## $ population   <dbl> 2906000, 2906000, 3058000, 3014000, 3014000, 3163000, 332…
## $ election_age <dbl> 38, 79, 49, 47, 46, 33, 43, 48, 40, 51, 68, 55, 45, 52, 6…
## $ death_age    <dbl> 79, 81, 63, 60, 79, 88, 79, 99, 42, 79, 75, 79, 76, 81, 7…
## $ lived_after  <dbl> 41.0, 2.1, 14.6, 13.3, 33.0, 54.5, 35.9, 51.0, 1.5, 27.9,…

There are 14 variables and 1,092 observations. In this Chapter, we will only be looking at the variables last_name, year, state, sex, lived_after, election_age, won, and close_race.

ch10 <- governors %>% 
  mutate(won = ifelse(win_margin > 0, TRUE, FALSE)) %>% 
  mutate(close_race = ifelse(abs(win_margin) < 5, TRUE, FALSE)) %>% 
  select(last_name, year, state, sex, lived_after, election_age, won, close_race)

election_age and lived_after are how many years a candidate lived before and after the election, respectively. As a consequence, only politicians who are already deceased are included in this data set. This means that there are only a handful of observations from elections in the last 20 years. Most candidates from that time period are still alive and are, therefore, excluded. We created the won variable to indicate whether or not the candidate won the election. We define close_race to be true if the winning margin was less than 5%.

One subtle issue: Should the same candidate be included multiple times? For example:

ch10 %>% 
  filter(last_name == "Cuomo")
## # A tibble: 4 x 8
##   last_name  year state    sex   lived_after election_age won   close_race
##   <chr>     <int> <chr>    <chr>       <dbl>        <dbl> <lgl> <lgl>     
## 1 Cuomo      1982 New York Male         32.2         50.4 TRUE  TRUE      
## 2 Cuomo      1986 New York Male         28.2         54.4 TRUE  FALSE     
## 3 Cuomo      1990 New York Male         24.2         58.4 TRUE  FALSE     
## 4 Cuomo      1994 New York Male         20.2         62.4 FALSE TRUE

For now, we leave in multiple observations for a single person.

First, let’s sample from our dataset.

ch10 %>% 
  slice_sample(n = 5)
## # A tibble: 5 x 8
##   last_name  year state        sex    lived_after election_age won   close_race
##   <chr>     <int> <chr>        <chr>        <dbl>        <dbl> <lgl> <lgl>     
## 1 Ristine    1964 Indiana      Male          44.6         44.8 FALSE FALSE     
## 2 Sundlun    1986 Rhode Island Male          24.7         66.8 FALSE FALSE     
## 3 Richards   1990 Texas        Female        15.9         57.2 TRUE  TRUE      
## 4 Turner     1946 Oklahoma     Male          26.6         52.0 TRUE  FALSE     
## 5 Williams   1948 Michigan     Male          39.2         37.7 TRUE  FALSE

As we might expect, sex is most often “Male.” To be more precise in inspecting our data, let’s skim() the dataset.

skim(ch10)
TABLE 10.1: Data summary
Name ch10
Number of rows 1092
Number of columns 8
_______________________
Column type frequency:
character 3
logical 2
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
last_name 0 1 3 11 0 615 0
state 0 1 4 14 0 50 0
sex 0 1 4 6 0 2 0

Variable type: logical

skim_variable n_missing complete_rate mean count
won 0 1 0.53 TRU: 576, FAL: 516
close_race 0 1 0.23 FAL: 838, TRU: 254

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1 1965 13.4 1945.00 1954 1962 1974 2011 ▇▆▃▂▁
lived_after 0 1 28 13.4 0.13 18 30 39 60 ▃▆▇▆▂
election_age 0 1 52 8.7 31.35 45 51 57 84 ▂▇▆▂▁

skim() groups the variable together by type, and provides some analysis for each variable. We are also given histograms of the numeric data.

Looking at the histogram for year, we see that it is skewed right — meaning that most of the data is bunched to the left and that there is a smaller tail to the right — with half of the observations from election years between 1945 and 1962. This makes sense logically, because we are only looking at deceased candidates, and candidates from more recent elections are more likely to still be alive.

In using this data set, our left-side variable will be lived_after. We are trying to understand/predict how many years a candidate will live after the election.

ch10 %>%
  ggplot(aes(x = year, y = lived_after)) +
  geom_point() +
  labs(title = "US Gubernatorial Candidate Years Lived Post-Election",
       subtitle = "Candidates who died more recently can't have lived for long post-election",
       caption = "Data: Barfort, Klemmensen and Larsen (2019)",
       x = "Year Elected",
       y = "Years Lived After Election") +
  scale_y_continuous(labels = scales::label_number()) +
  theme_classic() 

Note that there is a rough line above which we see no observations. Why might this be? When looking at the year elected and years lived post-election, there is missing data in the upper right quadrant due to the fact that it is impossible to have been elected post-2000 and lived more than 21 years. Simply put: this “edge” of the data represents, approximately, the most years a candidate could have lived, and still have died, given the year that they were elected.

The reason the data is slanted downward is because the maximum value for this scenario is greater in earlier years. That is, those candidates who ran for governor in earlier years could live a long time after the election and still have died prior to the data set creation, giving them higher lived_after values than those who ran for office in more recent years. The edge of the scatter plot is not perfectly straight because, for many election years, no candidate had the decency to die just before data collection. The reason for so few observations in later years is that fewer recent candidates have died.

To begin visualizing our lived_after data, we will inspect the difference in years lived post election between male and female candidates.

ch10 %>%
  ggplot(aes(x = sex, y = lived_after)) +
  geom_boxplot() +
  labs(title = "US Gubernatorial Candidate Years Lived Post-Election",
       subtitle = "Male candidates live much longer after the election",
       caption = "Data: Barfort, Klemmensen and Larsen (2019)",
       x = "Gender",
       y = "Years Lived After Election") +
  scale_y_continuous(labels = scales::label_number()) +
  theme_classic() 

This plot shows that men live much longer, on average, than women after the election. Does that make sense to you? Is there an intuitive explanation for why this might be?

10.1.2 Population

The concept of the “population” is subtle and important. The population is not the set of candidates for which we have data. That is the dataset. The population is the larger — potentially much larger — set of individuals about whom we want to make inferences. The parameters in our models refer to the population, not to the dataset.

Consider a simple example. Define \(\mu\) as the average number of years lived by candidates for governor after Election Day. Can we calculate \(\mu\) from our data? No! There are many candidates for governor who are still alive, who are not included in our data even though they are part of the “population” we want to study. \(\mu\) can not be calculated. It can only be estimated.

Note, also, that there are many different populations, each with its own \(\mu\), in which we might be interested.

  • The population of all candidates for governor in the US from 1945 to 2012. This is the period covered in the paper.

  • The population of all candidates for governor in the US from 1900 to 2012. A priori, we would not expect a major difference between candidates who run in 1946 and those who run in 1942. How different could they be? It is therefore not unreasonable to extend the population of interest back in time, even if we have no data for earlier periods.

  • The population of all candidates for governor in the US from 1945 to 2030. We are often interested in the future. We want to make predictions about what will happen to candidates, including candidates who have not yet run for office.

  • The population of candidates for governor around the world. Other countries have governors! We want to understand their longevity as well.

  • The population of candidates for all political offices in the US. We might expect candidates for Senator to have similar lifespans to candidates for Governor.

  • And so on. There are as many possible populations as there are questions we might ask.

All of these populations are different, so each has a different \(\mu\). Which \(\mu\) we are interested in depends on the problem we are trying to solve. It is a judgment call, a matter of Wisdom, as to whether or not that data we have is “close enough” to the population we are interested in to justify making a model.

Even though the original question is about “political candidates” in general, and does not specifically refer to either the United States or to elections for governor, we will assume that the data we have for US governors is, uh, representative enough of the population we are interested in (global politicians) that the exercise is useful. If we did not believe that, then we should stop right now. The major part of Wisdom is deciding what questions you can’t answer because of the data you just don’t have.

10.2 Justice and Courage

Justice

FIGURE 10.2: Justice

Courage

FIGURE 10.3: Courage

Because we will be going through a series of models in this chapter, it is useful to combine the virtues of Justice and Courage.

10.2.1 election_age

To begin, let’s model candidate lifespan after the election as a function of candidate lifespan prior to the election. The data:

ch10 %>% 
  ggplot(aes(x = election_age, y = lived_after)) +
    geom_point() +
    labs(title = "Longevity of Gubernatorial Candidates",
         subtitle = "Younger candidates live longer", 
         caption = "Data Source: Barfort, Klemmensen and Larsen (2019)",
         x = "Age in Years on Election Day",
         y = "Years Lived After Election") +
    scale_x_continuous(labels = scales::label_number()) +
    scale_y_continuous(labels = scales::label_number()) +
    theme_classic()

The math is fairly simple:

\[ lived\_after_i = \beta_0 + \beta_1 election\_age_i + \epsilon_i \]

with \(\epsilon_i \sim N(0, \sigma^2)\). - \(lived\_after_i\) is the number of years lived after the election for candidate \(i\). - \(election\_age_i\) is the number of years lived before the election for candidate \(i\). - \(\epsilon_i\) is the “error term,” the difference between the actual years-lived for candidate \(i\) and the modeled years-lived. \(\epsilon_i\) is normally distributed with a mean of 0 and a standard deviation of \(\sigma\).

The key distinction is between:

  • Variables, always subscripted with \(i\), whose values (potentially) vary across individuals.

  • Parameters, never subscripted with \(i\), whose values are constant across individuals.

Why do we use \(lived\_after_i\) in this formula instead of \(y_i\)? The more often we remind ourselves about the variable’s actual substance, the better. But there is another common convention: to always use \(y_i\) as the symbol for the dependent variable. It would not be unusual to describe this model as:

\[ y_i = \beta_0 + \beta_1 election\_age_i + \epsilon_i\]

Both mean the same thing.

Either way, \(\beta_0\) is the “intercept” of the regression, the average value for the population of \(lived\_after\), among those for whom \(election\_age = 0\).

\(\beta_1\) is the “coefficient” of \(election\_age\). When comparing two individuals, the first with an \(election\_age\) one year older than the second, we expect the first to have a \(lived\_after\) value \(\beta_1\) different from the second. In other words, we expect the older to have fewer years remaining, because \(\beta_1\) is negative. Again, this is the value for the population from which our data is drawn.

There are three unknown parameters — \(\beta_0\), \(\beta_1\) and \(\sigma\) — just as with the models we used in Chapter 8. Before we get to the five parameter case, it is useful to review this earlier material.

You may recall from middle school algebra that the equation of a line is \(y = m x + b\). There are two parameters: \(m\) and \(b\). The intercept \(b\) is the value of \(y\) when \(x = 0\). The slope coefficient \(m\) for \(x\) is the increase in \(y\) for every one unit increase in \(x\). When defining a regression line, we use slightly different notation but the fundamental relationship is the same.

We can implement this model with stan_glm().

fit_1 <- stan_glm(data = ch10,
                      formula = lived_after ~ election_age,
                      refresh = 0,
                      seed = 9)

As we discussed in Chapter 8, the most common term for a model like this is a “regression.” We have “regressed” lived_after, our dependent variable, on election_age, our (only) independent variable.

The parameter values:

print(fit_1, detail = FALSE)
##              Median MAD_SD
## (Intercept)  72.7    2.1  
## election_age -0.9    0.0  
## 
## Auxiliary parameter(s):
##       Median MAD_SD
## sigma 11.1    0.2

As is almost always the case, \(\sigma\) is a nuisance parameter, a value we are not interested in. This is why stan_glm() refers to it as an “Auxiliary” parameter.

The posterior distributions of \(\beta_0\) (the intercept) and \(\beta_1\) (the coefficient of \(election\_age_i\)), on the other hand, are important. Before looking at the posteriors themselves, let’s examine the fitted values:

ch10 %>% 
  ggplot(aes(x = election_age, y = lived_after)) +
    geom_point() +
    geom_line(aes(y = fitted(fit_1)), color = "blue") +
  
    # Add red box to highlight the slice of data we are discussing below using
    # geom_rect().
  
    geom_rect(aes(xmin = 38.5 , xmax = 41.5, ymin = 0, ymax = 60),
               fill = "transparent", color = "red", size = .5) +
    labs(title = "Longevity of Gubernatorial Candidates",
         subtitle = "Blue line shows fitted values", 
         caption = "Data Source: Barfort, Klemmensen and Larsen (2019)",
         x = "Age in Years at Time of Election",
         y = "Years Lived After Election") +
    scale_x_continuous(labels = scales::label_number()) +
    scale_y_continuous(labels = scales::label_number()) +
    theme_classic()

Consider someone who is about 40 years old on Election Day. We have a score or more data points for candidates around that age. This area is highlighted by the red box on our plot. As we can see, two died soon after the election. Some of them lived for 50 or more years after the election. Variation fills the world. However, the fitted line tells us that, on average, we would expect a candidate that age to live about 37 years after the election.

This model, with a continuous independent (or “predictor”) variable, has an infinite number of fitted values, one for each possible value of election_age.

We can create a formula for the fitted values by placing the median values of the posterior distribution of the parameters into the mathematical formula:

\[lived\_after_i = 72.7 - 0.9 election\_age_i\]

Consider the intercept. Since our independent variable is \(election\_age_i\), the intercept is the \(lived\_after_i\) value when \(election\_age_i\) is zero. Here, we would interpret this intercept as the average lifespan of a gubernatorial candidate after the election, if the candidate was alive for zero years prior to the election.

This is, of course, substantively nonsense. No one runs for office in the year they are born. In the next model, we will explore ways of making the intercept more interpretable. In the meantime, the math is the math.

Consider the coefficient for \(election\_age_i\), \(\beta_1\). The median of the posterior, -1, represents the slope of the model. For every unit increase in our independent variable, our dependent variable will change by this amount. For every additional year a candidate is alive before an election, their lifespan after the election will be -1 years lower, on average. If we are given the number of years a candidate lived before the election and want to estimate how long they will live for after, we will multiply the years they were alive prior by this beta of -0.9, and then combine that with the intercept.

This is a descriptive model, not a causal model. Remember our motto from Chapter 4: No causation without manipulation. There is no way, for person \(i\), to change the years that she has been alive on Election Day. On the day of this election, she is X years old. There is no way to change that. So, there are not two (or more) potential outcomes. Without more than one potential outcome, there can not be a causal effect.

Given that, it is important to monitor our language. We do not believe that that changes in election_age “cause” changes in lived_after. That is obvious. But there are some words and phrases — like “associated with” and “change by” — which are too close to causal. (And which we are guilty of using just a few paragraphs ago!) Be wary of their use. Always think in terms of comparisons when using a predictive model. We can’t change election_age from 40 to 50 for an individual candidate. We can only compare two candidates (or two groups of candidates), one with election_age equal to 40 and the other with election_age equal to 50. If our model is correct, such candidates will, on average, differ in lived_after by \(\beta_1\) times the difference between 40 and 50.

Let’s look at the posterior of \(\beta_1\), the coefficient of \(election\_age_i\):

fit_1 %>% 
  as_tibble() %>% 
  ggplot(aes(election_age)) + 
    geom_histogram(aes(y = after_stat(count/sum(count))),
                   bins = 100) +
    labs(title = "Posterior Distribution of the Coefficient of `election_age`",
         y = "Probability",
         x = "Coefficient of `election_age`") + 
    scale_y_continuous(labels = scales::percent_format()) +
    theme_classic()

As discussed in Chapter 9, we center variables when the model’s intercept does not make substantive sense. To center a model, we pick a constant value, usually the mean of the independent variable, and subtract that constant from every value of the independent variable. Another option would be to pick a reasonable reference value, like 50 years-old. Either way, the meaning of election_age changes from the raw number of years a candidate has been alive to the number of years relative to that reference value.

In this example, we want to center the value for election_age, the independent variable. First, we must pick the value that we will center by. Here, we will use the mean of election_age, which is about 52 years old. Once we find this number, we will subtract it from every election_age value, creating a new variable: c_election_age, where the “c” stand for centered. c_election_age is the number of years a candidate has been alive, as of Election Day, relative to the average years alive of all candidates on their respective election days.

ch10 <- ch10 %>% 
  mutate(c_election_age = election_age - mean(election_age))
fit_1_centered <- stan_glm(data = ch10,
                      formula = lived_after ~ c_election_age,
                      refresh = 0,
                      seed = 11)

print(fit_1_centered, detail = FALSE)
##                Median MAD_SD
## (Intercept)    28.2    0.3  
## c_election_age -0.9    0.0  
## 
## Auxiliary parameter(s):
##       Median MAD_SD
## sigma 11.1    0.2

We can see that the intercept has decreased while the slope has stayed the same. When we interpret this model, we only have to change the definition of the intercept. Rather than the intercept representing the remaining years of a candidate who was alive for zero years before running for governor, it now represents the post-election lifespan of a gubernatorial candidate who was alive for the mean number of years before running. If a candidate was alive for about 52 years before running for governor, they are expected to live for about 28 years after the election, on average.

Think hard about your parameters. What do they mean? Which population do they represent? What ideal Preceptor Table would make their calculation easy? In this case, \(\beta_0\) is the average number of years which gubernatorial candidates live after election day, but just for the subset of candidates who are the average age on Election Day. If we had the ideal Preceptor Table, this would be trivial to calculate. Just take the average for that subset! No estimation required. But, our actual Preceptor Table has lots of missing values. In particular, many gubernatorial candidates have not . . . uh . . . died. (How inconsiderate!) So, we can’t know how many years they will live. All we can do is, first, define \(\beta_0\) and, second, estimate a posterior probability distribution for it.

10.2.2 sex

Let’s now regress lived_after on sex to see how candidates’ post-election lifespans differ by sex.

fit_2 <- stan_glm(data = ch10,
                      formula = lived_after ~ sex - 1,
                      refresh = 0,
                      seed = 16)

Note this workflow. Try one model. Interpret it. Try another model. And then another. There is no one “true” model. There is an infinite number of possible models. Good data science involves an intelligent tour of that space.

print(fit_2, detail = FALSE)
##           Median MAD_SD
## sexFemale 16.0    2.9  
## sexMale   28.5    0.4  
## 
## Auxiliary parameter(s):
##       Median MAD_SD
## sigma 13.3    0.3

In this regression, we use the -1 in the formula to make the output more straightforward, with no intercept to interpret. The math of this model is the same as those we saw in Chapter 8:

\[ lived\_after_i = \beta_1 female_i + \beta_2 male_i + \epsilon_i\]

where \[female_i, male_i \in \{0,1\}\] \[female_i + male_i = 1\] \[\epsilon_i \sim N(0, \sigma^2)\]

The meaning of \(lived\_after_i\) and of \(\epsilon_i\) is the same as in the first model. Indeed, they are the same throughout these exercises. \(female_i\) and \(male_i\) are 0/1 variables, just like last chapter. They are variables whose values vary across individuals.

The important parameters are \(\beta_1\) and \(\beta_2\). They are the average years-lived post-election for, respectively, women and men. Again, this is not the “average” for the data we have. That is easy to calculate! No estimation required. \(\beta_1\) and \(\beta_2\) are averages for the entire “population,” however we have chosen to define that term. Those averages can not be calculated directly. They can only be estimated, by creating a posterior probability distribution.

Looking back to the regression model we just created, we see that there is no intercept. Instead of having a \(\beta_0\) value, we have \(\beta_1\) and \(\beta_2\) for female and male. This makes things easier to interpret. Without having to add or subtract anything from an intercept, this regression tells us that, on average, women are expected to live about 16 years after running for governor, and men are expected to live about 28 years.

This is a strange result. Why would men live almost twice as long as women after the election? One explanation for this might be that women don’t run for governor until later in life, and therefore are not expected to live as long.

Now that we have interpreted the model using a -1 in the formula to estimate \(\beta_1\) and \(\beta_2\), let’s take away the -1 and regress lived_after on an intercept and sex to see how our equation changes.

fit_2a <- stan_glm(data = ch10,
                       formula = lived_after ~ sex,
                       refresh = 0,
                       seed = 76)
print(fit_2a, detail = FALSE)
##             Median MAD_SD
## (Intercept) 16.1    2.9  
## sexMale     12.3    2.9  
## 
## Auxiliary parameter(s):
##       Median MAD_SD
## sigma 13.3    0.3

We no longer have a value for female. However we do have an intercept. In this regression our mathematical formula is:

\[ lived\_after_i = \beta_0 + \beta_1 male_i + \epsilon_i\]

\(\beta_0\) is our intercept, around 16 years. This result is very similar to the female value from before. In this type of model, our intercept represents the the variable which is not represented in the model. \(\beta_1\) only affects the outcome when the candidate is male. (If the candidate is female, then \(male_i = 0\). Therefore, the intercept value represents those who are not male, i.e., females.)

When the candidate is a male, we add the coefficient for male to the intercept value, which gives us the average lifespan of a male gubernatorial candidate after an election. As we can see from adding \(\beta_0\) and \(\beta_1\), this value is the same as what we got for males in the previous model.

Be careful with notation! \(\beta_1\) in the no-intercept model is different from \(\beta_1\) in the model with an intercept! Notation varies. We must pay attention each time we make a model.

The posterior distribution for \(\beta_0 + \beta_1\) can be constructed via simple addition.

fit_2a %>% 
  as_tibble() %>% 
  mutate(male_intercept = `(Intercept)` + sexMale) %>% 
  ggplot(aes(male_intercept)) + 
    geom_histogram(aes(y = after_stat(count/sum(count))),
                   bins = 100) +
    labs(title = "Posterior Distribution of Average Male Candidate Years Left",
         y = "Probability",
         x = "Years To Live After the Election") + 
    scale_x_continuous(labels = scales::number_format()) +
    scale_y_continuous(labels = scales::percent_format()) +
    theme_classic()

But is that actually so “simple?” Not really! Manipulating parameters directly is bothersome. Dealing with variables named (Intercept) is error-prone. Nothing about this approach scales to realistic cases involving numerous predictor variables. Using posterior_epred() and associated functions is much easier.

posterior_epred(fit_2a,
                tibble(sex = "Male")) %>% 
  as_tibble() %>% 
  ggplot(aes(`1`)) + 
    geom_histogram(aes(y = after_stat(count/sum(count))),
                   bins = 100) +
    labs(title = "Posterior Distribution of Average Male Candidate Yeas Left",
         y = "Probability",
         x = "Years To Live After the Election") + 
    scale_x_continuous(labels = scales::number_format()) +
    scale_y_continuous(labels = scales::percent_format()) +
    theme_classic()

The interpretation of this parameter is the same as we have seen before. There is a true average, across the entire population, of the number of years that male candidates live after the election. We can never know what that true average is. But, it seems very likely that the true average is somewhere between 27.5 and 29.5 years.

10.2.3 c_election_age and sex

In this model, our outcome variable continues to be lived_after, but now we will have two different explanatory variables: c_election_age and sex. Note that sex is a categorical explanatory variable and c_election_age is a continuous explanatory variable. This is the same type of model — parallel slopes — as we saw in Chapter 9.

Math:

\[ lived\_after_i = \beta_0 + \beta_1 male_i + \beta_2 c\_election\_age_i + \epsilon_i \]

But wait! The variable name is sex, not male. Where does male come from?

The answer is that male is an indicator variable, meaning a 0/1 variable. male takes a value of one if the candidate is “Male” and zero otherwise. This is the same as the \(male_i\) variable used in the previous two examples.

  • The outcome variable is \(lived\_after_i\), the number of years a person is alive after the election. \(male_i\) is one of our explanatory variables. If we are predicting the number of years a male candidate lives after the election, this value will be 1. When we are making this prediction for female candidates, this value will be 0. \(c\_election\_age_i\) is our other explanatory variable. It is the number of years a candidate has lived before the election, scaled by subtracting the average number of years lived by all candidates.

  • \(\beta_0\) is the average number of years lived after the election for women, who on the day of election, have been alive the average number of years of all candidates (i.e. both male and female). \(\beta_0\) is also the intercept of the equation. In other words, \(\beta_0\) is the expected value of \(lived\_after_i\), if \(male_i = 0\) and \(c\_election\_age_i = 0\).

  • \(\beta_1\) is almost meaningless by itself. The only time it has meaning is when its value is connected to our intercept (i.e. \(\beta_0 + \beta_1\)). When the two are added together, you get the average number of years lived after the election for men, who on the day of election, have been alive the average number of years for all candidates.

  • \(\beta_2\) is, for the entire population, the average difference in \(lived\_after_i\) between two individuals, one of whom has an \(c\_election\_age_i\) value of 1 greater than the other.

Let’s translate the model into code.

fit_3 <- stan_glm(data = ch10,
                      formula = lived_after ~ sex + c_election_age,
                      refresh = 0,
                      seed = 12)
print(fit_3, detail = FALSE)
##                Median MAD_SD
## (Intercept)    22.2    2.4  
## sexMale         6.1    2.4  
## c_election_age -0.8    0.0  
## 
## Auxiliary parameter(s):
##       Median MAD_SD
## sigma 11.1    0.2

Looking at our results, you can see that our intercept value is around 22. The average female candidate, who had been alive the average number of years of all candidates, would live another 22 years or so after the election.

Note that sexMale is around 6. This is our coefficient, \(\beta_1\). We need to connect this value to our intercept value to get something meaningful. Using the formula \(\beta_0 + \beta_1\), we find out that the number of years the average male candidate — who, on the day of election, is the average age of all candidates — would live is around 28 years.

Now take a look at the coefficient for \(c\_election\_age_i\), \(\beta_2\). The median of the posterior, -0.8, represents the slope of the model. When comparing two candates who differ by one year in c_election_age, we expect that they will differ by -0.8 years in lived_after. It makes sense that this value is negative. The more years a candidate has lived, the fewer years the candidate has left to live. So, for every extra year a candidate is alive before an election, their lifespan after the election will be 0.8 years lower, on average.

Let’s now look at some posteriors.

fit_3 %>% 
  as_tibble() %>% 
  mutate(male_years = `(Intercept)` + sexMale) %>% 
  rename(female_years = `(Intercept)`) %>% 
  select(female_years, male_years) %>% 
  pivot_longer(cols = female_years:male_years, 
               names_to = "parameters",
               values_to = "years") %>% 
  ggplot(aes(years, fill = parameters)) +
    geom_histogram(aes(y = after_stat(count/sum(count))),
                   alpha = 0.5, 
                   bins = 100, 
                   position = "identity") +
     labs(title = "Posterior Probability Distribution",
         subtitle = "Men live longer",
         x = "Average Years Lived Post Election",
         y = "Probability",
         fill = "Parameters") + 
    scale_x_continuous(labels = scales::number_format(accuracy = 1)) +
    scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
    theme_classic()

The graph above displays the posterior probability distributions for \(\beta_0\) and for \(\beta_0 + \beta_1\). The two distributions are fairly different; the distribution for females is more spread out than that for males. There were a greater number of men in the dataset, while there were far fewer women. The more data we have, the more precise we can be.

Let’s now take a look at a posterior distribution for \(\beta_2\), the coefficient of \(c\_election\_age_i\).

fit_3 %>% 
  as_tibble() %>% 
  ggplot(aes(c_election_age)) + 
    geom_histogram(aes(y = after_stat(count/sum(count))),
                   bins = 100) +
    labs(title = "Posterior Distribution of the Coefficient of `c_election_age`",
         y = "Probability",
         x = "Coefficient of `c_election_age`") + 
    scale_y_continuous(labels = scales::percent_format()) +
    theme_classic()

10.2.4 c_election_age, sex and c_election_age*sex

Let’s create another model. This time, however, the numeric outcome variable of lived_after is a function of the two explanatory variables we used above, c_election_age and sex, and of their interaction. To look at interactions, we need 5 parameters, which is why we needed to wait until this chapter to introduce the concept.

Math:

\[ lived\_after_i = \beta_0 + \beta_1 male_i + \beta_2 c\_election\_age_i + \\ \beta_3 male_i * c\_election\_age_i + \epsilon_i \]

  • Our outcome variable is still \(lived\_after_i\). We want to know how many years a candidate will live after an election. Our explanatory variables as the same as before. \(male_i\) is one for male candidates and zero for female candidates. \(c\_election\_age_i\) the number of years a candidate has lived before the election, relative to the average value for all candidates. In this model, we have a third predictor variable: the interaction between \(male_i\) and \(c\_election\_age_i\).

  • \(\beta_0\) is the average number of years lived after the election for women, who on the day of election, have been alive the average number of years of all candidates. In a sense, this is the same meaning as in the previous model, without an interaction term. But, always remember that the meaning of a parameter is conditional on the model in which it is embedded. Even if a parameter is called \(\beta_0\) in two different regressions does necessitate that it means the same thing in both regressions. Parameter names are arbitrary, or at least simply a matter of convention.

  • \(\beta_1\) does not have a simple interpretation as a stand-alone parameter. It is a measure of how different women are from men. However, \(\beta_0 + \beta_1\) has a straightforward meaning exactly analogous to the meaning of \(\beta_0\). The sum is the average number of years lived after the election for men, who on the day of election, have been alive the average number of years of all candidates.

  • \(\beta_2\) is the coefficient of \(c\_election\_age_i\). It it just the slope for women. It is the average difference in \(lived\_after_i\) between two women, one of whom has an \(c\_election\_age_i\) value of 1 greater than the other. In our last example, \(\beta_2\) was the slope for the whole population. Now we are different slopes for different genders.

  • \(\beta_3\) alone is difficult to interpret. However, when it is added to \(\beta_2\), the result in the slope for men.

Code:

fit_4 <- stan_glm(data = ch10,
                      formula = lived_after ~ sex*c_election_age,
                      refresh = 0,
                      seed = 13)
print(fit_4, detail = FALSE)
##                        Median MAD_SD
## (Intercept)            16.5    3.6  
## sexMale                11.9    3.6  
## c_election_age         -0.1    0.4  
## sexMale:c_election_age -0.8    0.4  
## 
## Auxiliary parameter(s):
##       Median MAD_SD
## sigma 11.1    0.2

The intercept has increased. \(\beta_0\) is around 17. This is the intercept for females. It still means the average number of years lived after the election for women is 17 or so. Our sexMale coefficient, \(\beta_1\), refers to the value that must be added to the intercept in order to get the average for males. When calculated, the result is 28. Keep in mind, however, that these values only apply if \(c\_election\_age_i = 0\), if, that is, candidate \(i\) is around 52 years old.

The coefficient for \(c\_election\_age_i\), \(\beta_2\), is -0.1. What does this mean? It is the slope for females. So, when comparing two female candidates who differ by one year in age, we expect that the older candidate will live 0.1 years less. Now direct your attention below at the coefficient of sexMale:c_election_age, \(\beta_3\), which is -0.8. This is the value that must be added to the coefficient of \(c\_election\_age_i\) (recall \(\beta_2 + \beta_3\)) in order to find the slope for males. When the two are added together, this value, or slope, is about -0.9. When comparing two male candidates who differ in age by one year, we expect the older candidate to live about 0.9 years less.

Key point: The interpretation of the intercepts only apply to candidates for whom \(c\_election\_age_i = 0\). Candidates who are not 52 years-old will have a different expected number of years to live. The interpretation of the slope applies to everyone. In other words, the relationship between \(lived\_after_i\) and \(c\_election\_age_i\) is the same, regardless of your gender or how old you are.

The posterior:

fit_4 %>% 
  as_tibble() %>% 
  mutate(male_years = `(Intercept)` + sexMale) %>% 
  rename(female_years = `(Intercept)`) %>% 
  select(female_years, male_years) %>% 
  pivot_longer(cols = female_years:male_years, 
               names_to = "parameters",
               values_to = "years") %>% 
  ggplot(aes(years, fill = parameters)) +
    geom_histogram(aes(y = after_stat(count/sum(count))),
                   alpha = 0.5, 
                   bins = 100, 
                   position = "identity") +
    labs(title = "Posterior for Years Lived After the Election",
         subtitle = "Men live longer",
         x = "Average Years Lived Post Election",
         y = "Probability", 
         fill = "Parameters") + 
    scale_x_continuous(labels = scales::number_format()) +
    scale_y_continuous(labels = scales::percent_format()) +
    theme_classic()

Again, we do not recommend working directly with parameters. The above analysis would be much easier with posterior_epred(), as we will see in the Temperance Section.

Male candidates live longer on average than female candidates. Note, also, that the average years to live after the election for females is about 17 with this model. With the previous model, it was 22 years. Why the difference? The interpretation of “average” is different! In the previous model, it was the average for all women. In this model, it is the average for all 52 years-old women. Those are different things, so we should hardly be surprised by different posteriors.

Slope posteriors:

fit_4 %>% 
  as_tibble() %>% 
  mutate(slope_men = c_election_age + `sexMale:c_election_age`) %>% 
  rename(slope_women = c_election_age) %>% 
  select(slope_women, slope_men) %>% 
  pivot_longer(cols = slope_women:slope_men, 
               names_to = "parameters",
               values_to = "slope") %>% 
  ggplot(aes(slope, fill = parameters)) +
    geom_histogram(aes(y = after_stat(count/sum(count))),
                   alpha = 0.5, 
                   bins = 100, 
                   position = "identity") +
    labs(title = "Posterior for Slope of Years-Lived on Years-to-Live",
         subtitle = "Men have a steeper slope",
         x = "Slope",
         y = "Probability") + 
    scale_y_continuous(labels = scales::percent_format()) +
    theme_classic() 

This posterior distribution shows the average slope values for men and women. You can see that men have a steeper slope, while the slope for women is practically 0! If you are trying to forecast the number of years that a women will live after the election, you may ignore the number of years that she has already lived. This is definitely not true for men. Why the difference?

10.2.5 What is \(\beta_0\)?

Attentive readers will have noticed that the meaning of \(\beta_0\) is identical for the last two models and, at the same time, our estimate for its posterior are very different. How can that be?

Recall the math:

\[ lived\_after_i = \beta_0 + \beta_1 male_i + \beta_2 c\_election\_age_i + \epsilon_i \]

\[ lived\_after_i = \beta_0 + \beta_1 male_i + \beta_2 c\_election\_age_i + \\ \beta_3 male_i * c\_election\_age_i + \epsilon_i\] In both cases, \(\beta_0\) has the same meaning. It is the average number of years lived after the election for women, who on the day of election, have been alive the average number of years of all candidates. Plug in 0 for \(male_i\) and 0 for \(c\_election\_age_i\) and both equations simplify to:

\[ lived\_after_i = \beta_0 + \epsilon_i \] So far, so good. But, despite the fact that we are using the same data, the posterior probability distributions for \(\beta_0\) are meaningfully different between the two models:

m_3 <- fit_3 %>% 
  as_tibble() %>%
  select(`(Intercept)`) %>% 
  rename(`No Interaction Term` = `(Intercept)`)

m_4 <- fit_4 %>% 
  as_tibble() %>%
  select(`(Intercept)`) %>% 
  rename(`With Interaction Term` = `(Intercept)`)

bind_cols(m_3, m_4) %>% 
  select(`No Interaction Term`, `With Interaction Term`) %>% 
  pivot_longer(cols = `No Interaction Term`:`With Interaction Term`, 
               names_to = "Model",
               values_to = "years") %>% 
  ggplot(aes(years, fill = Model)) +
    geom_histogram(aes(y = after_stat(count/sum(count))),
                   alpha = 0.5, 
                   bins = 100, 
                   position = "identity") +
    labs(title = "Posterior Probability Distributions",
         subtitle = "For intercept in two different models",
         x = "Additional Years Lived Post Election",
         y = "Probability") + 
    scale_x_continuous(labels = scales::number_format()) +
    scale_y_continuous(labels = scales::percent_format()) +
    theme_classic()

There is enough overlap between the two distributions that they aren’t completely incompatible. If, in truth, \(\beta_0 = 20\), neither posterior distribution is “wrong.”

The meaning of any parameter, including \(\beta_0\), is conditional on two assumptions: the population for which we seek to make inferences and the structure of the model. In this case, we assume that the population of interest is the same. But the models are different, albeit only by one interaction term. But that is enough of a difference to make \(\beta_0\) in the first model a different “thing” that \(\beta_0\) in the second model, even if one “interpretation” of what they mean is the same.

The trick is that an individual interpretation of the meaning of a parameter just captures one characteristic of the meaning of that parameter. It is just a single view of the “thing.” That view is fine, as far as it goes, but it can’t show us all aspects of the parameter. A picture of the full moon tells us little about what appears on its dark side.

Consider a different question: What is the expected value for male candidates who are 1 year older than average:

  • Without an interaction term: \(\beta_0 + \beta_1 + \beta_2 1\).

  • With an interaction term: \(\beta_0 + \beta_1 + \beta_2 1 + \beta_3 1\).

We shouldn’t be surprised that different models generate different formulas for this question. Yet note how \(\beta_0\) appears in both. In other words, \(\beta_0\) is not only the average number of years lived after the election for women, who on the day of election, have been alive the average number of years of all candidates. It is also part of a multitude of other estimates.

The only way to understand all of what is wrapped up in \(\beta_0\) is to see it in the context of the full mathematical formula. Since we have two different models, defined by two different formulas, it is hardly surprising that the posterior probability distributions for \(beta_0\) will differ, even though one specific interpretation — one out of many — is the same.

Our advice: Spend less time thinking about what parameters mean and more time using posterior_epred() and posterior_predict() to examine the implications of your models. Those posteriors are the real substance of your model.

10.2.6 Interaction model

Recall the parallel slopes model that we created in Chapter 9. Another visualization we can create, one that also uses slopes and intercepts for our model, is the interaction model. In this model, the slopes for our two groups are different, creating a non-parallel visualization.

The process for creating the interaction model is similar to creating the parallel slopes model. Let us begin the same way — by tidying our data and inspecting it.

# First, we will tidy the data from our model and select the term and estimate.
# This allows us to create our regression lines more easily.

tidy <- fit_4 %>% 
  tidy() %>% 
  select(term, estimate)

tidy
## # A tibble: 4 x 2
##   term                   estimate
##   <chr>                     <dbl>
## 1 (Intercept)             16.5   
## 2 sexMale                 11.9   
## 3 c_election_age          -0.0744
## 4 sexMale:c_election_age  -0.782

After tidying our data, we will extract values and assign sensible names for later use. Note that this is identical to the process from Chapter 9, with the addition of a fourth term (the interaction term):

# Extract and name the columns of our tidy object. By calling tidy$estimate[1],
# we are telling R to extract the first value from the estimate column in our
# tidy object.

intercept <- tidy$estimate[1]
sex_male <- tidy$estimate[2]
c_election_age <- tidy$estimate[3]
interaction_term <- tidy$estimate[4]

Now that we have extracted our values, we will create the intercept and slope values for our two different groups, females and males. Recall the following details about finding slopes and intercepts in an interaction model:

  • The intercept is the intercept for females. It represents the average number of years lived after the election for females.
  • Our sexMale coefficient refers to the value that must be added to the intercept in order to get the average years lived post-election for males.
  • The coefficient for \(c\_election\_age_i\) is the slope for females.
  • The coefficient of sexMale:c_election_age is the value that must be added to the coefficient of \(c\_election\_age_i\) in order to find the slope for males.
# Recall that the intercept and the estimate for c_election_age act as the
# estimates for female candidates only. Accordingly, we have assigned those
# values (from the previous code chunk) to more sensible names: female_intercept
# and female_slope.

female_intercept <- intercept
female_slope <- c_election_age

# To find the male intercept, we must add the intercept for the estimate for
# sex_male. To find the male slope, we must add c_election_age to our
# interaction term estimate.

male_intercept <- intercept + sex_male
male_slope <- c_election_age + interaction_term

After creating objects for our different intercepts and slopes, we will now create the interaction model using geom_abline() for a male and female line.

# From the ch10 data, create a ggplot object with c_election_age as the x-axis
# and lived_after as the y-axis. We will use color = sex.

ggplot(ch10, aes(x = c_election_age, y = lived_after, color = sex)) +
  
  # Use geom_point to show the datapoints. 
  
  geom_point() +
  
  # Create a geom_abline object for the female intercept and slope. Set the
  # intercept qual to our previously created female_intercept, while setting
  # slope equal to our previously created female_slope. The color call is for
  # coral, to match the colors used by tidyverse for geom_point().
  
  geom_abline(intercept = female_intercept,
              slope = female_slope, 
              color = "#F8766D", 
              size = 1) +
  
  # Create a geom_abline object for the male values. Set the intercept equal to
  # our previously created male_intercept, while setting slope equal to our
  # previously created male_slope. The color call is for teal, to match the
  # colors used by tidyverse for geom_point().

  geom_abline(intercept = male_intercept,
              slope = male_slope,
              color = "#00BFC4", 
              size = 1) +
  
  # Add the appropriate titles and axis labels. 
  
  labs(title = "Interaction Model",
       subtitle = "Comparing post election lifespan across sex",
       x = "Average Age at Time of Election", 
       y = "Years Lived Post-Election", 
       color = "Sex") +
  theme_classic()

This is our final interaction model! There are some interesting takeaways. First, we may note that there are far fewer data points for female candidates — a concern we previously mentioned. It makes sense, then, that the slope would be less dramatic when compared with male candidates. We also see that most female candidates run when they are older, as compared with male candidates. This might explain why our intercept for years lived post-election is lower for female candidates.

The male line seems more sensible, as we might expect with far more datapoints. For male candidates, we see a clear (logical) pattern: the older candidates are at the time of election, the less years post-election they live. This makes sense, as we are limited by the human lifespan.

10.3 Temperance

Temperance

FIGURE 10.4: Temperance

Recall the questions with which we began the chapter:

How long will two political candidates — one male and one female, both 10 years older than the average candidate — live after the election? How different will their lifespans be?

These questions are, purposely, less precise than the ones we tackled in Chapters 7 and 8, written more in a conversational style. This is how normal people talk.

However, as data scientists, our job is to bring precision to these questions. There are two commonsense interpretations. First, we could be curious about the expected values for these questions. If we averaged the data for a thousand candidates like these, what would the answer be? Second, we could be curious about two specific individuals. How long will they live? Averages involve questions about parameters. The fates of individuals require predictions. Those are general claims, violated too often to be firm rules. Yet, they highlight a key point: expected values are less variable than individual predictions.

To calculate expected values, use posterior_epred(). To forecast for individuals, use posterior_predict().

10.3.1 Expected values

Consider the “on average” interpretation first. The answer begins with the posterior distributions of the parameters in fit_4.

newobs = tibble(sex = c("Male", "Female"), 
                 c_election_age = 10)

pe <- posterior_epred(object = fit_4, 
                      newdata = newobs) %>% 
  as_tibble() %>% 
  rename("Male" = `1`,
         "Female" = `2`)


pe %>% 
 pivot_longer(cols = Male:Female, 
               names_to = "Gender",
               values_to = "years") %>% 
  ggplot(aes(years, fill = Gender)) +
    geom_histogram(aes(y = after_stat(count/sum(count))),
                   alpha = 0.5, 
                   bins = 100, 
                   position = "identity") +
    labs(title = "Posterior for Expected Years Lived Post-Election",
         subtitle = "Male candidates live longer",
         x = "Years",
         y = "Probability") + 
    scale_x_continuous(labels = 
                         scales::number_format(accuracy = 1)) +
    scale_y_continuous(labels = 
                         scales::percent_format(accuracy = 1)) +
    theme_classic()

Looking at our posterior probability distributions above, we can see that male candidates are expected to live longer. But how much longer? As in previous chapters, we can manipulate distributions in, more or less, the same way that we manipulate simple numbers. If we want to know the difference between two posterior distributions, we can simply subtract.

pe <- posterior_epred(object = fit_4, 
                      newdata = newobs) %>% 
  as_tibble() %>% 
  mutate(diff = `1` - `2`)


pe %>% 
  ggplot(aes(diff)) +
    geom_histogram(aes(y = after_stat(count/sum(count))),
                   alpha = 0.5, 
                   bins = 100, 
                   position = "identity") +
    labs(title = "Posterior for Expected Additional Male Years Lived",
         subtitle = "Male candidates live about 4 years longer",
         x = "Expected Additional Years Lived Post Election",
         y = "Probability") + 
    scale_x_continuous(labels = scales::number_format(accuracy = 1)) +
    scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
    theme_classic()

The average value of the difference in years-to-live is probably positive, with the most likely value being around 4 years. But there still a 6% chance the true value is less than zero, i.e., that we should expect female candidates to live longer.

Instead of using posterior_epred(), we could have answered these questions by using the posterior probability distributions for the parameters in the model, along with some simple math. Don’t do this! First, you are much more likely to make a mistake. Second, this approach does not generalize well to complex models with scores of parameters and their interactions.

10.3.2 Individual predictions

If, instead, we interpret the question as asking for a prediction for a small number of individuals, then we need to use posterior_predict().

Use posterior_predict() to create draws from the posterior probability distribution for our prediction for these cases. posterior_predict() takes two arguments: the model for which the simulations should be run, and a tibble indicating the covariate values for the individual(s) we want to predict. In this case, we are using the fit_4 model and the tibble is the one we just created above. In other words, the inputs for posterior_predict() and posterior_epred() are identical.

pp <- posterior_predict(object = fit_4, 
                        newdata = newobs) %>%
  as_tibble() %>% 
  mutate_all(as.numeric) %>% 
  rename("Male" = `1`,
         "Female" = `2`)

pp  
## # A tibble: 4,000 x 2
##     Male Female
##    <dbl>  <dbl>
##  1  5.03  24.6 
##  2 24.1    6.62
##  3 15.6   30.6 
##  4  4.16  23.0 
##  5 13.7   31.3 
##  6 25.4   18.3 
##  7 23.8   10.6 
##  8 27.2   41.8 
##  9 -1.71   1.90
## 10 17.5   25.3 
## # … with 3,990 more rows

The resulting tibble has 2 columns, the first for a male candidate and the second for female candidate. Both columns are draws from the posterior predictive distributions. In both cases, the forecasts depend on the values of all the covariates. That is, we would provide a different forecast if the candidates were younger or older.

Why do we need the weird mutate_all(as.numeric) incantation? The reason is that posterior_epred() returns a simple matrix, which is easy to transform into a tibble. posterior_predict(), on the other hand, returns a special sort of matrix which is much harder to work with. So, we need a little hackery to make the next steps easier.

Let’s look at the posterior predictive distribution for each candidate.

pp %>% 
  pivot_longer(cols = Male:Female, 
               names_to = "Gender",
               values_to = "years") %>% 
  ggplot(aes(years, fill = Gender)) +
    geom_histogram(aes(y = after_stat(count/sum(count))),
                   alpha = 0.5, 
                   bins = 100, 
                   position = "identity") +
    labs(title = "Posterior for a Candidate's Years Lived Post-Election",
         subtitle = "Individual lifespans have a great deal of variation",
         x = "Years Lived Post Election",
         y = "Probability") + 
    scale_x_continuous(labels = scales::number_format()) +
    scale_y_continuous(labels = 
                         scales::percent_format(accuracy = 1)) +
    theme_classic()

There is a big overlap in the predictions for individuals while, at the same time, there is much less overlap in the averages. Random stuff happens to an individual all the time. Random stuff cancels out when you take the average for many individuals. Consider the difference in the posterior predictive distributions for the two individuals.

pp %>% 
  mutate(diff = Male - Female) %>% 
  ggplot(aes(diff)) +
    geom_histogram(aes(y = after_stat(count/sum(count))),
                   alpha = 0.5, 
                   bins = 100, 
                   position = "identity") +
    labs(title = "Posterior for a Male Candidate's Extra Years Lived", 
         subtitle = "Any random male candidate may die before a random female candidate",
         x = "Years",
         y = "Probability") + 
    scale_x_continuous(labels = scales::number_format()) +
    scale_y_continuous(labels = 
                         scales::percent_format(accuracy = 1)) +
    theme_classic()

In words, we would predict that the male candidate would live longer than the female candidate. By how much? Well, that number is an unknown parameter. By looking at our posterior above, our best estimate is about 3.8 years. However, it is quite possible that, for any given male/female candidates, the female will live longer.

pp %>% 
  mutate(diff = Male - Female) %>% 
  summarize(f_live_longer = sum(diff < 0),
            total = n(),
            f_live_longer / total)
## # A tibble: 1 x 3
##   f_live_longer total `f_live_longer/total`
##           <int> <int>                 <dbl>
## 1          1617  4000                 0.404

In fact, there is a 4 in 10 chance that the female candidate lives longer.

Note what is the same and what is different when we move from a question about averages to a question about individuals. In both cases, the most likely value is about the same. That is, the average behavior is the same as our expected value for any given individual. But the uncertainty is much greater for an individual prediction. The chance of the true average for male candidates being less than that for female candidates is low. Yet, for any individual pair of candidates, it would not even be slightly surprising for the female candidate to outlive the male candidate. Individuals vary. Averages never tell the whole story.

10.3.3 Expectation versus individual variation

Let’s compare the results from posterior_epred() and posterior_predict() for this scenario directly. Most of this code is the same as what we have shown you above, but we think it is useful to look at everything together.

newobs <- tibble(sex = c("Male", "Female"),
                  c_election_age = 10)

pe <- posterior_epred(fit_4, 
                      newdata = newobs) %>%
  as_tibble() %>% 
  mutate(diff = `1` - `2`)

pp <- posterior_predict(fit_4, 
                        newdata = newobs) %>% 
  as_tibble() %>% 
  mutate_all(as.numeric) %>% 
  mutate(diff = `1` - `2`)

tibble(Expectation = pe$diff,
       Prediction = pp$diff) %>% 
  pivot_longer(cols = Expectation:Prediction, 
               names_to = "Type",
               values_to = "years") %>% 
  ggplot(aes(years, fill = Type)) +
    geom_histogram(aes(y = after_stat(count/sum(count))),
                   alpha = 0.5, 
                   bins = 100, 
                   position = "identity") +
    labs(title = "Posterior for Expected and Individual Male Advantage",
         subtitle = "Expected male advantage is much more precisely estimated",
         x = "Additional Years Lived Post Election",
         y = "Probability") + 
    scale_x_continuous(labels = scales::number_format()) +
    scale_y_continuous(labels = 
                         scales::percent_format(accuracy = 1)) +
    theme_classic()

Expected values vary much less than predictions. The above chart makes that easy to see. We are somewhat sure that the true underlying average for the numbers of years that male candidates live post-election is more than female candidates. But, for any two individual candidates, there is a good chance that that the female candidate will live longer. We can not ignore \(\epsilon\) when predicting the outcome for individuals. When estimating expected values or long-run averages, the \(\epsilon\)’s cancel out.

10.3.4 Testing

In a normal statistics book, we would have already discussed the concept of “testing” extensively. The terminology varies by field. “Tests,” “testing,” “hypothesis tests,” “tests of significance,” and “null hypothesis significance testing” all refer to the same concept. We will refer to this collection of approaches as NHST, a common abbreviation derived from the initials of the last phrase. Wikipedia provides an overview.

Our view: Amateurs test. Professionals summarize.

Consider the question of the expected difference in lifespans for our South Dakota and Washington candidates. Is that difference “significant?” Can we “reject the null hypothesis?” The conventional answer is Yes. Anytime that zero is outside of the 95% confidence interval, we can declare the result significant. But why should we? How does that help us? We already have the full posterior probability distribution. That is knowledge. A Yes/No question throws away too much information to (almost) ever be useful. There is no reason to test when you can summarize by providing the full posterior probability distribution.

The same arguments apply in the case of “insignificant” results, with \(p > 0.5\), when we can’t “reject” the null hypothesis. Instead of expected values, consider the case of two candidates, one from South Dakota and one from Washington. Is the difference in their predicted lifespans “significant?” Who cares!? We have the full posterior probability distribution for that prediction — also known as the posterior predictive distribution — as graphed above. The 95% confidence interval includes zero. Does that mean we should throw it away? No! That would be nonsense. Yes, there is a 20% chance that the South Dakota candidate lives longer, so we can hardly be surprised when that happens. But we still think that it is much more likely that the Washington candidate lives longer. Indeed, we would consider 4-to-1 odds as fair. The fact that the difference is not “significant” has no relevance to how we use the posterior to make decisions.

The same reasoning applies to every parameter we estimate, to every prediction we make. Never test — unless your boss demands a test. Use your judgment, make your models, summarize your knowledge of the world, and use that summary to make decisions.

10.4 Summary

The major part of Wisdom is deciding what questions you can’t answer because of the data you don’t have.

Avoid answering questions by working with parameters directly. Use posterior_epred() instead.

Good data science involves an intelligent tour of the space of possible models.

Always think in terms of comparisons when using a predictive model.

Spend less time thinking about what parameters mean and more time using posterior_epred() and posterior_predict() to examine the implications of your models.