8 Three Parameters

Models have parameters. In Chapter 6 we created models with a single parameter \(p\), the proportion of red beads in an urn. In Chapter 7, we used models with two parameters: \(\mu\) (the average height in the population, generically known as a model “intercept”) and \(\sigma\) (the variation in height in the population). Here — can you guess where this is going? — we will build models with three parameters: \(\sigma\) (which serves the same role throughout the book) and two coefficients. In models which relate a continuous predictor to the outcome, those two parameters will be labeled \(\beta_0\) and \(\beta_1\). In models which estimate two averages, the parameters will be \(\beta_1\) and \(\beta_2\). All this notation is confusing, not least because different academic fields use inconsistent schemes. Follow the Cardinal Virtues and tackle your problem step by step.

8.1 age ~ party

We want to build a model and then use that model to make claims about the world.

What is the expected age of a Democrat at the train station?

In a group of three Democrats and three Republicans, what will the age difference be between the oldest Democrat and the youngest Republican?

We can answer these and similar questions by creating a model that uses party affiliation to predict age

8.1.1 Wisdom

Wisdom

FIGURE 8.1: Wisdom

Wisdom begins with a look at your data.

8.1.1.1 EDA for trains

Always explore your data. To demonstrate modeling with three parameters, we will use the trains data set from the primer.data package. Recall the discussion from Chapter 4. Enos (2014) randomly placed Spanish-speaking confederates on nine train platforms around Boston, Massachusetts. Exposure to Spanish-speakers – the treatment – influenced attitudes toward immigration. These reactions were measured through changes in answers to three survey questions. Load the necessary libraries and look at the data.

glimpse(trains)
## Rows: 115
## Columns: 15
## $ treatment           <fct> Treated, Treated, Treated, Treated, Control, Trea…
## $ att_start           <dbl> 11, 9, 3, 11, 8, 13, 13, 10, 12, 9, 10, 11, 13, 6…
## $ att_end             <dbl> 11, 10, 5, 11, 5, 13, 13, 11, 12, 10, 9, 9, 13, 7…
## $ gender              <chr> "Female", "Female", "Male", "Male", "Male", "Fema…
## $ race                <chr> "White", "White", "White", "White", "White", "Whi…
## $ liberal             <lgl> FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, FA…
## $ party               <chr> "Democrat", "Republican", "Democrat", "Democrat",…
## $ age                 <int> 31, 34, 63, 45, 55, 37, 53, 36, 54, 42, 33, 50, 2…
## $ income              <dbl> 135000, 105000, 135000, 300000, 135000, 87500, 87…
## $ line                <chr> "Framingham", "Framingham", "Framingham", "Framin…
## $ station             <chr> "Grafton", "Southborough", "Grafton", "Grafton", …
## $ hisp_perc           <dbl> 0.026, 0.015, 0.019, 0.019, 0.019, 0.023, 0.030, …
## $ ideology_start      <int> 3, 4, 1, 4, 2, 5, 5, 4, 4, 4, 3, 5, 4, 1, 2, 2, 3…
## $ ideology_end        <int> 3, 4, 2, 4, 2, 5, 5, 4, 3, 4, 3, 4, 4, 1, 2, 3, 3…
## $ income_in_thousands <dbl> 135, 105, 135, 300, 135, 88, 88, 135, 105, 135, 1…

The data include information about each respondent’s gender, political affiliations, age, income and so on. treatment indicates whether a subject was in the control or treatment group. The key outcomes are their attitudes toward immigration both before (att_start) and after (att_end) the experiment. We also have a measure of their ideology before and after the experiment, but there is little change. Type ?trains to read the help page for more information about each variable. Let’s restrict attention to a subset of the variables.

ch8 <- trains %>% 
  select(age, att_end, party, income, treatment, liberal)

It is always smart to look at a some random samples of the data:

ch8 %>% 
  slice_sample(n = 5)
## # A tibble: 5 x 6
##     age att_end party      income treatment liberal
##   <int>   <dbl> <chr>       <dbl> <fct>     <lgl>  
## 1    44       9 Republican 250000 Control   FALSE  
## 2    68      10 Democrat    62500 Control   TRUE   
## 3    37       7 Republican 105000 Control   FALSE  
## 4    30      10 Democrat   135000 Control   TRUE   
## 5    29      15 Democrat    42500 Treated   TRUE

att_end is a measure of person’s attitude toward immigration. A higher number means more conservative, i.e., a more exclusionary stance toward immigration into the United States.

ch8 %>% 
  glimpse()
## Rows: 115
## Columns: 6
## $ age       <int> 31, 34, 63, 45, 55, 37, 53, 36, 54, 42, 33, 50, 24, 40, 53,…
## $ att_end   <dbl> 11, 10, 5, 11, 5, 13, 13, 11, 12, 10, 9, 9, 13, 7, 8, 13, 8…
## $ party     <chr> "Democrat", "Republican", "Democrat", "Democrat", "Democrat…
## $ income    <dbl> 135000, 105000, 135000, 300000, 135000, 87500, 87500, 13500…
## $ treatment <fct> Treated, Treated, Treated, Treated, Control, Treated, Contr…
## $ liberal   <lgl> FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE…

Pay attention to the variable types. Do they make sense? Perhaps. But there are certainly grounds for suspicion. Why is att_end a double rather than an integer? All the values in the data appear to be integers, so there is no benefit to having these variables be doubles. Why is party a character variable and treatment a factor variable? It could be that these are intentional choices made by the creator of the tibble, i.e., us. Or, these could be mistakes. Most likely, these choices are a mixture of sensible and arbitrary. Regardless, it is your responsibility to notice them. You can’t make a good model without looking closely at the data which you are using.

ch8 %>% 
  skim()
TABLE 8.1: Data summary
Name Piped data
Number of rows 115
Number of columns 6
_______________________
Column type frequency:
character 1
factor 1
logical 1
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
party 0 1 8 10 0 2 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
treatment 0 1 FALSE 2 Con: 64, Tre: 51

Variable type: logical

skim_variable n_missing complete_rate mean count
liberal 0 1 0.44 FAL: 64, TRU: 51

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age 0 1 4.2e+01 12.2 20 33 43 52 68 ▆▇▇▇▃
att_end 0 1 9.1e+00 2.9 3 7 9 11 15 ▂▃▇▃▃
income 0 1 1.4e+05 74476.6 23500 87500 135000 135000 300000 ▅▇▇▁▆

skim() shows us what the different values of treatment are because it is a factor. Unfortunately, it does not do the same for character variables like party. The ranges for age and att_end seem reasonable. Recall that participants were asked three questions about immigration issues, each of which allowed for an answer indicated strength of agreement on a scale form 1 to 5, with higher values indicating more agreement with conservative viewpoints. att_end is the sum of the responses to the three questions, so the most liberal possible value is 3 and the most conservative is 15.

Always plot your data.

ch8 %>%
  ggplot(aes(x = party, y = age)) + 
  geom_jitter(width = 0.1, height = 0) + 
  labs(title = "Age by Party Affiliation in Trains Dataset",
       subtitle = "Where are the old Republicans?",
       x = "Party",
       y = "Age")

There are more Democrats in this dataset than Republicans. Democrats also span a wider range of ages than Republicans.

In data science, there are two main aspects to Wisdom: representativeness and ethics. The data that we have are very limited. There are only 115 observations, all from 2012 and involving train commuters to Boston. How useful will this data be today, or for other populations around Boston, or for other cities in the US? Only your judgment, along with advice from your colleagues, can guide you.

The key concept is the idea of a “population.” From which larger population is the data we have being (conceptually) drawn? If we were only interested in the age of individuals in our data set, we would have no need for inference. We know everyone’s ages already. We only need tools like stan_glm() if we seek to understand individuals not in our data. Only if our data is “representative,” at least to some extent, of that larger population can we use our data to answer our questions.

The ethical issues are, fortunately, not fraught. Estimating someone’s age is (universally?) viewed as OK. Estimating someone’s health or income or criminal record are far dicier.

8.1.2 Justice

Justice

FIGURE 8.2: Justice

Justice, in data science, consists of three topics: predictive versus causal modeling, the Preceptor Table, and a mathematical formulation of the model.

Any model with age as its dependent variable will be predictive, not causal, for the simple reason that nothing, other than time, can change your age. You are X years old. It would not matter if you changed your party registration from Democrat to Republican or vice versa. Your age is your age.

There is only one potential outcome, i.e., one outcome. There is not a potential outcome if you are Democrat and a different potential outcome if you are a Republican.

When dealing with a non-causal model, the focus is on predicting things. The underlying mechanism which connects age with party is less important than the brute statistical fact that there is a connection. Predictive models care little about causality.

A good way at looking at this is with a Preceptor Table, as seen below. Unlike the previous table in Chapter 7, we now have two columns in addition to ID. Since our data does not include all Republicans and Democrats in the world, not every row is filled in.

ID Predictor Outcome
Political Party Age
1 D 31
2 ? ?
... ... ...
473 D 58
474 ? ?
... ... ...
3,258 ? ?
3,259 R 49
... ... ...
N ? ?

 

We now know that we are working with a predictive model. Recall:

\[\text{outcome} = \text{model} + \text{not in the model}\]

In words, any event depends on our explicitly described model as well as on influences unknown to us. Everything that happens in the world is the result of various factors, and we can only consider some of them in our model (because we do not know about some influences, or because we have no data about them).

The mathematics:

\[ y_i = \beta_1 republican_i + \beta_2 democrat_i + \epsilon_i\]

where \[republican_i, democrat_i \in \{0,1\}\] \[republican_i + democrat_i = 1\] \[\epsilon_i \sim N(0, \sigma^2)\]

Don’t panic dear poets and philosophers, the whole thing is easier than it looks.

  • On the left-hand side we have the outcome, \(y_i\), which is the variable to be explained. In our case, this is the age of an individual in the population.

  • On the right-hand side we first have the part contained in the model: two similar terms added together. Each term consists of a parameter and a data point. The betas are our two parameters: \(\beta_1\) is the average age of Republicans in the population and \(\beta_2\) is the average age of Democrats in the population. \(republican_i\) and \(democrat_i\) are our explanatory variables and take the values 1 or 0. If person \(i\) is a Republican we have \(republican_i = 1\) and \(democrat_i = 0\). If person \(i\) is a Democrat we have \(republican_i = 0\) and \(democrat_i = 1\). In other words, their values are mutually exclusive – if you are a Democrat, you cannot also be a Republican.

  • The last part, \(\epsilon_i\) (“epsilon”), represents the unexplained part of the outcome and is called the error term. It is simply the difference between the outcome and our model predictions. This includes all factors that have an influence on someone’s age but are not connected to party affiliation. We assume that this error follows a normal distribution with an expected value of 0 (meaning it is 0 on average).

  • The small \(i\)’s are an index to number the observations. It is equivalent to the “ID” column in our Preceptor Table and simply states that the outcome for person \(i\) is explained by the modeled and non-modeled factors for person \(i\).

  • The model is a claim about how the world works, not just for the 115 individuals for which we have data but for the all the people in the population for which we seek to draw inferences.

  • Although terminology differs across academic fields, the most common term to describe a model like this is a “regression.” We are “regressing” age on party in order to see if they are associated with each other. The formula above is a “regression formula,” and the model is a “regression model.” This terminology would also apply to our model of height in Chapter 7.

  • The model in Chapter 7 is sometimes called “intercept-only” because the only (interesting) parameter is the intercept. Here we have a “two intercept” model because, instead of estimating an average for the whole population, we are estimating two averages.

8.1.3 Courage

Courage

FIGURE 8.3: Courage

Courage allows us to translate math to code.

To get posterior distributions for our three parameters, we will again use stan_glm(), just as we did in Chapter 7.

fit_1 <- stan_glm(age ~ party - 1, 
                    data = trains, 
                    seed = 17,
                    refresh = 0)
  • The variable before the tilde, age, is our outcome.

  • The only explanatory variable is party. This variable has only two values, ‘Democrat’ and ‘Republican.’

  • We have also added -1 at the end of the equation, indicating that we do not want an intercept, which would otherwise be added by default.

The resulting output:

fit_1
## stan_glm
##  family:       gaussian [identity]
##  formula:      age ~ party - 1
##  observations: 115
##  predictors:   2
## ------
##                 Median MAD_SD
## partyDemocrat   42.6    1.2  
## partyRepublican 41.2    2.7  
## 
## Auxiliary parameter(s):
##       Median MAD_SD
## sigma 12.3    0.8  
## 
## ------
## * For help interpreting the printed output see ?print.stanreg
## * For info on the priors used see ?prior_summary.stanreg

partyDemocrat corresponds to \(\beta_1\), the average age of Democrats in the population. partyRepublican corresponds to \(\beta_2\), the average age of Republicans in the population. Since we don’t really care about the posterior distribution for \(\sigma\), we won’t discuss it here. Graphically:

fit_1 %>% 
  as_tibble() %>% 
  select(-sigma) %>% 
  mutate(Democrat = partyDemocrat, Republican = partyRepublican) %>%
  pivot_longer(cols = Democrat:Republican,
               names_to = "parameter",
               values_to = "age") %>% 
  ggplot(aes(x = age, fill = parameter)) +
    geom_histogram(aes(y = after_stat(count/sum(count))),
                   alpha = 0.5, 
                   bins = 100, 
                   position = "identity") +
    labs(title = "Posterior for Average Age",
         subtitle = "More data allows for a more precise posterior for Democrats",
         x = "Age",
         y = "Probability") +
    scale_y_continuous(labels = scales::percent_format()) +
    theme_classic()

The unknown parameters \(\beta_1\) (partyDemocrat) and \(\beta_2\) (partyRepublican) are still unknown. We can never know the true average age of all Democrats in the population. But we can calculate a posterior probability distribution for each parameter. Comments:

  • Democrats seem slightly older than Republicans. That was true in the sample and so, almost (but not quite!) by definition, it will be true in our the posterior probability distributions.

  • Our estimate for the average age of Democrats in the population is much more precise than that for Republicans because we have five times as many Democrats as Republicans in our sample. A central lesson from Chapter 6 is that the more data you have related to a parameter, the narrower your posterior distribution will be.

  • There is a great deal of overlap between the two distributions. Would we be surprised if, in truth, the average age of Republicans in the population was greater than that for Democrats? Not really. We don’t have enough data to be sure either way.

  • The phrase “in the population” is doing a great deal of work because we have not said what, precisely, we mean by the “population.” Is it the set of people on those commuter platforms on those days in 2012 when the experiment was done? Is it the set of people on all platforms, including ones never visited? Is it the set of all Boston commuter? All Massachusetts residents? All US residents? Does it include people today, or can we only draw inferences for 2012? We will explore these questions in every model we create.

  • The parameters \(\beta_1\) and \(\beta_2\) can be interpreted in two ways. First, like all parameters, they are a part of the model. We need to estimate them. But, in many cases, we don’t really care what the value of the parameter is. The exact value of \(\sigma\), for example, does not really matter. Second, some parameters have a substantive interpretaion, as with \(\beta_1\) and \(\beta_2\) being the average ages in the population. But this will often not be the case! Fortunately, with such models, we can use functions like posterior_epred() and posterior_predict() to answer our questions.

Consider a table which shows a sample of 8 individuals.

8 Observations from Trains Dataset
Age Party Fitted Residual
31 Democrat 42.59 -11.6
34 Republican 41.19 -7.2
63 Democrat 42.59 20.4
45 Democrat 42.59 2.4
55 Democrat 42.59 12.4
37 Democrat 42.59 -5.6
53 Republican 41.19 11.8
36 Democrat 42.59 -6.6

 

The fitted values are the same for all Republicans and for all Democrats, as the model produces one fitted value for each condition. This table shows how just a sample of 8 individuals captures a wide range of residuals, making it difficult to predict the age of a new individual. We can get a better picture of the unmodeled variation in our sample if we plot these three variables for all the individuals in our data.

The following three histograms show the actual outcomes, fitted values, and residuals of all people in trains:

outcome <- ch8 %>% 
  ggplot(aes(age)) +
    geom_histogram(bins = 100) +
    labs(x = "Age",
         y = "Count") 

fitted <- tibble(age = fitted(fit_1)) %>% 
  ggplot(aes(age)) +
    geom_bar() +
    labs(x = "Fitted Values",
         y = NULL) +
    scale_x_continuous(limits = c(20, 70)) 

res <- tibble(resids = residuals(fit_1)) %>% 
  ggplot(aes(resids)) +
    geom_histogram(bins = 100) +
    labs(x = "Residuals",
         y = NULL) 
  

outcome + fitted + res +
  plot_annotation(title = "Decomposition of Height into Fitted Values and Residuals")

The three plots are structured like our equation and table above. A value in the left plot is the sum of one value from the middle plot plus one from the right plot.

  • The actual age distribution looks like a normal distribution. It is centered around 43, and it has a standard deviation of about 12 years.

  • The middle plot for the fitted values shows only two adjacent spikes, which represent the estimates for Democrats and Republicans.

  • Since the residuals plot represents the difference between the other two plots, its distribution looks like the first plot.

8.1.4 Temperance

Temperance

FIGURE 8.4: Temperance

Recall the first questions with which we began this section:

  • What is the probability that, if a Democrat shows up at the train station, he will be over 50 years old?

So far we have only tried our model on people from our data set whose real age we already knew. This is helpful to understand the model, but our ultimate goal is to understand more about the real world, about people we don’t yet know much about. Temperance guides us to make meaningful predictions and to become aware of their known and unknown limitations.

Start with a simple question, what are the chances that a random Democrat is over 50 years old? First, we create a tibble with the desired input for our model. In our case the tibble has a variable named “party” which contains a single observation with the value “Democrat.” This is a bit different than Chapter 7.

new_obs <- tibble(party = "Democrat")

Use posterior_predict() to create draws from the posterior for this scenario. Note that we have a new posterior distribution under consideration here. The unknown parameter, call it \(D_{age}\), is the age of a Democrat. This could be the age of a randomly selected Democrat from the population or of the next Democrat we meet or of the next Democrat we interview on the train platform. The definition of “population” determines the appropriate interpretation. Yet, regardless, \(D_{age}\) is an unknown parameter. But it is not one — like \(\beta_1\), \(\beta_2\), or \(\sigma\) — for which we have already created a posterior probability distribution. That is why we need posterior_predict().

posterior_predict() takes two arguments: the model for which the simulations should be run, and a tibble indicating for which and how many parameters we want to run these simulations. In this case, the model is the one from Courage and the tibble is the one we just created.

pp <- posterior_predict(fit_1, newdata = new_obs) %>%
    as_tibble() %>%
    mutate_all(as.numeric)

head(pp, 10)
## # A tibble: 10 x 1
##      `1`
##    <dbl>
##  1  47.2
##  2  49.5
##  3  36.3
##  4  36.4
##  5  41.0
##  6  59.5
##  7  55.5
##  8  37.5
##  9  47.7
## 10  31.9

We might expect that we can use as_tibble() directly after the object returned posterior_predict(). Sadly, for obscure technical reasons, that won’t quite work. So, we need the incantation mutate_all(as.numeric) to make sure that the resulting tibble is well-behaved. This command ensures that every column in the tibble is a simple numeric vector, which is what we want.

The result are draws from the posterior distribution of the age of a Democrat. It is important to understand that this is not a concrete person from the trains dataset - the algorithm in posterior_predict() simply uses the existing data from trains to estimate this posterior distribution.

pp %>% 
  ggplot(aes(x = `1`)) +
    geom_histogram(aes(y = after_stat(count/sum(count))),
                   bins = 100)  +
    labs(title = "Posterior for a Random Democrat's Age",
         subtitle = "Individual predictions are always more variable than expected values",
         x = "Age",
         y = "Probability") +
    scale_y_continuous(labels = scales::percent_format()) +
    theme_classic()

Once we have the posterior distribution, we can answer (almost) any reasonable question. In this case, the probability that the next Democrat will be over 50 is around 28%.

sum(pp$`1` > 50) / nrow(pp)
## [1] 0.26

Recall the second question:

  • In a group of three Democrats and three Republicans, what will the age difference be between the oldest Democrat and the youngest Republican?

As before we start by creating a tibble with the desired input. Note that the name of the column (“party”) and the observations (“Democrat,” “Republican”) must always be exactly as they are in the original data set. This tibble as well as our model can then be used as arguments for posterior_predict():

newobs <- tibble(party = c("Democrat", "Democrat", "Democrat", 
                        "Republican", "Republican","Republican"))

posterior_predict(fit_1, newdata = newobs) %>%
    as_tibble() %>%
    mutate_all(as.numeric)
## # A tibble: 4,000 x 6
##      `1`   `2`   `3`   `4`   `5`   `6`
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  38.5  27.6  44.9 31.9   47.4  40.4
##  2  44.0  36.2  50.4 60.0   47.3  46.0
##  3  34.7  54.8  44.5 32.0   45.3  49.9
##  4  51.7  41.3  26.8 22.7   48.6  18.3
##  5  41.9  33.2  42.2 46.4   27.6  41.0
##  6  29.5  29.5  43.8  9.26  39.8  34.0
##  7  42.3  48.8  47.5 46.4   31.3  26.3
##  8  30.9  51.6  43.0 48.0   31.5  49.8
##  9  43.0  33.9  30.3 10.2   34.9  53.3
## 10  22.3  49.7  30.3 32.6   62.7  39.8
## # … with 3,990 more rows

We have 6 columns: one for each person. posterior_predict() does not name the columns, but they are arranged in the same order in which we specified the persons in newobs: D, D, D, R, R, R. To determine the expected age difference, we add code which works with these posterior draws:

pp <- posterior_predict(fit_1, newdata = newobs) %>%
    as_tibble() %>%
    mutate_all(as.numeric) %>% 
  
    # We don't need to rename the columns, but doing so makes the subsequest
    # code much easier to understand. We could just have worked with columns 1,
    # 2, 3 and so on. Either way, the key is to ensure that you correctly map
    # the covariates in newobs to the columns in the posterior_predict object.
  
    set_names(c("dem_1", "dem_2", "dem_3", 
                "rep_1", "rep_2", "rep_3")) %>% 
    rowwise() %>% 
  
  # Creating three new columns. The first two are the highest age among
  # Democrats and the lowest age among Republicans, respectively. The third one
  # is the difference between the first two.
  
  mutate(dems_oldest = max(c_across(dem_1:dem_3)),
         reps_youngest = min(c_across(rep_1:rep_3)),
         age_diff = dems_oldest - reps_youngest)

pp
## # A tibble: 4,000 x 9
## # Rowwise: 
##    dem_1 dem_2 dem_3 rep_1 rep_2 rep_3 dems_oldest reps_youngest age_diff
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>       <dbl>         <dbl>    <dbl>
##  1  54.9  28.4  51.8  53.4  63.4  60.6        54.9          53.4     1.46
##  2  34.5  66.0  35.3  35.1  41.5  46.5        66.0          35.1    30.9 
##  3  38.4  44.1  31.2  56.4  42.1  67.5        44.1          42.1     2.03
##  4  72.8  35.1  36.7  30.2  29.4  35.3        72.8          29.4    43.4 
##  5  46.0  44.3  32.5  26.5  55.2  30.2        46.0          26.5    19.6 
##  6  21.8  21.8  62.9  24.8  45.9  57.6        62.9          24.8    38.0 
##  7  34.5  57.3  39.3  39.0  36.9  32.2        57.3          32.2    25.1 
##  8  48.3  32.5  40.8  44.6  55.3  25.1        48.3          25.1    23.2 
##  9  55.8  29.6  64.8  70.5  43.1  52.8        64.8          43.1    21.8 
## 10  19.8  44.4  41.3  30.1  45.9  46.6        44.4          30.1    14.3 
## # … with 3,990 more rows

The plotting code is similar to what we have seen before:

pp %>%  
  ggplot(aes(x = age_diff)) +
    geom_histogram(aes(y = after_stat(count/sum(count))),
                   bins = 100) +
    labs(title = "Posterior for Age Difference",
         subtitle = "Oldest of three Democrats compared to youngest of three Republicans",
         x = "Age",
         y = "Probability") +
    scale_y_continuous(labels = scales::percent_format()) +
    theme_classic()

In words, we would expect the oldest Democrat to be about 22 years older than the youngest Republican, but we would not be too surprised if the oldest Democrat was actually younger than the youngest Republican in a group of 6.

8.1.5 Addendum

Instead of parameterizing this model without an intercept, we could have used one. In that case, the math would be:

\[ y_i = \beta_0 + \beta_1 democratic_i + \epsilon_i\] The interpretations of the parameters are different from the prior model. \(\beta_0\) is now the average age of Republicans. This is the same interpretation as \(\beta_1\) in the original set up. \(\beta_1\) is now the difference between the the average age of Republicans and that of Democrats.

To fit this model, we use the exact same code as before, except without the -1 in the formula argument.

stan_glm(age ~ party, 
         data = trains, 
         seed = 98,
         refresh = 0)
## stan_glm
##  family:       gaussian [identity]
##  formula:      age ~ party
##  observations: 115
##  predictors:   2
## ------
##                 Median MAD_SD
## (Intercept)     42.6    1.3  
## partyRepublican -1.5    3.1  
## 
## Auxiliary parameter(s):
##       Median MAD_SD
## sigma 12.3    0.8  
## 
## ------
## * For help interpreting the printed output see ?print.stanreg
## * For info on the priors used see ?prior_summary.stanreg

The intercept, 42.6, is the same as the partyDemocrat estimate in the first model. The partyRepublican estimate, which was previously 41.0, is now -1.5, meaning it is the difference (allowing for rounding) between the average age of Democrats and Republicans.

Little else about the models will be different. They will have the same fitted values and residuals. posterior_predict() will generate the same posterior predictive probabilty distributions. Which parameterization we use does not matter much. But you should be able to interpret the meaning of the coefficients in both.

8.2 att_end ~ treatment

Above, we created a predictive model: with someone’s party affiliation, we can make a better guess as to what their age is than we could have in the absence of information about their party. There was nothing causal about that model. Changing someone’s party registration can not change their age. In this example, we build a causal model. Consider these two questions:

What is the average treatment effect, of exposing people to Spanish-speakers, on their attitudes toward immigration?

What is the largest causal effect which still has a 1 in 10 chance of occurring?

Models help us answer questions better than we could without those models, but only if we follow the Cardinal Virtues.

8.2.1 Wisdom

Wisdom

FIGURE 8.5: Wisdom

Look at your data! Before making a model which seeks to explain att_end, we should plot att_end and the variable we think is connected to it.

ch8 %>%
  ggplot(aes(x = treatment, y = att_end)) + 
  geom_jitter() + 
  labs(title = "Attitude End and Treatment in Enos (2014)",
       subtitle = "Did the treatment make people more conservative?",
       x = "Treatment",
       y = "Attitude After Experiment")

Treated individuals seem to have a higher values for att_end than control individuals, although there is a great deal of variation in both.

We are still using the data from Enos (2014). Yet the world is a very different place today! Is data from 2012, four years before Donald Trump’s election as president, still relevant? Can we generalize data from Boston commuters to other people in Massachusetts, to other people in the US? There are no obvious answers to these questions. The data we have is never a perfect match to the problem we face because data is always old. The world is constantly changing. To use old data, we need to make assumptions about the stability of our model of the world and of the parameters we are estimating. Whether those assumptions are reasonable is a difficult question, one that requires knowledge about the world as it was and the world as it now is. Math won’t save us.

The issue is always: Do the data we have and the data we want to have come from the same population? If there is no connection between the two, progress is impossible. But, in this case, if we are willing to consider the population to be US residents over the last decade (including today), and if we are willing to assume that our data set of 115 individuals is representative of that population, then we can use our data to create a model to answer our question.

Ethical issues are tricky, or at least trickier than they were in the context of models in which the dependent variable was height or age. (And the ethical issues in conducting the experiment in the first place are non-trivial.) Assume we make a good model. What is to prevent someone from using that information to, say, influence voting and donations? Imagine a Republican Senate candidate who hires Spanish-speakers to ride commuter trains in order to shift voters’ attitudes toward immigration in a conservative direction. She believes (correctly?) that doing so will increase her odds of winning the election. Is that the sort of knowledge we seek to create? We can make such models. Indeed, the purpose of this chapter is to show you how to do so! But should you make those models?

8.2.2 Justice

Justice

FIGURE 8.6: Justice

The two elements of Justice in a data science project remain the same: the Preceptor Table and the mathematical formula.

A Preceptor Table for this model would look very similar to the one in the first half of this chapter, except that we have now possible outcomes for each individual: att_end if they are exposed to treatment and att_end if they are exposed to control.

The math for this model is exactly the same as the math for the predictive model in the first part of this chapter, although we change the notation a bit for clarity.

\[ y_i = \beta_1 treatment_i + \beta_2 control_i + \epsilon_i\]

where \[treatment_i, control_i \in \{0,1\}\] \[treatment_i + control_i = 1\] \[\epsilon_i \sim N(0, \sigma^2)\]

Nothing has changed, except for the meaning of the data items and the interpretations of the parameters.

  • On the left-hand side we have the outcome, \(y_i\), which is the variable to be explained. In our case, this is a person’s attitude toward immigration after the experiment is complete. \(y_i\) takes on integer values between 3 and 15 inclusive.

  • On the right-hand side we first have the part contained in the model, consisting of two similar terms. The two terms stand for Treated and Control and work as follows. Each term consists of a parameter and a data point. \(\beta_1\) is the average attitude toward immigration for treated individuals — those exposed to Spanish-speakers — in the population. \(\beta_2\) is the average attitude toward immigration for control individuals — those not exposed to Spanish-speakers — in the population. The \(x\)’s are our explanatory variables and take the values 1 or 0. If person \(i\) is Treated, \(treatment_i = 1\) and \(control_i = 0\). If person \(i\) is Control, \(treatment_i = 0\) and \(control_i = 1\). In other words, these are binary variables and are mutually exclusive – if you are Treated, you cannot also be Control.

  • Note that the formula applies to everyone in the population, not just the 115 people for whom we have data. The index \(i\) does not just go from 1 through 115. It goes from 1 through \(N\), where \(N\) is the number of individuals in the population. Conceptually, everyone has an att_end under treatment and under control.

  • The last part, \(\epsilon_i\) (“epsilon”), represents the unexplained part and is called the error term. It is simply the difference between the outcome and our model predictions. In our particular case, this includes all factors that have an influence on someone’s attitude toward immigration but are not explained by treatment status. We assume that this error follows a normal distribution with an expected value of 0.

  • The small \(i\)’s are an index for the data set. It is equivalent to the “ID” column in our Preceptor Table and simply states that the outcome for person \(i\) is explained by the predictor variables (\(treatment\) and \(control\)) for person \(i\), along with an error term.

8.2.3 Courage

Courage

FIGURE 8.7: Courage

With Justice satisfied, we gather our Courage and fit the model. Note that, except for the change in variable names, the code is exactly the same as it was above, in our predictive model for age. Predictive models and causal models use the same math and the same code. The differences, and they are very important, lie in the interpretation of the results, not in their creation.

fit_2 <- stan_glm(att_end ~ treatment - 1, 
                      data = trains, 
                      seed = 45,
                      refresh = 0)

fit_2
## stan_glm
##  family:       gaussian [identity]
##  formula:      att_end ~ treatment - 1
##  observations: 115
##  predictors:   2
## ------
##                  Median MAD_SD
## treatmentTreated 10.0    0.4  
## treatmentControl  8.5    0.3  
## 
## Auxiliary parameter(s):
##       Median MAD_SD
## sigma 2.8    0.2   
## 
## ------
## * For help interpreting the printed output see ?print.stanreg
## * For info on the priors used see ?prior_summary.stanreg

treatmentTreated corresponds to \(\beta_1\). As always, R has, behind the scenes, estimated the entire posterior probability distribution for \(\beta_1\). We will graph that distribution in the next section. But the basic print method for these objects can’t show the entire distribution, so it gives us summary numbers: the median and the MAD SD. Speaking roughly, we would expect about 95% of the values in the posterior to be within two MAD SD’s of the median. In other words, we are 95% confident that the true, but unknowable, average attitude toward immigration among the Treated in the population to be between 9.2 and 10.8.

treatmentControl corresponds to \(\beta_2\). The same analysis applies. We are about 95% confident that the true value for the average attitude toward immigration for Control in the population is between 7.9 and 9.1.

Up until now, we have used the Bayesian interpretation of “confidence interval.” This is also the intuitive meaning which, outside of academia, is almost universal. There is a truth out there. We don’t know, and sometimes can’t know, the truth. A confidence interval, and its associated confidence level, tells us how likely the truth is to lie within a specific range. If your boss asks you for a confidence interval, she almost certainly is using this interpretation.

But, in contemporary academic research, the phrase “confidence interval” is usually given a “Frequentist” interpretation. (The biggest divide in statistics is between Bayesians and Frequentist interpretations. The Frequentist approach, also known as “Classical” statistics, has been dominant for 100 years. Its power is fading, which is why this textbook uses the Bayesian approach.) For a Frequentist, a 95% confidence interval means that, if we were to apply the procedure we used in an infinite number of future situations like this, we would expect the true value to fall within the calculated confidence intervals 95% of the time. In academia, a distinction is sometimes made between confidence intervals (which use the Frequentist interpretation) and credible intervals (which use the Bayesian interpretation). We won’t worry about that difference in this Primer.

Let’s look at the full posteriors for both \(\beta_1\) and \(\beta_2\).

fit_2 %>% 
  as_tibble() %>% 
  select(-sigma) %>% 
  pivot_longer(cols = treatmentTreated:treatmentControl,
               names_to = "Parameter",
               values_to = "attitude") %>% 
  ggplot(aes(x = attitude, fill = Parameter)) +
    geom_histogram(aes(y = after_stat(count/sum(count))),
                   alpha = 0.5, 
                   bins = 100, 
                   position = "identity") +
    labs(title = "Posterior for Expected Attitude Toward Immigration",
         subtitle = "Treated individuals are more conservative",
         x = "Attitude",
         y = "Probability") +
    scale_y_continuous(labels = scales::percent_format()) + 
    theme_classic()

It appears that the affect of the treatment is to change people’s attitudes to be more conservative about immigration issues. Which is somewhat surprising!

We can decompose the the dependent variable, att_end into two parts: the fitted values and the residuals. There are only two possible fitted values, one for the Treated and one for the Control. The residuals, as always, are simply the difference between the outcomes and the fitted values.

The smaller the spread of the residuals, the better a job the model is doing of explaining the outcomes.

8.2.4 Temperance

Temperance

FIGURE 8.8: Temperance

Recall the first question with which we began this section:

  • What is the average treatment effect, of exposing people to Spanish-speakers, on their attitudes toward immigration?

Chapter 4 defined the average treatment effect. One simple estimator of the average treatment effect is the difference between \(\beta_1\) and \(\beta_2\). After all, the definition of \(\beta_1\) is the average attitude toward immigration, of the population, for anyone, under exposure to the treatment. So, \(\beta_1 - \beta_2\) is the average treatment effect for the population, roughly 1.5. However, estimating the posterior probability distribution for this parameter is tricky, unless we make use of the posterior distributions of \(\beta_1\) and \(\beta_2\). With that information, the problem is simple:

fit_2 %>% 
  as_tibble() %>% 
  mutate(ate = treatmentTreated - treatmentControl) %>% 
  ggplot(aes(x = ate)) +
    geom_histogram(aes(y = after_stat(count/sum(count))),
                   bins = 100) +
    labs(title = "Posterior for Average Treatment Effect",
         subtitle = "Exposure to Spanish-speakers shifts immigration attitudes rightward",
         x = "Difference in Attitude",
         y = "Probability") +
    scale_y_continuous(labels = scales::percent_format()) +
    theme_classic()

Could the true value of the average treatment effect be as much as 2 or as little as 1? Of course! The most likely value is around 1.5, but the variation in the data and the smallness of our sample cause the estimate to be imprecise. However, it is quite unlikely that the true average treatment effect is below zero.

We can use posterior_epred() to answer this question. Create a tibble and use it as we have done before:

newobs <- tibble(treatment = c("Treated", "Control"))

pe <- posterior_epred(fit_2, newobs) %>% 
    as_tibble() %>% 
    mutate(ate = `1` - `2`)

pe
## # A tibble: 4,000 x 3
##      `1`   `2`   ate
##    <dbl> <dbl> <dbl>
##  1 10.0   8.56 1.47 
##  2  9.94  8.33 1.61 
##  3 10.0   8.48 1.56 
##  4 10.6   8.28 2.36 
##  5 10.6   8.29 2.32 
##  6  9.26  8.81 0.449
##  7 10.4   8.42 1.96 
##  8 10.5   8.28 2.26 
##  9 10.8   8.51 2.25 
## 10 10.3   8.37 1.88 
## # … with 3,990 more rows

The posterior probability distribution created with posterior_epred() is the same as the one produced by manipulating the parameters directly.

pe %>% 
  ggplot(aes(x = ate)) +
    geom_histogram(aes(y = after_stat(count/sum(count))),
                   bins = 100) +
    labs(title = "Posterior for Average Treatment Effect",
         subtitle = "Exposure to Spanish-speakers shifts immigration attitudes rightward",
         x = "Difference in Attitude",
         y = "Probability") +
    scale_y_continuous(labels = scales::percent_format()) +
    theme_classic()

Our second question:

  • What is the largest effect size which still has a 1 in 10 chance of occurring?

Create a tibble which we can pass to posterior_predict(). The variables in the tibble which will be passed in as newdata. Fortunately, the tibble we created above is just what we need for this question also.

Consider the result of posterior_predict() for two people, one treated and one control. Take the difference.

pp <- posterior_predict(fit_2, 
                        newdata = newobs) %>%
    as_tibble() %>%
    mutate_all(as.numeric) %>% 
    mutate(te = `1` - `2`)
  
pp
## # A tibble: 4,000 x 3
##      `1`   `2`    te
##    <dbl> <dbl> <dbl>
##  1  8.66  6.21  2.44
##  2  9.20  3.94  5.25
##  3 10.9   6.47  4.48
##  4 14.7   8.00  6.71
##  5 12.6   5.47  7.08
##  6  9.77  7.16  2.61
##  7 11.9   7.95  3.95
##  8  8.38  4.86  3.53
##  9 13.8   6.82  6.96
## 10 10.4   8.32  2.11
## # … with 3,990 more rows

Create a graphic:

pp %>% 
  ggplot(aes(x = te)) +
    geom_histogram(aes(y = after_stat(count/sum(count))),
                   bins = 100)  +
    labs(title = "Posterior for Treatment Effect for One Person",
         subtitle = "Causal effects are more variable for indvduals",
         x = "Difference in Attitude",
         y = "Probability") +
    scale_y_continuous(labels = scales::percent_format()) +
    theme_classic()

In this case, we are looking at the distribution of the treatment effect for a single individual. This is very different than the average treatment effect. In particular, it is much more variable. We are looking at one row in the Preceptor Table. For a single individual, att_end can be anywhere from 3 to 15, both under treatment and under control. The causal effect — the difference between the two potential outcomes can, in theory, be anywhere from -12 to +12. Such extreme values are rare, but not impossible.

The question, however, was interested in the value at the 90th percentile.

quantile(pp$te, prob = 0.9)
## 90% 
## 6.5

We would not expect a treatment effect of this magnitude to be common, but, at the same time, effects this big and bigger will occur about 10% of the time.

8.3 income ~ age

So far, we have only created models in which the predictor variable is discrete, with two possible values. party is either “Democrat” or “Republican.” treatment is either “Treated” or “Control.” Often times, however, the predictor variable will be continuous. Fortunately, the exact same approach works in this case. Consider:

  • What would you expect the income to be for a 40-year old?

8.3.1 Wisdom

Wisdom

FIGURE 8.9: Wisdom

Before making a model which seeks to explain income, we should plot it and the variable we think is connected to it.

ch8 %>%
  ggplot(aes(x = age, y = income)) + 
  geom_point() + 
  labs(title = "Income and Age",
       subtitle = "Older people make more money. Maybe?",
       x = "Age",
       y = "Income") +
  scale_y_continuous(labels = scales::dollar_format(accuracy = 1))

Is that data believable? Maybe? One could imagine that income was not recorded exactly and that, therefore, it is not surprising that many people had exactly the same income. Rounding makes different numbers the same, after all. But what about the (almost) complete lack of people reporting income between $150,000 and $250,000? In a real data science project, this puzzle would require further investigation. For now, we ignore the issue and blithely press on.

8.3.2 Justice

Justice

FIGURE 8.10: Justice

The mathematics for a continuous predictor is unchanged from the intercept-including example we explored in Section 8.1.5:

\[y_i = \beta_0 + \beta_1 age_i + \epsilon_i\]

When comparing two people (persons 1 and 2), the first one year older than the second, \(\beta_1\) is the expected difference in their incomes. The algebra is simple. Start with the two individuals.

\[y_1 = \beta_0 + \beta_1 age_1\] \[y_2 = \beta_0 + \beta_1 age_2\] We want the difference between them, so we subtract the second from the first, performing that subtraction on both sides of the equals sign.

\[y_1 - y_2 = \beta_0 + \beta_1 age_1 - \beta_0 - \beta_1 age_2\\ y_1 - y_2 = \beta_1 age_1 - \beta_1 age_2\\ y_1 - y_2 = \beta_1 (age_1 - age_2)\]

So, if person 1 is one year older than person 2, we have:

\[y_1 - y_2 = \beta_1 (age_1 - age_2)\\ y_1 - y_2 = \beta_1 (1)\\ y_1 - y_2 = \beta_1\]

The algebra demonstrates that \(\beta_1\) is the same for all ages. The difference in expected income between two people aged 23 and 24 is the same as the difference between two people aged 80 and 81. Is that plausible? Maybe. The algebra does not lie. When we create a model like this, this is the assumption we are making.

Note how careful we are not to imply that increasing age by one year “causes” an increase in income. That is nonsense! No causation without manipulation. Since it is impossible to change someone’s age, there is only one potential outcome. With only one potential outcome, a causal effect is not defined.

8.3.3 Courage

Courage

FIGURE 8.11: Courage

The use of stan_glm() is the same as usual.

fit_3 <- stan_glm(income ~ age, 
                  data = trains, 
                  seed = 28,
                  refresh = 0)

print(fit_3, details = FALSE)
## stan_glm
##  family:       gaussian [identity]
##  formula:      income ~ age
##  observations: 115
##  predictors:   2
## ------
##             Median   MAD_SD  
## (Intercept) 104783.8  25577.1
## age            877.7    578.5
## 
## Auxiliary parameter(s):
##       Median  MAD_SD 
## sigma 74186.1  4820.3
## 
## ------
## * For help interpreting the printed output see ?print.stanreg
## * For info on the priors used see ?prior_summary.stanreg

When comparing two individuals, one 30 years old and one 40, we expect the older to earn about $9,000 more. But we are far from certain: the 95% confidence interval ranges from -$3,000 to $20,000.

The above is a good summary of the models.

  • It is brief! No one wants to listen to too much of your prattle. One sentence gives a number of interest. The second sentence provides a confidence interval.

  • It rounds appropriately. No one wants to hear a bunch of decimals. Use sensible units.

  • It does not just blindly repeat numbers in the printed display. A one year difference in age, which is associated with a $900 difference in income, is awkward. (We think.) A decade comparison is more sensible.

  • “When comparing” is a great phrase to start the summary of any non-causal model. Avoid language like “associated with” or “leads to” or “implies” or anything which even hints at a causal claim.

Consider our usual decomposition of the outcome into two parts: the model and the error term.

There are scores of different fitted values. Indeed, there are a greater number of different fitted values than there are different outcome values! This is often true for models which have continuous predictor variables, we have here with age.

8.3.4 Temperance

Temperance

FIGURE 8.12: Temperance

Recall our question:

What would you expect the income to be for a random 40 year old?

Given that we are looking for an expected value, we use posterior_epred().

newobs <- tibble(age = 40)

pe <- posterior_epred(fit_3, newdata = newobs) %>% 
  as_tibble() 

pe
## # A tibble: 4,000 x 1
##        `1`
##      <dbl>
##  1 131988.
##  2 132900.
##  3 143766.
##  4 149269.
##  5 138030.
##  6 135142.
##  7 145487.
##  8 147361.
##  9 131460.
## 10 147589.
## # … with 3,990 more rows

Plotting is the same as always.

pe %>% 
  ggplot(aes(x = `1`)) +
    geom_histogram(aes(y = after_stat(count/sum(count))),
                   bins = 100)  +
    labs(title = "Posterior for Expected Income",
         subtitle = "A 40-years old commuter earns around $140,000",
         x = "Income",
         y = "Probability") +
    scale_x_continuous(labels = scales::dollar_format()) +
    scale_y_continuous(labels = scales::percent_format()) +
    theme_classic()

8.4 liberal ~ income

So far in this chapter, we have only considered continuous outcome variables. age, att_end and income all take on a variety of values. None of them are, truly, continuous, of course. age is only reported as an integer value. att_end can only, by definition, take on 13 distinct values. However, from a modeling perspective, what matters is that they have more than 2 possible values.

liberal, however, only takes on two values: TRUE and FALSE. In order to model it, we must use the binomial family. We begin, as always, with some questions:

Among all people who have an income $100,000, what proportion are liberal?

Assume we have a group of eight people, two of whom make $100,000, two $200,000, two $300,000 and two $400,000. How many will be liberal?

8.4.1 Wisdom

Look at our data.

ch8 %>%
  ggplot(aes(x = income, y = liberal)) + 
  geom_jitter(width = 0, height = 0.1) + 
  labs(title = "Liberal Status by Income in Trains Dataset",
       subtitle = "Where the gap in the income distribution?",
       x = "Income",
       y = "Liberal") +
  scale_x_continuous(labels = scales::dollar_format())

Why is there only one observation with an income between $135,000 and $250,000? We don’t know. The important thing is to notice it, to explore the variables you are using in the model.

The second part of Wisdom is examining whether the data we have is representative of the overall population about which we seek to draw inferences. Our data is from 2012 and includes only Boston-area commuters. Our questions are, apparently, about today. Are they for the world, the US, the Boston-area, or just commuters at these select train stations? We don’t know! The questions are not very precise.

One of the key jobs of a data scientist is to guide people — our colleagues, bosses and clients — toward asking precise questions. We must help them to translate English into inference. That translation requires precision. In this case, let’s assume that the questions reference Boston-area commuters in 2021.

So, is the data we have from 2012 representative of the data from 2021 about which we seek to draw inferences? Maybe? There is rarely a correct answer to this question. Some things have changed in Boston and some things have not. Our job is to think about these changes and adjust our confidence in our final answers accordingly. Or, in the worst case scenario, if there is no way to understand our actual data and our desired data as coming from the same population, we must stop the project right there.

8.4.2 Justice

Recall the discussion in Section 7.6 about the logistic regression model which we use whenever the outcome or dependent variable is binary/logical. The math is there, if you care about math. We don’t, at least not too much. Reminder:

\[p(\text{Liberal}_i = \text{TRUE}) = \frac{\text{exp}(\beta_0 + \beta_1 \text{income}_i)}{1 + \text{exp}(\beta_0 + \beta_1 \text{income}_i)}\]

This model only has two parameters, \(\beta_0\) and \(\beta_1\). But these parameters do not have simple interpretations, unlike the parameters in a linear (or gaussian) model.

Recall the fundamental structure of all data science problems:

\[\text{outcome} = \text{model} + \text{what is not in the model}\] The exact mathematics of the model — the parameters, their interpretations — are all just dross in the foundry of our inferences: unavoidable but not worth too much of our time.

Even if the math is ignorable, the causal versus predictive nature of the model is not. Is this a causal model or a predictive model? It depends! It could be causal if you assume that we can manipulate someone’s income, if, that is, there are at least two potential outcomes: person \(i\)’s liberal status if she makes X dollars and person \(i\)’s liberal status if she makes Y dollars. Remember: No causation without manipulation. The definition of a causal effect is the difference between two potential outcomes. If you only have one outcome, then your model can not be causal.

In many circumstances, we don’t really care if a model is causal or not. We might only want to forecast/predict/explain the outcome variable. In that case, whether we can interpret the influence of a variable as causal is irrelevant to our use of that variable.

8.4.3 Courage

Fitting a logistic model is easy. We use all the same arguments as usual, but with family = binomial added.

fit_4 <- stan_glm(data = ch8,
                  formula = liberal ~ income,
                  family = binomial,
                  refresh = 0,
                  seed = 365)

Having fit the model, we can look at a printed summary. Note the use of the digits argument to display more digits in the printout.

print(fit_4, digits = 6)
## stan_glm
##  family:       binomial [logit]
##  formula:      liberal ~ income
##  observations: 115
##  predictors:   2
## ------
##             Median   MAD_SD  
## (Intercept)  5.6e-01  4.3e-01
## income      -6.0e-06  3.0e-06
## 
## ------
## * For help interpreting the printed output see ?print.stanreg
## * For info on the priors used see ?prior_summary.stanreg

Fitted models tell us about the posterior distributions for the parameters in the formula which defines the model we have estimated. We are assuming that the model is true. And, as discussed in Chapter 6, that assumption is always false! Our model is never a perfectly accurate representation of reality. But, if it were perfect, then the posterior distributions which we have created for \(\beta_0\), \(\beta_1\), and so on would be perfect as well.

When working with a linear model, we will often interpret the meaning of the parameters, as we have already done in the first three sections of this chapter. Such interpretations are much harder with logistic models because the math is much less convenient. So, we won’t even bother to try to understand the meaning of these parameters. However, we can note that \(\beta_1\) is negative, suggesting that people with higher incomes are less likely to be liberal.

8.4.4 Temperance

Among all people who have an income $100,000, what proportion are liberal?

Although our model is now logistic, all the steps in answering a question like this are the same as with a linear/guassian model.

newobs <- tibble(income = 100000)

pe <- posterior_epred(fit_4, 
                      newdata = newobs) %>% 
  as_tibble()

pe is a tibble with a single vector. That vector is 4,000 draws from the posterior distribution of proportion of people, among those who make $100,000, who are liberal. The population proportion is the same thing as the probability for any single individual.

pe %>% 
  ggplot(aes(x = `1`)) +
    geom_histogram(aes(y = after_stat(count/sum(count))),
                   bins = 100)  +
    labs(title = "Posterior for Proportion Liberal Among $100,000 Earners",
         subtitle = "The population proportion is the same as the probability for any individual",
         x = "Income",
         y = "Probability of Being Liberal") +
    scale_x_continuous(labels = scales::percent_format(accuracy = 1)) +
    scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
    theme_classic()

Assume we have a group of eight people, two of whom make $100,000, two $200,000, two $300,000 and two $400,000. How many will be liberal?

Because we are trying to predict the outcome for a small number of units, we use posterior_predict(). The more complex the questions we ask, the more care we need to devote to making the newobs tibble. We use the same rowwise() and c_across() tricks as earlier in the chapter.

newobs <- tibble(income = c(rep(100000, 2),
                            rep(200000, 2),
                            rep(300000, 2),
                            rep(400000, 2)))
                 

pp <- posterior_predict(fit_4, 
                        newdata = newobs) %>% 
  as_tibble() %>% 
  mutate_all(as.numeric) %>% 
  rowwise() %>% 
  mutate(total = sum(c_across()))

pp
## # A tibble: 4,000 x 9
## # Rowwise: 
##      `1`   `2`   `3`   `4`   `5`   `6`   `7`   `8` total
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1     0     1     1     0     0     1     0     1     4
##  2     1     1     0     0     0     0     0     0     2
##  3     0     0     0     0     0     0     1     0     1
##  4     1     1     0     0     0     0     1     0     3
##  5     1     1     0     1     0     1     0     0     4
##  6     1     0     1     0     0     0     0     0     2
##  7     1     1     1     1     1     0     1     0     6
##  8     0     1     0     0     0     0     0     0     1
##  9     1     1     0     0     0     0     0     0     2
## 10     1     1     0     0     0     0     0     0     2
## # … with 3,990 more rows

Study the pp tibble. Understand its component parts. The first column, for example, is 4,000 draws from the posterior distribution for the liberal status of a random person with an income of $100,000. Note how all the draws are zeroes or ones. That is very different from the draws we have seen before! But it also makes sense. We are making a prediction about a binary variable, a variable which only have two possible values: zero or one. So, any (reasonable!) predictions will only be zero or one.

The second column is the same thing as the first column. Both are 4,000 draws from the posterior distribution for the liberal status of a random person with an income of $100,000. Yet they also have different values. They are both the same thing and different things, in the same way that rnorm(10) and rnorm(10) are the same thing — both are 10 draws from the standard normal distribution — and different things in that the values vary.

The third and fourth columns are different from the first two columns. They are both 4,000 draws from the posterior distribution for the liberal status of a random person with an income of $200,000. And so on for later columns. We can answer very difficult questions by putting together simple building blocks, each of them a set of draws from a posterior distribution. Recall the discussion in Section 2.8.

The total column is simply the sum of the first eight columns. Having created the building blocks with 8 columns of draws from four different posterior distributions, we can switch our focus to each row. Consider row 2. It has a vector of 8 numbers: 1 1 1 0 0 1 0 0. We can treat that vector as a unit of analysis. This is what might happen with our 8 people. The first three might be liberal, the fourth not liberal and so on. This row is just one example of what might happen, one draw from the posterior distribution of possible outcomes for groups of eight people with these incomes.

We can simplify this draw by taking the sum, or doing anything else which might answer the question with which we are confronted. Posterior distributions are as flexible as individual numbers. We can, more or less, just use algebra to work with them.

Graphically we have:

pp %>% 
  ggplot(aes(x = total)) +
    geom_histogram(aes(y = after_stat(count/sum(count))),
                   bins = 100)  +
    labs(title = "Posterior for Number of Liberals in Group with Varied Incomes",
         subtitle = "Two is the most likely number, but values from 0 to 5 are plausible",
         x = "Number of Liberals",
         y = "Probability") +
    scale_x_continuous(labels = scales::number_format(accuracy = 1)) +
    scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
    theme_classic()

As always, there is some truth. If, tomorrow, we were to meet 8 new people, with the specified incomes, a certain number of them would be liberal. If we had the ideal Preceptor Table, we could just look up that number. No data science required. Alas, we don’t know the truth. The bets we can do is to create a posterior distribution for that unknown value, as we have done here. We then need to translate that posterior into English — “The most likely number of liberals is 2 or 3, but a total as low as zero or as high as 5 is also plausible. Having 6 liberals would be really surprising. Having 7 or 8 is almost impossible.”

Are these two posterior probability distributions perfect? No! This is the central message of the virtue of Temperance. We must demonstrate our humility when we use our models. Recall the distinction between the unknown true distribution and the estimated distribution. The first is the posterior distribution we would create of we understood every detail of the process and could accurately model it. We would still not know the true unknown number, but our posterior distribution for that number would be perfect. Yet, our model is never perfect. We are making all sorts of assumptions behind the scenes. Some of those assumptions are plausible. Others are less so. Either way, the estimated distribution is what we have graphed above.

The central lesson of Temperance is: Don’t confuse the estimated posterior (which is what you have) with the true posterior (which is what you want). Recognize the unavoidable imperfections in the process. You can still use your estimated posterior — what choice do you have? — but be cautious and humble in doing so. The more that you suspect that your estimated posterior differs from the true posterior, the more humble and cautious you should be.

8.5 Summary

In this chapter, we explored relationships between different variables in the trains data set. We built three predictive models and one causal model.

Similar to previous chapters, our first task is to use Wisdom . We judge how relevant our data is to the questions we ask. Is it reasonable to consider the data we have (e.g., income and age data from Boston commuters in 2012) as being drawn from the same population as the data we want to have (e.g., income and age data from today for the entire US)? Probably?

Justice is necessary to decide the best way to represent the models we make. A little math won’t kill you. We use Courage to translate our models into code. Our goal is to understand, generate posterior distributions for the parameters, and interpret their meaning. Temperance leads us to the final stage, using our models to answer our questions.

Key commands:

  • Create a model with stan_glm().

  • Use posterior_epred() to estimate expected values. The e in epred stands for expected.

  • Use posterior_predict() to make forecasts for individuals. The variable in predictions is always greater than the variability in expectations because predictions can’t pretend that \(\epsilon_i\) is zero.

Once we have draws from a posterior distribution for our outcome variable — whether that be an expectation or a prediction — we can manipulate those draws to answer our question.

Remember:

  • Always explore your data.

  • Predictive models care little about causality.

  • Predictive models and causal models use the same math and the same code.

  • “When comparing” is a great phrase to start the summary of any non-causal model.

  • Don’t confuse the estimated posterior (which is what you have) with the true posterior (which is what you want). Be humble and cautious in your use of the posterior.