8 Three Parameters

Even if the two outcome distributions are the same, then a second problem could be that the parameters of the model are different for different ranges of the independent variable. (Grumpy people versus non-grumpy people.)

Models have parameters. In Chapter 6 we created models with a single parameter \(p\), the proportion of red beads in an urn. In Chapter 7, we used models with two parameters: \(\mu\) (the average height in the population, generically known as a model “intercept”) and \(\sigma\) (the variation in height in the population). Here — can you guess where this is going? — we will build models with three parameters: \(\sigma\) (which serves the same role throughout the book) and two coefficients. In models which relate a continuous predictor to the outcome, those two parameters will be labeled \(\beta_0\) and \(\beta_1\). In models which estimate two averages, the parameters will be \(\beta_1\) and \(\beta_2\). All this notation is confusing, not least because different academic fields use inconsistent schemes. Follow the Cardinal Virtues and tackle your problem step by step.

8.1 Wisdom

8.1.1 Preceptor Table

Wisdom begins with considering the questions we desire to answer and the data set we are given. In this chapter, we are going to ask a series of questions involving train commuters’ ages, party affiliations, incomes, and political ideology, as well as the causal effect of exposure to Spanish-speakers on their attitude on immigration. These questions will pertain to all train commuters in the US today. Given these types of questions, the Preceptor Table to answer every question would appear as the following:

ID Age Party Income Liberal Control Ending Attitude Treated Ending Attitude

Commuter 1

23

Democrat

50000

Liberal

3

8

Commuter 2

18

Republican

150000

Not Liberal

7

7

...

...

...

...

...

...

...

Commuter 1000

49

Republican

100000

Liberal

4

8

Commuter 1001

38

Democrat

200000

Liberal

9

7

...

...

...

...

...

...

...

Recall: a Preceptor Table is the smallest possible table such that, if there is no missing data, all questions are easy to answer. So, in our scenario in which we are looking at train commuters in the US now, this table would ideally include all of this information for a random sample from this group of people.

Additionally, notice that since we are looking at the causal effect of being exposed to Spanish-speakers on immigration attitudes, we have two columns for attitude: one for those who receive the treatment of being exposed to Spanish-speakers and one for the control group. This is because the ending attitude would then be the dependent variable in a causal model so there will be two potential outcomes.

Now that we have determined the data we would ideally have to answer our questions, we have to take a look at the data set we have. To demonstrate modeling with three parameters, we will use the trains data set from the primer.data package.

8.1.2 EDA for trains

Always explore your data. Recall the discussion from Chapter 4. Enos (2014) randomly placed Spanish-speaking confederates on nine train platforms around Boston, Massachusetts. Exposure to Spanish-speakers – the treatment – influenced attitudes toward immigration. These reactions were measured through changes in answers to three survey questions. Load the necessary libraries and look at the data.

glimpse(trains)
## Rows: 115
## Columns: 15
## $ treatment           <fct> Treated, Treated, Treated, Treated, Control, Treat…
## $ att_start           <dbl> 11, 9, 3, 11, 8, 13, 13, 10, 12, 9, 10, 11, 13, 6,…
## $ att_end             <dbl> 11, 10, 5, 11, 5, 13, 13, 11, 12, 10, 9, 9, 13, 7,…
## $ gender              <chr> "Female", "Female", "Male", "Male", "Male", "Femal…
## $ race                <chr> "White", "White", "White", "White", "White", "Whit…
## $ liberal             <lgl> FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, FAL…
## $ party               <chr> "Democrat", "Republican", "Democrat", "Democrat", …
## $ age                 <int> 31, 34, 63, 45, 55, 37, 53, 36, 54, 42, 33, 50, 24…
## $ income              <dbl> 135000, 105000, 135000, 300000, 135000, 87500, 875…
## $ line                <chr> "Framingham", "Framingham", "Framingham", "Framing…
## $ station             <chr> "Grafton", "Southborough", "Grafton", "Grafton", "…
## $ hisp_perc           <dbl> 0.026, 0.015, 0.019, 0.019, 0.019, 0.023, 0.030, 0…
## $ ideology_start      <int> 3, 4, 1, 4, 2, 5, 5, 4, 4, 4, 3, 5, 4, 1, 2, 2, 3,…
## $ ideology_end        <int> 3, 4, 2, 4, 2, 5, 5, 4, 3, 4, 3, 4, 4, 1, 2, 3, 3,…
## $ income_in_thousands <dbl> 135, 105, 135, 300, 135, 88, 88, 135, 105, 135, 10…

The data include information about each respondent’s gender, political affiliations, age, income and so on. treatment indicates whether a subject was in the control or treatment group. The key outcomes are their attitudes toward immigration both before (att_start) and after (att_end) the experiment. We also have a measure of their ideology before and after the experiment, but there is little change. Type ?trains to read the help page for more information about each variable.

Let’s restrict attention to a subset of the variables by recalling the Preceptor Table we made. By that, we can determine that the variables age, party, liberal and income are relevant to the questions we seek to answer. This is because age is the age of the respondent, party is their political party affiliation, liberal is whether they are liberal or not, and income is the income of the respondent. Additionally, since treatment tells us whether the respondent was given the treatment of being exposed to Spanish-speakers in the train station and att_end tell us about the respondent’s attitude on immigration after the treatment, these are also variables of interest to us since they will be useful in answering some questions.

ch8 <- trains %>% 
  select(age, att_end, party, income, treatment, liberal)

It is always smart to look at a some random samples of the data:

ch8 %>% 
  slice_sample(n = 5)
## # A tibble: 5 x 6
##     age att_end party    income treatment liberal
##   <int>   <dbl> <chr>     <dbl> <fct>     <lgl>  
## 1    38      11 Democrat  87500 Treated   FALSE  
## 2    31       9 Democrat  87500 Control   TRUE   
## 3    45      11 Democrat 300000 Treated   FALSE  
## 4    52      13 Democrat 135000 Treated   FALSE  
## 5    67       9 Democrat 135000 Control   TRUE

att_end is a measure of person’s attitude toward immigration. A higher number means more conservative, i.e., a more exclusionary stance toward immigration into the United States.

ch8 %>% 
  glimpse()
## Rows: 115
## Columns: 6
## $ age       <int> 31, 34, 63, 45, 55, 37, 53, 36, 54, 42, 33, 50, 24, 40, 53, …
## $ att_end   <dbl> 11, 10, 5, 11, 5, 13, 13, 11, 12, 10, 9, 9, 13, 7, 8, 13, 8,…
## $ party     <chr> "Democrat", "Republican", "Democrat", "Democrat", "Democrat"…
## $ income    <dbl> 135000, 105000, 135000, 300000, 135000, 87500, 87500, 135000…
## $ treatment <fct> Treated, Treated, Treated, Treated, Control, Treated, Contro…
## $ liberal   <lgl> FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE,…

Pay attention to the variable types. Do they make sense? Perhaps. But there are certainly grounds for suspicion. Why is att_end a double rather than an integer? All the values in the data appear to be integers, so there is no benefit to having these variables be doubles. Why is party a character variable and treatment a factor variable? It could be that these are intentional choices made by the creator of the tibble, i.e., us. Or, these could be mistakes. Most likely, these choices are a mixture of sensible and arbitrary. Regardless, it is your responsibility to notice them. You can’t make a good model without looking closely at the data which you are using.

ch8 %>% 
  skim()
TABLE 8.1: Data summary
Name Piped data
Number of rows 115
Number of columns 6
_______________________
Column type frequency:
character 1
factor 1
logical 1
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
party 0 1 8 10 0 2 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
treatment 0 1 FALSE 2 Con: 64, Tre: 51

Variable type: logical

skim_variable n_missing complete_rate mean count
liberal 0 1 0.44 FAL: 64, TRU: 51

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age 0 1 4.2e+01 12.2 20 33 43 52 68 ▆▇▇▇▃
att_end 0 1 9.1e+00 2.9 3 7 9 11 15 ▂▃▇▃▃
income 0 1 1.4e+05 74476.6 23500 87500 135000 135000 300000 ▅▇▇▁▆

skim() shows us what the different values of treatment are because it is a factor. Unfortunately, it does not do the same for character variables like party. The ranges for age and att_end seem reasonable. Recall that participants were asked three questions about immigration issues, each of which allowed for an answer indicated strength of agreement on a scale form 1 to 5, with higher values indicating more agreement with conservative viewpoints. att_end is the sum of the responses to the three questions, so the most liberal possible value is 3 and the most conservative is 15.

By taking our first glance at the data in the above manners, we gain greater understanding of the data we have to work with and whether it will be capable of answering our questions. However, we can’t make models and use the data for our analysis without also understanding the population that the data is from and whether it can be applied to the population we would like to work with.

8.1.3 Population

The data that we have are very limited. There are only 115 observations, all from 2012 and involving train commuters to Boston. What me must ask ourselves is if the data we have and the data in our Preceptor Table are drawn from the same population. We therefore must consider how useful this data will be today, or for other populations around Boston, or for other cities in the US. Only your judgment, along with advice from your colleagues, can guide you.

For our current questions, the population we are answering questions for is the general population of train commuters today. However, the data we have involves train commuters in the Boston area in 2012. So, what we must ask ourselves is if this data from Boston in 2012 can be drawn from the same population as all train commuters in the US today. If we believe that the data is representative enough of today’s population of train commuters and this can be considered to be true, then we can continue with the process of answering our questions. We will assume that the data is drawn from the larger population of train commuters and that there isn’t very much change overtime within this population so we can apply our data to the population of train commuters today.

The key concept is the idea of a “population.” From which larger population is the data we have being (conceptually) drawn? If we were only interested in the age of individuals in our data set, we would have no need for inference. We know everyone’s ages already. We only need tools like stan_glm() if we seek to understand individuals not in our data.

The issue is always: Do the data we have and the data we want to have come from the same population? If there is no connection between the two, progress is impossible. But, in this case, if we are willing to consider the population to be US residents over the last decade (including today), and if we are willing to assume that there is a single population from which the Preceptor Table and our data set is drawn, then we can use our data to create a model to answer our question.

8.1.4 Population Table

Having determined that the Preceptor Table and the data were drawn from the same population, we can produce a Population Table.

The Population Table shows rows from three sources: the Preceptor Table, the actual data, and the population (outside of the data).

Our Preceptor Table rows contain the information that we would want to know in order to answer our questions. These rows contain entries for our covariates (city and year) but they do not contain any outcome results. We are trying to answer questions about the train commuter population in 2021, so our city entries for these rows will vary and our year entries of these rows will read “2021.”

Our actual data rows contain the information that we do know. These rows contain entries for both our covariates and the outcomes. In this case, the actual data comes from a study conducted on train commuters around Boston, MA in 2012, so our city entries for these rows will read “Boston, MA” and our year entries of these rows will read “2012.”

Our population rows contain no data. These are subjects which fall under our desired population, but for which we have no data. As such, all rows are missing. However, this may include, for example, data from a range of years before and after 2012, since we considered little change overtime, so we would also consider this data to be representative and drawn from the larger population of train commuters in years near 2012.

The population table includes commuters from our Preceptor Table with the information that we would ideally have to answer the questions and those from the data we have that is specific to Boston, MA. Additionally, the population table also includes the groups of people within the population from which both the data we have and the Preceptor table is drawn from that we don’t have.

tibble(source = c("Preceptor Table", "Preceptor Table", "Preceptor Table",  "...", "Data", "Data", "Data", "...", "Other", "Other", "Other"),
       city = c("Chicago, IL", "Seattle, WA", "Atlanta, GA", "...",
                  "Boston, MA", "Boston, MA", "Boston, MA", "...",
                  "?", "?", "?"),
       year = c("2021", "2021", "2021", "...",
                "2012", "2012", "2012", "...",
                "?", "?", "?"),
       age = c("?", "?", "?", "...",
                 "43", "52", "28", "...",
                 "?", "?", "?"),
       party = c("?", "?", "?", "...", 
                  "Democrat", "Republican", "Republican", "...",
                  "?", "?", "?"),
       income = c("?", "?", "?", "...", 
                    "150000", "50000", "200000", "...",
                    "?", "?", "?"),
       liberal = c("?", "?", "?", "...", 
                 "Liberal", "Liberal", "Not Liberal", "...",
                 "?", "?", "?"),
       att_treat = c("?", "?", "?", "...", 
                 "6", "8", "4", "...",
                 "?", "?", "?"),
       att_control = c("?", "?", "?", "...", 
                 "4", "2", "5", "...",
                 "?", "?", "?")) %>%
  
  # Then, we use the gt function to make it pretty
  
  gt() %>%
  cols_label(source = md("Source"),
             city = md("City"),
             year = md("Year"),
             age = md("Age"),
             party = md("Party"),
             income = md("Income"),
             liberal = md("Liberal"),
             att_treat = md("Treatment"),
             att_control = md("Control")) %>%
  tab_style(cell_borders(sides = "right"),
            location = cells_body(columns = c(source))) %>%
  tab_style(style = cell_text(align = "left", v_align = "middle", size = "large"), 
            locations = cells_column_labels(columns = c(source))) %>%
  cols_align(align = "center", columns = everything()) %>%
  cols_align(align = "left", columns = c(source)) %>%
  fmt_markdown(columns = everything())  
Source City Year Age Party Income Liberal Treatment Control

Preceptor Table

Chicago, IL

2021

?

?

?

?

?

?

Preceptor Table

Seattle, WA

2021

?

?

?

?

?

?

Preceptor Table

Atlanta, GA

2021

?

?

?

?

?

?

...

...

...

...

...

...

...

...

...

Data

Boston, MA

2012

43

Democrat

150000

Liberal

6

4

Data

Boston, MA

2012

52

Republican

50000

Liberal

8

2

Data

Boston, MA

2012

28

Republican

200000

Not Liberal

4

5

...

...

...

...

...

...

...

...

...

Other

?

?

?

?

?

?

?

?

Other

?

?

?

?

?

?

?

?

Other

?

?

?

?

?

?

?

?

8.2 age ~ party

We want to build a model and then use that model to make claims about the world. Our questions about the relationship between age and party are the following:

What is the expected age of a Democrat at the train station?

In a group of three Democrats and three Republicans, what will the age difference be between the oldest Democrat and the youngest Republican?

We can answer these and similar questions by creating a model that uses party affiliation to predict age

8.2.1 Justice

Justice

FIGURE 8.1: Justice

Justice consists of four topics: validity, stability, representativeness, and the data generating mechanism (DGM).

When considering validity, we must look back upon our population table and consider if we are comfortable combining the data from our Preceptor Table and our data in the population table. For party, this could mean considering if the meaning of aligning as a “Democrat” or “Republican” has changed over the years and is it a valid assumption for us to consider party as a column from both. For example, if instead of comparing 2012 and 2021, we were comparing 2021 to 1850, we would need to recognize that the meaning of being a Democrat and Republican has drastically changed and we wouldn’t feel comfortable making party one column for these years. In our scenario, we would consider this for 2012 and 2021, in that the meaning of party affiliation my have also changed between then and now. As we know, the political parties have become more polarized over the past few decades so is it valid to consider party to have the same meaning for 2012 data and our 2021 Preceptor Table? While the parties may have been more polarized, in our instance, we will consider this to be valid because the core beliefs have not changed too significantly.

We then must consider the stability of our model. We are still using the data from Enos (2014). Yet the world is a very different place today! The data we have is never a perfect match to the problem we face because data is always old. The world is constantly changing. To use old data, we need to make assumptions about the stability of our model of the world and of the parameters we are estimating. In other words, we must assume that the DGM doesn’t change over time. So, in this scenario, we must assume that the relationship between age and party has not changed over time. Whether those assumptions are reasonable is a difficult question, one that requires knowledge about the world as it was and the world as it now is. For instance, with our knowledge about the world, we might know that Democrats have been getting younger, so our relationship between age and party now may not be the same as that in 2012. So is making the assumption that the relationship between age and party the same in 2012 and now reasonable? We will say yes, but this is a question you must consider.

Next, let’s consider representativeness. In order to do so, we must determine if the data is representative of the time period which the data is from. Essentially, can we consider the data of 2012 Boston train commuters representative of 2012 US train commuters? Well, no data will every be truly representative since random sampling will also never really be perfect, so what we are truly determining is if we can consider the data of 2012 Boston train commuters to be representative enough of train commuters in the US today. While we may feel we can more confidently consider the data of the Boston train commuters to be representative of train commuters from a city similar to Boston, we might feel less confident that the data is representative of the general population of train commuters in the US due to the variance among many different locations in the US. However, issues with representativeness are most concerning if we believe that they are correlated with a variable we are looking at. For this reason, we will consider the data to be representative enough.

Now, let’s consider the type of DGM. Any DGM with age as its dependent variable will be predictive, not causal, for the simple reason that nothing, other than time, can change your age. You are X years old. It would not matter if you changed your party registration from Democrat to Republican or vice versa. Your age is your age.

There is only one potential outcome, i.e., one outcome. There is not a potential outcome if you are Democrat and a different potential outcome if you are a Republican.

When dealing with a non-causal DGM, the focus is on predicting things. The underlying mechanism which connects age with party is less important than the brute statistical fact that there is a connection. Predictive models care little about causality.

A good way at looking at this is with a Preceptor Table, as seen below. Unlike the previous table in Chapter 7, we now have two columns in addition to ID. Since our data does not include all Republicans and Democrats in the world, not every row is filled in.

ID Predictor Outcome
Political Party Age
1 D 31
2 ? ?
... ... ...
473 D 58
474 ? ?
... ... ...
3,258 ? ?
3,259 R 49
... ... ...
N ? ?

 

We now know that we are working with a predictive DGM. Recall:

\[\text{outcome} = \text{model} + \text{not in the model}\]

In words, any event depends on our explicitly described model as well as on influences unknown to us. Everything that happens in the world is the result of various factors, and we can only consider some of them in our model (because we do not know about some influences, or because we have no data about them).

The mathematics:

\[ y_i = \beta_1 republican_i + \beta_2 democrat_i + \epsilon_i\]

where \[republican_i, democrat_i \in \{0,1\}\] \[republican_i + democrat_i = 1\] \[\epsilon_i \sim N(0, \sigma^2)\]

Don’t panic dear poets and philosophers, the whole thing is easier than it looks.

  • On the left-hand side we have the outcome, \(y_i\), which is the variable to be explained. In our case, this is the age of an individual in the population.

  • On the right-hand side we first have the part contained in the model: two similar terms added together. Each term consists of a parameter and a data point. The betas are our two parameters: \(\beta_1\) is the average age of Republicans in the population and \(\beta_2\) is the average age of Democrats in the population. \(republican_i\) and \(democrat_i\) are our explanatory variables and take the values 1 or 0. If person \(i\) is a Republican we have \(republican_i = 1\) and \(democrat_i = 0\). If person \(i\) is a Democrat we have \(republican_i = 0\) and \(democrat_i = 1\). In other words, their values are mutually exclusive – if you are a Democrat, you cannot also be a Republican.

  • The last part, \(\epsilon_i\) (“epsilon”), represents the unexplained part of the outcome and is called the error term. It is simply the difference between the outcome and our model predictions. This includes all factors that have an influence on someone’s age but are not connected to party affiliation. We assume that this error follows a normal distribution with an expected value of 0 (meaning it is 0 on average).

  • The small \(i\)’s are an index to number the observations. It is equivalent to the “ID” column in our Preceptor Table and simply states that the outcome for person \(i\) is explained by the modeled and non-modeled factors for person \(i\).

  • The model is a claim about how the world works, not just for the 115 individuals for which we have data but for the all the people in the population for which we seek to draw inferences.

  • Although terminology differs across academic fields, the most common term to describe a model like this is a “regression.” We are “regressing” age on party in order to see if they are associated with each other. The formula above is a “regression formula,” and the model is a “regression model.” This terminology would also apply to our model of height in Chapter 7.

  • The model in Chapter 7 is sometimes called “intercept-only” because the only (interesting) parameter is the intercept. Here we have a “two intercept” model because, instead of estimating an average for the whole population, we are estimating two averages.

8.2.2 Courage

Courage

FIGURE 8.2: Courage

Courage allows us to translate math to code.

To get posterior distributions for our three parameters, we will again use stan_glm(), just as we did in Chapter 7.

fit_1 <- stan_glm(age ~ party - 1, 
                    data = trains, 
                    seed = 17,
                    refresh = 0)
  • The variable before the tilde, age, is our outcome.

  • The only explanatory variable is party. This variable has only two values, ‘Democrat’ and ‘Republican.’

  • We have also added -1 at the end of the equation, indicating that we do not want an intercept, which would otherwise be added by default.

The resulting output:

fit_1
## stan_glm
##  family:       gaussian [identity]
##  formula:      age ~ party - 1
##  observations: 115
##  predictors:   2
## ------
##                 Median MAD_SD
## partyDemocrat   42.6    1.2  
## partyRepublican 41.2    2.7  
## 
## Auxiliary parameter(s):
##       Median MAD_SD
## sigma 12.3    0.8  
## 
## ------
## * For help interpreting the printed output see ?print.stanreg
## * For info on the priors used see ?prior_summary.stanreg

partyDemocrat corresponds to \(\beta_1\), the average age of Democrats in the population. partyRepublican corresponds to \(\beta_2\), the average age of Republicans in the population. Since we don’t really care about the posterior distribution for \(\sigma\), we won’t discuss it here. Graphically:

fit_1 %>% 
  as_tibble() %>% 
  select(-sigma) %>% 
  mutate(Democrat = partyDemocrat, Republican = partyRepublican) %>%
  pivot_longer(cols = Democrat:Republican,
               names_to = "parameter",
               values_to = "age") %>% 
  ggplot(aes(x = age, fill = parameter)) +
    geom_histogram(aes(y = after_stat(count/sum(count))),
                   alpha = 0.5, 
                   bins = 100, 
                   position = "identity") +
    labs(title = "Posterior for Average Age",
         subtitle = "More data allows for a more precise posterior for Democrats",
         x = "Age",
         y = "Probability") +
    scale_y_continuous(labels = scales::percent_format()) +
    theme_classic()

The unknown parameters \(\beta_1\) (partyDemocrat) and \(\beta_2\) (partyRepublican) are still unknown. We can never know the true average age of all Democrats in the population. But we can calculate a posterior probability distribution for each parameter. Comments:

  • Democrats seem slightly older than Republicans. That was true in the sample and so, almost (but not quite!) by definition, it will be true in our the posterior probability distributions.

  • Our estimate for the average age of Democrats in the population is much more precise than that for Republicans because we have five times as many Democrats as Republicans in our sample. A central lesson from Chapter 6 is that the more data you have related to a parameter, the narrower your posterior distribution will be.

  • There is a great deal of overlap between the two distributions. Would we be surprised if, in truth, the average age of Republicans in the population was greater than that for Democrats? Not really. We don’t have enough data to be sure either way.

  • The phrase “in the population” is doing a great deal of work because we have not said what, precisely, we mean by the “population.” Is it the set of people on those commuter platforms on those days in 2012 when the experiment was done? Is it the set of people on all platforms, including ones never visited? Is it the set of all Boston commuter? All Massachusetts residents? All US residents? Does it include people today, or can we only draw inferences for 2012? We will explore these questions in every model we create.

  • The parameters \(\beta_1\) and \(\beta_2\) can be interpreted in two ways. First, like all parameters, they are a part of the model. We need to estimate them. But, in many cases, we don’t really care what the value of the parameter is. The exact value of \(\sigma\), for example, does not really matter. Second, some parameters have a substantive interpretaion, as with \(\beta_1\) and \(\beta_2\) being the average ages in the population. But this will often not be the case! Fortunately, with such models, we can use functions like posterior_epred() and posterior_predict() to answer our questions.

Consider a table which shows a sample of 8 individuals.

8 Observations from Trains Dataset
Age Party Fitted Residual
31 Democrat 42.59 -11.6
34 Republican 41.19 -7.2
63 Democrat 42.59 20.4
45 Democrat 42.59 2.4
55 Democrat 42.59 12.4
37 Democrat 42.59 -5.6
53 Republican 41.19 11.8
36 Democrat 42.59 -6.6

 

The fitted values are the same for all Republicans and for all Democrats, as the model produces one fitted value for each condition. This table shows how just a sample of 8 individuals captures a wide range of residuals, making it difficult to predict the age of a new individual. We can get a better picture of the unmodeled variation in our sample if we plot these three variables for all the individuals in our data.

The following three histograms show the actual outcomes, fitted values, and residuals of all people in trains:

outcome <- ch8 %>% 
  ggplot(aes(age)) +
    geom_histogram(bins = 100) +
    labs(x = "Age",
         y = "Count") 

fitted <- tibble(age = fitted(fit_1)) %>% 
  ggplot(aes(age)) +
    geom_bar() +
    labs(x = "Fitted Values",
         y = NULL) +
    scale_x_continuous(limits = c(20, 70)) 

res <- tibble(resids = residuals(fit_1)) %>% 
  ggplot(aes(resids)) +
    geom_histogram(bins = 100) +
    labs(x = "Residuals",
         y = NULL) 
  

outcome + fitted + res +
  plot_annotation(title = "Decomposition of Height into Fitted Values and Residuals")

The three plots are structured like our equation and table above. A value in the left plot is the sum of one value from the middle plot plus one from the right plot.

  • The actual age distribution looks like a normal distribution. It is centered around 43, and it has a standard deviation of about 12 years.

  • The middle plot for the fitted values shows only two adjacent spikes, which represent the estimates for Democrats and Republicans.

  • Since the residuals plot represents the difference between the other two plots, its distribution looks like the first plot.

8.2.3 Temperance

Temperance

FIGURE 8.3: Temperance

Recall the first questions with which we began this section:

  • What is the probability that, if a Democrat shows up at the train station, he will be over 50 years old?

So far we have only tried our model on people from our data set whose real age we already knew. This is helpful to understand the model, but our ultimate goal is to understand more about the real world, about people we don’t yet know much about. Temperance guides us to make meaningful predictions and to become aware of their known and unknown limitations.

Start with a simple question, what are the chances that a random Democrat is over 50 years old? First, we create a tibble with the desired input for our model. In our case the tibble has a variable named “party” which contains a single observation with the value “Democrat.” This is a bit different than Chapter 7.

new_obs <- tibble(party = "Democrat")

Use posterior_predict() to create draws from the posterior for this scenario. Note that we have a new posterior distribution under consideration here. The unknown parameter, call it \(D_{age}\), is the age of a Democrat. This could be the age of a randomly selected Democrat from the population or of the next Democrat we meet or of the next Democrat we interview on the train platform. The definition of “population” determines the appropriate interpretation. Yet, regardless, \(D_{age}\) is an unknown parameter. But it is not one — like \(\beta_1\), \(\beta_2\), or \(\sigma\) — for which we have already created a posterior probability distribution. That is why we need posterior_predict().

posterior_predict() takes two arguments: the model for which the simulations should be run, and a tibble indicating for which and how many parameters we want to run these simulations. In this case, the model is the one from Courage and the tibble is the one we just created.

pp <- posterior_predict(fit_1, newdata = new_obs) %>%
    as_tibble() %>%
    mutate_all(as.numeric)

head(pp, 10)
## # A tibble: 10 x 1
##      `1`
##    <dbl>
##  1  53.3
##  2  40.6
##  3  37.0
##  4  39.6
##  5  47.0
##  6  43.8
##  7  53.1
##  8  49.6
##  9  51.9
## 10  44.0

We might expect that we can use as_tibble() directly after the object returned posterior_predict(). Sadly, for obscure technical reasons, that won’t quite work. So, we need the incantation mutate_all(as.numeric) to make sure that the resulting tibble is well-behaved. This command ensures that every column in the tibble is a simple numeric vector, which is what we want.

The result are draws from the posterior distribution of the age of a Democrat. It is important to understand that this is not a concrete person from the trains dataset - the algorithm in posterior_predict() simply uses the existing data from trains to estimate this posterior distribution.

pp %>% 
  ggplot(aes(x = `1`)) +
    geom_histogram(aes(y = after_stat(count/sum(count))),
                   bins = 100)  +
    labs(title = "Posterior for a Random Democrat's Age",
         subtitle = "Individual predictions are always more variable than expected values",
         x = "Age",
         y = "Probability") +
    scale_y_continuous(labels = scales::percent_format()) +
    theme_classic()

Once we have the posterior distribution, we can answer (almost) any reasonable question. In this case, the probability that the next Democrat will be over 50 is around 28%.

sum(pp$`1` > 50) / nrow(pp)
## [1] 0.27

Recall the second question:

  • In a group of three Democrats and three Republicans, what will the age difference be between the oldest Democrat and the youngest Republican?

As before we start by creating a tibble with the desired input. Note that the name of the column (“party”) and the observations (“Democrat,” “Republican”) must always be exactly as they are in the original data set. This tibble as well as our model can then be used as arguments for posterior_predict():

newobs <- tibble(party = c("Democrat", "Democrat", "Democrat", 
                        "Republican", "Republican","Republican"))

posterior_predict(fit_1, newdata = newobs) %>%
    as_tibble() %>%
    mutate_all(as.numeric)
## # A tibble: 4,000 x 6
##      `1`   `2`   `3`   `4`   `5`   `6`
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  41.9  58.1  60.5  42.9  40.6  49.7
##  2  66.0  53.5  27.3  43.0  39.8  33.9
##  3  13.3  41.1  35.1  26.5  42.4  27.6
##  4  54.5  42.4  21.4  38.7  63.1  35.4
##  5  13.9  17.5  37.0  44.9  41.3  55.6
##  6  49.0  37.9  48.6  32.2  32.8  45.1
##  7  51.3  48.6  36.3  20.8  53.0  26.6
##  8  27.6  27.9  35.0  43.3  49.9  41.1
##  9  38.4  26.1  33.8  41.5  58.5  28.9
## 10  57.4  36.2  42.5  52.2  51.2  40.6
## # … with 3,990 more rows

We have 6 columns: one for each person. posterior_predict() does not name the columns, but they are arranged in the same order in which we specified the persons in newobs: D, D, D, R, R, R. To determine the expected age difference, we add code which works with these posterior draws:

pp <- posterior_predict(fit_1, newdata = newobs) %>%
    as_tibble() %>%
    mutate_all(as.numeric) %>% 
  
    # We don't need to rename the columns, but doing so makes the subsequest
    # code much easier to understand. We could just have worked with columns 1,
    # 2, 3 and so on. Either way, the key is to ensure that you correctly map
    # the covariates in newobs to the columns in the posterior_predict object.
  
    set_names(c("dem_1", "dem_2", "dem_3", 
                "rep_1", "rep_2", "rep_3")) %>% 
    rowwise() %>% 
  
  # Creating three new columns. The first two are the highest age among
  # Democrats and the lowest age among Republicans, respectively. The third one
  # is the difference between the first two.
  
  mutate(dems_oldest = max(c_across(dem_1:dem_3)),
         reps_youngest = min(c_across(rep_1:rep_3)),
         age_diff = dems_oldest - reps_youngest)

pp
## # A tibble: 4,000 x 9
## # Rowwise: 
##    dem_1 dem_2 dem_3 rep_1 rep_2 rep_3 dems_oldest reps_youngest age_diff
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>       <dbl>         <dbl>    <dbl>
##  1  81.0  30.7  40.6  61.6  23.1  30.6        81.0          23.1     57.8
##  2  59.8  22.1  55.5  18.8  45.5  56.8        59.8          18.8     41.0
##  3  29.6  38.1  45.7  42.4  23.2  42.4        45.7          23.2     22.5
##  4  58.8  57.4  72.0  34.7  15.2  23.6        72.0          15.2     56.8
##  5  25.8  34.0  30.2  52.5  44.1  47.1        34.0          44.1    -10.0
##  6  40.6  35.7  41.2  56.1  28.3  28.2        41.2          28.2     13.0
##  7  34.9  51.6  28.1  37.5  38.8  30.2        51.6          30.2     21.4
##  8  47.9  45.3  46.9  30.5  40.8  39.4        47.9          30.5     17.4
##  9  76.8  42.5  55.1  27.2  53.8  40.7        76.8          27.2     49.6
## 10  24.5  35.4  64.9  37.2  33.5  58.2        64.9          33.5     31.4
## # … with 3,990 more rows

The plotting code is similar to what we have seen before:

pp %>%  
  ggplot(aes(x = age_diff)) +
    geom_histogram(aes(y = after_stat(count/sum(count))),
                   bins = 100) +
    labs(title = "Posterior for Age Difference",
         subtitle = "Oldest of three Democrats compared to youngest of three Republicans",
         x = "Age",
         y = "Probability") +
    scale_y_continuous(labels = scales::percent_format()) +
    theme_classic()

In words, we would expect the oldest Democrat to be about 22 years older than the youngest Republican, but we would not be too surprised if the oldest Democrat was actually younger than the youngest Republican in a group of 6.

8.2.4 Addendum

Instead of parameterizing this model without an intercept, we could have used one. In that case, the math would be:

\[ y_i = \beta_0 + \beta_1 democratic_i + \epsilon_i\] The interpretations of the parameters are different from the prior model. \(\beta_0\) is now the average age of Republicans. This is the same interpretation as \(\beta_1\) in the original set up. \(\beta_1\) is now the difference between the the average age of Republicans and that of Democrats.

To fit this model, we use the exact same code as before, except without the -1 in the formula argument.

stan_glm(age ~ party, 
         data = trains, 
         seed = 98,
         refresh = 0)
## stan_glm
##  family:       gaussian [identity]
##  formula:      age ~ party
##  observations: 115
##  predictors:   2
## ------
##                 Median MAD_SD
## (Intercept)     42.6    1.3  
## partyRepublican -1.5    3.1  
## 
## Auxiliary parameter(s):
##       Median MAD_SD
## sigma 12.3    0.8  
## 
## ------
## * For help interpreting the printed output see ?print.stanreg
## * For info on the priors used see ?prior_summary.stanreg

The intercept, 42.6, is the same as the partyDemocrat estimate in the first model. The partyRepublican estimate, which was previously 41.0, is now -1.5, meaning it is the difference (allowing for rounding) between the average age of Democrats and Republicans.

Little else about the models will be different. They will have the same fitted values and residuals. posterior_predict() will generate the same posterior predictive probabilty distributions. Which parameterization we use does not matter much. But you should be able to interpret the meaning of the coefficients in both.

8.3 att_end ~ treatment

Above, we created a predictive model: with someone’s party affiliation, we can make a better guess as to what their age is than we could have in the absence of information about their party. There was nothing causal about that model. Changing someone’s party registration can not change their age. In this example, we build a causal model. Consider these two questions:

What is the average treatment effect, of exposing people to Spanish-speakers, on their attitudes toward immigration?

What is the largest causal effect which still has a 1 in 10 chance of occurring?

Models help us answer questions better than we could without those models, but only if we follow the Cardinal Virtues.

8.3.1 Justice

Justice

FIGURE 8.4: Justice

The four elements of Justice in a data science project remain the same: validity, stability, representativeness, and the model.

First, we must consider the validity of the model, and to do so, let’s look at our treatment variable. What does it mean for somebody to receive the treatment? Let’s say somebody was running late on their way to the train in the morning. Well, this person may have heard the Spanish-speakers for a shorter amount of time than a commuter who arrived earlier. Similarly, some Spanish-speakers may have been speaking louder than others or some commuters may have been closer to the speakers than other commuters. Therefore, these commuters would hear the treatment of being exposed to Spanish-speakers less loudly than the other commuters. Additionally, the Spanish-speakers weren’t the same for every train station in our 2012 data and if we were to get data for now, we wouldn’t be able to hire the exact same Spanish-speakers. When considering validity with reference to treatments, what we must determine is if we can assume that the treatments of being exposed to Spanish-speakers that each commuter may experience a bit differently can be assumed to be the same.

Next, we must consider the stability of our model for the relationship between att_end and treatment between 2012 and 2021. Is this relationship from 2012, four years before Donald Trump’s election as president, still the same? While we might not know for sure, we have to consider this in order to continue and make assumptions with our data. For our purposes, we will consider the relationship to be stable. Even though we know that there may have been some changes, we will consider the model to be the same in both years.

Let’s now once again consider whether the data from 2012 train commuters around Boston is representative of 2012 train commuters in the US. In our last model, we discussed the issue about how Boston may be different from other cities and therefore not representative of the US, and we will now consider the issue of random sampling that may lead to representativeness issues.

Let’s say that even though Boston is different from other US cities, we considered Boston to be perfectly representative of the US. Great, but this 2012 data could still not be representative. This is because there could be bias within those who are chosen to give the survey, in that the commuters who are approached and receive the surveys may not be random or representative. What if the individuals giving out the surveys were younger and also tended to choose people to approach with a survey that were similar in age to them? A scenario like this could end up overestimating younger train commuters in the population, which could influence our answers to any of our questions. Specifically, when considering the relationship between att_end and treatment, this could influence the results of the model because younger individuals may have similar attitudes on immigration.

The math for this model is exactly the same as the math for the predictive model in the first part of this chapter, although we change the notation a bit for clarity.

\[ y_i = \beta_1 treatment_i + \beta_2 control_i + \epsilon_i\]

where \[treatment_i, control_i \in \{0,1\}\] \[treatment_i + control_i = 1\] \[\epsilon_i \sim N(0, \sigma^2)\]

Nothing has changed, except for the meaning of the data items and the interpretations of the parameters.

  • On the left-hand side we have the outcome, \(y_i\), which is the variable to be explained. In our case, this is a person’s attitude toward immigration after the experiment is complete. \(y_i\) takes on integer values between 3 and 15 inclusive.

  • On the right-hand side we first have the part contained in the model, consisting of two similar terms. The two terms stand for Treated and Control and work as follows. Each term consists of a parameter and a data point. \(\beta_1\) is the average attitude toward immigration for treated individuals — those exposed to Spanish-speakers — in the population. \(\beta_2\) is the average attitude toward immigration for control individuals — those not exposed to Spanish-speakers — in the population. The \(x\)’s are our explanatory variables and take the values 1 or 0. If person \(i\) is Treated, \(treatment_i = 1\) and \(control_i = 0\). If person \(i\) is Control, \(treatment_i = 0\) and \(control_i = 1\). In other words, these are binary variables and are mutually exclusive – if you are Treated, you cannot also be Control.

  • Note that the formula applies to everyone in the population, not just the 115 people for whom we have data. The index \(i\) does not just go from 1 through 115. It goes from 1 through \(N\), where \(N\) is the number of individuals in the population. Conceptually, everyone has an att_end under treatment and under control.

  • The last part, \(\epsilon_i\) (“epsilon”), represents the unexplained part and is called the error term. It is simply the difference between the outcome and our model predictions. In our particular case, this includes all factors that have an influence on someone’s attitude toward immigration but are not explained by treatment status. We assume that this error follows a normal distribution with an expected value of 0.

  • The small \(i\)’s are an index for the data set. It is equivalent to the “ID” column in our Preceptor Table and simply states that the outcome for person \(i\) is explained by the predictor variables (\(treatment\) and \(control\)) for person \(i\), along with an error term.

8.3.2 Courage

Courage

FIGURE 8.5: Courage

With Justice satisfied, we gather our Courage and fit the model. Note that, except for the change in variable names, the code is exactly the same as it was above, in our predictive model for age. Predictive models and causal models use the same math and the same code. The differences, and they are very important, lie in the interpretation of the results, not in their creation.

fit_2 <- stan_glm(att_end ~ treatment - 1, 
                      data = trains, 
                      seed = 45,
                      refresh = 0)

fit_2
## stan_glm
##  family:       gaussian [identity]
##  formula:      att_end ~ treatment - 1
##  observations: 115
##  predictors:   2
## ------
##                  Median MAD_SD
## treatmentTreated 10.0    0.4  
## treatmentControl  8.5    0.3  
## 
## Auxiliary parameter(s):
##       Median MAD_SD
## sigma 2.8    0.2   
## 
## ------
## * For help interpreting the printed output see ?print.stanreg
## * For info on the priors used see ?prior_summary.stanreg

treatmentTreated corresponds to \(\beta_1\). As always, R has, behind the scenes, estimated the entire posterior probability distribution for \(\beta_1\). We will graph that distribution in the next section. But the basic print method for these objects can’t show the entire distribution, so it gives us summary numbers: the median and the MAD SD. Speaking roughly, we would expect about 95% of the values in the posterior to be within two MAD SD’s of the median. In other words, we are 95% confident that the true, but unknowable, average attitude toward immigration among the Treated in the population to be between 9.2 and 10.8.

treatmentControl corresponds to \(\beta_2\). The same analysis applies. We are about 95% confident that the true value for the average attitude toward immigration for Control in the population is between 7.9 and 9.1.

Up until now, we have used the Bayesian interpretation of “confidence interval.” This is also the intuitive meaning which, outside of academia, is almost universal. There is a truth out there. We don’t know, and sometimes can’t know, the truth. A confidence interval, and its associated confidence level, tells us how likely the truth is to lie within a specific range. If your boss asks you for a confidence interval, she almost certainly is using this interpretation.

But, in contemporary academic research, the phrase “confidence interval” is usually given a “Frequentist” interpretation. (The biggest divide in statistics is between Bayesians and Frequentist interpretations. The Frequentist approach, also known as “Classical” statistics, has been dominant for 100 years. Its power is fading, which is why this textbook uses the Bayesian approach.) For a Frequentist, a 95% confidence interval means that, if we were to apply the procedure we used in an infinite number of future situations like this, we would expect the true value to fall within the calculated confidence intervals 95% of the time. In academia, a distinction is sometimes made between confidence intervals (which use the Frequentist interpretation) and credible intervals (which use the Bayesian interpretation). We won’t worry about that difference in this Primer.

Let’s look at the full posteriors for both \(\beta_1\) and \(\beta_2\).

fit_2 %>% 
  as_tibble() %>% 
  select(-sigma) %>% 
  pivot_longer(cols = treatmentTreated:treatmentControl,
               names_to = "Parameter",
               values_to = "attitude") %>% 
  ggplot(aes(x = attitude, fill = Parameter)) +
    geom_histogram(aes(y = after_stat(count/sum(count))),
                   alpha = 0.5, 
                   bins = 100, 
                   position = "identity") +
    labs(title = "Posterior for Expected Attitude Toward Immigration",
         subtitle = "Treated individuals are more conservative",
         x = "Attitude",
         y = "Probability") +
    scale_y_continuous(labels = scales::percent_format()) + 
    theme_classic()

It appears that the affect of the treatment is to change people’s attitudes to be more conservative about immigration issues. Which is somewhat surprising!

We can decompose the the dependent variable, att_end into two parts: the fitted values and the residuals. There are only two possible fitted values, one for the Treated and one for the Control. The residuals, as always, are simply the difference between the outcomes and the fitted values.

The smaller the spread of the residuals, the better a job the model is doing of explaining the outcomes.

8.3.3 Temperance

Temperance

FIGURE 8.6: Temperance

Recall the first question with which we began this section:

  • What is the average treatment effect, of exposing people to Spanish-speakers, on their attitudes toward immigration?

Chapter 4 defined the average treatment effect. One simple estimator of the average treatment effect is the difference between \(\beta_1\) and \(\beta_2\). After all, the definition of \(\beta_1\) is the average attitude toward immigration, of the population, for anyone, under exposure to the treatment. So, \(\beta_1 - \beta_2\) is the average treatment effect for the population, roughly 1.5. However, estimating the posterior probability distribution for this parameter is tricky, unless we make use of the posterior distributions of \(\beta_1\) and \(\beta_2\). With that information, the problem is simple:

fit_2 %>% 
  as_tibble() %>% 
  mutate(ate = treatmentTreated - treatmentControl) %>% 
  ggplot(aes(x = ate)) +
    geom_histogram(aes(y = after_stat(count/sum(count))),
                   bins = 100) +
    labs(title = "Posterior for Average Treatment Effect",
         subtitle = "Exposure to Spanish-speakers shifts immigration attitudes rightward",
         x = "Difference in Attitude",
         y = "Probability") +
    scale_y_continuous(labels = scales::percent_format()) +
    theme_classic()

Could the true value of the average treatment effect be as much as 2 or as little as 1? Of course! The most likely value is around 1.5, but the variation in the data and the smallness of our sample cause the estimate to be imprecise. However, it is quite unlikely that the true average treatment effect is below zero.

We can use posterior_epred() to answer this question. Create a tibble and use it as we have done before:

newobs <- tibble(treatment = c("Treated", "Control"))

pe <- posterior_epred(fit_2, newobs) %>% 
    as_tibble() %>% 
    mutate(ate = `1` - `2`)

pe
## # A tibble: 4,000 x 3
##      `1`   `2`   ate
##    <dbl> <dbl> <dbl>
##  1 10.0   8.56 1.47 
##  2  9.94  8.33 1.61 
##  3 10.0   8.48 1.56 
##  4 10.6   8.28 2.36 
##  5 10.6   8.29 2.32 
##  6  9.26  8.81 0.449
##  7 10.4   8.42 1.96 
##  8 10.5   8.28 2.26 
##  9 10.8   8.51 2.25 
## 10 10.3   8.37 1.88 
## # … with 3,990 more rows

The posterior probability distribution created with posterior_epred() is the same as the one produced by manipulating the parameters directly.

pe %>% 
  ggplot(aes(x = ate)) +
    geom_histogram(aes(y = after_stat(count/sum(count))),
                   bins = 100) +
    labs(title = "Posterior for Average Treatment Effect",
         subtitle = "Exposure to Spanish-speakers shifts immigration attitudes rightward",
         x = "Difference in Attitude",
         y = "Probability") +
    scale_y_continuous(labels = scales::percent_format()) +
    theme_classic()

Our second question:

  • What is the largest effect size which still has a 1 in 10 chance of occurring?

Create a tibble which we can pass to posterior_predict(). The variables in the tibble which will be passed in as newdata. Fortunately, the tibble we created above is just what we need for this question also.

Consider the result of posterior_predict() for two people, one treated and one control. Take the difference.

pp <- posterior_predict(fit_2, 
                        newdata = newobs) %>%
    as_tibble() %>%
    mutate_all(as.numeric) %>% 
    mutate(te = `1` - `2`)
  
pp
## # A tibble: 4,000 x 3
##      `1`   `2`       te
##    <dbl> <dbl>    <dbl>
##  1  9.80  4.31  5.49   
##  2  7.34  6.84  0.505  
##  3 12.0   7.13  4.89   
##  4 14.7   9.67  5.04   
##  5 13.5   7.54  5.91   
##  6  9.65 14.4  -4.77   
##  7  9.28  6.42  2.86   
##  8 12.7   5.88  6.82   
##  9 10.4  10.4  -0.00962
## 10  9.68  9.59  0.0967 
## # … with 3,990 more rows

Create a graphic:

pp %>% 
  ggplot(aes(x = te)) +
    geom_histogram(aes(y = after_stat(count/sum(count))),
                   bins = 100)  +
    labs(title = "Posterior for Treatment Effect for One Person",
         subtitle = "Causal effects are more variable for indvduals",
         x = "Difference in Attitude",
         y = "Probability") +
    scale_y_continuous(labels = scales::percent_format()) +
    theme_classic()

In this case, we are looking at the distribution of the treatment effect for a single individual. This is very different than the average treatment effect. In particular, it is much more variable. We are looking at one row in the Preceptor Table. For a single individual, att_end can be anywhere from 3 to 15, both under treatment and under control. The causal effect — the difference between the two potential outcomes can, in theory, be anywhere from -12 to +12. Such extreme values are rare, but not impossible.

The question, however, was interested in the value at the 90th percentile.

quantile(pp$te, prob = 0.9)
## 90% 
## 6.5

We would not expect a treatment effect of this magnitude to be common, but, at the same time, effects this big and bigger will occur about 10% of the time.

8.4 income ~ age

So far, we have only created models in which the predictor variable is discrete, with two possible values. party is either “Democrat” or “Republican.” treatment is either “Treated” or “Control.” Often times, however, the predictor variable will be continuous. Fortunately, the exact same approach works in this case. Consider:

What would you expect the income to be for a 40-year old?

8.4.1 Justice

Once again, in Justice, we must consider validity, stability, representativeness, and the data generating mechanism.

Let’s look at the validity of age and income. Recall that validity refers to whether or not we can consider the columns age and income to have the same meaning in our data set of 2012 Boston train commuters and in our Preceptor Table. While age doesn’t really change meaning over time, income can be impacted by inflation. After all, $100,000 in 2012 doesn’t have the same worth as $100,000 now due to inflation and now would have less purchasing power. This would result in income being underestimated within our model. However, since there hasn’t been drastic inflation that dramatically changed the buying power, we will consider income to be valid. If there had been like 300% inflation, however, our conclusion would probably be different.

Now, let’s consider the stability of our model and if we believe that our relationship between age and income has changed between 2012 and now. Once again, let’s consider inflation and how that could impact income. If incomes were to increase at the rate of inflation, then the income distribution would be different than that of 2012. However, wages don’t tend to change as quickly as inflation does, so they likely did not change significantly and we can consider this model to be stable.

Next, let’s consider another issue that we may have with representativeness. What if we were now to assume that Boston train commuters are perfectly representative of US train commuters AND those who were approached to respond to the survey were also random? Even if this were true, we could still not assume representativeness because those who actually complete and submit the survey are not random. Instead of having more young people chosen to respond by those handing out surveys like we discussed in our last model, what if were to assume that when the surveys are handed out randomly, younger people tended to fill out and submit them more than those who are older? Well, this would still skew the age distribution and overestimate younger people in the population, and if younger people also tend to have a lower income than older people, this could also alter our answers to our current questions.

If we had reason to believe this is true, one way that we could fix this issue of representativeness is to alter our population to be train commuters in the US who would respond to the survey. In doing so, our population would then accommodate for the skewed age distribution under the assumption that younger individuals tend to respond to surveys at higher rates than older people.

Justice

FIGURE 8.7: Justice

The mathematics for a continuous predictor is unchanged from the intercept-including example we explored in Section 8.2.4:

\[y_i = \beta_0 + \beta_1 age_i + \epsilon_i\]

When comparing two people (persons 1 and 2), the first one year older than the second, \(\beta_1\) is the expected difference in their incomes. The algebra is simple. Start with the two individuals.

\[y_1 = \beta_0 + \beta_1 age_1\] \[y_2 = \beta_0 + \beta_1 age_2\] We want the difference between them, so we subtract the second from the first, performing that subtraction on both sides of the equals sign.

\[y_1 - y_2 = \beta_0 + \beta_1 age_1 - \beta_0 - \beta_1 age_2\\ y_1 - y_2 = \beta_1 age_1 - \beta_1 age_2\\ y_1 - y_2 = \beta_1 (age_1 - age_2)\]

So, if person 1 is one year older than person 2, we have:

\[y_1 - y_2 = \beta_1 (age_1 - age_2)\\ y_1 - y_2 = \beta_1 (1)\\ y_1 - y_2 = \beta_1\]

The algebra demonstrates that \(\beta_1\) is the same for all ages. The difference in expected income between two people aged 23 and 24 is the same as the difference between two people aged 80 and 81. Is that plausible? Maybe. The algebra does not lie. When we create a model like this, this is the assumption we are making.

Note how careful we are not to imply that increasing age by one year “causes” an increase in income. That is nonsense! No causation without manipulation. Since it is impossible to change someone’s age, there is only one potential outcome. With only one potential outcome, a causal effect is not defined.

8.4.2 Courage

Courage

FIGURE 8.8: Courage

The use of stan_glm() is the same as usual.

fit_3 <- stan_glm(income ~ age, 
                  data = trains, 
                  seed = 28,
                  refresh = 0)

print(fit_3, details = FALSE)
## stan_glm
##  family:       gaussian [identity]
##  formula:      income ~ age
##  observations: 115
##  predictors:   2
## ------
##             Median   MAD_SD  
## (Intercept) 104783.8  25577.1
## age            877.7    578.5
## 
## Auxiliary parameter(s):
##       Median  MAD_SD 
## sigma 74186.1  4820.3
## 
## ------
## * For help interpreting the printed output see ?print.stanreg
## * For info on the priors used see ?prior_summary.stanreg

When comparing two individuals, one 30 years old and one 40, we expect the older to earn about $9,000 more. But we are far from certain: the 95% confidence interval ranges from -$3,000 to $20,000.

The above is a good summary of the models.

  • It is brief! No one wants to listen to too much of your prattle. One sentence gives a number of interest. The second sentence provides a confidence interval.

  • It rounds appropriately. No one wants to hear a bunch of decimals. Use sensible units.

  • It does not just blindly repeat numbers in the printed display. A one year difference in age, which is associated with a $900 difference in income, is awkward. (We think.) A decade comparison is more sensible.

  • “When comparing” is a great phrase to start the summary of any non-causal model. Avoid language like “associated with” or “leads to” or “implies” or anything which even hints at a causal claim.

Consider our usual decomposition of the outcome into two parts: the model and the error term.

There are scores of different fitted values. Indeed, there are a greater number of different fitted values than there are different outcome values! This is often true for models which have continuous predictor variables, we have here with age.

8.4.3 Temperance

Temperance

FIGURE 8.9: Temperance

Recall our question:

What would you expect the income to be for a random 40 year old?

Given that we are looking for an expected value, we use posterior_epred().

newobs <- tibble(age = 40)

pe <- posterior_epred(fit_3, newdata = newobs) %>% 
  as_tibble() 

pe
## # A tibble: 4,000 x 1
##        `1`
##      <dbl>
##  1 131988.
##  2 132900.
##  3 143766.
##  4 149269.
##  5 138030.
##  6 135142.
##  7 145487.
##  8 147361.
##  9 131460.
## 10 147589.
## # … with 3,990 more rows

Plotting is the same as always.

pe %>% 
  ggplot(aes(x = `1`)) +
    geom_histogram(aes(y = after_stat(count/sum(count))),
                   bins = 100)  +
    labs(title = "Posterior for Expected Income",
         subtitle = "A 40-years old commuter earns around $140,000",
         x = "Income",
         y = "Probability") +
    scale_x_continuous(labels = scales::dollar_format()) +
    scale_y_continuous(labels = scales::percent_format()) +
    theme_classic()

8.5 liberal ~ income

So far in this chapter, we have only considered continuous outcome variables. age, att_end and income all take on a variety of values. None of them are, truly, continuous, of course. age is only reported as an integer value. att_end can only, by definition, take on 13 distinct values. However, from a modeling perspective, what matters is that they have more than 2 possible values.

liberal, however, only takes on two values: TRUE and FALSE. In order to model it, we must use the binomial family. We begin, as always, with some questions:

Among all people who have an income $100,000, what proportion are liberal?

Assume we have a group of eight people, two of whom make $100,000, two $200,000, two $300,000 and two $400,000. How many will be liberal?

8.5.1 Justice

Let’s now consider Justice for our relationship between liberal and income.

First, let’s look at validity, especially pertaining to liberal. As we know, this column has the values “Liberal” and “Not Liberal” to convey the political ideology of the train commuters. What we must determine if the column liberal and therefore the meaning of being liberal is the same in Boston in 2012 and Boston in 2021. If we can determine these to be the same, then we can assume validity, and in this case, because the core beliefs of being liberal have not changed very much between 2012 and 2021, we can determine that the data is valid.

Now, let’s consider stability. To do so, we must look at the relationship between liberal and income and determine whether or not we believe that this relationship has changed between 2012 and 2021. With our knowledge of the world, do we have any reason to believe that this has changed? What if between 2012 and 2021, income increased for those who are liberal? Well, then our model for the the relationship between income and liberal would have changed over the years, and therefore so would the model. However, since with our knowledge of the world we have no reason to believe that something of the sorts happened to affect this relationship, we can consider our model to be stable.

In our past three models, we have considered 3 possible issues that we could have with the representativeness of our model, such as the difference between Boston and other cities, problems with random sampling, and bias in those who respond, and we will now consider how in this survey there were surveys given before and after treatment, and some people may have filled out only one of the surveys. People who only filled out one of the surveys could affect the representativeness of the data because they could not be included in the data and if those who only filled out one survey tended to be liberal, then this would affect our data because it would underestimate the amount of liberals in the survey. This is something we must consider when looking at representativeness, since we could otherwise not determine if this data from train commuters in Boston in 2012 is representative enough of train commuters in the US now to continue using our data.

Recall the discussion in Section 7.5 about the logistic regression model which we use whenever the outcome or dependent variable is binary/logical. The math is there, if you care about math. We don’t, at least not too much. Reminder:

\[p(\text{Liberal}_i = \text{TRUE}) = \frac{\text{exp}(\beta_0 + \beta_1 \text{income}_i)}{1 + \text{exp}(\beta_0 + \beta_1 \text{income}_i)}\]

This model only has two parameters, \(\beta_0\) and \(\beta_1\). But these parameters do not have simple interpretations, unlike the parameters in a linear (or gaussian) model.

Recall the fundamental structure of all data science problems:

\[\text{outcome} = \text{model} + \text{what is not in the model}\] The exact mathematics of the model — the parameters, their interpretations — are all just dross in the foundry of our inferences: unavoidable but not worth too much of our time.

Even if the math is ignorable, the causal versus predictive nature of the model is not. Is this a causal model or a predictive model? It depends! It could be causal if you assume that we can manipulate someone’s income, if, that is, there are at least two potential outcomes: person \(i\)’s liberal status if she makes X dollars and person \(i\)’s liberal status if she makes Y dollars. Remember: No causation without manipulation. The definition of a causal effect is the difference between two potential outcomes. If you only have one outcome, then your model can not be causal.

In many circumstances, we don’t really care if a model is causal or not. We might only want to forecast/predict/explain the outcome variable. In that case, whether we can interpret the influence of a variable as causal is irrelevant to our use of that variable.

8.5.2 Courage

Fitting a logistic model is easy. We use all the same arguments as usual, but with family = binomial added.

fit_4 <- stan_glm(data = ch8,
                  formula = liberal ~ income,
                  family = binomial,
                  refresh = 0,
                  seed = 365)

Having fit the model, we can look at a printed summary. Note the use of the digits argument to display more digits in the printout.

print(fit_4, digits = 6)
## stan_glm
##  family:       binomial [logit]
##  formula:      liberal ~ income
##  observations: 115
##  predictors:   2
## ------
##             Median   MAD_SD  
## (Intercept)  5.6e-01  4.3e-01
## income      -6.0e-06  3.0e-06
## 
## ------
## * For help interpreting the printed output see ?print.stanreg
## * For info on the priors used see ?prior_summary.stanreg

Fitted models tell us about the posterior distributions for the parameters in the formula which defines the model we have estimated. We are assuming that the model is true. And, as discussed in Chapter 6, that assumption is always false! Our model is never a perfectly accurate representation of reality. But, if it were perfect, then the posterior distributions which we have created for \(\beta_0\), \(\beta_1\), and so on would be perfect as well.

When working with a linear model, we will often interpret the meaning of the parameters, as we have already done in the first three sections of this chapter. Such interpretations are much harder with logistic models because the math is much less convenient. So, we won’t even bother to try to understand the meaning of these parameters. However, we can note that \(\beta_1\) is negative, suggesting that people with higher incomes are less likely to be liberal.

8.5.3 Temperance

Among all people who have an income $100,000, what proportion are liberal?

Although our model is now logistic, all the steps in answering a question like this are the same as with a linear/guassian model.

newobs <- tibble(income = 100000)

pe <- posterior_epred(fit_4, 
                      newdata = newobs) %>% 
  as_tibble()

pe is a tibble with a single vector. That vector is 4,000 draws from the posterior distribution of proportion of people, among those who make $100,000, who are liberal. The population proportion is the same thing as the probability for any single individual.

pe %>% 
  ggplot(aes(x = `1`)) +
    geom_histogram(aes(y = after_stat(count/sum(count))),
                   bins = 100)  +
    labs(title = "Posterior for Proportion Liberal Among $100,000 Earners",
         subtitle = "The population proportion is the same as the probability for any individual",
         x = "Income",
         y = "Probability of Being Liberal") +
    scale_x_continuous(labels = scales::percent_format(accuracy = 1)) +
    scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
    theme_classic()

Assume we have a group of eight people, two of whom make $100,000, two $200,000, two $300,000 and two $400,000. How many will be liberal?

Because we are trying to predict the outcome for a small number of units, we use posterior_predict(). The more complex the questions we ask, the more care we need to devote to making the newobs tibble. We use the same rowwise() and c_across() tricks as earlier in the chapter.

newobs <- tibble(income = c(rep(100000, 2),
                            rep(200000, 2),
                            rep(300000, 2),
                            rep(400000, 2)))
                 

pp <- posterior_predict(fit_4, 
                        newdata = newobs) %>% 
  as_tibble() %>% 
  mutate_all(as.numeric) %>% 
  rowwise() %>% 
  mutate(total = sum(c_across()))

pp
## # A tibble: 4,000 x 9
## # Rowwise: 
##      `1`   `2`   `3`   `4`   `5`   `6`   `7`   `8` total
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1     0     1     1     0     0     0     0     0     2
##  2     0     0     0     1     0     0     1     0     2
##  3     0     0     0     1     1     0     0     0     2
##  4     1     1     0     0     1     0     0     0     3
##  5     0     1     0     0     0     0     0     0     1
##  6     0     0     0     0     0     0     0     1     1
##  7     0     1     0     0     0     1     0     0     2
##  8     0     1     1     0     0     0     0     0     2
##  9     0     1     1     0     0     0     0     0     2
## 10     0     0     0     0     1     0     1     0     2
## # … with 3,990 more rows

Study the pp tibble. Understand its component parts. The first column, for example, is 4,000 draws from the posterior distribution for the liberal status of a random person with an income of $100,000. Note how all the draws are zeroes or ones. That is very different from the draws we have seen before! But it also makes sense. We are making a prediction about a binary variable, a variable which only have two possible values: zero or one. So, any (reasonable!) predictions will only be zero or one.

The second column is the same thing as the first column. Both are 4,000 draws from the posterior distribution for the liberal status of a random person with an income of $100,000. Yet they also have different values. They are both the same thing and different things, in the same way that rnorm(10) and rnorm(10) are the same thing — both are 10 draws from the standard normal distribution — and different things in that the values vary.

The third and fourth columns are different from the first two columns. They are both 4,000 draws from the posterior distribution for the liberal status of a random person with an income of $200,000. And so on for later columns. We can answer very difficult questions by putting together simple building blocks, each of them a set of draws from a posterior distribution. Recall the discussion in Section 2.8.

The total column is simply the sum of the first eight columns. Having created the building blocks with 8 columns of draws from four different posterior distributions, we can switch our focus to each row. Consider row 2. It has a vector of 8 numbers: 1 1 1 0 0 1 0 0. We can treat that vector as a unit of analysis. This is what might happen with our 8 people. The first three might be liberal, the fourth not liberal and so on. This row is just one example of what might happen, one draw from the posterior distribution of possible outcomes for groups of eight people with these incomes.

We can simplify this draw by taking the sum, or doing anything else which might answer the question with which we are confronted. Posterior distributions are as flexible as individual numbers. We can, more or less, just use algebra to work with them.

Graphically we have:

pp %>% 
  ggplot(aes(x = total)) +
    geom_histogram(aes(y = after_stat(count/sum(count))),
                   bins = 100)  +
    labs(title = "Posterior for Number of Liberals in Group with Varied Incomes",
         subtitle = "Two is the most likely number, but values from 0 to 5 are plausible",
         x = "Number of Liberals",
         y = "Probability") +
    scale_x_continuous(labels = scales::number_format(accuracy = 1)) +
    scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
    theme_classic()

As always, there is some truth. If, tomorrow, we were to meet 8 new people, with the specified incomes, a certain number of them would be liberal. If we had the ideal Preceptor Table, we could just look up that number. No data science required. Alas, we don’t know the truth. The bets we can do is to create a posterior distribution for that unknown value, as we have done here. We then need to translate that posterior into English — “The most likely number of liberals is 2 or 3, but a total as low as zero or as high as 5 is also plausible. Having 6 liberals would be really surprising. Having 7 or 8 is almost impossible.”

Are these two posterior probability distributions perfect? No! This is the central message of the virtue of Temperance. We must demonstrate our humility when we use our models. Recall the distinction between the unknown true distribution and the estimated distribution. The first is the posterior distribution we would create of we understood every detail of the process and could accurately model it. We would still not know the true unknown number, but our posterior distribution for that number would be perfect. Yet, our model is never perfect. We are making all sorts of assumptions behind the scenes. Some of those assumptions are plausible. Others are less so. Either way, the estimated distribution is what we have graphed above.

The central lesson of Temperance is: Don’t confuse the estimated posterior (which is what you have) with the true posterior (which is what you want). Recognize the unavoidable imperfections in the process. You can still use your estimated posterior — what choice do you have? — but be cautious and humble in doing so. The more that you suspect that your estimated posterior differs from the true posterior, the more humble and cautious you should be.

8.6 Summary

In this chapter, we explored relationships between different variables in the trains data set. We built three predictive models and one causal model.

Similar to previous chapters, our first task is to use Wisdom . We judge how relevant our data is to the questions we ask. Is it reasonable to consider the data we have (e.g., income and age data from Boston commuters in 2012) as being drawn from the same population as the data we want to have (e.g., income and age data from today for the entire US)? Probably?

Justice is necessary to decide the best way to represent the models we make. A little math won’t kill you. We use Courage to translate our models into code. Our goal is to understand, generate posterior distributions for the parameters, and interpret their meaning. Temperance leads us to the final stage, using our models to answer our questions.

Key commands:

  • Create a model with stan_glm().

  • Use posterior_epred() to estimate expected values. The e in epred stands for expected.

  • Use posterior_predict() to make forecasts for individuals. The variable in predictions is always greater than the variability in expectations because predictions can’t pretend that \(\epsilon_i\) is zero.

Once we have draws from a posterior distribution for our outcome variable — whether that be an expectation or a prediction — we can manipulate those draws to answer our question.

Remember:

  • Always explore your data.

  • Predictive models care little about causality.

  • Predictive models and causal models use the same math and the same code.

  • “When comparing” is a great phrase to start the summary of any non-causal model.

  • Don’t confuse the estimated posterior (which is what you have) with the true posterior (which is what you want). Be humble and cautious in your use of the posterior.