8 Three Parameters
Models have parameters. In Chapter 6 we created models with a single parameter \(p\), the proportion of red beads in an urn. In Chapter 7, we used models with two parameters: \(\mu\) (the average height in the population, generically known as a model “intercept”) and \(\sigma\) (the variation in height in the population). Here — can you guess where this is going? — we will build models with three parameters: \(\sigma\) (which serves the same role throughout the book) and two “coefficients.” In models which relate a continuous predictor to the outcome, those two parameters will be labeled \(\beta_0\) and \(\beta_1\). In models which estimate two averages, the parameters will be \(\beta_1\) and \(\beta_2\). All this notation is confusing, not least because different academic fields use inconsistent schemes. Follow the Cardinal Virtues and tackle your problem step by step.
8.1 Wisdom
8.1.1 Preceptor Table
Wisdom begins with considering the questions we desire to answer and the data set we are given. In this chapter, we are going to ask a series of questions involving train commuters’ ages, party affiliations, incomes, and political ideology, as well as the causal effect of exposure to Spanish-speakers on their attitude toward immigration. These questions will pertain to all train commuters in the US today. Given these types of questions, the Preceptor Table would be:
Preceptor Table | ||||||
ID | Age | Party | Income | Liberal | Control Ending Attitude | Treated Ending Attitude |
---|---|---|---|---|---|---|
Commuter 1 |
23 |
Democrat |
50000 |
Liberal |
3 |
8 |
Commuter 2 |
18 |
Republican |
150000 |
Not Liberal |
7 |
7 |
... |
... |
... |
... |
... |
... |
... |
Commuter 1000 |
49 |
Republican |
100000 |
Liberal |
4 |
8 |
Commuter 1001 |
38 |
Democrat |
200000 |
Liberal |
9 |
7 |
... |
... |
... |
... |
... |
... |
... |
Recall: a Preceptor Table is the smallest possible table such that, if there is no missing data, all our questions are easy to answer. To answer questions — like “What is, today, the average age of train commuters in the US?” — we need a row for every train commuter.
Notice that one of our questions is about a causal effect: What change in immigration attitudes is caused by being exposed to Spanish-speakers? Answering causal questions requires (at least) two potential outcomes: immigration attitude for those who receive the treatment of being exposed to Spanish-speakers and for those who do not.
Having created the Preceptor Table, we now look at the data we have: the trains
data set from the primer.data package.
8.1.2 EDA for trains
Always explore your data. Recall the discussion from Chapter 4. Enos (2014) randomly placed Spanish-speaking confederates on nine train platforms around Boston, Massachusetts. Exposure to Spanish-speakers – the treatment
– influenced attitudes toward immigration. These reactions were measured through changes in answers to three survey questions. Load the necessary libraries and look at the data.
glimpse(trains)
Rows: 115
Columns: 14
$ treatment <fct> Treated, Treated, Treated, Treated, Control, Treated, C…
$ att_start <dbl> 11, 9, 3, 11, 8, 13, 13, 10, 12, 9, 10, 11, 13, 6, 8, 1…
$ att_end <dbl> 11, 10, 5, 11, 5, 13, 13, 11, 12, 10, 9, 9, 13, 7, 8, 1…
$ gender <chr> "Female", "Female", "Male", "Male", "Male", "Female", "…
$ race <chr> "White", "White", "White", "White", "White", "White", "…
$ liberal <lgl> FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, F…
$ party <chr> "Democrat", "Republican", "Democrat", "Democrat", "Demo…
$ age <int> 31, 34, 63, 45, 55, 37, 53, 36, 54, 42, 33, 50, 24, 40,…
$ income <dbl> 135000, 105000, 135000, 300000, 135000, 87500, 87500, 1…
$ line <chr> "Framingham", "Framingham", "Framingham", "Framingham",…
$ station <chr> "Grafton", "Southborough", "Grafton", "Grafton", "Graft…
$ hisp_perc <dbl> 0.0264, 0.0154, 0.0191, 0.0191, 0.0191, 0.0231, 0.0304,…
$ ideology_start <int> 3, 4, 1, 4, 2, 5, 5, 4, 4, 4, 3, 5, 4, 1, 2, 2, 3, 4, 2…
$ ideology_end <int> 3, 4, 2, 4, 2, 5, 5, 4, 3, 4, 3, 4, 4, 1, 2, 3, 3, 1, 2…
The data include information about each respondent’s gender, political affiliations, age, income and so on. treatment
indicates whether a subject was in the control or treatment group. The key outcomes are their attitudes toward immigration both before (att_start
) and after (att_end
) the experiment. Type ?trains
to read the help page for more information about each variable.
Let’s restrict attention to a subset of the variables we need to answer our questions, as specified in the Preceptor Table. age
is the age of the respondent; party
is their political party affiliation; liberal
is whether they are liberal or not; income
is the income of the respondent; treatment
specifies whether the respondent was given the treatment of being exposed to Spanish-speakers in the train station; and att_end
tell us about the respondent’s attitude on immigration after the treatment, .
ch8 <- trains |>
select(age, att_end, party, income, treatment, liberal)
It is always smart to look at a some random slices of the data:
ch8 |>
slice_sample(n = 5)
# A tibble: 5 × 6
age att_end party income treatment liberal
<int> <dbl> <chr> <dbl> <fct> <lgl>
1 36 11 Democrat 135000 Treated FALSE
2 42 10 Democrat 62500 Treated TRUE
3 44 6 Democrat 62500 Control TRUE
4 22 11 Democrat 135000 Control FALSE
5 50 7 Democrat 135000 Treated TRUE
att_end
is a measure of person’s attitude toward immigration. A higher number means more conservative, i.e., a more exclusionary stance toward immigration into the United States.
ch8 |>
glimpse()
Rows: 115
Columns: 6
$ age <int> 31, 34, 63, 45, 55, 37, 53, 36, 54, 42, 33, 50, 24, 40, 53, …
$ att_end <dbl> 11, 10, 5, 11, 5, 13, 13, 11, 12, 10, 9, 9, 13, 7, 8, 13, 8,…
$ party <chr> "Democrat", "Republican", "Democrat", "Democrat", "Democrat"…
$ income <dbl> 135000, 105000, 135000, 300000, 135000, 87500, 87500, 135000…
$ treatment <fct> Treated, Treated, Treated, Treated, Control, Treated, Contro…
$ liberal <lgl> FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE,…
Pay attention to the variable types. Do they make sense? Perhaps. But there are certainly grounds for suspicion. Why is att_end
a double rather than an integer? All the values in the data appear to be integers, so there is no benefit to having these variables be doubles. Why is party
a character variable and treatment
a factor variable? It could be that these are intentional choices made by the creator of the tibble, i.e., us. Or, these could be mistakes. Most likely, these choices are a mixture of sensible and arbitrary. Regardless, it is your responsibility to notice them. You can’t make a good model without looking closely at the data which you are using.
ch8 |>
skim()
Name | ch8 |
Number of rows | 115 |
Number of columns | 6 |
_______________________ | |
Column type frequency: | |
character | 1 |
factor | 1 |
logical | 1 |
numeric | 3 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
party | 0 | 1 | 8 | 10 | 0 | 2 | 0 |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
treatment | 0 | 1 | FALSE | 2 | Con: 64, Tre: 51 |
Variable type: logical
skim_variable | n_missing | complete_rate | mean | count |
---|---|---|---|---|
liberal | 0 | 1 | 0.44 | FAL: 64, TRU: 51 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
age | 0 | 1 | 42.37 | 12.20 | 20 | 33 | 43 | 52 | 68 | ▆▇▇▇▃ |
att_end | 0 | 1 | 9.14 | 2.87 | 3 | 7 | 9 | 11 | 15 | ▂▃▇▃▃ |
income | 0 | 1 | 141813.04 | 74476.64 | 23500 | 87500 | 135000 | 135000 | 300000 | ▅▇▇▁▆ |
skim()
shows us what the different values of treatment
are because it is a factor. Unfortunately, it does not do the same for character variables like party
. The ranges for age
and att_end
seem reasonable. Recall that participants were asked three questions about immigration issues, each of which allowed for an answer indicated strength of agreement on a scale form 1 to 5, with higher values indicating more agreement with conservative viewpoints. att_end
is the sum of the responses to the three questions, so the most liberal possible value is 3 and the most conservative is 15.
8.1.3 Population
The data that we have are very limited. There are only 115 observations, all from 2012 and involving train commuters to Boston. Can we assume that the the data we have and the data in our Preceptor Table are drawn from the same population? Only your judgment, along with advice from your colleagues, can guide you.
There is no truth here. The data is real enough, but you created the Preceptor Table. Whether or not there is a population from which you can assume both the data and the Preceptor Table might have been drawn is not a TRUE/FALSE question.
The Preceptor Table is all US train commuters today. However, the data we have involves train commuters in the Boston area in 2012. So, what we must ask ourselves is if this data from Boston in 2012 can be drawn from the same population as all train commuters in the US today.
The key concept is the idea of a “population.” From which larger population is the data we have being (conceptually) drawn? If we were only interested in the age of individuals in our data set, we would have no need for inference. We know everyone’s ages already. We only need tools like stan_glm()
if we seek to understand individuals not in our data.
The issue is always: Do the data we have and the data we want to have come from the same population? If there is no connection between the two, progress is impossible. But, in this case, if we are willing to consider the population to be US residents over the last decade (including today), and if we are willing to assume that there is a single population from which the Preceptor Table and our data set is drawn, then we can use our data to create a model to answer our questions.
8.1.4 Population Table
Having determined that the Preceptor Table and the data were drawn from the same population, we can produce a Population Table.
The Population Table includes rows from three sources: the Preceptor Table, the actual data, and all other members of the population.
The rows in the Preceptor Table contain the information that we would want to know in order to answer our questions. These rows contain entries for our covariates (city and year) but they do not contain any outcome results. We are trying to answer questions about the train commuter population in 2021, so our city entries for these rows will vary and our year entries of these rows will read “2021”.
Our actual data rows contain the information that we do know. These rows contain entries for both our covariates and the outcomes. In this case, the actual data comes from a study conducted on train commuters around Boston, MA in 2012, so our city entries for these rows will read “Boston, MA” and our year entries of these rows will read “2012”.
Our other rows contain no data. These are subjects which are members of our desired population, but for which we have no data. As such, all outcomes and covariates are missing. The “Other” rows his may include, for example, data from a range of years before and after 2012, since we considered little change overtime, so we would also consider this data to be representative and drawn from the larger population of train commuters in years near 2012.
The population table includes commuters from our Preceptor Table with the information that we would ideally have to answer the questions and those from the data we have that is specific to Boston, MA. Additionally, the population table also includes the groups of people within the population from which both the data we have and the Preceptor table is drawn from that we don’t have.
Population Table | ||||||||
Source | City | Year | Age | Party | Income | Liberal | Treatment | Control |
---|---|---|---|---|---|---|---|---|
Other |
? |
2008 |
? |
? |
? |
? |
? |
? |
Other |
? |
2009 |
? |
? |
? |
? |
? |
? |
... |
... |
... |
... |
... |
... |
... |
... |
... |
Data |
Boston, MA |
2012 |
43 |
Democrat |
150000 |
Liberal |
6 |
? |
Data |
Boston, MA |
2012 |
52 |
Republican |
50000 |
Not Liberal |
? |
2 |
... |
... |
... |
... |
... |
... |
... |
... |
... |
Other |
? |
2013 |
? |
? |
? |
? |
? |
? |
Other |
? |
2014 |
? |
? |
? |
? |
? |
? |
... |
... |
... |
... |
... |
... |
... |
... |
... |
Preceptor Table |
Wilmington, DE |
2021 |
45 |
Republican |
45000 |
Not Liberal |
? |
? |
Preceptor Table |
Atlanta, GA |
2021 |
65 |
Republican |
15000 |
Not Liberal |
? |
? |
... |
... |
... |
... |
... |
... |
... |
... |
... |
Other |
? |
2025 |
? |
? |
? |
? |
? |
? |
Other |
? |
2027 |
? |
? |
? |
? |
? |
? |
Our Preceptor Table rows are chronologically ordered. The “other” rows our greater population for which we are making assumptions — this is why the year at the start is earlier than our data and the year at the end is later than our Preceptor Table. This shows the more expansive population that we are inferring about.
8.2 age ~ party
We want to build a model and then use that model to make claims about the world. Our questions about the relationship between age
and party
are the following:
What is the expected age of a Democrat at the train station?
In a group of three Democrats and three Republicans, what will the age difference be between the oldest Democrat and the youngest Republican?
We can answer these and similar questions by creating a model that uses party affiliation to predict age
8.2.1 Justice
Justice consists of four topics: validity, stability, representativeness, and the data generating mechanism (DGM).
To understand validity in regards to the Population Table, we must first recognize an inherent flaw in any experiment design: no two units receive exactly the same treatment. If this doesn’t ring true, consider our Spanish speaking train experiment. The units on the Spanish-speaking platform received the same treatment, right? No, actually! Every unit heard Spanish at a different decibel level. Every unit also heard Spanish for a different amount of time, depending on when they arrived to the station. Thus, two units can have the same treatment — that is, hearing Spanish on the platform — while having very different versions of that treatment. This is why it is crucial to define one’s estimand precisely: if we are interested in the difference in potential outcomes between Spanish being spoken for 10 minutes at a 60 dB versus control, we can ignore all the other possible columns in the Population Table.
We then must consider the stability of our model. Stability means that the relationship between the columns is the same for three categories of rows: the data, the Preceptor table, and the larger population from which both are drawn. With something like height, it is much easier to assume stability over a greater period of time. Changes in global height occur extremely slowly, so height being stable across a span of 20 years is reasonable to assume. With something like political ideology, it is much harder to make the assertion that data collected in 2010 would be stable to data collected in 2025. When we are confronted with this uncertainty, we can consider making our timeframe smaller. However, we would still need to assume stability from 2014 (time of data collection) to today. Stability allows us to ignore the issue of time.
Next, let’s consider representativeness. Representativeness has to do with how well our sample represents the larger population we are interested in generalizing to. Does the train experiment allow us to calculate a causal effect for people who commute by cars? Can we calculate the causal effect for people in New York City? Before we generalize to broader populations we have to consider if our experimental estimates are applicable beyond our experiment. Generally: if there was no chance that a certain type of person would have been in this experiment, we cannot make an assumption for that person.
Now let’s discuss the kind of DGM we will be using. For our purposes, we will be choosing between linear or logistic, however, there are several different types of distributions which we will not be dealing with at the moment. Let’s recall the difference between linear and logistic models, which is dependent upon the outcome variable. If the outcome variable is continuous, we will use the a linear model, and when the outcome variable is binary, or only has two options, we will use a logistic model. In this case, our outcome variable is age which is continous, so it is a linear model.
We can also consider the type of DGM. Any DGM with age as its dependent variable will be predictive, not causal, for the simple reason that nothing, other than time, can change your age. You are X years old. It would not matter if you changed your party registration from Democrat to Republican or vice versa. Your age is your age. When dealing with a non-causal DGM, the focus is on predicting things. The underlying mechanism which connects age with party is less important than the brute statistical fact that there is a connection. Predictive models care little about causality.
A good way at looking at this is with a Preceptor Table, as seen below. Unlike the previous table in Chapter 7, we now have two columns in addition to ID. Since our data does not include all Republicans and Democrats in the world, not every row is filled in.
ID | Predictor | Outcome |
---|---|---|
Political Party | Age | |
1 | D | 31 |
2 | ? | ? |
... | ... | ... |
473 | D | 58 |
474 | ? | ? |
... | ... | ... |
3,258 | ? | ? |
3,259 | R | 49 |
... | ... | ... |
N | ? | ? |
We now know that we are working with a predictive DGM. Recall:
\[\text{outcome} = \text{model} + \text{not in the model}\]
In words, any event depends on our explicitly described model as well as on influences unknown to us. Everything that happens in the world is the result of various factors, and we can only consider some of them in our model (because we do not know about some influences, or because we have no data about them).
Let’s make our DGM. The mathematics:
\[ y_i = \beta_1 republican_i + \beta_2 democrat_i + \epsilon_i\]
where \[republican_i, democrat_i \in \{0,1\}\] \[republican_i + democrat_i = 1\] \[\epsilon_i \sim N(0, \sigma^2)\]
Don’t panic dear poets and philosophers, the whole thing is easier than it looks. This will follow the form above in which the outcome is the result of what is in our model and not in our model.
On the left-hand side we have the outcome, \(y_i\), which is the variable to be explained. In our case, this is the age of an individual in the population.
The right-hand side contains two parts, that which is contained within the model, and that which isn’t.
- First, we have the part contained in the model, which consists of the parameter and the data points. The betas are our two parameters: \(\beta_1\) is the average age of Republicans in the population and \(\beta_2\) is the average age of Democrats in the population. \(republican_i\) and \(democrat_i\) are our explanatory variables and take the values 1 or 0. As shown in our model, \(\beta_1 republican_i\) and \(\beta_2 democrat_i\) are two similar terms which are added together to make our model and each term consists of a parameter and a data point. If person \(i\) is a Republican we have \(republican_i = 1\) and \(democrat_i = 0\). If person \(i\) is a Democrat we have \(republican_i = 0\) and \(democrat_i = 1\). In other words, their values are mutually exclusive – if you are a Democrat, you cannot also be a Republican.
- The second part in the right-hand side, \(\epsilon_i\) (“epsilon”), represents the unexplained part of the outcome and is called the error term. This includes all factors that have an influence on someone’s age but are not related to party affiliation. In other words, \(\epsilon_i\) is what influence age that is not factored into our model. We assume that this error follows a normal distribution with an expected value of 0 (meaning it is 0 on average) and it is simply the difference between the outcome and our model predictions.
Some things to note about our model:
The small \(i\)’s are an index to number the observations. It is equivalent to the “ID” column in our Preceptor Table and simply states that the outcome for person \(i\) is explained by the modeled and non-modeled factors for person \(i\).
The model is a claim about how the world works, not just for the 115 individuals for which we have data but for the all the people in the population for which we seek to draw inferences.
Although terminology differs across academic fields, the most common term to describe a model like this is a “regression.” We are “regressing”
age
onparty
in order to see if they are associated with each other. The formula above is a “regression formula”, and the model is a “regression model.” This terminology would also apply to our model ofheight
in Chapter 7.The model in Chapter 7 is sometimes called “intercept-only” because the only (interesting) parameter is the intercept. Here we have a “two intercept” model because, instead of estimating an average for the whole population, we are estimating two averages.
8.2.2 Courage
Courage allows us to translate math to code.
To get posterior distributions for our three parameters, we will again use stan_glm()
, just as we did in Chapter 7.
fit_1 <- stan_glm(age ~ party - 1,
data = trains,
family = gaussian,
seed = 17,
refresh = 0)
The variable before the tilde,
age
, is our outcome.The only explanatory variable is
party
. This variable has only two values, ‘Democrat’ and ‘Republican’.Recall that our model is linear. Since we are using a linear model, the
family
we use will begaussian
.We have also added
-1
at the end of the equation, indicating that we do not want an intercept, which would otherwise be added by default.
The resulting output:
fit_1
stan_glm
family: gaussian [identity]
formula: age ~ party - 1
observations: 115
predictors: 2
------
Median MAD_SD
partyDemocrat 42.6 1.2
partyRepublican 41.2 2.7
Auxiliary parameter(s):
Median MAD_SD
sigma 12.3 0.8
------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg
partyDemocrat
corresponds to \(\beta_1\), the average age of Democrats in the population. partyRepublican
corresponds to \(\beta_2\), the average age of Republicans in the population. Since we don’t really care about the posterior distribution for \(\sigma\), we won’t discuss it here. Graphically:
fit_1 |>
as_tibble() |>
select(-sigma) |>
mutate(Democrat = partyDemocrat, Republican = partyRepublican) |>
pivot_longer(cols = Democrat:Republican,
names_to = "parameter",
values_to = "age") |>
ggplot(aes(x = age, fill = parameter)) +
geom_histogram(aes(y = after_stat(count/sum(count))),
alpha = 0.5,
bins = 100,
position = "identity") +
labs(title = "Posterior for Average Age",
subtitle = "More data allows for a more precise posterior for Democrats",
x = "Age",
y = "Probability") +
scale_y_continuous(labels = scales::percent_format()) +
theme_classic()
The unknown parameters \(\beta_1\) (partyDemocrat) and \(\beta_2\) (partyRepublican) are still unknown. We can never know the true average age of all Democrats in the population. But we can calculate a posterior probability distribution for each parameter. Comments:
Democrats seem slightly older than Republicans. That was true in the sample and so, almost (but not quite!) by definition, it will be true in our the posterior probability distributions.
Our estimate for the average age of Democrats in the population is much more precise than that for Republicans because we have five times as many Democrats as Republicans in our sample. A central lesson from Chapter 6 is that the more data you have related to a parameter, the narrower your posterior distribution will be.
There is a great deal of overlap between the two distributions. Would we be surprised if, in truth, the average age of Republicans in the population was greater than that for Democrats? Not really. We don’t have enough data to be sure either way.
The phrase “in the population” is doing a great deal of work because we have not said what, precisely, we mean by the “population.” Is it the set of people on those commuter platforms on those days in 2012 when the experiment was done? Is it the set of people on all platforms, including ones never visited? Is it the set of all Boston commuter? All Massachusetts residents? All US residents? Does it include people today, or can we only draw inferences for 2012? We will explore these questions in every model we create.
The parameters \(\beta_1\) and \(\beta_2\) can be interpreted in two ways. First, like all parameters, they are a part of the model. We need to estimate them. But, in many cases, we don’t really care what the value of the parameter is. The exact value of \(\sigma\), for example, does not really matter. Second, some parameters have a substantive interpretaion, as with \(\beta_1\) and \(\beta_2\) being the average ages in the population. But this will often not be the case! Fortunately, with such models, we can use functions like
posterior_epred()
andposterior_predict()
to answer our questions.
Consider a table which shows a sample of 8 individuals.
8 Observations from Trains Dataset | |||
Age | Party | Fitted | Residual |
---|---|---|---|
31 | Democrat | 42.59 | -11.593526 |
34 | Republican | 41.19 | -7.186270 |
63 | Democrat | 42.59 | 20.406474 |
45 | Democrat | 42.59 | 2.406474 |
55 | Democrat | 42.59 | 12.406474 |
37 | Democrat | 42.59 | -5.593526 |
53 | Republican | 41.19 | 11.813730 |
36 | Democrat | 42.59 | -6.593526 |
The fitted values are the same for all Republicans and for all Democrats, as the model produces one fitted value for each condition. This table shows how just a sample of 8 individuals captures a wide range of residuals, making it difficult to predict the age of a new individual. We can get a better picture of the unmodeled variation in our sample if we plot these three variables for all the individuals in our data.
The following three histograms show the actual outcomes, fitted values, and residuals of all people in trains
:
outcome <- ch8 |>
ggplot(aes(age)) +
geom_histogram(bins = 100) +
labs(x = "Age",
y = "Count")
fitted <- tibble(age = fitted(fit_1)) |>
ggplot(aes(age)) +
geom_bar() +
labs(x = "Fitted Values",
y = NULL) +
scale_x_continuous(limits = c(20, 70))
res <- tibble(resids = residuals(fit_1)) |>
ggplot(aes(resids)) +
geom_histogram(bins = 100) +
labs(x = "Residuals",
y = NULL)
outcome + fitted + res +
plot_annotation(title = "Decomposition of Height into Fitted Values and Residuals")
The three plots are structured like our equation and table above. A value in the left plot is the sum of one value from the middle plot plus one from the right plot.
The actual age distribution looks like a normal distribution. It is centered around 43, and it has a standard deviation of about 12 years.
The middle plot for the fitted values shows only two adjacent spikes, which represent the estimates for Democrats and Republicans.
Since the residuals plot represents the difference between the other two plots, its distribution looks like the first plot.
8.2.3 Temperance
Recall the first questions with which we began this section:
- What is the probability that, if a Democrat shows up at the train station, he will be over 50 years old?
So far we have only tried our model on people from our data set whose real age we already knew. This is helpful to understand the model, but our ultimate goal is to understand more about the real world, about people we don’t yet know much about. Temperance guides us to make meaningful predictions and to become aware of their known and unknown limitations.
Start with a simple question, what are the chances that a random Democrat is over 50 years old? First, we create a tibble with the desired input for our model. In our case the tibble has a variable named “party” which contains a single observation with the value “Democrat”. This is a bit different than Chapter 7.
new_obs <- tibble(party = "Democrat")
Use posterior_predict()
to create draws from the posterior for this scenario. Note that we have a new posterior distribution under consideration here. The unknown parameter, call it \(D_{age}\), is the age of a Democrat. This could be the age of a randomly selected Democrat from the population or of the next Democrat we meet or of the next Democrat we interview on the train platform. The definition of “population” determines the appropriate interpretation. Yet, regardless, \(D_{age}\) is an unknown parameter. But it is not one — like \(\beta_1\), \(\beta_2\), or \(\sigma\) — for which we have already created a posterior probability distribution. That is why we need posterior_predict()
.
posterior_predict()
takes two arguments: the model for which the simulations should be run, and a tibble indicating for which and how many parameters we want to run these simulations. In this case, the model is the one from Courage and the tibble is the one we just created.
pp <- posterior_predict(fit_1, newdata = new_obs) |>
as_tibble()
head(pp, 10)
# A tibble: 10 × 1
`1`
<dbl>
1 46.7
2 49.8
3 43.9
4 55.7
5 32.5
6 37.0
7 44.4
8 52.4
9 48.5
10 34.9
The result are draws from the posterior distribution of the age of a Democrat. It is important to understand that this is not a concrete person from the trains
dataset - the algorithm in posterior_predict()
simply uses the existing data from trains
to estimate this posterior distribution.
pp |>
ggplot(aes(x = `1`)) +
geom_histogram(aes(y = after_stat(count/sum(count))),
bins = 100) +
labs(title = "Posterior for a Random Democrat's Age",
subtitle = "Individual predictions are always more variable than expected values",
x = "Age",
y = "Probability") +
scale_y_continuous(labels = scales::percent_format()) +
theme_classic()
Once we have the posterior distribution, we can answer (almost) any reasonable question. In this case, the probability that the next Democrat will be over 50 is around 28%.
Recall the second question:
- In a group of three Democrats and three Republicans, what will the age difference be between the oldest Democrat and the youngest Republican?
As before we start by creating a tibble with the desired input. Note that the name of the column (“party”) and the observations (“Democrat”, “Republican”) must always be exactly as they are in the original data set. This tibble as well as our model can then be used as arguments for posterior_predict()
:
newobs <- tibble(party = c("Democrat", "Democrat", "Democrat",
"Republican", "Republican","Republican"))
posterior_predict(fit_1, newdata = newobs) |>
as_tibble()
# A tibble: 4,000 × 6
`1` `2` `3` `4` `5` `6`
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 22.2 38.3 43.4 48.8 47.2 48.1
2 25.7 51.4 33.8 39.3 42.3 37.4
3 47.4 76.3 71.0 27.5 36.7 21.0
4 41.2 43.7 35.5 22.2 45.7 40.0
5 44.8 53.0 35.7 13.1 22.6 39.5
6 60.4 29.9 40.5 34.5 57.6 59.6
7 45.9 58.9 34.3 46.5 21.0 40.7
8 29.2 39.8 39.0 21.6 40.2 69.1
9 72.4 45.4 48.6 29.6 46.3 46.0
10 59.5 41.2 36.1 54.2 67.0 39.7
# … with 3,990 more rows
We have 6 columns: one for each person. posterior_predict()
does not name the columns, but they are arranged in the same order in which we specified the persons in newobs
: D, D, D, R, R, R. To determine the expected age difference, we add code which works with these posterior draws:
pp <- posterior_predict(fit_1, newdata = newobs) |>
as_tibble() |>
# We don't need to rename the columns, but doing so makes the subsequest
# code much easier to understand. We could just have worked with columns 1,
# 2, 3 and so on. Either way, the key is to ensure that you correctly map
# the covariates in newobs to the columns in the posterior_predict object.
set_names(c("dem_1", "dem_2", "dem_3",
"rep_1", "rep_2", "rep_3")) |>
rowwise() |>
# Creating three new columns. The first two are the highest age among
# Democrats and the lowest age among Republicans, respectively. The third one
# is the difference between the first two.
mutate(dems_oldest = max(c_across(dem_1:dem_3)),
reps_youngest = min(c_across(rep_1:rep_3)),
age_diff = dems_oldest - reps_youngest)
pp
# A tibble: 4,000 × 9
# Rowwise:
dem_1 dem_2 dem_3 rep_1 rep_2 rep_3 dems_oldest reps_youngest age_diff
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 44.0 41.7 70.1 52.7 26.4 38.1 70.1 26.4 43.7
2 22.2 38.2 24.9 39.5 60.4 17.3 38.2 17.3 20.8
3 50.9 44.1 56.7 19.9 64.5 45.2 56.7 19.9 36.8
4 26.1 57.1 65.2 50.3 45.9 41.4 65.2 41.4 23.9
5 57.3 40.8 42.2 41.3 36.0 35.2 57.3 35.2 22.1
6 47.1 35.4 45.5 48.2 38.9 26.1 47.1 26.1 21.0
7 33.8 23.5 56.7 37.8 23.6 45.4 56.7 23.6 33.2
8 52.0 40.6 55.5 39.2 31.7 35.9 55.5 31.7 23.8
9 46.9 32.2 27.1 52.0 46.2 40.9 46.9 40.9 5.99
10 31.6 36.9 37.3 67.8 48.5 45.9 37.3 45.9 -8.59
# … with 3,990 more rows
The plotting code is similar to what we have seen before:
pp |>
ggplot(aes(x = age_diff)) +
geom_histogram(aes(y = after_stat(count/sum(count))),
bins = 100) +
labs(title = "Posterior for Age Difference",
subtitle = "Oldest of three Democrats compared to youngest of three Republicans",
x = "Age",
y = "Probability") +
scale_y_continuous(labels = scales::percent_format()) +
theme_classic()
In words, we would expect the oldest Democrat to be about 22 years older than the youngest Republican, but we would not be too surprised if the oldest Democrat was actually younger than the youngest Republican in a group of 6.
8.2.4 Addendum
Instead of parameterizing this model without an intercept, we could have used one. In that case, the math would be:
\[ y_i = \beta_0 + \beta_1 democratic_i + \epsilon_i\] The interpretations of the parameters are different from the prior model. \(\beta_0\) is now the average age of Republicans. This is the same interpretation as \(\beta_1\) in the original set up. \(\beta_1\) is now the difference between the the average age of Republicans and that of Democrats.
To fit this model, we use the exact same code as before, except without the -1
in the formula
argument.
stan_glm(age ~ party,
data = trains,
seed = 98,
refresh = 0)
stan_glm
family: gaussian [identity]
formula: age ~ party
observations: 115
predictors: 2
------
Median MAD_SD
(Intercept) 42.6 1.3
partyRepublican -1.5 3.1
Auxiliary parameter(s):
Median MAD_SD
sigma 12.3 0.8
------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg
The intercept, 42.6, is the same as the partyDemocrat
estimate in the first model. The partyRepublican
estimate, which was previously 41.0, is now -1.5, meaning it is the difference (allowing for rounding) between the average age of Democrats and Republicans.
Little else about the models will be different. They will have the same fitted values and residuals. posterior_predict()
will generate the same posterior predictive probabilty distributions. Which parameterization we use does not matter much. But you should be able to interpret the meaning of the coefficients in both.
8.3 att_end ~ treatment
Above, we created a predictive model: with someone’s party affiliation, we can make a better guess as to what their age is than we could have in the absence of information about their party. There was nothing causal about that model. Changing someone’s party registration can not change their age. In this example, we build a causal model. Consider these two questions:
What is the average treatment effect, of exposing people to Spanish-speakers, on their attitudes toward immigration?
What is the largest causal effect which still has a 1 in 10 chance of occurring?
Models help us answer questions better than we could without those models, but only if we follow the Cardinal Virtues.
8.3.1 Justice
The four elements of Justice in a data science project remain the same: validity, stability, representativeness, and the model.
First, we must consider the validity of the model, and to do so, let’s look at our treatment variable. What does it mean for somebody to receive the treatment? Let’s say somebody was running late on their way to the train in the morning. Well, this person may have heard the Spanish-speakers for a shorter amount of time than a commuter who arrived earlier. Similarly, some Spanish-speakers may have been speaking louder than others or some commuters may have been closer to the speakers than other commuters. Therefore, these commuters would hear the treatment of being exposed to Spanish-speakers less loudly than the other commuters. Additionally, the Spanish-speakers weren’t the same for every train station in our 2012 data and if we were to get data for now, we wouldn’t be able to hire the exact same Spanish-speakers. When considering validity with reference to treatments, what we must determine is if we can assume that the treatments of being exposed to Spanish-speakers that each commuter may experience a bit differently can be assumed to be the same.
Next, we must consider the stability of our model for the relationship between att_end
and treatment
between 2012 and 2021. Is this relationship from 2012, four years before Donald Trump’s election as president, still the same? While we might not know for sure, we have to consider this in order to continue and make assumptions with our data. For our purposes, we will consider the relationship to be stable. Even though we know that there may have been some changes, we will consider the model to be the same in both years.
Let’s now once again consider whether the data from 2012 train commuters around Boston is representative of 2012 train commuters in the US. In our last model, we discussed the issue about how Boston may be different from other cities and therefore not representative of the US, and we will now consider the issue of random sampling that may lead to representativeness issues.
Let’s say that even though Boston is different from other US cities, we considered Boston to be perfectly representative of the US. Great, but this 2012 data could still not be representative. This is because there could be bias within those who are chosen to give the survey, in that the commuters who are approached and receive the surveys may not be random or representative. What if the individuals giving out the surveys were younger and also tended to choose people to approach with a survey that were similar in age to them? A scenario like this could end up overestimating younger train commuters in the population, which could influence our answers to any of our questions. Specifically, when considering the relationship between att_end
and treatment
, this could influence the results of the model because younger individuals may have similar attitudes on immigration.
Now, let’s determine whether our DGM will be linear or logistic. Since our outcome variable att_end
is a continuous variable since it has a range of possible values, we will use a linear model.
The math for this model is exactly the same as the math for the predictive model in the first part of this chapter, although we change the notation a bit for clarity.
\[ y_i = \beta_1 treatment_i + \beta_2 control_i + \epsilon_i\]
where \[treatment_i, control_i \in \{0,1\}\] \[treatment_i + control_i = 1\] \[\epsilon_i \sim N(0, \sigma^2)\]
Nothing has changed, except for the meaning of the data items and the interpretations of the parameters.
On the left-hand side we still have the outcome, \(y_i\), however in this case, this is a person’s attitude toward immigration after the experiment is complete. \(y_i\) takes on integer values between 3 and 15 inclusive.
On the right-hand side, the part contained in the model will consist of the terms \(\beta_1 treatment_i\) and \(\beta_2 control_i\). These two terms stand for Treated and Control and as before, each term consists of a parameter and a data point. \(\beta_1\) is the average attitude toward immigration for treated individuals — those exposed to Spanish-speakers — in the population. \(\beta_2\) is the average attitude toward immigration for control individuals — those not exposed to Spanish-speakers — in the population.These are both our parameters. The \(x\)’s are our explanatory variables and take the values 1 or 0. If person \(i\) is Treated, \(treatment_i = 1\) and \(control_i = 0\). If person \(i\) is Control, \(treatment_i = 0\) and \(control_i = 1\). In other words, these are binary variables and are mutually exclusive – if you are Treated, you cannot also be Control.
The last part, \(\epsilon_i\) (“epsilon”), represents the part that is not explained by our model and is called the error term. It is simply the difference between the outcome and our model predictions. In our particular case, this includes all factors that have an influence on someone’s attitude toward immigration but are not explained by treatment status. We assume that this error follows a normal distribution with an expected value of 0.
- Note that the formula applies to everyone in the population, not just the 115 people for whom we have data. The index \(i\) does not just go from 1 through 115. It goes from 1 through \(N\), where \(N\) is the number of individuals in the population. Conceptually, everyone has an
att_end
under treatment and under control.
- The small \(i\)’s are an index for the data set. It is equivalent to the “ID” column in our Preceptor Table and simply states that the outcome for person \(i\) is explained by the predictor variables (\(treatment\) and \(control\)) for person \(i\), along with an error term.
8.3.2 Courage
With Justice satisfied, we gather our Courage and fit the model. Note that, except for the change in variable names, the code is exactly the same as it was above, in our predictive model for age. Predictive models and causal models use the same math and the same code. The differences, and they are very important, lie in the interpretation of the results, not in their creation.
fit_2 <- stan_glm(att_end ~ treatment - 1,
data = trains,
seed = 45,
refresh = 0)
fit_2
stan_glm
family: gaussian [identity]
formula: att_end ~ treatment - 1
observations: 115
predictors: 2
------
Median MAD_SD
treatmentTreated 10.0 0.4
treatmentControl 8.5 0.3
Auxiliary parameter(s):
Median MAD_SD
sigma 2.8 0.2
------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg
Note that once again, since we are using a linear model, we set the family
argument to “gaussian”.
treatmentTreated
corresponds to \(\beta_1\). As always, R has, behind the scenes, estimated the entire posterior probability distribution for \(\beta_1\). We will graph that distribution in the next section. But the basic print method for these objects can’t show the entire distribution, so it gives us summary numbers: the median and the MAD SD. Speaking roughly, we would expect about 95% of the values in the posterior to be within two MAD SD’s of the median. In other words, we are 95% confident that the true, but unknowable, average attitude toward immigration among the Treated in the population to be between 9.2 and 10.8.
treatmentControl
corresponds to \(\beta_2\). The same analysis applies. We are about 95% confident that the true value for the average attitude toward immigration for Control in the population is between 7.9 and 9.1.
Up until now, we have used the Bayesian interpretation of “confidence interval.” This is also the intuitive meaning which, outside of academia, is almost universal. There is a truth out there. We don’t know, and sometimes can’t know, the truth. A confidence interval, and its associated confidence level, tells us how likely the truth is to lie within a specific range. If your boss asks you for a confidence interval, she almost certainly is using this interpretation.
But, in contemporary academic research, the phrase “confidence interval” is usually given a “Frequentist” interpretation. (The biggest divide in statistics is between Bayesians and Frequentist interpretations. The Frequentist approach, also known as “Classical” statistics, has been dominant for 100 years. Its power is fading, which is why this textbook uses the Bayesian approach.) For a Frequentist, a 95% confidence interval means that, if we were to apply the procedure we used in an infinite number of future situations like this, we would expect the true value to fall within the calculated confidence intervals 95% of the time. In academia, a distinction is sometimes made between confidence intervals (which use the Frequentist interpretation) and credible intervals (which use the Bayesian interpretation). We won’t worry about that difference in this Primer.
Let’s look at the full posteriors for both \(\beta_1\) and \(\beta_2\).
fit_2 |>
as_tibble() |>
select(-sigma) |>
pivot_longer(cols = treatmentTreated:treatmentControl,
names_to = "Parameter",
values_to = "attitude") |>
ggplot(aes(x = attitude, fill = Parameter)) +
geom_histogram(aes(y = after_stat(count/sum(count))),
alpha = 0.5,
bins = 100,
position = "identity") +
labs(title = "Posterior for Expected Attitude Toward Immigration",
subtitle = "Treated individuals are more conservative",
x = "Attitude",
y = "Probability") +
scale_y_continuous(labels = scales::percent_format()) +
theme_classic()
It appears that the affect of the treatment is to change people’s attitudes to be more conservative about immigration issues. Which is somewhat surprising!
We can decompose the the dependent variable, att_end
into two parts: the fitted values and the residuals. There are only two possible fitted values, one for the Treated and one for the Control. The residuals, as always, are simply the difference between the outcomes and the fitted values.
The smaller the spread of the residuals, the better a job the model is doing of explaining the outcomes.
8.3.3 Temperance
Recall the first question with which we began this section:
- What is the average treatment effect, of exposing people to Spanish-speakers, on their attitudes toward immigration?
Chapter Chapter 4 defined the average treatment effect. One simple estimator of the average treatment effect is the difference between \(\beta_1\) and \(\beta_2\). After all, the definition of \(\beta_1\) is the average attitude toward immigration, of the population, for anyone, under exposure to the treatment. So, \(\beta_1 - \beta_2\) is the average treatment effect for the population, roughly 1.5. However, estimating the posterior probability distribution for this parameter is tricky, unless we make use of the posterior distributions of \(\beta_1\) and \(\beta_2\). With that information, the problem is simple:
fit_2 |>
as_tibble() |>
mutate(ate = treatmentTreated - treatmentControl) |>
ggplot(aes(x = ate)) +
geom_histogram(aes(y = after_stat(count/sum(count))),
bins = 100) +
labs(title = "Posterior for Average Treatment Effect",
subtitle = "Exposure to Spanish-speakers shifts immigration attitudes rightward",
x = "Difference in Attitude",
y = "Probability") +
scale_y_continuous(labels = scales::percent_format()) +
theme_classic()
Could the true value of the average treatment effect be as much as 2 or as little as 1? Of course! The most likely value is around 1.5, but the variation in the data and the smallness of our sample cause the estimate to be imprecise. However, it is quite unlikely that the true average treatment effect is below zero.
We can use posterior_epred()
to answer this question. Create a tibble and use it as we have done before:
newobs <- tibble(treatment = c("Treated", "Control"))
pe <- posterior_epred(fit_2, newobs) |>
as_tibble() |>
mutate(ate = `1` - `2`)
pe
# A tibble: 4,000 × 3
`1` `2` ate
<dbl> <dbl> <dbl>
1 10.0 8.56 1.47
2 9.94 8.33 1.61
3 10.0 8.48 1.56
4 10.6 8.28 2.36
5 10.6 8.29 2.32
6 9.26 8.81 0.449
7 10.4 8.42 1.96
8 10.5 8.28 2.26
9 10.8 8.51 2.25
10 10.3 8.37 1.88
# … with 3,990 more rows
The posterior probability distribution created with posterior_epred()
is the same as the one produced by manipulating the parameters directly.
pe |>
ggplot(aes(x = ate)) +
geom_histogram(aes(y = after_stat(count/sum(count))),
bins = 100) +
labs(title = "Posterior for Average Treatment Effect",
subtitle = "Exposure to Spanish-speakers shifts immigration attitudes rightward",
x = "Difference in Attitude",
y = "Probability") +
scale_y_continuous(labels = scales::percent_format()) +
theme_classic()
Our second question:
- What is the largest effect size which still has a 1 in 10 chance of occurring?
Create a tibble which we can pass to posterior_predict()
. The variables in the tibble which will be passed in as newdata
. Fortunately, the tibble we created above is just what we need for this question also.
Consider the result of posterior_predict()
for two people, one treated and one control. Take the difference.
pp <- posterior_predict(fit_2,
newdata = newobs) |>
as_tibble() |>
mutate(te = `1` - `2`)
pp
# A tibble: 4,000 × 3
`1` `2` te
<dbl> <dbl> <dbl>
1 9.98 8.22 1.76
2 8.93 7.59 1.35
3 10.2 3.18 7.07
4 12.9 9.91 2.95
5 10.2 3.93 6.24
6 10.2 7.20 2.97
7 8.24 11.2 -2.94
8 8.91 10.7 -1.82
9 14.7 8.02 6.71
10 15.9 11.5 4.32
# … with 3,990 more rows
Create a graphic:
pp |>
ggplot(aes(x = te)) +
geom_histogram(aes(y = after_stat(count/sum(count))),
bins = 100) +
labs(title = "Posterior for Treatment Effect for One Person",
subtitle = "Causal effects are more variable for indvduals",
x = "Difference in Attitude",
y = "Probability") +
scale_y_continuous(labels = scales::percent_format()) +
theme_classic()
In this case, we are looking at the distribution of the treatment effect for a single individual. This is very different than the average treatment effect. In particular, it is much more variable. We are looking at one row in the Preceptor Table. For a single individual, att_end
can be anywhere from 3 to 15, both under treatment and under control. The causal effect — the difference between the two potential outcomes can, in theory, be anywhere from -12 to +12. Such extreme values are rare, but not impossible.
The question, however, was interested in the value at the 90th percentile.
quantile(pp$te, prob = 0.9)
90%
6.731598
We would not expect a treatment effect of this magnitude to be common, but, at the same time, effects this big and bigger will occur about 10% of the time.
8.4 income ~ age
So far, we have only created models in which the predictor variable is discrete, with two possible values. party
is either “Democrat” or “Republican”. treatment
is either “Treated” or “Control”. Often times, however, the predictor variable will be continuous. Fortunately, the exact same approach works in this case. Consider:
What would you expect the income to be for a 40-year old?
8.4.1 Justice
Once again, in Justice, we must consider validity, stability, representativeness, and the data generating mechanism.
Let’s look at the validity of age
and income
. Recall that validity refers to whether or not we can consider the columns age
and income
to have the same meaning in our data set of 2012 Boston train commuters and in our Preceptor Table. While age
doesn’t really change meaning over time, income
can be impacted by inflation. After all, $100,000 in 2012 doesn’t have the same worth as $100,000 now due to inflation and now would have less purchasing power. This would result in income being underestimated within our model. However, since there hasn’t been drastic inflation that dramatically changed the buying power, we will consider income to be valid. If there had been like 300% inflation, however, our conclusion would probably be different.
Now, let’s consider the stability of our model and if we believe that our relationship between age and income has changed between 2012 and now. Once again, let’s consider inflation and how that could impact income. If incomes were to increase at the rate of inflation, then the income distribution would be different than that of 2012. However, wages don’t tend to change as quickly as inflation does, so they likely did not change significantly and we can consider this model to be stable.
Next, let’s consider another issue that we may have with representativeness. What if we were now to assume that Boston train commuters are perfectly representative of US train commuters AND those who were approached to respond to the survey were also random? Even if this were true, we could still not assume representativeness because those who actually complete and submit the survey are not random. Instead of having more young people chosen to respond by those handing out surveys like we discussed in our last model, what if were to assume that when the surveys are handed out randomly, younger people tended to fill out and submit them more than those who are older? Well, this would still skew the age distribution and overestimate younger people in the population, and if younger people also tend to have a lower income than older people, this could also alter our answers to our current questions.
If we had reason to believe this is true, one way that we could fix this issue of representativeness is to alter our population to be train commuters in the US who would respond to the survey. In doing so, our population would then accommodate for the skewed age distribution under the assumption that younger individuals tend to respond to surveys at higher rates than older people.
The mathematics for a continuous predictor is unchanged from the intercept-including example we explored previously.
\[y_i = \beta_0 + \beta_1 age_i + \epsilon_i\]
When comparing two people (persons 1 and 2), the first one year older than the second, \(\beta_1\) is the expected difference in their incomes. The algebra is simple. Start with the two individuals.
\[y_1 = \beta_0 + \beta_1 age_1\] \[y_2 = \beta_0 + \beta_1 age_2\] We want the difference between them, so we subtract the second from the first, performing that subtraction on both sides of the equals sign.
\[y_1 - y_2 = \beta_0 + \beta_1 age_1 - \beta_0 - \beta_1 age_2\\ y_1 - y_2 = \beta_1 age_1 - \beta_1 age_2\\ y_1 - y_2 = \beta_1 (age_1 - age_2)\]
So, if person 1 is one year older than person 2, we have:
\[y_1 - y_2 = \beta_1 (age_1 - age_2)\\ y_1 - y_2 = \beta_1 (1)\\ y_1 - y_2 = \beta_1\]
The algebra demonstrates that \(\beta_1\) is the same for all ages. The difference in expected income between two people aged 23 and 24 is the same as the difference between two people aged 80 and 81. Is that plausible? Maybe. The algebra does not lie. When we create a model like this, this is the assumption we are making.
Note how careful we are not to imply that increasing age by one year “causes” an increase in income. That is nonsense! No causation without manipulation. Since it is impossible to change someone’s age, there is only one potential outcome. With only one potential outcome, a causal effect is not defined.
8.4.2 Courage
The use of stan_glm()
is the same as usual.
fit_3 <- stan_glm(income ~ age,
data = trains,
seed = 28,
refresh = 0)
print(fit_3, details = FALSE)
stan_glm
family: gaussian [identity]
formula: income ~ age
observations: 115
predictors: 2
------
Median MAD_SD
(Intercept) 103288.5 24077.6
age 907.8 535.8
Auxiliary parameter(s):
Median MAD_SD
sigma 74121.3 4897.0
------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg
When comparing two individuals, one 30 years old and one 40, we expect the older to earn about $9,000 more. But we are far from certain: the 95% confidence interval ranges from -$3,000 to $20,000.
The above is a good summary of the models.
It is brief! No one wants to listen to too much of your prattle. One sentence gives a number of interest. The second sentence provides a confidence interval.
It rounds appropriately. No one wants to hear a bunch of decimals. Use sensible units.
It does not just blindly repeat numbers in the printed display. A one year difference in age, which is associated with a $900 difference in income, is awkward. (We think.) A decade comparison is more sensible.
“When comparing” is a great phrase to start the summary of any non-causal model. Avoid language like “associated with” or “leads to” or “implies” or anything which even hints at a causal claim.
Consider our usual decomposition of the outcome into two parts: the model and the error term.
There are scores of different fitted values. Indeed, there are a greater number of different fitted values than there are different outcome values! This is often true for models which have continuous predictor variables, we have here with age
.
8.4.3 Temperance
Recall our question:
What would you expect the income to be for a random 40 year old?
Given that we are looking for an expected value, we use posterior_epred()
.
newobs <- tibble(age = 40)
pe <- posterior_epred(fit_3, newdata = newobs) |>
as_tibble()
pe
# A tibble: 4,000 × 1
`1`
<dbl>
1 142106.
2 141530.
3 141363.
4 139289.
5 139647.
6 142876.
7 139427.
8 143336.
9 145881.
10 143052.
# … with 3,990 more rows
Plotting is the same as always.
pe |>
ggplot(aes(x = `1`)) +
geom_histogram(aes(y = after_stat(count/sum(count))),
bins = 100) +
labs(title = "Posterior for Expected Income",
subtitle = "A 40-years old commuter earns around $140,000",
x = "Income",
y = "Probability") +
scale_x_continuous(labels = scales::dollar_format()) +
scale_y_continuous(labels = scales::percent_format()) +
theme_classic()
8.5 liberal ~ income
So far in this chapter, we have only considered continuous outcome variables. age
, att_end
and income
all take on a variety of values. None of them are, truly, continuous, of course. age
is only reported as an integer value. att_end
can only, by definition, take on 13 distinct values. However, from a modeling perspective, what matters is that they have more than 2 possible values.
liberal
, however, only takes on two values: TRUE and FALSE. In order to model it, we must use the binomial
family. We begin, as always, with some questions:
Among all people who have an income $100,000, what proportion are liberal?
Assume we have a group of eight people, two of whom make $100,000, two $200,000, two $300,000 and two $400,000. How many will be liberal?
8.5.1 Justice
Let’s now consider Justice for our relationship between liberal
and income
.
First, let’s look at validity, especially pertaining to liberal
. As we know, this column has the values “Liberal” and “Not Liberal” to convey the political ideology of the train commuters. What we must determine if the column liberal
and therefore the meaning of being liberal is the same in Boston in 2012 and Boston in 2021. If we can determine these to be the same, then we can assume validity, and in this case, because the core beliefs of being liberal have not changed very much between 2012 and 2021, we can determine that the data is valid.
Now, let’s consider stability. To do so, we must look at the relationship between liberal
and income
and determine whether or not we believe that this relationship has changed between 2012 and 2021. With our knowledge of the world, do we have any reason to believe that this has changed? What if between 2012 and 2021, income increased for those who are liberal? Well, then our model for the the relationship between income
and liberal
would have changed over the years, and therefore so would the model. However, since with our knowledge of the world we have no reason to believe that something of the sorts happened to affect this relationship, we can consider our model to be stable.
In our past three models, we have considered 3 possible issues that we could have with the representativeness of our model, such as the difference between Boston and other cities, problems with random sampling, and bias in those who respond, and we will now consider how in this survey there were surveys given before and after treatment, and some people may have filled out only one of the surveys. People who only filled out one of the surveys could affect the representativeness of the data because they could not be included in the data and if those who only filled out one survey tended to be liberal, then this would affect our data because it would underestimate the amount of liberals in the survey. This is something we must consider when looking at representativeness, since we could otherwise not determine if this data from train commuters in Boston in 2012 is representative enough of train commuters in the US now to continue using our data.
Let’s consider whether this model is linear or logistic. Unlike our previous models this chapter, the outcome variable, liberal
, for this model, only has two options, “Liberal” or “Not Liberal”. Therefore, this will be logistic because there are only 2 possible outcomes and the outcome variable isn’t continuous.
Recall the discussion in Section 7.5 about the logistic regression model which we use whenever the outcome or dependent variable is binary/logical. The math is there, if you care about math. We don’t, at least not too much. Reminder:
\[p(\text{Liberal}_i = \text{TRUE}) = \frac{\text{exp}(\beta_0 + \beta_1 \text{income}_i)}{1 + \text{exp}(\beta_0 + \beta_1 \text{income}_i)}\]
This model only has two parameters, \(\beta_0\) and \(\beta_1\). But these parameters do not have simple interpretations, unlike the parameters in a linear (or gaussian) model.
Recall the fundamental structure of all data science problems:
\[\text{outcome} = \text{model} + \text{what is not in the model}\] The exact mathematics of the model — the parameters, their interpretations — are all just dross in the foundry of our inferences: unavoidable but not worth too much of our time.
Even if the math is ignorable, the causal versus predictive nature of the model is not. Is this a causal model or a predictive model? It depends! It could be causal if you assume that we can manipulate someone’s income, if, that is, there are at least two potential outcomes: person \(i\)’s liberal status if she makes X dollars and person \(i\)’s liberal status if she makes Y dollars. Remember: No causation without manipulation. The definition of a causal effect is the difference between two potential outcomes. If you only have one outcome, then your model can not be causal.
In many circumstances, we don’t really care if a model is causal or not. We might only want to forecast/predict/explain the outcome variable. In that case, whether we can interpret the influence of a variable as causal is irrelevant to our use of that variable.
8.5.2 Courage
Fitting a logistic model is easy. We use all the same arguments as usual, but with family = binomial
added.
fit_4 <- stan_glm(data = ch8,
formula = liberal ~ income,
family = binomial,
refresh = 0,
seed = 365)
Having fit the model, we can look at a printed summary. Note the use of the digits
argument to display more digits in the printout.
print(fit_4, digits = 6)
stan_glm
family: binomial [logit]
formula: liberal ~ income
observations: 115
predictors: 2
------
Median MAD_SD
(Intercept) 0.562837 0.426033
income -0.000006 0.000003
------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg
Fitted models tell us about the posterior distributions for the parameters in the formula which defines the model we have estimated. We are assuming that the model is true. And, as discussed in Chapter 6, that assumption is always false! Our model is never a perfectly accurate representation of reality. But, if it were perfect, then the posterior distributions which we have created for \(\beta_0\), \(\beta_1\), and so on would be perfect as well.
When working with a linear model, we will often interpret the meaning of the parameters, as we have already done in the first three sections of this chapter. Such interpretations are much harder with logistic models because the math is much less convenient. So, we won’t even bother to try to understand the meaning of these parameters. However, we can note that \(\beta_1\) is negative, suggesting that people with higher incomes are less likely to be liberal.
8.5.3 Temperance
Among all people who have an income $100,000, what proportion are liberal?
Although our model is now logistic, all the steps in answering a question like this are the same as with a linear/guassian model.
newobs <- tibble(income = 100000)
pe <- posterior_epred(fit_4,
newdata = newobs) |>
as_tibble()
pe
is a tibble with a single vector. That vector is 4,000 draws from the posterior distribution of proportion of people, among those who make $100,000, who are liberal. The population proportion is the same thing as the probability for any single individual.
pe |>
ggplot(aes(x = `1`)) +
geom_histogram(aes(y = after_stat(count/sum(count))),
bins = 100) +
labs(title = "Posterior for Proportion Liberal Among $100,000 Earners",
subtitle = "The population proportion is the same as the probability for any individual",
x = "Income",
y = "Probability of Being Liberal") +
scale_x_continuous(labels = scales::percent_format(accuracy = 1)) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
theme_classic()
Assume we have a group of eight people, two of whom make $100,000, two $200,000, two $300,000 and two $400,000. How many will be liberal?
Because we are trying to predict the outcome for a small number of units, we use posterior_predict()
. The more complex the questions we ask, the more care we need to devote to making the newobs
tibble. We use the same rowwise()
and c_across()
tricks as earlier in the chapter.
newobs <- tibble(income = c(rep(100000, 2),
rep(200000, 2),
rep(300000, 2),
rep(400000, 2)))
pp <- posterior_predict(fit_4,
newdata = newobs) |>
as_tibble() |>
rowwise() |>
mutate(total = sum(c_across()))
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `total = sum(c_across())`.
ℹ In row 1.
Caused by warning:
! Using `c_across()` without supplying `cols` was deprecated in dplyr 1.1.0.
ℹ Please supply `cols` instead.
pp
# A tibble: 4,000 × 9
# Rowwise:
`1` `2` `3` `4` `5` `6` `7` `8` total
<int> <int> <int> <int> <int> <int> <int> <int> <int>
1 0 1 0 1 1 0 0 0 3
2 1 0 0 0 0 1 0 0 2
3 1 1 1 0 0 0 0 0 3
4 0 1 1 0 0 0 0 0 2
5 0 1 0 0 1 0 0 1 3
6 0 1 1 0 0 0 0 0 2
7 1 0 0 0 0 1 0 0 2
8 1 1 1 0 0 0 0 0 3
9 1 1 0 1 0 0 0 0 3
10 0 1 0 0 0 0 0 0 1
# … with 3,990 more rows
Study the pp
tibble. Understand its component parts. The first column, for example, is 4,000 draws from the posterior distribution for the liberal status of a random person with an income of $100,000. Note how all the draws are zeroes or ones. That is very different from the draws we have seen before! But it also makes sense. We are making a prediction about a binary variable, a variable which only have two possible values: zero or one. So, any (reasonable!) predictions will only be zero or one.
The second column is the same thing as the first column. Both are 4,000 draws from the posterior distribution for the liberal status of a random person with an income of $100,000. Yet they also have different values. They are both the same thing and different things, in the same way that rnorm(10)
and rnorm(10)
are the same thing — both are 10 draws from the standard normal distribution — and different things in that the values vary.
The third and fourth columns are different from the first two columns. They are both 4,000 draws from the posterior distribution for the liberal status of a random person with an income of $200,000. And so on for later columns. We can answer very difficult questions by putting together simple building blocks, each of them a set of draws from a posterior distribution. Recall the discussion in Section 5.1.
The total
column is simply the sum of the first eight columns. Having created the building blocks with 8 columns of draws from four different posterior distributions, we can switch our focus to each row. Consider row 2. It has a vector of 8 numbers: 1 1 1 0 0 1 0 0
. We can treat that vector as a unit of analysis. This is what might happen with our 8 people. The first three might be liberal, the fourth not liberal and so on. This row is just one example of what might happen, one draw from the posterior distribution of possible outcomes for groups of eight people with these incomes.
We can simplify this draw by taking the sum, or doing anything else which might answer the question with which we are confronted. Posterior distributions are as flexible as individual numbers. We can, more or less, just use algebra to work with them.
Graphically we have:
pp |>
ggplot(aes(x = total)) +
geom_histogram(aes(y = after_stat(count/sum(count))),
bins = 100) +
labs(title = "Posterior for Number of Liberals in Group with Varied Incomes",
subtitle = "Two is the most likely number, but values from 0 to 5 are plausible",
x = "Number of Liberals",
y = "Probability") +
scale_x_continuous(labels = scales::number_format(accuracy = 1)) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
theme_classic()
As always, there is some truth. If, tomorrow, we were to meet 8 new people, with the specified incomes, a certain number of them would be liberal. If we had the ideal Preceptor Table, we could just look up that number. No data science required. Alas, we don’t know the truth. The bets we can do is to create a posterior distribution for that unknown value, as we have done here. We then need to translate that posterior into English — “The most likely number of liberals is 2 or 3, but a total as low as zero or as high as 5 is also plausible. Having 6 liberals would be really surprising. Having 7 or 8 is almost impossible.”
Are these two posterior probability distributions perfect? No! This is the central message of the virtue of Temperance. We must demonstrate our humility when we use our models. Recall the distinction between the unknown true distribution and the estimated distribution. The first is the posterior distribution we would create of we understood every detail of the process and could accurately model it. We would still not know the true unknown number, but our posterior distribution for that number would be perfect. Yet, our model is never perfect. We are making all sorts of assumptions behind the scenes. Some of those assumptions are plausible. Others are less so. Either way, the estimated distribution is what we have graphed above.
The central lesson of Temperance is: Don’t confuse the estimated posterior (which is what you have) with the true posterior (which is what you want). Recognize the unavoidable imperfections in the process. You can still use your estimated posterior — what choice do you have? — but be cautious and humble in doing so. The more that you suspect that your estimated posterior differs from the true posterior, the more humble and cautious you should be.
8.6 Summary
In this chapter, we explored relationships between different variables in the trains
data set. We built three predictive models and one causal model.
Similar to previous chapters, our first task is to use Wisdom . We judge how relevant our data is to the questions we ask. Is it reasonable to consider the data we have (e.g., income and age data from Boston commuters in 2012) as being drawn from the same population as the data we want to have (e.g., income and age data from today for the entire US)? Probably?
Justice is necessary to decide the best way to represent the models we make. A little math won’t kill you. We use Courage to translate our models into code. Our goal is to understand, generate posterior distributions for the parameters, and interpret their meaning. Temperance leads us to the final stage, using our models to answer our questions.
Key commands:
Create a model with
stan_glm()
.Use
posterior_epred()
to estimate expected values. The e in epred stands for expected.Use
posterior_predict()
to make forecasts for individuals. The variable in predictions is always greater than the variability in expectations because predictions can’t pretend that \(\epsilon_i\) is zero.
Once we have draws from a posterior distribution for our outcome variable — whether that be an expectation or a prediction — we can manipulate those draws to answer our question.
Remember:
Always explore your data.
Predictive models care little about causality.
Predictive models and causal models use the same math and the same code.
“When comparing” is a great phrase to start the summary of any non-causal model.
Don’t confuse the estimated posterior (which is what you have) with the true posterior (which is what you want). Be humble and cautious in your use of the posterior.