5 Three Parameters
Models have parameters. In Chapter 3 we created models with a single parameter \(p\), the proportion of red beads in an urn. In Chapter 4, we used models with two parameters: \(\mu\) (the average height in the population, generically known as a model “intercept”) and \(\sigma\) (the variation in height in the population). Here — can you guess where this is going? — we will build models with three parameters: \(\sigma\) (which serves the same role throughout the book) and two “coefficients.” In models which relate a continuous predictor to the outcome, those two parameters will be labeled \(\beta_0\) and \(\beta_1\). In models which estimate two averages, the parameters will be \(\beta_1\) and \(\beta_2\). All this notation is confusing, not least because different academic fields use inconsistent schemes. Follow the Cardinal Virtues and tackle your problem step by step.
Perhaps more importantly, a focus on parameters is less relevant now than it was decades ago, when computational limitations made answering our actual questions harder. Parameters are imaginary. They don’t exist. They, in general, are not the answer to an real world question. They are tools, along with the models of which they are a part, we use to answer questions.
Packages:
We are concerned with these three packages because we are exploring the trains
dataset which exists within the primer.data package. We use the rstanarm package to explore models and the approach of posterior_predict()
. Finally, we use the tidyverse package to conduct all of our graphs and functions.
In this chapter, we are going to ask a series of questions involving train commuters’ ages, party affiliations, incomes, and political ideology, as well as the causal effect of exposure to Spanish-speakers on their attitude toward immigration. These questions will pertain to all train commuters in the US today.
5.1 age ~ party
We want to build a model and then use that model to make claims about the world. Our questions about the relationship between age
and party
are the following:
What is the probability that, if a Democrat shows up at the train station, he will be over 50 years old?
In a group of three Democrats and three Republicans, what will the age difference be between the oldest Democrat and the youngest Republican?
We can answer these and similar questions by creating a model that uses party affiliation to predict age. Let’s follow the four Cardinal Virtue: Wisdom, Justice, Courage and Temperance.
5.1.1 Wisdom
Wisdom begins with considering the questions we desire to answer and the data set we are given.
5.1.1.1 Preceptor Table
First, do we need a causal or predictive model? In our question above we have a predictive model as we only have one outcome which is the person’s age based on our predictor which is the person’s party.
Second, what is the outcome? A person’s age will be the outcome. Note that the outcome is not necessarily the same concept mentioned in the questions. For example, the outcome column is not “probability that a person is over 50” even though that is one of the questions we need to answer. The questions, however, leave unclear details. Where train stations are these people at? Which country? Also, at what moment in time is this experiment taking place? In order to dive deeper into these questions we need to engage in a conversation with the person that is asking us these questions. When this occurs, we need to discuss and dive deeper into the questions that have been asked to understand what the true question is at hand. For the questions that we have at hand we will be talking about all the adults in July 1, 2012 at the location of Chicago, Illinois.
Third, what are the units? Our units for this scenario would be individuals because the questions are about the attributes of unique people at the station.
Fourth, what are the covariates? In our case, a person’s party will be the only covariate because it is another factor that plays to the person’s age and identity. The Preceptor Table must include any covariates mentioned in the question.
Fifth, do we have a treatment? No. We will be using a predictive model as discussed above and treatments will only apply to situations with causal models. A treatment is just a covariate which we can, at least in theory, manipulate, thereby creating other potential outcomes.
Let’s look at our refined question to create our Preceptor Table:
What is the probability that, if a Democrat shows up at the train station in Chicago, Illinois in 2012, he will be over 50 years old?
Our Preceptor Table:
Preceptor Table | ||
ID | Outcome | Covariate |
---|---|---|
Age | Party | |
1 |
31 |
D |
2 |
58 |
R |
… |
… |
… |
10 |
67 |
R |
11 |
23 |
R |
… |
… |
… |
N |
44 |
D |
Recall: a Preceptor Table is smallest possible table with rows and columns such that, if there is no missing data, all our questions are easy to answer. To answer questions — like “What is the probability that, if a Democrat shows up at the train station, he will be over 50 years old?” — we need a row for every individual at the station.
5.1.1.2 Always Explore Your Data
Recall the discussion from Chapter 1. Enos (2014) randomly placed Spanish-speaking confederates on nine train platforms around Boston, Massachusetts. The data that we want to analyze consists of the age and party of each individual on these train platforms based on our variables in our Preceptor Table. These reactions were measured through changes in answers to three survey questions.
Code
trains |>
select(age, party)
# A tibble: 115 × 2
age party
<int> <chr>
1 31 Democrat
2 34 Republican
3 63 Democrat
4 45 Democrat
5 55 Democrat
6 37 Democrat
7 53 Republican
8 36 Democrat
9 54 Democrat
10 42 Republican
# ℹ 105 more rows
As specified in the Preceptor Table we only care about age
is the age of the respondent and party
is their political party affiliation. party
can be shown to have two possibilities: Democrat and Republican.
age party
Min. :20.00 Length:115
1st Qu.:33.00 Class :character
Median :43.00 Mode :character
Mean :42.37
3rd Qu.:52.00
Max. :68.00
The range of values for age
seems reasonable. Perhaps children were explicitly excluded from the study. Or perhaps there were no children at the train station on those days.
Code
trains |>
count(party)
# A tibble: 2 × 2
party n
<chr> <int>
1 Democrat 96
2 Republican 19
There are about 5 times as many Democrats as Republicans, which is perhaps not too surprising in a heavily Democratic state like Massachusetts.
Plotting your data is always a good idea.
Code
trains |>
ggplot(aes(x = age, fill = party)) +
geom_histogram(aes(y = after_stat(count/sum(count))),
alpha = 0.5,
bins = 25,
position = "dodge") +
labs(title = "Distribution of Age and Party",
subtitle = "More data allows for a more precise probability",
x = "Age",
y = "Percent of Sample",
fill = NULL) +
scale_y_continuous(labels = scales::percent_format()) +
scale_fill_manual(values = c("#7fdfe1", "#fbbab6")) +
theme_classic()
Democrats seem slightly older than Republicans. Our estimate for the average age of Democrats in the population will be much more precise than that for Republicans because we have five times as many Democrats as Republicans in our sample.
5.1.1.3 Validity
Do the columns in the Preceptor Table mean the same thing as the columns in the data? That is the heart of validity. Just because we use the same words — age
and party
— in both tables does not prove that they are, in fact, the same things.
Validity is the assumption which allows us to consider the Preceptor Table and our data to have been drawn from the same population. There is no truth here. The data is real enough, but we created the Preceptor Table. Whether or not there is a population from which we can assume both the data and the Preceptor Table might have been drawn is not a TRUE/FALSE question, we can’t use the data to make inferences about the Preceptor Table.
Now the assumption of validity may not hold due to the possibility of what our terms mean. For example, what it means to be a “Republican” might be very different in Massachusetts (meaning in our data) than what it means in the rest of the United States (meaning the Preceptor Table). Just slapping the two data sources together does not solve that problem. Similarly, age
in the data is measured from a survey. People make mistakes. They even lie. In our Preceptor Table, we want the column age
to be each person’s actual age.
If party
or age
are too different between the Preceptor Table and our data, then the assumption of validity fails. We can’t consider both sources to have been drawn from the same population.
Overall, however, the assumption of validity seems reasonable. age
and party
are close enough between our data and The Preceptor Table that we can “stack” them on top of each other. We will assume that both are drawn from the same population.
5.1.2 Justice
Justice concerns five topics: Population Table, stability, representativeness, unconfoundedness and the mathematical structure of the data generating mechanism (DGM).
5.1.2.1 Population Table
After assuming validity, we can now create our Population Table. Recall that every row from both the Preceptor Table and the data is included in the Population Table, along with all the rows from the underlying population from which we assume that both the Preceptor Table and the data were drawn.
Population Table | ||||
Source | Outcome | Covariates | ||
---|---|---|---|---|
Age | Party | Year | City | |
… |
… |
… |
… |
… |
Data |
43 |
Democrat |
2012 |
Boston, MA |
Data |
52 |
Republican |
2012 |
Boston, MA |
… |
… |
… |
… |
… |
… |
… |
… |
… |
… |
Preceptor Table |
… |
… |
2023 |
Chicago, IL |
Preceptor Table |
… |
… |
2023 |
Chicago, IL |
… |
… |
… |
… |
… |
Our year within the Population table is an example of the moment in time.
Our Preceptor Table rows are chronologically ordered. The “other” rows are our greater population for which we are making assumptions — this is why the year at the start is earlier than our data and the year at the end is later than our Preceptor Table. This shows the more expansive population about which we are making inferences.
5.1.2.2 Stability
Stability means that the relationship between the columns is the same for three categories of rows: the data, the Preceptor table, and the larger population from which both are drawn. With something like height, it is much easier to assume stability over a greater period of time. Changes in global height occur extremely slowly, so height being stable across a span of 20 years is reasonable to assume. With something like political ideology, it is much harder to make the assertion that data collected in 2010 would be stable to data collected in 2030. When we are confronted with this uncertainty, we can consider making our time frame smaller. However, we would still need to assume stability from 2014 (time of data collection) to today. Stability allows us to ignore the issue of time.
5.1.2.3 Representativeness
Representativeness has to do with how well our sample represents the larger population we are interested in generalizing to. We might run into two potential problems: Is our data representative of the population? Is our Preceptor Table representative of the population? When we look at the data being representative of the population, let’s dive deeper and take a look at a few questions at hand. Does the train experiment allow us to calculate a predictive effect for people who commute by cars? Oftentimes the way that people commute change the way that we can interpret their attitude regarding party. Can we calculate the predictive effect for people in Boston? When we deal with representativeness we deal with two primary concerns: the generic concern and the variable concern. The generic concern deals with our situation at hand and whether we can believe that our data and Preceptor Table is representative of the broader population that it is drawn from. Before we generalize to broader populations we have to consider if our experimental estimates are applicable beyond our experiment. Generally: if there was no chance that a certain type of person would have been in this experiment, we cannot make an assumption for that person. Our second level of concern deals with the variable concern which in this case deals with an individual’s age and party. Out of our sample, we aren’t able to contain all of the ages of individuals within the broader population. For example: we don’t have the age less than 20 which does not represent a large amount of our broader population. When we deal with party we see a similar result. We are only dealing with Democrat and Republican as our two main factors, however, that is not representative of our entire broader population. There are multiple parties within the nation and limiting our data to two, condenses the parties that we have - ultimately undermining representativeness within the situation.
5.1.2.4 Unconfoundedness
Unconfoundedness means that the treatment assignment is independent of the potential outcomes, when we condition on pre-treatment covariates. However, with our scenario, we have a predictive model and this assumption is only relevant for causal models.
5.1.3 Courage
Courage centers around the creation of the data generating mechanism, i.e., our fitted model.
5.1.3.1 Mathematics
The type of outcome variable is the most important issue in determining the mathematical structure of the data generating mechanism. Since our outcome is a (mostly) continuous variable, a linear structure is the most reasonable. If the outcome variable had been binary — only two options — we would use a logistic model.
Let’s interpret courage with a look at the mathematics:
\[ age_i = \beta_0 + \beta_1 democratic_i + \epsilon_i\]
where \[democrat_i \in \{0,1\}\] \[\epsilon_i \sim N(0, \sigma^2)\]
Don’t panic dear poets and philosophers, the whole thing is easier than it looks. This will follow the form above in which the outcome is the result of what is in our model and not in our model.
Let’s dive deeper into the data generating mechanism at hand. On the left-hand side we have the outcome, \(y\), which is the variable to be explained. In our case, this is the age of an individual in the population. The right-hand side contains two parts: that which is contained within the model, and that which isn’t.
As important as our mathematics is, we need to take a look at the parameters that we have at hand. Looking at our parameters allows us to create fitted model by combining the mathematics and the interpretations of the parameters. \(\beta_0\) is the difference between the the average age of Republicans and that of Democrats. \(\beta_1\) is the average age of Democrats. As shown in our model, \(\beta_0\) and \(\beta_1\) are two terms that allow us to make our model. The second part in the right-hand side, \(\epsilon\) (“epsilon”), represents the unexplained part of the outcome and is called the error term. In other words, \(\sigma\) is what plays a role with age that is not factored into our model. We assume that this error follows a normal distribution with an expected value of 0 (meaning it is 0 on average) and it is simply the difference between the outcome and our model predictions. Therefore, we have three main parameters: \(\beta_0\), \(\beta_1\), and \(\sigma\) that we will discuss within this section.
Some things we want to note about our model:
The small \(i\)’s are an index to number the observations. It is equivalent to the “ID” column in our Preceptor Table and simply states that the outcome for person \(i\) is explained by the modeled and non-modeled factors for person \(i\).
The model is a claim about how the world works, not just for the 115 individuals for which we have data but for the all the people in the population for which we seek to draw inferences.
Although terminology differs across academic fields, the most common term to describe a model like this is a “regression.” We are “regressing”
age
onparty
in order to see if they are associated with each other. The formula above is a “regression formula”, and the model is a “regression model.” This terminology would also apply to our model ofheight
in Chapter 4.The model in Chapter 4 is sometimes called “intercept-only” because the only (interesting) parameter is the intercept.
5.1.3.2 Fitted Model
With the help of Courage we can translate this math into code.
To get posterior distributions for our three parameters, we will again use stan_glm()
, just as we did in Chapter 4.
Code
fit_1 <- stan_glm(age ~ party,
data = trains,
family = gaussian,
seed = 17,
refresh = 0)
The variable before the tilde,
age
, is our outcome.The only explanatory variable is
party
. This variable has only two values, ‘Democrat’ and ‘Republican’.Recall that our model is linear. Since we are using a linear model, the
family
we use will begaussian
.
The resulting output:
Code
fit_1
stan_glm
family: gaussian [identity]
formula: age ~ party
observations: 115
predictors: 2
------
Median MAD_SD
(Intercept) 42.6 1.2
partyRepublican -1.5 3.0
Auxiliary parameter(s):
Median MAD_SD
sigma 12.2 0.8
------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg
The key concept is the idea of a “population.” From which larger population is the data we have being (conceptually) drawn? If we were only interested in the age of individuals in our data set, we would have no need for inference. We know everyone’s ages already. We only need tools like stan_glm()
if we seek to understand individuals not in our data.
The intercept, 42.6 estimates the age of Democrats. The median suggests that the average age that is represented by the age is roughly 42.6 with a confidence of about 1.2 (which is good). We want to have a MAD_SD that is smaller because that represents that the median will be more accurate when predicted. The partyRepublican
estimate which is -1.5 means that on average they would be 1.5 years younger than the median age for the Democrats. However, in this case the difference between the average age of Democrats and Republicans is less accurate when compared to the intercept above. The MAD_SD of partyRepublican
is higher at 3.0 which means that there is more room for error to occur.
When we are creating our fitted model we need to consider the type of model we are creating. Any model with the age as its dependent variable will be predictive, not causal, for the reason that nothing, other than time, can change your age. You are X years old. It would not matter if you changed your party registration from Democrat to Republican or vice versa. Your age is your age. The underlying model which connects age with party is less important than the brute statistical fact that there is a connection. Predictive models care nothing about causality.
5.1.3.3 Model Checks
Now that we have created our fitted model we can perform model checks to ensure that our fitted model is reasonably accurate. We attempt to the make the model as accurate as possible, however, we can only view whether it is accurate through model checks that we need to perform.
Consider a table which shows a sample of 8 individuals.
8 Observations from Trains Dataset | |||
Age | Party | Fitted | Residual |
---|---|---|---|
31 | Democrat | 42.58 | -11.579447 |
34 | Republican | 41.12 | -7.119306 |
63 | Democrat | 42.58 | 20.420553 |
45 | Democrat | 42.58 | 2.420553 |
55 | Democrat | 42.58 | 12.420553 |
37 | Democrat | 42.58 | -5.579447 |
53 | Republican | 41.12 | 11.880694 |
36 | Democrat | 42.58 | -6.579447 |
If we take a look at the the observations above we see that the model only produces one fitted value for each condition. This table just takes a sample of the 8 individuals from the entire dataset to capture the wide range of residuals. When we take a look at the residuals to the right hand side, we can view the error at hand. When we have the negative values that means that the age should be younger, and positive values means that the age should be higher. For the most part, the median age seems to be somewhat accurate as most of the values are within a reasonable range of the median age for both the Democrats and the Republicans. We can get a better picture of the unmodeled variation in our sample if we plot these three variables for all the individuals in our data.
The following three histograms show the actual outcomes, fitted values, and residuals of all people in trains
:
Code
outcome <- ch5 |>
ggplot(aes(age)) +
geom_histogram(bins = 100) +
labs(x = "Age",
y = "Count")
fitted <- tibble(age = fitted(fit_1)) |>
ggplot(aes(age)) +
geom_bar() +
labs(x = "Fitted Values",
y = NULL) +
scale_x_continuous(limits = c(20, 70))
res <- tibble(resids = residuals(fit_1)) |>
ggplot(aes(resids)) +
geom_histogram(bins = 100) +
labs(x = "Residuals",
y = NULL)
outcome + fitted + res +
plot_annotation(title = "Decomposition of Height into Fitted Values and Residuals")
The three plots are structured like our equation and table above. A value in the left plot is the sum of one value from the middle plot plus one from the right plot.
The actual age distribution looks like a normal distribution. It is centered around 43, and it has a standard deviation of about 12 years.
The middle plot for the fitted values shows only two adjacent spikes, which represent the estimates for Democrats and Republicans.
Since the residuals plot represents the difference between the other two plots, its distribution looks like the first plot.
These three plots play a strong role in understanding the error that is associated with the fitted model. With the fitted model that we have created we only have a smaller number of values when compared to the actual age. The bigger gap of residuals proves that the error that we have associated with the model makes the fitted model reasonably inadequate.
Another model check that we need to run is the posterior predictive check. With the help of the posterior predictive check we can run a “fake-data” simulation upon our fitted model that generates a distribution of the data. A posterior predictive check is targeted for the comparison between what the fitted model generates and the actual observed data. The aim is to detect if the model is inadequate to describe the data. With the amount of observations and times we run the “fake-data” simulation we can view whether our fitted model is reasonably accurate when compared to our actual data based on the distribution.
pp_check()
will create a graph that will plot the “fake-data” simulation when compared to the actual data. We are able to control how many times that we run the simulation through nreps
. Through the use of plotfun
we are able to classify what type of graph that we want to create. With the help of pp_check()
we are able to run a posterior predict check that allows us to check how accurate our data is.
Code
pp_check(fit_1, plotfun = "hist", nreps = 5, binwidth = 3)
Let’s take a deeper look at the graphs that we have above. Our graph in the darker blue represents our actual data. As we can see with the lighter blue graph, our fitted model is able to generate a distribution that is similar when compared to the actual data. One important item that we want to note is that for the actual data there is no value that is under 20 for age, however, in the fitted model we are able to see several general values for which the ages are under 20. For the most part our fitted model does a great job in generating a distribution through the “fake-data” simulation when compared to the actual data set.
5.1.3.4 Data Generating Mechanism (DGM)
Recall the Preceptor Table and the units from the beginning. We are concerned with three main ideas from the Preceptor Table: our units (individuals with unique attributes at the train station), our outcome (individual’s age), our covariates (the party membership of each individual). With the help of the data generating mechanism that we have created from our fitted model we can fill in the missing values to the Preceptor Table. We use the data generating mechanism as the last step of courage as it allows us to analyze and view our fitted model’s data.
Let’s view all of our fitted data through a pleasing table:
Code
gtsummary::tbl_regression(fit_1, intercept = TRUE) %>%
bold_labels()
Characteristic | Beta | 95% CI^{1} |
---|---|---|
(Intercept) | 43 | 40, 45 |
party | ||
Democrat | — | — |
Republican | -1.5 | -7.3, 4.3 |
^{1} CI = Credible Interval |
Comments:
Democrats seem slightly older than Republicans. That was true in the sample and so, almost (but not quite!) by definition, it will be true in our the posterior probability distributions. We can see this through the
partyRepublican
that is negative.Our estimate for the average age of Democrats in the population is much more precise than that for Republicans because we have five times as many Democrats as Republicans in our sample. A central lesson from Chapter 3 is that the more data you have related to a parameter, the narrower your posterior distribution will be.
The phrase “in the population” is doing a great deal of work because we have not said what, precisely, we mean by the “population.” Is it the set of people on those commuter platforms on those days in 2012 when the experiment was done? Is it the set of people on all platforms, including ones never visited? Is it the set of all Boston commuters? All Massachusetts residents? All US residents? Does it include people today, or can we only draw inferences for 2012? We should explore these questions in every model we create.
The parameters \(\beta_0\) and \(\beta_1\) can be interpreted in two ways. First, like all parameters, they are a part of the model. We need to estimate them. But, in many cases, we don’t really care what the value of the parameter is. The exact value of \(\sigma\), for example, does not really matter. Second, some parameters have a substantive interpretation, as with \(\beta_0\) and \(\beta_1\) being the average age and difference in the population. There are several possibilities that still exist with age and party. Each variable changes the way that we view the data because they hold several outcomes based on the way that we interpret them. All of these seemingly “minute” factors play a strong role in changing the outcome of our fitted model and how we are able to provide more accurate results with the help of our model.
In summary: we model the age of individuals at a train station as a linear function of the party membership. We find that the Republicans are about one and a half years younger than the Democrats at the train station.
5.1.4 Temperance
Courage gave us the data generating mechanism. Temperance guides us in the use of the DGM to answer the questions we began with.
Recall the first question:
What is the probability that, if a Democrat shows up at the train station, he will be over 50 years old?
Create a tibble with the assumed input for our model. In our case the tibble has a variable named “party” which contains a single observation with the value “Democrat”. This is a bit different than Chapter 4.
Code
new_obs <- tibble(party = "Democrat")
Use posterior_predict()
to create draws from the posterior for this scenario. Note that we have a new posterior distribution under consideration here. The unknown parameter, call it \(D_{age}\), is the age of a Democrat. This could be the age of a randomly selected Democrat from the population or of the next Democrat we meet or of the next Democrat we interview on the train platform. The definition of “population” determines the appropriate interpretation. Yet, regardless, \(D_{age}\) is an unknown parameter. But it is not one — like \(\beta_0\), \(\beta_1\), or \(\sigma\) — for which we have already created a posterior probability distribution. That is why we need posterior_predict()
.
posterior_predict()
takes two arguments: the model for which the simulations should be run, and a tibble indicating for which and how many parameters we want to run these simulations. In this case, the model is the one from Courage and the tibble is the one we just created.
Code
pp <- posterior_predict(fit_1, newdata = new_obs) |>
as_tibble()
head(pp, 10)
# A tibble: 10 × 1
`1`
<dbl>
1 48.4
2 54.6
3 65.5
4 49.7
5 17.4
6 67.5
7 58.9
8 49.4
9 39.3
10 48.9
The result are draws from the posterior distribution of the age of a Democrat. It is important to understand that this is not a concrete person from the trains
dataset - the algorithm in posterior_predict()
simply uses the existing data from trains
to create the DGM, which is then used to sample from the posterior distribution of \(D_age\).
Code
pp |>
ggplot(aes(x = `1`)) +
geom_histogram(aes(y = after_stat(count/sum(count))),
bins = 100) +
labs(title = "Posterior for a Random Democrat's Age",
subtitle = "Individual predictions are always more variable than expected values",
x = "Age",
y = "Probability") +
scale_y_continuous(labels = scales::percent_format()) +
theme_classic()
Once we have the posterior distribution, we can answer (almost) any reasonable question. In this case, the probability that the next Democrat will be over 50 is around 28%.
The second question:
In a group of three Democrats and three Republicans, what will the age difference be between the oldest Democrat and the youngest Republican?
As before we start by creating a tibble with the assumed input. Note that the name of the column (“party”) and the observations (“Democrat”, “Republican”) must always be exactly as they are in the original data set. This tibble as well as our model can then be used as arguments for posterior_predict()
:
Code
newobs <- tibble(party = c("Democrat", "Democrat", "Democrat",
"Republican", "Republican","Republican"))
posterior_predict(fit_1, newdata = newobs) |>
as_tibble()
# A tibble: 4,000 × 6
`1` `2` `3` `4` `5` `6`
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 40.7 33.8 42.2 31.7 33.0 42.0
2 63.9 41.7 56.6 45.0 32.1 42.0
3 51.0 39.1 12.4 41.9 32.5 56.6
4 28.4 60.2 32.0 32.2 34.2 40.3
5 59.6 57.9 53.3 48.6 24.4 41.7
6 55.1 60.0 37.4 24.4 55.6 55.1
7 45.7 45.1 30.5 33.5 42.7 32.5
8 34.4 53.6 56.1 51.0 34.7 53.7
9 40.4 25.0 28.5 36.8 38.7 43.7
10 40.7 57.7 25.6 39.5 50.8 30.5
# ℹ 3,990 more rows
We have 6 columns: one for each person. posterior_predict()
does not name the columns, but they are arranged in the same order in which we specified the persons in newobs
: D, D, D, R, R, R. To determine the expected age difference, we add code which works with these posterior draws:
Code
pp <- posterior_predict(fit_1, newdata = newobs) |>
as_tibble() |>
# We don't need to rename the columns, but doing so makes the subsequest
# code much easier to understand. We could just have worked with columns 1,
# 2, 3 and so on. Either way, the key is to ensure that you correctly map
# the covariates in newobs to the columns in the posterior_predict object.
set_names(c("dem_1", "dem_2", "dem_3",
"rep_1", "rep_2", "rep_3")) |>
rowwise() |>
# Creating three new columns. The first two are the highest age among
# Democrats and the lowest age among Republicans, respectively. The third one
# is the difference between the first two.
mutate(dems_oldest = max(c_across(dem_1:dem_3)),
reps_youngest = min(c_across(rep_1:rep_3)),
age_diff = dems_oldest - reps_youngest)
pp
# A tibble: 4,000 × 9
# Rowwise:
dem_1 dem_2 dem_3 rep_1 rep_2 rep_3 dems_oldest reps_youngest age_diff
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 32.9 52.6 39.6 37.2 34.2 38.5 52.6 34.2 18.4
2 31.9 31.1 51.9 22.0 41.0 32.2 51.9 22.0 30.0
3 59.6 56.4 43.2 51.6 37.6 45.3 59.6 37.6 22.0
4 31.9 65.9 46.2 31.6 55.2 21.9 65.9 21.9 44.0
5 8.82 34.3 70.6 39.1 41.1 41.6 70.6 39.1 31.5
6 34.1 50.7 44.1 50.3 50.2 48.7 50.7 48.7 2.04
7 49.8 26.4 43.7 44.8 36.7 18.8 49.8 18.8 31.0
8 45.0 50.4 64.9 51.5 52.7 24.5 64.9 24.5 40.4
9 47.4 46.1 61.6 42.8 28.2 57.7 61.6 28.2 33.4
10 38.1 54.5 47.9 64.8 34.4 32.3 54.5 32.3 22.2
# ℹ 3,990 more rows
The plotting code is similar to what we have seen before:
Code
pp |>
ggplot(aes(x = age_diff)) +
geom_histogram(aes(y = after_stat(count/sum(count))),
bins = 100) +
labs(title = "Posterior for Age Difference",
subtitle = "Oldest of three Democrats compared to youngest of three Republicans",
x = "Age",
y = "Probability") +
scale_y_continuous(labels = scales::percent_format()) +
theme_classic()
In words, we would expect the oldest Democrat to be about 22 years older than the youngest Republican, but we would not be too surprised if the oldest Democrat was actually younger than the youngest Republican in a group of 6.
We should always be cautious about our inferences. Our assumptions are never true. We need to be cautious with our inferences when we deal with all of the assumptions that we have made within the chapter regarding validity, stability, representativeness, and model structure. When we take a look at the assumption of validity a possibility where it may not hold would be for what it means to be a “Republican” might be very different in Massachusetts (meaning in our data) than what it means in the rest of the United States (meaning the Preceptor Table). When we take a look at the assumption of stability a possibility where it may not hold would be for the time frame that we are currently considering and whether we would still need to assume stability from 2014 (time of data collection) to today. When we take a look at the assumption of representativeness a possibility where it may not hold would be with correlation between our “sampling mechanism” and other variables to be true. When we take a look at the assumption of the model structure a possibility where it may not hold would be with the flaw of the fitted model where we conducted a posterior predict check that generated a distribution of values that were under the age of 20 which is not possible with our actual data. Our stated inferences almost certainly underestimate the true uncertainty of the world.
We need to maintain humility when we are making our inferences and decisions. Stay cautious my friends.
5.2 att_end ~ treatment
Above, we created a predictive model: with someone’s party affiliation, we can make a better guess as to what their age is than we could have in the absence of information about their party. There was nothing causal about that model. Changing someone’s party registration can not change their age. In this example, we build a causal model. Consider these two questions:
What is the average treatment effect, of exposing people to Spanish-speakers, on their attitudes toward immigration?
What is the largest causal effect which still has a 1 in 10 chance of occurring?
Answering causal questions requires (at least) two potential outcomes: immigration attitudes when a person receives the treatment of being exposed to Spanish-speakers and immigration attitudes, for that same person, when they don’t receive the treatment. We can answer these and similar questions by creating a model with immigration attitude as the dependent variable and exposure to Spanish-speakers as the independent variable. Let’s follow the four Cardinal Virtues: Wisdom, Justice, Courage and Temperance.
5.2.1 Wisdom
Wisdom requires the creation of a Preceptor Table, an examination of our data, and a determination, using the concept of “validity,” as to whether or not we can (reasonably!) assume that the two come from the same population.
5.2.1.1 Preceptor Table
First, do we need a causal or predictive model? In this case, the model is clearly causal, so our Preceptor Table will have two columns for the potential outcomes. If all you need to know to answer the question is the outcome under one value of the treatment, then the model is predictive. Example: What is the attitude for all women if they were to get the treatment? Here we a question that implies a comparison, for a single individual unit, between two states of the world, when where they get treatment and one where they do not get treatment.
Second, what is the outcome? A person’s attitude toward immigration is the outcome. When we take a look at the questions we discover a flaw exists. The individual that asked this question did not tell us where or when we are looking at these questions. It is important to understand where and when that we are answering these questions because otherwise our data could be not accurately applied to the whole country, nation, or specific state. Therefore, by diving deeper within these questions we are able to accurately answer our question with the population that it refers to. For the questions that we have at hand we will be talking about all the adults in July 1, 2012 at the location of Chicago, Illinois which will allow us to answer the questions more accurately.
Third, what are the units? Our units for this scenario would be individuals because the questions are about the attributes of unique people at the station.
Fourth, do we have a treatment? Yes. In any causal model, there is at least one covariate which is defined as the “treatment,” something which we can manipulate so that some units receive one version and other units get a different version. A “treatment” is just a covariate in which we can manipulate and need to manipulate, at least in theory, to answer the question. In this case, the treatment is exposure to Spanish-speakers. Units can either be exposed, i.e., they receive the “treatment,” or they can not be exposed, i.e., they receive the “control.”
Let’s look at our refined question to create our Preceptor Table:
What is the average treatment effect, of exposing people to Spanish-speakers in Chicago Illinois in 2012, on their attitudes toward immigration?
Our Preceptor Table:
Preceptor Table | |||
ID | Potential Outcomes | Covariate | |
---|---|---|---|
Control Ending Attitude | Treated Ending Attitude | Treatment | |
1 |
5* |
8 |
Yes |
2 |
7 |
4* |
No |
… |
… |
… |
… |
10 |
3* |
5 |
Yes |
11 |
10* |
7 |
Yes |
… |
… |
… |
… |
N |
6 |
13* |
No |
Note: the values that have a star next to them symbolize the possible values that may exist if they were “control” instead of “treated” or vice versa.
Recall: a Preceptor Table is the smallest possible table with rows and columns such that, if there is no missing data, all our questions are easy to answer. To answer questions — like “What is the average treatment effect, of exposing people to Spanish-speakers, on their attitudes toward immigration? and What is the largest causal effect which still has a 1 in 10 chance of occurring?” — we need a row for every individual.
5.2.1.2 Exploratory Data Analysis
Recall the discussion from Chapter 1. Enos (2014) randomly placed Spanish-speaking confederates on nine train platforms around Boston, Massachusetts. The data that we want to analyze consists of the attitude toward immigration after the experiment is complete (att_end
) and the exposure to Spanish-speakers (treatment
) of each individual on these train platforms.
Code
trains |>
select(att_end, treatment)
# A tibble: 115 × 2
att_end treatment
<dbl> <fct>
1 11 Treated
2 10 Treated
3 5 Treated
4 11 Treated
5 5 Control
6 13 Treated
7 13 Control
8 11 Treated
9 12 Control
10 10 Treated
# ℹ 105 more rows
The treatment
can either be “Treated” or “Control” which are the two factors that may influence att_end
. Participants were asked three questions about immigration issues, each of which allowed for an answer indicated strength of agreement on a scale form 1 to 5, with higher values for att_end
indicating more agreement with conservative viewpoints.
att_end treatment
Min. : 3.000 Treated:51
1st Qu.: 7.000 Control:64
Median : 9.000
Mean : 9.139
3rd Qu.:11.000
Max. :15.000
The data include information about each respondent’s gender, political affiliations, age, income and so on. treatment
indicates whether a subject was in the control or treatment group. The key outcomes are their attitudes toward immigration both before (att_start
) and after (att_end
) the experiment.
summary()
shows us what the different values of att_end
and treatment
are because it is a factor. The range for att_end
seems reasonable.
Code
trains |>
ggplot(aes(x = att_end, fill = treatment)) +
geom_bar(aes(y = after_stat(count/sum(count))),
position = "dodge") +
labs(title = "Expected Attitude Toward Immigration",
subtitle = "Treated Individuals Are More Conservative",
x = "Attitude",
y = "Probability",
fill = NULL) +
scale_y_continuous(labels = scales::percent_format()) +
theme_classic()
We can never know the true average attitude of all treated in the population. But we can calculate a probability distribution for each parameter. As we can see with the plot above, the treated individuals tend to be more conservative. When compared with the controlled population, the treated individuals tend to show more signs of being conservative after having the treatment applied to them.
5.2.1.3 Validity
Now the assumption of validity may not hold due to the possibility of when our data was collected. For example if the data was collected in the morning for our Preceptor Table the people that are interviewed may be more grumpy when going to work. Compared to the possibility of our other data being collected at night when the people are more relaxed coming back from work to go home.
Another case where the assumption of validity may not hold because the variables may be too different from each other within the data and the Preceptor Table as the columns are not similar enough between each other. One possible reason we may consider is that the Preceptor Table is all the people in Chicago in 2023. However, the data we have involves train commuters in the Boston area in 2012.
While the assumption can prove validity to be wrong, overall, the assumption of validity seems reasonable enough. att_end
and treatment
are similar enough between to the underlying concepts in the Preceptor Table and the data that we can “stack” them on top of each other. We can assume that both are drawn from the same population.
5.2.2 Justice
The five concerns of Justice in a data science project remain the same: Population Table, stability, representativeness, unconfoundedness and the mathematical structure of the data generating mechanism (DGM).
5.2.2.1 Population Table
After assuming validity, we can now create our Population Table. Recall that every row from both the Preceptor Table and the data is included in the Population Table, along with all the rows from the underlying population from which we assume that both the Preceptor Table and the data were drawn. The Population Table includes rows from three sources: the Preceptor Table, the actual data, and all other members of the population.
Population Table | |||||
Source | Potential Outcomes | Covariates | |||
---|---|---|---|---|---|
Controlled | Treated | Treatment | Year | City | |
… |
… |
… |
… |
… |
… |
Data |
2 |
7* |
No |
2012 |
Boston, MA |
Data |
10* |
6 |
Yes |
2012 |
Boston, MA |
… |
… |
… |
… |
… |
… |
… |
… |
… |
… |
… |
… |
Preceptor Table |
… |
… |
… |
2023 |
Chicago, IL |
Preceptor Table |
… |
… |
… |
2023 |
Chicago, IL |
… |
… |
… |
… |
… |
… |
Note: the values that have a star next to them symbolize the possible values that may exist if they were “control” instead of “treated” or vice versa.
Our year within the Population table is an example of the moment in time.
Our actual data rows contain the information that we do know. These rows contain entries for both our covariates and the outcomes. In this case, the actual data comes from a study conducted on train commuters around Boston, MA in 2012, so our city entries for these rows will read “Boston, MA” and our year entries of these rows will read “2012”.
5.2.2.2 Stability
Consider the stability of our model for the relationship between att_end
and treatment
between 2012 and 2023. Is this relationship from 2012, four years before Donald Trump’s election as president, still the same? While we might not know for sure, we have to consider this in order to continue and make assumptions with our data. For our purposes, we will consider the relationship to be stable. Even though we know that there may have been some changes, we will consider the model to be the same in both years.
5.2.2.3 Representativeness
Representativeness has to do with how well our sample represents the larger population we are interested in generalizing to. We might run into two potential problems: Is our data representative of the population? Is our Preceptor Table representative of the population? Let’s say that even though Boston is different from Chicago, we considered Boston to be perfectly representative of Chicago. Great, but this 2012 data could still not be representative. This is because there could be bias within those who are chosen to give the survey, in that the commuters who are approached and receive the surveys may not be random or representative. When we deal with representativeness we deal with two primary concerns: the generic concern and the variable concern. In our previous section we talked about the generic concern which applies the same within this situation. Our second level of concern deals with the variable concern which in this case deals with an individual’s attitude and treatment. In our data, we aren’t able to consider all of the ages of individuals within the broader population. For example: there are no ages less than 20, meaning we do represent a large amount of our broader population. What if the individuals giving out the surveys were younger and also tended to choose people to approach with a survey that were similar in age to them? A scenario like this could end up overestimating younger train commuters in the population, which could influence our answers to any of our questions. Specifically, when considering the relationship between att_end
and treatment
, this could influence the results of the model because younger individuals may have similar attitudes on immigration.
5.2.3 Courage
Once we have the mathematical structure of the model, we use Courage to create a fitted model. In Chapter Chapter 6, we will go into much more detail about the process of created a fitted model. This process will involve transforming the variables as well as selecting the variables to include and to discard. For now, however, we keep things simple.
5.2.3.1 Mathematics
The math for this model is exactly the same as the math for the predictive model in the first part of this chapter, although we change the notation a bit for clarity.
\[ attitude_i = \beta_1 + \beta_2 control_i + \epsilon_i\]
where \[control_i \in \{0,1\}\] \[\epsilon_i \sim N(0, \sigma^2)\]
Nothing has changed, except for the meaning of the data items and the interpretations of the parameters. Let’s take a look at the parameters of the situation which plays a role alongside the mathematics to create the fitted model.
On the left-hand side we still have the outcome, \(y\), however in this case, this is a person’s attitude toward immigration after the experiment is complete. \(y\) takes on integer values between 3 and 15 inclusive. On the right-hand side, the part contained in the model will consist of the terms \(\beta_1\) and \(\beta_2\). These two terms stand for the intercept and the control and as before, each term consists of a parameter and a data point. \(\beta_1\) is the average attitude toward immigration for treated individuals. \(\beta_2\) is the difference between the attitude toward immigration for treated individuals — those exposed to Spanish-speakers — and the average attitude toward immigration for control individuals — those not exposed to Spanish-speakers. These are both our parameters. The \(x\)’s are our explanatory variables and take the values 1 or 0. In other words, these are binary variables and are mutually exclusive.The last part, \(\epsilon\) (“epsilon”), represents the part that is not explained by our model and is called the error term. It is simply the difference between the outcome and our model predictions. In our particular case, this includes all factors that have an influence on someone’s attitude toward immigration but are not explained by treatment status. We assume that this error follows a normal distribution with an expected value of 0.
Note that the formula applies to everyone in the population, not just the 115 people for whom we have data. The index \(i\) does not just go from 1 through 115. It goes from 1 through \(N\), where \(N\) is the number of individuals in the population. Conceptually, everyone has an
att_end
under treatment and under control.The small \(i\)’s are an index for the data set. It is equivalent to the “ID” column in our Preceptor Table and simply states that the outcome for person \(i\) is explained by the predictor variables (\(treatment\) and \(control\)) for person \(i\), along with an error term.
5.2.3.2 Fitted Model
With Justice satisfied, we gather our Courage and fit the model. Note that, except for the change in variable names, the code is exactly the same as it was above, in our predictive model for age. Predictive models and causal models use the same math and the same code. The differences, and they are very important, lie in the interpretation of the results, not in their creation.
Code
fit_2 <- stan_glm(att_end ~ treatment,
data = trains,
seed = 45,
refresh = 0)
fit_2
stan_glm
family: gaussian [identity]
formula: att_end ~ treatment
observations: 115
predictors: 2
------
Median MAD_SD
(Intercept) 10.0 0.4
treatmentControl -1.5 0.5
Auxiliary parameter(s):
Median MAD_SD
sigma 2.8 0.2
------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg
att_end
is a measure of person’s attitude toward immigration. A higher number means more conservative, i.e., a more exclusionary stance toward immigration into the United States.
Note that once again, since we are using a linear model, we set the family
argument to “gaussian”.
5.2.3.3 Model Checks
Now that we have created our fitted model we need to perform model checks so that we can ensure that the fitted model we created is reasonably accurate. We attempt to the make the model as accurate as possible, however, we can only view whether it is accurate through model checks that we perform.
Let’s decompose the the dependent variable, att_end
into two parts: the fitted values and the residuals. There are only two possible fitted values, one for the Treated and one for the Control. The residuals, as always, are simply the difference between the outcomes and the fitted values.
The smaller the spread of the residuals, the better a job the model is doing of explaining the outcomes. The residuals that we have above explains our error that we have within the model. As we can see above, our model did not do a great job of predicting the attitude toward immigration as our fitted values only present two possible values. The credit of this goes towards the residuals as the error within our graph is huge causing for the fitted values to be smaller and less precise.
Another model check that we can and should run is the posterior predictive check. With the help of the posterior predictive check we can run a “fake-data” simulation upon our fitted model that generates a distribution based on the variables at hand. With the help of this check we can view our similar our distribution is when compared with the “fake-data” simulation and the actual data.
Our graph in the darker blue represents our actual data. As we can see with the lighter blue graph, our fitted model is able to generate a distribution of the data that is similar when compared to the actual data. However, the “fake-data” produces some values for att_end
which are impossible. We know from the survey details that the lowest possible value for att_end
is 3 and the highest is 15. But, with our “fake-data” simulations, we see several values that are less than 3 and great than 15. This is a flaw in our model. Is it a serious flaw? That is tough to decide. But, for now, we lack the necessary skills to improve the model enough to remove it. In later sections, we will be able to understand new techniques to improve our fitted models that tackles these kinds of flaws.
5.2.3.4 Data Generating Mechanism (DGM)
Recall the Preceptor Table and the units from the beginning. We are concerned with three main ideas from the Preceptor Table: our units (individuals with unique attributes at the train station), our outcome (individual’s attitude toward immigration) covariates (the treatment that each individual will receive). Now, let’s determine whether our Data Generating Mechanism will be linear or logistic. Since our outcome variable att_end
is a continuous variable since it has a range of possible values, we will use a linear model. With the help of the data generating mechanism that we have created from our fitted model we can fill in the missing values to the Preceptor Tables. We use the data generating mechanism as the last step of courage as it allows us to analyze and view our fitted model’s data.
Now that we have tested our fitted model through the model checks we can view all of our data with the table that we have below:
Characteristic | Beta | 95% CI^{1} |
---|---|---|
(Intercept) | 10 | 9.2, 11 |
treatment | ||
Treated | — | — |
Control | -1.5 | -2.6, -0.55 |
^{1} CI = Credible Interval |
(Intercept)
corresponds to 10.0 which is \(\beta_1\). As always, R has, behind the scenes, estimated the entire posterior probability distribution for \(\beta_1\). But the basic print method for these objects can’t show the entire distribution, so it gives us summary numbers: the median and the MAD SD. Speaking roughly, we would expect about 95% of the values in the posterior to be within two MAD SD’s of the median. In other words, we are 95% confident that the true, but unknowable, average attitude toward immigration among the Treated in the population to be between 9.2 and 10.8. treatmentControl
corresponds to \(\beta_2\). The treatmentControl
estimate is -1.5 attitudes younger as the difference between “control” and “treated” has been shown to be younger regarding attitudes for the median values.
Up until now, we have used the Bayesian interpretation of “confidence interval.” This is also the intuitive meaning which, outside of academia, is almost universal. There is a truth out there. We don’t know, and sometimes can’t know, the truth. A confidence interval, and its associated confidence level, tells us how likely the truth is to lie within a specific range. If your boss asks you for a confidence interval, she almost certainly is using this interpretation.
But, in contemporary academic research, the phrase “confidence interval” is usually given a “Frequentist” interpretation. (The biggest divide in statistics is between Bayesians and Frequentist interpretations. The Frequentist approach, also known as “Classical” statistics, has been dominant for 100 years. Its power is fading, which is why this textbook uses the Bayesian approach.) For a Frequentist, a 95% confidence interval means that, if we were to apply the procedure we used in an infinite number of future situations like this, we would expect the true value to fall within the calculated confidence intervals 95% of the time. In academia, a distinction is sometimes made between confidence intervals (which use the Frequentist interpretation) and credible intervals (which use the Bayesian interpretation). We won’t worry about that difference in this Primer.
In summary: we model the attitude of individuals at a train station as a linear function of the treatment each individual receives. We find that the individuals that receive any Spanish speaking treatment are about half a standard deviation more in attitude than the attitude for individuals that did not receive Spanish speaking treatment meaning that the individuals that received Spanish speaking treatment ended up being more conservative.
5.2.4 Temperance
Courage gave us the fitted model. With Temperance we can create posteriors of the quantities of interest. We should be modest in the claims we make.
Recall the first question with which we began this section:
- What is the average treatment effect, of exposing people to Spanish-speakers, on their attitudes toward immigration?
We can use posterior_epred()
to answer this question. Create a tibble and use it as we have done before:
Code
newobs <- tibble(treatment = c("Treated", "Control"))
pe <- posterior_epred(fit_2, newobs) |>
as_tibble() |>
mutate(ate = `1` - `2`)
pe
# A tibble: 4,000 × 3
`1` `2` ate
<dbl> <dbl> <dbl>
1 10.0 8.54 1.49
2 9.62 8.47 1.15
3 9.42 8.56 0.861
4 10.0 9.42 0.623
5 9.96 9.49 0.462
6 9.44 8.83 0.617
7 10.3 9.24 1.01
8 10.2 9.07 1.15
9 9.89 8.88 1.01
10 9.81 9.06 0.750
# ℹ 3,990 more rows
The posterior probability distribution created with posterior_epred()
is the same as the one produced by manipulating the parameters directly.
Code
pe |>
ggplot(aes(x = ate)) +
geom_histogram(aes(y = after_stat(count/sum(count))),
bins = 100) +
labs(title = "Posterior for Average Treatment Effect",
subtitle = "Exposure to Spanish-speakers shifts immigration attitudes rightward",
x = "Difference in Attitude",
y = "Probability") +
scale_y_continuous(labels = scales::percent_format()) +
theme_classic()
Our second question:
- What is the largest effect size which still has a 1 in 10 chance of occurring?
Create a tibble which we can pass to posterior_predict()
. The variables in the tibble which will be passed in as newdata
. Fortunately, the tibble we created above is just what we need for this question also.
Consider the result of posterior_predict()
for two people, one treated and one control. Take the difference.
Code
pp <- posterior_predict(fit_2,
newdata = newobs) |>
as_tibble() |>
mutate(te = `1` - `2`)
pp
# A tibble: 4,000 × 3
`1` `2` te
<dbl> <dbl> <dbl>
1 12.3 1.35 11.0
2 8.29 11.2 -2.92
3 9.18 8.72 0.458
4 10.8 9.94 0.900
5 7.33 11.1 -3.75
6 12.6 4.50 8.08
7 11.2 14.8 -3.51
8 9.84 9.83 0.00276
9 6.66 6.68 -0.0171
10 8.22 7.79 0.427
# ℹ 3,990 more rows
Create a graphic:
Code
pp |>
ggplot(aes(x = te)) +
geom_histogram(aes(y = after_stat(count/sum(count))),
bins = 100) +
labs(title = "Posterior for Treatment Effect for One Person",
subtitle = "Causal effects are more variable for indvduals",
x = "Difference in Attitude",
y = "Probability") +
scale_y_continuous(labels = scales::percent_format()) +
theme_classic()
In this case, we are looking at the distribution of the treatment effect for a single individual. This is very different than the average treatment effect. In particular, it is much more variable. We are looking at one row in the Preceptor Table. For a single individual, att_end
can be anywhere from 3 to 15, both under treatment and under control. The causal effect — the difference between the two potential outcomes can, in theory, be anywhere from -12 to +12. Such extreme values are rare, but not impossible.
The question, however, was interested in the value at the 90th percentile.
Code
quantile(pp$te, prob = 0.9)
90%
6.516448
We would not expect a treatment effect of this magnitude to be common, but, at the same time, effects this big and bigger will occur about 10% of the time.
We should always be cautious about our inferences. Our assumptions are never true. We need to be cautious with our inferences when we deal with all of the assumptions that we have made within the chapter regarding validity, stability, representativeness, and model structure. When we take a look at the assumption of validity a possibility may not hold due to when our data was collected, for example if the data was collected in the morning the people that are interviewed may be more grumpy when going to work. When we take a look at the assumption of stability a possibility where it may not hold would be the definition of our party and how the definition of conservative may change over time. When we take a look at the assumption of representativeness a possibility where it may not hold would be with correlation between our “assignment mechanism” and other variables to be true. When we take a look at the assumption of the model structure a possibility where it may not hold would be with the flaw of the fitted model where we conducted a posterior predict check that generated a distribution of values that were less than 3 and above 15 indicating that there are flaws within the model. Our stated inferences almost certainly underestimate the true uncertainty of the world.
We need to maintain humility when we are making our inferences and decisions. Stay cautious my friends.
5.3 income ~ age
So far, we have only created models in which the predictor variable is discrete, with two possible values. party
is either “Democrat” or “Republican”. treatment
is either “Treated” or “Control”. Often times, however, the predictor variable will be continuous. We can answer these and similar questions by creating a model that uses party affiliation to predict age. Let’s follow the four Cardinal Virtue: Wisdom, Justice, Courage and Temperance. Fortunately, the exact same approach works in this case. Consider:
What would you expect the income to be for a 40-year old?
5.3.1 Wisdom
Wisdom requires the creation of a Preceptor Table, an examination of our data, and a determination, using the concept of “validity,” as to whether or not we can (reasonably!) assume that the two come from the same population.
5.3.1.1 Preceptor Table
First, do we need a causal or predictive model? In our question above we have a predictive model as we are only considered with one outcome: a person’s income based on their age. Remember the motto: No causation without manipulation.
Second, what is the outcome? A person’s income would be our outcome. The individual that asked this question did not tell us where or when we are looking at these questions. Do we need to look at the year 2022, 2020, 2018? Do we need to look in Florida, Connecticut, Maryland? It is vital to understand where and when that we are answering these questions because otherwise our data could be not accurately applied to the whole country, nation, or specific state. Therefore, by diving deeper within these questions we are able to accurately answer our question with the population that it refers to. For the questions that we have at hand we will be talking about all the adults in July 1, 2012 at the location of the United States which will allow us to answer the questions more accurately.
Third, what are the units? Our units for this scenario would be dollars as we are concerned with dollars at the station for 40-year olds.
Fourth, what are the covariates? In this case, a person’s age is the only covariate. The Preceptor Table must include any covariates mentioned in the question.
Fifth, do we have a treatment? No. We will be using a predictive model as discussed above and treatments will only apply to situations with causal models. A treatment is just a covariate which we can, at least in theory, manipulate, thereby creating other potential outcomes.
Let’s look at our refined question to create our Preceptor Table:
What would you expect the income to be for a 40-year old in the United States in 2012?
Our Preceptor Table:
Preceptor Table | ||
ID | Outcome | Covariate |
---|---|---|
Income | Age | |
1 |
150000 |
31 |
2 |
50000 |
58 |
… |
… |
… |
10 |
65000 |
67 |
11 |
35000 |
23 |
… |
… |
… |
N |
78000 |
44 |
Recall: a Preceptor Table is table with rows and columns such that, if there is no missing data, all our questions are easy to answer. To answer questions — like “What would you expect the income to be for a 40-year old?” — we need a row for every individual at the station.
When we take a look at the questions we discover a flaw exists. The individual that asked this question did not tell us where or when we are looking at these questions. Do we need to look at the year 2022, 2020, 2018? Do we need to look in Florida, Connecticut, Maryland? It is vital to understand where and when that we are answering these questions because otherwise our data could be not accurately applied to the whole country, nation, or specific state. Therefore, by diving deeper within these questions we are able to accurately answer our question with the population that it refers to. For the questions that we have at hand we will be using the year 2023 and the location of Chicago, Illinois which will allow us to answer the questions more accurately.
5.3.1.2 Data Analysis
Recall the discussion from Chapter 1. Enos (2014) randomly placed Spanish-speaking confederates on nine train platforms around Boston, Massachusetts. The data that we want to analyze consists of the age and party of each individual on these train platforms based on our variables in our Preceptor Table. These reactions were measured through changes in answers to three survey questions.
Code
trains |>
select(age, income)
# A tibble: 115 × 2
age income
<int> <dbl>
1 31 135000
2 34 105000
3 63 135000
4 45 300000
5 55 135000
6 37 87500
7 53 87500
8 36 135000
9 54 105000
10 42 135000
# ℹ 105 more rows
As specified in the Preceptor Table we only care about age
is the age of the respondent and income
is the income of the respondent. The age
and income
are both numbers either as a double or integer which plays a role in making up an individual’s identity.
age income
Min. :20.00 Min. : 23500
1st Qu.:33.00 1st Qu.: 87500
Median :43.00 Median :135000
Mean :42.37 Mean :141813
3rd Qu.:52.00 3rd Qu.:135000
Max. :68.00 Max. :300000
summary()
shows us what the different values of age
and income
are because it is a factor. The range for age
seems reasonable. Recall that participants were asked three questions about immigration issues, each of which allowed for an answer indicated strength of agreement on a scale form 1 to 5, with higher values indicating more agreement with conservative viewpoints.
Code
trains |>
ggplot(aes(x = age, y = income)) +
geom_point() +
labs(x = "Age",
y = "Income") +
scale_y_continuous(labels = scales::dollar_format()) +
theme_classic()
We can never know the true average income of the population. But we can calculate a distribution for each parameter. As we can see with the plot above, the older individuals tend to be have more of an income. When compared with the younger individuals, the older individuals tend to exhibit a greater income because of the longer life and possibilities they have.
5.3.1.3 Validity
Recall that validity refers to whether or not we can consider the columns age
and income
to have the same meaning in our data set of 2012 Boston train commuters and in our Preceptor Table. While age
doesn’t really change meaning over time, income
can be impacted by inflation. After all, $100,000 in 2012 doesn’t have the same worth as $100,000 now due to inflation and now would have less purchasing power. This would result in income being underestimated within our model. However, since there hasn’t been drastic inflation that dramatically changed the buying power, we will consider income to be valid. If there had been like 300% inflation, however, our conclusion would probably be different.
Now the assumption of validity may not hold due to the possibility of how our age data was collected. For example, the data has been collected from filling out surveys about age without checking a driver’s license. We don’t know whether our age is accurate enough when compared to the driver’s license of an individual.
Another possible scenario that could prove the assumption of validity wrong could be the privacy of each station. The privacy of a train station differs from filling out surveys at home. The key difference that exists between the Preceptor Table and the data can exist from the difference in privacy that we have between the two sets of data.
While the assumption can prove validity to be wrong, overall, the assumption of validity seems reasonable enough. age
and income
are similar enough between to the underlying concepts in the Preceptor Table and the data that we can “stack” them on top of each other. We can assume that both are drawn from the same population.
5.3.2 Justice
Once again, in Justice, we must consider the Population Table, stability, representativeness, unconfoundedness and the mathematical structure of the data generating mechanism (DGM).
5.3.2.1 Population Table
With our assumption of validity proving to be reasonably accurate, we can now create our Population Table. We can make a Population Table with our Preceptor Table and other data based on our assumptions that we have made.
Population Table | ||||
Source | Outcome | Covariates | ||
---|---|---|---|---|
Income | Age | Year | City | |
… |
… |
… |
… |
… |
Data |
150000 |
43 |
2012 |
Boston, MA |
Data |
50000 |
52 |
2012 |
Boston, MA |
… |
… |
… |
… |
… |
… |
… |
… |
… |
… |
Preceptor Table |
… |
… |
2023 |
Wilmington, DE |
Preceptor Table |
… |
… |
2023 |
Atlanta, GA |
… |
… |
… |
… |
… |
Our year within the Population table is an example of the moment in time.
The Population Table includes commuters from our Preceptor Table with the information that we would ideally have to answer the questions and those from the data we have that is specific to Boston, MA. Additionally, the Population Table also includes the groups of people within the population from which both the data we have and the Preceptor table is drawn from that we don’t have.
5.3.2.2 Stability
Consider the stability of our model and if we believe that our relationship between age and income has changed between 2012 and now. Once again, let’s consider inflation and how that could impact income. If incomes were to increase at the rate of inflation, then the income distribution would be different than that of 2012. However, wages don’t tend to change as quickly as inflation does, so they likely did not change significantly and we can consider this model to be stable.
5.3.2.3 Representativeness
Representativeness has to do with how well our sample represents the larger population we are interested in generalizing to. When we deal with representativeness we deal with two potential problems: Is our data representative of the population? Is our Preceptor Table representative of the population? For example: Can we assume that Boston train commuters are perfectly representative of Chicago train commuters AND those who were approached to respond to the survey? The location in which people commute from can change the way that we interpret their attitude regarding the treatment. For the predictive question, representativeness may not apply as the attitude regarding each individual in Boston might differ from the US as they don’t refer to the same broader population making it not representative of the entire population. When we deal with representativeness we deal with two primary concerns: the generic concern and the variable concern. In our previous section we talked about the generic concern which applies the same within this situation. Our second level of concern deals with the variable concern which in this case deals with an individual’s income based on their age. Within our sample, we don’t consider the entire broader population in the sample, we only consider the folks that are in the Boston area which means that the way that we relate the Boston individuals to the broader population may not hold. The train station environment, time of day, and initial attitude all play a strong factor in influencing the representativeness of the entire population. Another factor exists from having more young people chosen to respond by those handing out surveys like we discussed in our last model, what if were to assume that when the surveys are handed out randomly, younger people tended to fill out and submit them more than those who are older? Well, this would still skew the age distribution and overestimate younger people in the population, and if younger people also tend to have a lower income than older people, this could also alter our answers to our current questions.
If we had reason to believe this is true, one way that we could fix this issue of representativeness is to alter our population to be train commuters in the US who would respond to the survey. In doing so, our population would then accommodate for the skewed age distribution under the assumption that younger individuals tend to respond to surveys at higher rates than older people.
5.3.3 Courage
Once we have construct the mathematical structure of our model through the data generating mechanism we can use Courage to create a fitted model. The process of creating our fitted model will involve the transformation of several variables with a discussion and what variables to include and discard.
5.3.3.1 Mathematics
Let’s interpret courage with with a look at the mathematics:
The mathematics for a continuous predictor is unchanged from the intercept-including example we explored previously.
\[income_i = \beta_0 + \beta_1 age_i + \epsilon_i\]
When comparing two people (persons 1 and 2), the first one year older than the second, \(\beta_1\) is the expected difference in their incomes. The algebra is simple. Start with the two individuals.
\[income_1 = \beta_0 + \beta_1 age_1\] \[income_2 = \beta_0 + \beta_1 age_2\] We want the difference between them, so we subtract the second from the first, performing that subtraction on both sides of the equals sign.
\[income_1 - income_2 = \beta_0 + \beta_1 age_1 - \beta_0 - \beta_1 age_2\\ income_1 - income_2 = \beta_1 age_1 - \beta_1 age_2\\ income_1 - income_2 = \beta_1 (age_1 - age_2)\]
So, if person 1 is one year older than person 2, we have:
\[income_1 - income_2 = \beta_1 (age_1 - age_2)\\ income_1 - income_2 = \beta_1 (1)\\ income_1 - income_2 = \beta_1\]
The algebra demonstrates that \(\beta_1\) is the same for all ages. The difference in expected income between two people aged 23 and 24 is the same as the difference between two people aged 80 and 81. Is that plausible? Maybe. The algebra does not lie. When we create a model like this, this is the assumption we are making.
Note how careful we are not to imply that increasing age by one year “causes” an increase in income. That is nonsense! No causation without manipulation. Since it is impossible to change someone’s age, there is only one potential outcome. With only one potential outcome, a causal effect is not defined.
As important as our mathematics is, we need to take a look at the parameters at hand. Looking at our parameters allows us to create fitted model by combining the mathematics and interpretations of the parameters. \(\beta_0\) is the difference in expected income between individuals. \(\beta_1\) is the average age within all of the people in the dataset. As shown in our model, \(\beta_0\) and \(\beta_1\) are two terms that allow us to make our model. The second part in the right-hand side, \(\epsilon\) (“epsilon”), represents the unexplained part of the outcome and is called the error term. In other words, \(\sigma\) is what plays a role with age that is not factored into our model. We assume that this error follows a normal distribution with an expected value of 0 (meaning it is 0 on average) and it is simply the difference between the outcome and our model predictions. Therefore, we have three main parameters: \(\beta_0\), \(\beta_1\), and \(\sigma\) that we will use within this section.
5.3.3.2 Fitted Model
With Justice satisfied, we can use Courage to fit the model. The differences lie in how we interpret the results not with the creation of the model. The use of stan_glm()
is the same as usual.
Code
stan_glm
family: gaussian [identity]
formula: income ~ age
observations: 115
predictors: 2
------
Median MAD_SD
(Intercept) 103288.5 24077.6
age 907.8 535.8
Auxiliary parameter(s):
Median MAD_SD
sigma 74121.3 4897.0
------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg
When comparing two individuals, one 30 years old and one 40, we expect the older to earn about $9,000 more. But we are far from certain: the 95% confidence interval ranges from -$3,000 to $20,000.
The above is a good summary of the models.
It is brief! No one wants to listen to too much of your prattle. One sentence gives a number of interest. The second sentence provides a confidence interval.
It rounds appropriately. No one wants to hear a bunch of decimals. Use sensible units.
It does not just blindly repeat numbers in the printed display. A one year difference in age, which is associated with a $900 difference in income, is awkward. (We think.) A decade comparison is more sensible.
“When comparing” is a great phrase to start the summary of any non-causal model. Avoid language like “associated with” or “leads to” or “implies” or anything which even hints at a causal claim.
5.3.3.3 Model Checks
With the fitted model that we have created we are able to perform model checks. The model checks that we perform will help us understand how accurate our model is because we need to be able to ensure that our fitted model is reasonably accurate. We can view our model through model checks that interpret the residuals (error), outcome, and “fake-data” simulations.
Consider our usual decomposition of the outcome into two parts: the model and the error term.
There are scores of different fitted values. Indeed, there are a greater number of different fitted values than there are different outcome values! This is often true for models which have continuous predictor variables, we have here with age
.
Another model check that we will run is the posterior predictive check. With the help of the posterior predictive check we can run a “fake-data” simulation upon our fitted model that generates a distribution that allows us to view whether our fitted model is able to produce a distribution that is similar to the actual data distribution.
Our graph in the darker blue represents our actual data. As we can see with the lighter blue graph, our fitted model is able to generate a distribution for the values of income that very similar to the distribution of values for income for our actual data. When we take a look at the generated distribution that has been generated from our fitted model we can view that we don’t have highly trusted responses to the results we have at hand. These three graphs that we have simulated all have data that is reasonably accurate when compared to the actual data, however, the residuals present within the fitted model will change the way that graph is presented.
5.3.3.4 Data Generating Mechanism (DGM)
Recall the Preceptor Table and the units from the beginning. We are concerned with three main ideas from the Preceptor Table: our units (dollars for the 40-year olds at the train station), our outcome (individual’s income) covariates (the age of each individual). Now that we have run a posterior predict check to ensure that we have a fitted model that best suits us we can create a table like the one below to visualize the data we currently have. With the help of the data generating mechanism that we have created from our fitted model we can use this to fill in all of the missing values for our Preceptor Table. We use the data generating mechanism as the last step of courage as it allows us to analyze and view our fitted model’s data.
Characteristic | Beta | 95% CI^{1} |
---|---|---|
(Intercept) | 103,289 | 57,808, 151,719 |
age | 908 | -221, 1,936 |
^{1} CI = Credible Interval |
(Intercept)
corresponds to 103,062 which is \(\beta_0\). As always, R has, behind the scenes, estimated the entire posterior probability distribution for \(\beta_0\). We will graph that distribution in the next section. But the basic print method for these objects can’t show the entire distribution, so it gives us summary numbers: the median and the MAD SD. Speaking roughly, we would expect about 95% of the values in the posterior to be within two MAD SD’s of the median.
In summary: we model the income of individuals at a train station as a linear function of the age of each individual. We find that the average income for individuals at the train station approximates to 103,000 as well as, we can see that the older the age the more income seems to be.
5.3.4 Temperance
Courage gave us the fitted model. With Temperance the posteriors we create are never the “truth.” The assumptions we made to create the model are never perfect.
Recall the question:
What would you expect the income to be for a random 40 year old?
Given that we are looking for an expected value, we use posterior_epred()
.
Code
newobs <- tibble(age = 40)
pe <- posterior_epred(fit_3, newdata = newobs) |>
as_tibble()
pe
# A tibble: 4,000 × 1
`1`
<dbl>
1 142106.
2 141530.
3 141363.
4 139289.
5 139647.
6 142876.
7 139427.
8 143336.
9 145881.
10 143052.
# ℹ 3,990 more rows
Plotting is the same as always.
Code
pe |>
ggplot(aes(x = `1`)) +
geom_histogram(aes(y = after_stat(count/sum(count))),
bins = 100) +
labs(title = "Posterior for Expected Income",
subtitle = "A 40-years old commuter earns around $140,000",
x = "Income",
y = "Probability") +
scale_x_continuous(labels = scales::dollar_format()) +
scale_y_continuous(labels = scales::percent_format()) +
theme_classic()
Now that we have plotted all of our models and viewed our data we want to talk about the usage of these models and how Temperance plays a strong factor in the usage of these models.
We should always be cautious about our inferences. Our assumptions are never true. We need to be cautious with our inferences when we deal with all of the assumptions that we have made within the chapter regarding validity, stability, representativeness, and model structure. When we take a look at the assumption of validity a possibility may not hold due could be the privacy of each station because the privacy of a train station differs from filling out surveys at home. When we take a look at the assumption of stability a possibility where it may not hold would be inflation and how that could impact income. When we take a look at the assumption of representativeness a possibility where it may not hold would be with correlation between our “sampling mechanism” and other variables to be true. When we take a look at the assumption of the model structure a possibility where it may not hold would be with the flaw of the fitted model where we conducted a posterior predict check that generated a distribution of values that showed the flaws within the model. Our stated inferences almost certainly underestimate the true uncertainty of the world.
We need to maintain humility when we are making our inferences and decisions. Stay cautious my friends.
5.4 liberal ~ income
So far in this chapter, we have only considered continuous outcome variables. age
, att_end
and income
all take on a variety of values. None of them are, truly, continuous, of course. age
is only reported as an integer value. att_end
can only, by definition, take on 13 distinct values. However, from a modeling perspective, what matters is that they have more than 2 possible values.
liberal
, however, only takes on two values: TRUE and FALSE. In order to model it, we must use the binomial
family. We begin, as always, with some questions:
Among all people who have an income of $100,000, what proportion are liberal?
Assume we have a group of eight people, two of whom make $100,000, two $200,000, two $300,000 and two $400,000. How many will be liberal?
We can answer these and similar questions by creating a model that uses party affiliation to predict age. Let’s follow the four Cardinal Virtues: Wisdom, Justice, Courage and Temperance.
5.4.1 Wisdom
Wisdom requires the creation of a Preceptor Table, an examination of our data, and a determination, using the concept of “validity,” as to whether or not we can (reasonably!) assume that the two come from the same population.
5.4.1.1 Preceptor Table
First, do we need a causal or predictive model? In our question above we have a predictive model. When we are deciding between a causal or predictive model we want to look at the question. Does it include key words such as: “cause,” “affect,” or “influence.” Often these key words indicate that we are dealing with a causal model were a comparison exists between two states of a variable.
Second, what is the outcome? A person’s attitude would be the outcome in this scenario. When we take a look at the questions we discover that there is a flaw that exists. It is vital to understand where and when that we are answering these questions because otherwise it could apply to the whole country, nation, or specific state. Therefore, by diving deeper within these questions we are able to accurately answer our question with the population that it refers to. For the questions that we have at hand we will be talking about all the adults in July 1, 2012 at the location of the United States which will allow us to answer the questions more accurately.
Third, what are the units? Our units for this scenario would be individuals because the questions are about the attributes of unique people at the station.
Fourth, do we have a treatment? No. When we deal with predictive models we don’t have any treatments. In this case, the treatment does not apply as we aren’t dealing with a covariate that we can manipulate and need to manipulate.
Let’s look at our refined question to create our Preceptor Table:
Among all people in the United States in 2012 who have an income of $100,000, what proportion are liberal?
Our Preceptor Table:
Preceptor Table | ||
ID | Outcome | Covariate |
---|---|---|
Liberal | Income | |
1 |
0 |
150000 |
2 |
0 |
50000 |
… |
… |
… |
10 |
1 |
65000 |
11 |
1 |
35000 |
… |
… |
… |
N |
1 |
78000 |
Note: the values that have a star next to them symbolize the possible values that may exist if they were “control” instead of “treated” or vice versa.
Recall: a Preceptor Table is the smallest table with rows and columns such that, if there is no missing data, all our questions are easy to answer. To answer questions — like “Among all people who have an income $100,000, what proportion are liberal? and Assume we have a group of eight people, two of whom make $100,000, two $200,000, two $300,000 and two $400,000. How many will be liberal?” — we need a row for every individual at the station.
5.4.1.2 Data Analysis with Exploratory Approach
Recall the discussion from Chapter 1. Enos (2014) randomly placed Spanish-speaking confederates on nine train platforms around Boston, Massachusetts. The data that we want to analyze consists of the age and party of each individual on these train platforms based on our variables in our Preceptor Table. These reactions were measured through changes in answers to three survey questions.
Code
trains |>
select(liberal, income)
# A tibble: 115 × 2
liberal income
<lgl> <dbl>
1 FALSE 135000
2 FALSE 105000
3 TRUE 135000
4 FALSE 300000
5 TRUE 135000
6 FALSE 87500
7 FALSE 87500
8 FALSE 135000
9 FALSE 105000
10 FALSE 135000
# ℹ 105 more rows
As specified in the Preceptor Table we only care about liberal
is whether they are liberal or not and income
is the income of the respondent. liberal
is shown to be TRUE
when the individual is liberal and FALSE
when an individual is not liberal, playing a role in shaping their attitude towards the immigrants.
liberal income
Mode :logical Min. : 23500
FALSE:64 1st Qu.: 87500
TRUE :51 Median :135000
Mean :141813
3rd Qu.:135000
Max. :300000
The range for income
seems reasonable.
Code
trains |>
ggplot(aes(x = income, y = liberal)) +
geom_point() +
labs(title = "Income Based On Liberal",
x = "Income in 1000s",
y = NULL,
fill = NULL) +
scale_x_continuous(labels = scales::dollar_format(scale = 1/1000)) +
scale_y_discrete(labels=c("TRUE" = "Liberal", "FALSE" = "Not Liberal"))
Our plot above shows a wide assortment of incomes for both liberals and non-liberals, however, there is no clear connection between liberal and income. As we can see on the higher end there are both individuals that are liberal and not liberal that are attaining this sort of income. On the lower end it is the same, as a result we can see that a strong connection between liberal and income clearly does not exist within our situation.
5.4.1.3 Validity
Now let’s take a look at the assumption of validity for our data and Preceptor Table. There is no truth here. The assumption of validity may not hold due to the possibility of how our data was collected. We asked each individual whether they were “liberal,” which means that we didn’t rely on the government database. The possibility of the data being different from the government database and the way that we asked each individual may differ across both the Preceptor Table and the data which can prove the assumption of validity to be false.
Another case where the assumption of validity may not hold would be from how our income data was collected similar to our age data collection. The data has been collected from filling out surveys about income without cross-checking with the government database. We don’t know whether our income is accurate enough when compared to the government’s database to the survey that each individual filled out.
Despite these concerns, we will assume that validity does hold between the relationship of liberal
and income
such so we can “stack” our data and the Preceptor Table on top of each other.
5.4.2 Justice
Let’s now consider Justice for the relationship between liberal
and income
. In Justice, we must consider the Population Table, stability, representativeness, unconfoundedness and the mathematical structure of the data generating mechanism (DGM).
5.4.2.1 Population Table
With the assumption of validity seeming reasonable, we can now create our Population Table. The Population Table consists of all the rows from the Preceptor Table, all the rows from our data, and all the rows from the broader population from which both are drawn.
Population Table | |||||
Source | Potential Outcomes | Covariates | |||
---|---|---|---|---|---|
Liberal Ending Attitude | Not Liberal Ending Attitude | Income | Year | City | |
… |
… |
… |
… |
… |
… |
Data |
3 |
6* |
150000 |
2012 |
Boston, MA |
Data |
4 |
9* |
50000 |
2012 |
Boston, MA |
… |
… |
… |
… |
… |
… |
… |
… |
… |
… |
… |
… |
Preceptor Table |
… |
… |
… |
2023 |
Chicago, IL |
Preceptor Table |
… |
… |
… |
2023 |
Chicago, IL |
… |
… |
… |
… |
… |
… |
Note: the values that have a star next to them symbolize the possible values that may exist if they were “control” instead of “treated” or vice versa.
Our year within the Population table is an example of the moment in time.
Look at the validity of the liberal
variable. This column has the values “Liberal” and “Not Liberal” to convey the political ideology of the train commuters. We must determine if the column liberal
, and therefore the meaning of being liberal, is the same in Boston in 2012 as in the US in 2023. If we can determine these to be the same, then we can assume validity, and in this case, because the core beliefs of being liberal have not changed very much between 2012 and 2023, we can determine that the data is valid.
5.4.2.2 Stability
Let’s look at stability with the relationship between liberal
and income
and determine whether or not we believe that this relationship has changed between 2012 and 2023. With our knowledge of the world, do we have any reason to believe that this has changed? What if between 2012 and 2023, income increased for those who are liberal? Well, then our model for the the relationship between income
and liberal
would have changed over the years, and therefore so would the model. However, since with our knowledge of the world we have no reason to believe that something of the sorts happened to affect this relationship, we can consider our model to be stable.
5.4.2.3 Representativeness
Representativeness has to do with how well our sample represents the larger population we are interested in generalizing to. When we deal with representativeness we deal with two potential problems: Is our data representative of the population? Is our Preceptor Table representative of the population? In our past three situations, we have considered two possible issues that we could have with the representativeness of our model, such as the difference between Boston and other cities and bias in those who respond. We will now consider how in this survey there were surveys given before and after treatment and some people may have filled out only one of the surveys. People who only filled out one of the surveys could affect the representativeness of the data because they could not be included in the data and if those who only filled out one survey tended to be liberal, then this would affect our data because it would underestimate the amount of liberals in the survey. This is something we must consider when looking at representativeness, since we could otherwise not determine if this data from train commuters in Boston in 2012 is representative enough of train commuters in the US now to continue using our data. Our generic level of concern has been seen from the previous sections. When we take a look at our variable concern and whether an individual is liberal and an individual’s income, the way than an individual may define liberal can change across Chicago making the sample that we have not representative of the broader population.
5.4.3 Courage
We will use the mathematical structure that through the data generating mechanism to generate a fitted model with the help of Courage. Courage allows us to explore different models. We can avoid hypothesis tests as we check our models for consistency with the data we have.
5.4.3.1 Mathematics
Let’s consider whether this model is linear or logistic. Unlike our previous models this chapter, the outcome variable, liberal
, for this model, only has two options, “Liberal” or “Not Liberal”. Therefore, this will be logistic because there are only 2 possible outcomes and the outcome variable isn’t continuous.
Recall the discussion in Section 4.5 about the logistic regression model which we use whenever the outcome or dependent variable is binary/logical. The math is there, if you care about math. We don’t, at least not too much. Reminder:
\[p(\text{Liberal}_i = \text{TRUE}) = \frac{\text{exp}(\beta_0 + \beta_1 \text{income}_i)}{1 + \text{exp}(\beta_0 + \beta_1 \text{income}_i)}\]
This model only has two parameters, \(\beta_0\) and \(\beta_1\). But these parameters do not have simple interpretations, unlike the parameters in a linear (or gaussian) model.
Recall the fundamental structure of all data science problems:
\[\text{outcome} = \text{model} + \text{what is not in the model}\] The exact mathematics of the model — the parameters, their interpretations — are all just dross in the foundry of our inferences: unavoidable but not worth too much of our time.
Even if the math is ignorable, the causal versus predictive nature of the model is not. Is this a causal model or a predictive model? It depends! It could be causal if you assume that we can manipulate someone’s income, if, that is, there are at least two potential outcomes: person \(i\)’s liberal status if she makes X dollars and person \(i\)’s liberal status if she makes Y dollars. Remember: No causation without manipulation. The definition of a causal effect is the difference between two potential outcomes. If you only have one outcome, then your model can not be causal.
In many circumstances, we don’t really care if a model is causal or not. We might only want to forecast/predict/explain the outcome variable. In that case, whether we can interpret the influence of a variable as causal is irrelevant to our use of that variable.
5.4.3.2 Fitted Model
Fitting a logistic model is easy. We use all the same arguments as usual, but with family = binomial
added.
Code
fit_4 <- stan_glm(data = ch5,
formula = liberal ~ income,
family = binomial,
refresh = 0,
seed = 365)
Having fit the model, we can look at a printed summary. Note the use of the digits
argument to display more digits in the printout.
Code
print(fit_4, digits = 6)
stan_glm
family: binomial [logit]
formula: liberal ~ income
observations: 115
predictors: 2
------
Median MAD_SD
(Intercept) 0.562837 0.426033
income -0.000006 0.000003
------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg
Fitted models tell us about the posterior distributions for the parameters in the formula which defines the model we have estimated. We are assuming that the model is true. And, as discussed in Chapter 3, that assumption is always false! Our model is never a perfectly accurate representation of reality. But, if it were perfect, then the posterior distributions which we have created for \(\beta_0\), \(\beta_1\), and so on would be perfect as well.
When working with a linear model, we will often interpret the meaning of the parameters, as we have already done in the first three sections of this chapter. Such interpretations are much harder with logistic models because the math is much less convenient. So, we won’t even bother to try to understand the meaning of these parameters. However, we can note that \(\beta_1\) is negative, suggesting that people with higher incomes are less likely to be liberal.
5.4.3.3 Model Checks
With the fitted model that we have created we are able to perform model checks. Model checks help us understand how accurate our model is to ensure that the fitted model that we have created is reasonably accurate when compared to the actual data. We can view our model through the posterior predictive check that simulates the data upon our fitted model to generate a distribution. With the posterior predictive check we are able to visualize how accurate our data is compared to the actual data ensuring that we have created a great fitted model.
Our graph in the darker blue represents our actual data. As we can see with the lighter blue graph, our fitted model is able to generate a distribution for liberal that is similar to the actual data distribution as shown above. For the most part the model is really accurate as we are able to see the predictions that seem to be highly accurate.
5.4.3.4 Data Generating Mechanism (DGM)
Recall the Preceptor Table and the units from the beginning. We are concerned with three main ideas from the Preceptor Table: our units (individuals with unique attributes at the train station), our outcome (individual’s attitude toward immigration) covariates (the treatment that each individual will receive). Now that we have run the necessary model checks on our fitted model to ensure that we have the best suited model, we can create a table to view our data. With the accurate fitted model and low residuals proving to be accurate we can use the data generating mechanism to finish the last step in the Courage process.
Characteristic | log(OR)^{1} | 95% CI^{1} |
---|---|---|
(Intercept) | 0.56 | -0.26, 1.4 |
income | 0.00 | 0.00, 0.00 |
^{1} OR = Odds Ratio, CI = Credible Interval |
(Intercept)
corresponds to 0.56 which is \(\beta_0\). As always, R has, behind the scenes, estimated the entire posterior probability distribution for \(\beta_0\). We will graph that distribution in the next section. But the basic print method for these objects can’t show the entire distribution, so it gives us summary numbers: the median and the MAD SD. Speaking roughly, we would expect about 95% of the values in the posterior to be within two MAD SD’s of the median.
As we can see with the able above are using log(OR) which means that the way that we interpret the data given to us plays a huge role in the way that we can predict future data. When we are dealing with income for example we can either use the log of income or deal with income straight which play different roles in changing the way that we feed and understand the data. With the different ways that we are inputting the data, the model changes because of the variables that are changing within the model.
In summary: we model the income of individuals at a train station as a linear function of the treatment each individual receives. We find that the individuals that receive any Spanish speaking treatment are about half a standard deviation more in attitude than the attitude for individuals that did not receive Spanish speaking treatment meaning that the individuals that received Spanish speaking treatment ended up being less liberal.
5.4.4 Temperance
Courage gave us the fitted model. With Temperance the decisions made with flawed posteriors are almost always better than decisions made without them.
Among all people who have an income of $100,000, what proportion are liberal?
Although our model is now logistic, all the steps in answering a question like this are the same as with a linear/guassian model.
Code
newobs <- tibble(income = 100000)
pe <- posterior_epred(fit_4,
newdata = newobs) |>
as_tibble()
pe
is a tibble with a single vector. That vector is 4,000 draws from the posterior distribution of proportion of people, among those who make $100,000, who are liberal. The population proportion is the same thing as the probability for any single individual.
Code
pe |>
ggplot(aes(x = `1`)) +
geom_histogram(aes(y = after_stat(count/sum(count))),
bins = 100) +
labs(title = "Posterior for Proportion Liberal Among $100,000 Earners",
subtitle = "The population proportion is the same as the probability for any individual",
x = "Income",
y = "Probability of Being Liberal") +
scale_x_continuous(labels = scales::percent_format(accuracy = 1)) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
theme_classic()
Assume we have a group of eight people, two of whom make $100,000, two $200,000, two $300,000 and two $400,000. How many will be liberal?
Because we are trying to predict the outcome for a small number of units, we use posterior_predict()
. The more complex the questions we ask, the more care we need to devote to making the newobs
tibble. We use the same rowwise()
and c_across()
tricks as earlier in the chapter.
Code
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `total = sum(c_across())`.
ℹ In row 1.
Caused by warning:
! Using `c_across()` without supplying `cols` was deprecated in dplyr 1.1.0.
ℹ Please supply `cols` instead.
Code
pp
# A tibble: 4,000 × 9
# Rowwise:
`1` `2` `3` `4` `5` `6` `7` `8` total
<int> <int> <int> <int> <int> <int> <int> <int> <int>
1 1 1 1 1 0 1 0 0 5
2 0 0 1 0 0 0 0 0 1
3 0 0 1 1 0 0 0 0 2
4 1 1 1 0 0 0 1 0 4
5 1 1 0 0 0 0 0 0 2
6 1 0 0 1 0 1 1 0 4
7 0 1 1 0 0 0 1 0 3
8 1 0 1 0 0 0 0 0 2
9 0 1 1 0 0 0 0 0 2
10 0 1 0 1 0 1 0 0 3
# ℹ 3,990 more rows
Study the pp
tibble. Understand its component parts. The first column, for example, is 4,000 draws from the posterior distribution for the liberal status of a random person with an income of $100,000. Note how all the draws are zeroes or ones. That is very different from the draws we have seen before! But it also makes sense. We are making a prediction about a binary variable, a variable which only have two possible values: zero or one. So, any (reasonable!) predictions will only be zero or one.
The second column is the same thing as the first column. Both are 4,000 draws from the posterior distribution for the liberal status of a random person with an income of $100,000. Yet they also have different values. They are both the same thing and different things, in the same way that rnorm(10)
and rnorm(10)
are the same thing — both are 10 draws from the standard normal distribution — and different things in that the values vary.
The third and fourth columns are different from the first two columns. They are both 4,000 draws from the posterior distribution for the liberal status of a random person with an income of $200,000. And so on for later columns. We can answer very difficult questions by putting together simple building blocks, each of them a set of draws from a posterior distribution. Recall the discussion in Section 2.1.
The total
column is simply the sum of the first eight columns. Having created the building blocks with 8 columns of draws from four different posterior distributions, we can switch our focus to each row. Consider row 2. It has a vector of 8 numbers: 1 1 1 0 0 1 0 0
. We can treat that vector as a unit of analysis. This is what might happen with our 8 people. The first three might be liberal, the fourth not liberal and so on. This row is just one example of what might happen, one draw from the posterior distribution of possible outcomes for groups of eight people with these incomes.
We can simplify this draw by taking the sum, or doing anything else which might answer the question with which we are confronted. Posterior distributions are as flexible as individual numbers. We can, more or less, just use algebra to work with them.
Graphically we have:
Code
pp |>
ggplot(aes(x = total)) +
geom_histogram(aes(y = after_stat(count/sum(count))),
bins = 100) +
labs(title = "Posterior for Number of Liberals in Group with Varied Incomes",
subtitle = "Two is the most likely number, but values from 0 to 5 are plausible",
x = "Number of Liberals",
y = "Probability") +
scale_x_continuous(labels = scales::number_format(accuracy = 1)) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
theme_classic()
As always, there is some truth. If, tomorrow, we were to meet 8 new people, with the specified incomes, a certain number of them would be liberal. If we had the ideal Preceptor Table, we could just look up that number. No data science required. Alas, we don’t know the truth. The bets we can do is to create a posterior distribution for that unknown value, as we have done here. We then need to translate that posterior into English — “The most likely number of liberals is 2 or 3, but a total as low as zero or as high as 5 is also plausible. Having 6 liberals would be really surprising. Having 7 or 8 is almost impossible.”
Are these two posterior probability distributions perfect? No! This is the central message of the virtue of Temperance. We must demonstrate our humility when we use our models. Recall the distinction between the unknown true distribution and the estimated distribution. The first is the posterior distribution we would create of we understood every detail of the process and could accurately model it. We would still not know the true unknown number, but our posterior distribution for that number would be perfect. Yet, our model is never perfect. We are making all sorts of assumptions behind the scenes. Some of those assumptions are plausible. Others are less so. Either way, the estimated distribution is what we have graphed above. When we take a look at the assumption of validity a possibility may not hold due would be from how our income data was collected similar to our age data collection. When we take a look at the assumption of stability a possibility where it may not hold would be inflation and how that could impact income. When we take a look at the assumption of representativeness a possibility where it may not hold would be with correlation between our “sampling mechanism” and other variables to be true. When we take a look at the assumption of the model structure a possibility where it may not hold would be with the flaw of the fitted model where we conducted a posterior predict check that generated a distribution of values that showed the flaws within the model.
The central lesson of Temperance is: Don’t confuse the estimated posterior (which is what you have) with the true posterior (which is what you want). Recognize the unavoidable imperfections in the process. You can still use your estimated posterior — what choice do you have? — but be cautious and humble in doing so.
The more that you suspect that your estimated posterior differs from the true posterior, the more humble and cautious you should be. Stay cautious my friends.
5.5 Summary
Throughout this chapter, we explored relationships between different variables in the trains
data set. We built two predictive models and two causal models.
Similar to previous chapters, our first task is to always use Wisdom. We want to judge how relevant our data is based on the questions we ask. Is it reasonable to consider the data we have (e.g., income and age data from Boston commuters in 2012) are being drawn from the same population as the data we want to have (e.g., income and age data from today for the entire US)? Probably? Recall that these questions will bring up the assumption of validity and whether our data will be able to “stack” up with each other.
Our next step is to take a look at Justice which helps us decide the best way to represent the models that we will make. A little math won’t kill you. Translating that math into code will be through the help of Courage. Our primary goal is to generate posterior distributions for the parameters and understand/interpret their meaning. Finally, we end off with Temperance that uses our models to answer the questions that we have asked above. Remember it is important to hone in on our questions to the most that we can because the most detailed questions will allow for the more accurate answers to prevail.
Key Lessons and Commands That Were Talked About:
Create a model using
stan_glm()
.Use
posterior_epred()
to estimate expected values. The e in epred stands for expected.Use
posterior_predict()
to make forecasts for individuals. The variable in predictions is always greater than the variability in expectations because predictions can’t pretend that \(\epsilon_i\) is zero.Once we have draws from a posterior distribution for our outcome variable — whether that be an expectation or a prediction — we can manipulate those draws to answer our question.
Always Remember the Following:
Always explore your data.
Predictive models care little about causality.
Predictive models and causal models use the same math and the same code.
“When comparing” is a great phrase to start the summary of any non-causal model.
Don’t confuse the estimated posterior (which is what you have) with the true posterior (which is what you want). Be cautious in your use of the posterior.