# 4 Rubin Causal Model

Have you ever wondered what the world would be like without you?

George Bailey, in the movie “It’s a Wonderful Life,” believes that his life has served no purpose. The movie follows George as he explores a world in which he was never born. It is clear that he had a profound impact on the lives of many people in his community. His actions mattered, more than he ever realized.

By showing what the world would have been like without George, we get an idea of the *causal effect* of his life on his town and the people who live there. This chapter covers causal effects using the framework of *potential outcomes* and the *Rubin Causal Model* (RCM).

## 4.1 Preceptor Tables

A **Preceptor Table** is a table with rows and columns for all the data that we have and all the data we would (reasonably) like to have, such that, if none of the data is missing, the thing we want to know is trivial to calculate. Preceptor Tables vary in the number of their rows and columns, as well as in the amount of missing data. We use question marks to indicate missing data in a Preceptor Table.

Assume that there are five adult brothers and you are given four of their heights. What is the average height of all five brothers? Because we are not given it, we can use statistics to estimate this unknown value. Consider a Preceptor Table for this problem:

ID | Outcome |
---|---|

Heights (cm) | |

Robert | 178 |

Andy | ? |

Beau | 172 |

Ishan | 173 |

Nicolas | 165 |

In this case, we have a row for each brother and a column for their height. We will always have an ID column in Preceptor Tables so that we can identify different units. It is always furthest to the left. In addition to the ID column, we call the column with the brothers’ heights an *outcome* column.

To estimate average height, we need to estimate Andy’s height. One guess is just the average of the other four brothers: \[\frac{(178 + 165 + 172 + 173)}{4} = 172\] Is that realistic? We should consider things like why we know the four other brothers’ heights. Were they sampled randomly? Or do we know their heights because they are the tallest in the family? In that case would it make sense to use an average to estimate Andy’s height? Probably not!

Keep in mind that there is a *truth* out there, a state of the world independent of our knowledge of it. Andy is a specific height. If we had a complete Preceptor Table, with no missing values, we could calculate the average height of the brothers exactly. No fancy statistics would be needed. Just arithmetic.

Consider a more complex problem. We have the heights of 100 Harvard students, and from that we want to know the average height of students in the school.

ID | Outcome |
---|---|

Heights (cm) | |

Student 1 | ? |

Student 2 | ? |

... | ... |

Student 473 | 172 |

Student 474 | ? |

... | ... |

Student 3,258 | ? |

Student 3,259 | 162 |

... | ... |

Student 6,700 | ? |

Again, are these 100 students randomly sampled? Could we estimate the 90th percentile of height in the student population? These questions are more complicated, and we might be less confident in our best guess. Now let’s say we are given some characteristics other than height for the 100 sampled students, i.e., sex and age.

ID | Outcome | Covariates | |
---|---|---|---|

Heights (cm) | Age | Sex | |

Student 1 | ? | ? | ? |

Student 2 | ? | ? | ? |

... | ... | ... | ... |

Student 473 | 172 | 19 | M |

Student 474 | ? | ? | ? |

... | ... | ... | ... |

Student 3,258 | ? | ? | ? |

Student 3,259 | 162 | 20 | F |

... | ... | ... | ... |

Student 6,700 | ? | ? | ? |

You’ll notice our Preceptor Table has a new type of column: **covariates**. Are we better able to forecast a student’s height if we are given their age and sex?

So far, we have only asked predictive questions. This chapter, however, primarily focuses on causal inference.

## 4.2 Causal effect

The Rubin Causal Model (RCM) is based on the idea of **potential outcomes.** For example, Enos (2014) measured attitudes toward immigration among Boston commuters. Individuals were exposed to one of two possible conditions, and then their attitudes towards immigrants were recorded. One condition was being on a train platform near individuals speaking Spanish. The other was being on a train platform without Spanish-speakers. To calculate the causal effect of having Spanish-speakers nearby, we need to compare the outcome for an individual in one possible state of the world (with Spanish-speakers) to another (without Spanish-speakers). It is impossible to observe both potential outcomes at once. One of the potential outcomes is always missing. This dilemma is the *Fundamental Problem of Causal Inference*.

In most circumstances, we are interested in comparing two experimental manipulations, one generally termed “treatment” and the other “control.” The difference between the potential outcome under treatment and the potential outcome under control is a “causal effect” or a “treatment effect.” The scenario that didn’t actually happen, and thus that we didn’t observe, is a “counterfactual.” According to the RCM, the **causal effect** of being on the platform with Spanish-speakers is the *difference* between what your attitude would have been under “treatment” (with Spanish-speakers) and under “control” (no Spanish-speakers).

The commuter survey consisted of three questions, each measuring agreement on a 1 to 5 integer scale, with 1 being liberal and 5 being conservative. For each person, the three answers were summed, generating an overall measure of attitude toward immigration which ranged from 3 to 15. If your attitude towards immigrants would have been a 13 with Spanish-speakers and a 9 without Spanish-speakers, then the causal effect of being on a platform with Spanish-speakers is a 4-point increase in your score.

We will use the symbol \(Y\) to represent potential outcomes, the variable we are interested in understanding and modeling. \(Y\) is called the *response* or *outcome* variable. It is the variable we want to “explain.” In our case this would be the attitude score. If we are trying to understand a causal effect, we need two symbols: so that control and treated values can be represented separately. We will use the symbols \(Y_t\) and \(Y_c\).

### 4.2.1 Potential outcomes

Suppose that Yao is one of the commuters surveyed in this experiment. If we were omniscient, we would know the outcomes for Yao under both treatment (with Spanish-speakers) and control (no Spanish-speakers). We can show this using an ideal Preceptor Table. This Preceptor Table is considered *ideal* because we are not missing any data, and so calculating the number we are interested in is trivial.

ID | Outcomes | |
---|---|---|

Attitude if Treated | Attitude if Control | |

Yao | 13 | 9 |

From this table we only know the causal effect on Yao. Everyone else in the study might have a lower attitude score (more liberal) if treated. Regardless of what the causal effect is for other subjects, the causal effect for Yao of being on the train platform with Spanish-speakers is a shift towards a more conservative attitude.

Using the response variable — the actual symbol rather than a written description — makes for a more concise Preceptor Table.

ID | $$\text{Outcomes}$$ | |
---|---|---|

$$Y_t$$ | $$Y_c$$ | |

Yao |
13 |
9 |

Recall that the “causal effect” is the difference between Yao’s potential outcomes under treatment and under control.

ID | $$\text{Outcomes}$$ | $$\text{Estimand}$$ | |
---|---|---|---|

$$Y_t$$ | $$Y_c$$ | $$Y_t - Y_c$$ | |

Yao |
13 |
9 |
+4 |

Remember that in an *actual* Preceptor Table we will have a bunch of missing data! We can not use simple arithmetic to calculate the causal effect on Yao’s attitude toward immigration. Instead, we will be required to estimate it. An **estimand** is some variable in the real world that we are trying to measure. In this case, it is \(Y_{t}-Y_{c}\), not \(+4\). An estimand is not the *value* you calculated, but is rather the *unknown variable* you want to estimate.

### 4.2.2 Multiple units

Generally a study has many individuals (or, more broadly, “units”) who each have their own potential outcomes. More notation is needed to allow us to differentiate between different units.

In other words, there needs to be a distinction between \(Y_t\) for Yao, and \(Y_t\) for Emma. We use the variable \(u\) (\(u\) for “unit”) to indicate that the outcome under control and the outcome under treatment can differ for each individual unit (person).

Instead of \(Y_t\), we will use \(Y_t(u)\) to represent “Attitude if Treated.” If you want to talk about only Emma, you could say “Emma’s Attitude if Treated” or “\(Y_t(u = Emma)\)” or “the \(Y_t(u)\) for Emma,” but not just \(Y_t\). That notation is too ambiguous when there is more than one subject.

Let’s look at an ideal Preceptor Table with more subjects using our new notation:

ID | $$\text{Outcomes}$$ | $$\text{Estimand}$$ | |
---|---|---|---|

$$Y_t(u)$$ | $$Y_c(u)$$ | $$Y_t(u) - Y_c(u)$$ | |

Yao |
13 |
9 |
+4 |

Emma |
11 |
11 |
0 |

Cassidy |
11 |
10 |
+1 |

Tahmid |
9 |
12 |
-3 |

Diego |
6 |
4 |
+2 |

From this ideal Preceptor Table, there are many possible estimands we might be interested in. Consider some examples, along with their true values:

- A potential outcome for one person, e.g., Yao’s potential outcome under treatment: \(13\).

- A causal effect for one person, such as for Emma. This is the difference between the potential outcomes: \(11 - 11 = 0\).
- The most positive causal effect: \(+4\), for Yao.
- The most negative causal effect: \(-3\), for Tahmid.
- The median causal effect: \(+1\).
- The median percentage change: \(+10.0\%\). To see this, calculate the percentage change for each person. You’ll get 5 percentages: \(+44.4\%\), \(0.0\%\), \(+10.0\%\), \(-25.0\%\), and \(+25.0\%\).

And so on. There are a lot of things one might care about!

All of the variables we calculated above are examples of estimands we might be interested in. One estimand is important enough that it has its own name: the **average treatment effect**, often abbreviated as **ATE**. The average treatment effect is the mean of all the individual causal effects. Here, the mean is \(+0.8\).

Remember what an actual Preceptor Table riddled, as it always is, with question marks, looks like:

ID | $$\text{Outcomes}$$ | $$\text{Estimand}$$ | |
---|---|---|---|

$$Y_t(u)$$ | $$Y_c(u)$$ | $$Y_t(u) - Y_c(u)$$ | |

Yao |
13 |
? |
? |

Emma |
11 |
? |
? |

Cassidy |
? |
10 |
? |

Tahmid |
? |
12 |
? |

Diego |
6 |
? |
? |

Calculating values from this table is no longer a simple math problem. See this discussion from Harvard Professor Matt Blackwell:

## 4.3 Simple models

How can we fill in the question marks? Because of the Fundamental Problem of Causal Inference, we can never *know* the missing values. Because we can never know the missing values, we must make assumptions. “Assumption” just means that we need a “model,” and all models have parameters.

### 4.3.1 A single value for tau

One model might be that the causal effect is the same for everyone. There is a single parameter, \(\tau\), which we then estimate. (\(\tau\) is a Greek letter, written as “tau” and rhyming with “cow.”) Once we have an estimate, we can fill in the Preceptor Table because, knowing it, we can estimate what the unobserved potential outcome is for each person. We use our assumption about \(\tau\) to estimate the counterfactual outcome for each unit.

Remember what our Preceptor Table looks like with all of the missing data:

ID | $$\text{Outcomes}$$ | $$\text{Estimand}$$ | |
---|---|---|---|

$$Y_t(u)$$ | $$Y_c(u)$$ | $$Y_t(u) - Y_c(u)$$ | |

Yao |
13 |
? |
? |

Emma |
11 |
? |
? |

Cassidy |
? |
10 |
? |

Tahmid |
? |
12 |
? |

Diego |
6 |
? |
? |

If we assume \(\tau\) is the treatment effect for everyone, how do we fill in the table? We are using \(\tau\) as an estimate for the causal effect. By definition: \(Y_t(u) - Y_c(u) = \tau\). Using simple algebra, it is then clear that \(Y_t(u) = Y_c(u) + \tau\) and \(Y_c(u) = Y_t(u) - \tau\). In other words, you could add it to the observed value of every observation in the control group (or subtract it from the observed value of every observation in the treatment group), and thus fill in all the missing values.

Assuming there is a constant treatment effect, \(\tau\), for everyone, filling in the missing values would look like this:

ID | $$\text{Outcomes}$$ | $$\text{Estimand}$$ | |
---|---|---|---|

$$Y_t(u)$$ | $$Y_c(u)$$ | $$Y_t(u) - Y_c(u)$$ | |

Yao |
13 |
$$13 - \tau$$ |
$$\tau$$ |

Emma |
11 |
$$11 - \tau$$ |
$$\tau$$ |

Cassidy |
$$10 + \tau$$ |
10 |
$$\tau$$ |

Tahmid |
$$12 + \tau$$ |
12 |
$$\tau$$ |

Diego |
6 |
$$6 - \tau$$ |
$$\tau$$ |

Now we need to find an estimate for \(\tau\) in order to fill in the missing values. One approach is to subtract the average of the observed control values from the average of the observed treated values. \[((13 + 11 + 6) / 3) - ((10 + 12) / 2)\] \[10 - 11 = -1\]

This gives us an estimate of \(-1\) for \(\tau\). Let’s fill in our missing values by adding \(\tau\) to the observed values under control and by subtracting \(\tau\) from the observed value under treatment like so:

ID | $$\text{Outcomes}$$ | $$\text{Estimand}$$ | |
---|---|---|---|

$$Y_t(u)$$ | $$Y_c(u)$$ | $$Y_t(u) - Y_c(u)$$ | |

Yao |
13 |
$$13 - (-1)$$ |
-1 |

Emma |
11 |
$$11 - (-1)$$ |
-1 |

Cassidy |
$$10 + (-1)$$ |
10 |
-1 |

Tahmid |
$$12 + (-1)$$ |
12 |
-1 |

Diego |
5 |
$$5 - (-1)$$ |
-1 |

Which gives us:

ID | $$\text{Outcomes}$$ | $$\text{Estimand}$$ | |
---|---|---|---|

$$Y_t(u)$$ | $$Y_c(u)$$ | $$Y_t(u) - Y_c(u)$$ | |

Yao |
13 |
14 |
-1 |

Emma |
11 |
12 |
-1 |

Cassidy |
9 |
10 |
-1 |

Tahmid |
11 |
12 |
-1 |

Diego |
6 |
7 |
-1 |

### 4.3.2 Two values for tau

A second model might assume that the causal effect is different between levels of a category but the same within those levels. For example, perhaps there is a \(\tau_F\) for females and \(\tau_M\) for males where \(\tau_F != \tau_M\). We are making this assumption to give us a different model with which we can fill in the missing values in our Preceptor Table. The key concept is that we can’t make any progress unless we make some assumptions. That is an inescapable result of the Fundamental Problem of Causal Inference.

Consider a model in which causal effects differ based on sex:

ID | $$\text{Outcomes}$$ | $$\text{Estimand}$$ | |
---|---|---|---|

$$Y_t(u)$$ | $$Y_c(u)$$ | $$Y_t(u) - Y_c(u)$$ | |

Yao |
13 |
$$13 - \tau_M$$ |
$$\tau_M$$ |

Emma |
11 |
$$11 - \tau_F$$ |
$$\tau_F$$ |

Cassidy |
$$10 + \tau_F$$ |
10 |
$$\tau_F$$ |

Tahmid |
$$12 + \tau_M$$ |
12 |
$$\tau_M$$ |

Diego |
6 |
$$6 - \tau_M$$ |
$$\tau_M$$ |

We would have two different estimates for \(\tau\).

\(\tau_M\) would be \[(13+6)/2 - 12 = -2.5\] \(\tau_F\) would be \[(11-10 = +1)\]

Using those values, we would fill out our new table like this:

ID | $$\text{Outcomes}$$ | $$\text{Estimand}$$ | |
---|---|---|---|

$$Y_t(u)$$ | $$Y_c(u)$$ | $$Y_t(u) - Y_c(u)$$ | |

Yao |
13 |
$$13 - (-2.5)$$ |
-2.5 |

Emma |
11 |
$$11 - (+1)$$ |
+1 |

Cassidy |
$$10 + (+1)$$ |
10 |
+1 |

Tahmid |
$$12 + (-2.5)$$ |
12 |
-2.5 |

Diego |
6 |
$$6 - (-2.5)$$ |
-2.5 |

Which gives us:

ID | $$\text{Outcomes}$$ | $$\text{Estimand}$$ | |
---|---|---|---|

$$Y_t(u)$$ | $$Y_c(u)$$ | $$Y_t(u) - Y_c(u)$$ | |

Yao |
13 |
16.5 |
-2.5 |

Emma |
11 |
10 |
+1 |

Cassidy |
11 |
10 |
+1 |

Tahmid |
8.5 |
12 |
-2.5 |

Diego |
6 |
8.5 |
-2.5 |

We now have two different estimates for Emma (and for everyone else in the table). When we estimate \(Y_c(Emma)\) using an assumption of constant treatment effect (a single value for \(\tau\)), we get 12. When we estimate assuming treatment effect is constant for each sex, we calculate that \(Y_c(Emma) = 10\). This difference between our estimates for Emma highlights the difficulties of inference. Models drive inference. Different models will produce different inferences.

### 4.3.3 Heterogenous treatment effects

Is the assumption of a constant treatment effect, \(\tau\), usually true? No! It is never true. People vary. The effect of a pill on you will always be different from the effect of a pill on your friend, at least if we measure outcomes accurately enough. Treatment effects are always *heterogeneous*, meaning that they vary across individuals.

Reality looks like this:

ID | $$\text{Outcomes}$$ | $$\text{Estimand}$$ | |
---|---|---|---|

$$Y_t(u)$$ | $$Y_c(u)$$ | $$Y_t(u) - Y_c(u)$$ | |

Yao |
13 |
$$13 - \tau_{yao}$$ |
$$\tau_{yao}$$ |

Emma |
11 |
$$11 - \tau_{emma}$$ |
$$\tau_{emma}$$ |

Cassidy |
$$10 + \tau_{cassidy}$$ |
10 |
$$\tau_{cassidy}$$ |

Tahmid |
$$12 + \tau_{tahmid}$$ |
12 |
$$\tau_{tahmid}$$ |

Diego |
6 |
$$6 - \tau_{diego}$$ |
$$\tau_{diego}$$ |

Can we solve for \(\tau_{yao}\)? No! That is the Fundamental Problem of Causal Inference. So how can we make any progress from here if we are unwilling to assume there is at least some structure to the causal effect across different individuals? Instead of worrying about the causal effect for specific individuals, we might, instead, focus on the causal effect for the entire population.

### 4.3.4 Average treatment effect

The average treatment effect is the **average** difference in *potential* outcomes between the treated group and the control groups. Because averaging is a linear operator, the average difference is the same as the difference between the averages. The distinction between this estimand and estimands like \(\tau\), \(\tau_M\) and \(\tau_F\), is that, in this case, we do not care about using the average treatment effect to fill in missing values in each row. The average treatment effect is useful because we don’t have to assume anything about each individuals’ \(\tau\), like \(\tau_{yao}\), but can still understand something about the average causal effect across the whole population.

As we did before, the simplest way to estimate the ATE is to take the mean of the treated group (\(10\)) and the mean of the control group (\(11\)) and then take the difference in those means (\(-1\)). We’ll call this estimate of the average treatment effect, \(\widehat{ATE}\), pronounced “ATE-hat.”

If we already did this exact same calculation above, why are we talking about it again? Remember that we are unwilling to assume treatment effect is constant in our study population, and we cannot solve for \(\tau\) if \(\tau\) is different for different individuals. This is where \(\widehat{ATE}\) is helpful.

*Some* estimands may not require filling in all the question marks in the Preceptor Table. We can get a good estimate of the *average* treatment effect without filling in every question mark — the average treatment effect is just a single number. Rarely in a study do we care about what happens to individuals. In our case, we don’t care about what specifically would happen to Cassidy’s attitude if treated. Instead, we care generally about how our experiment impacts people’s attitudes towards immigrants. This is why an average estimate, like \(\widehat{ATE}\) can be helpful.

As we noted before, this is a popular estimand. Why?

There’s an obvious

*estimator*for this estimand: the difference in*observed*outcomes between the treated group and the control group: \(Y_t(u) - Y_c(u)\).If treatment is

*randomly assigned*, the estimator is*unbiased*: you can be fairly confident in the estimate if you have a large enough treatment and control group.As we did earlier, if you are willing to assume that the causal effect is the same for everyone (a big assumption!), you can use your estimate of the ATE, \(\widehat{ATE}\), to fill in the missing individual values in your Preceptor Table.

Just because the ATE is often a useful estimand doesn’t mean that it *always* is.

Consider point #3. For example, let’s say the treatment effect does vary dependent on sex. For males there is a strong negative effect (-3.5), but for females there is a smaller positive effect (+1). However, the average treatment effect for the whole sample, even if you estimate it correctly, will be a single negative number (-1) – since the negative effect for males is larger than the positive effect for females.

Estimating the average treatment effect, by calculating \(\widehat{ATE}\), is easy. But is our \(\widehat{ATE}\) a good estimate of the actual ATE? After all, if we knew all the missing values in the Preceptor Table, we could calculate the ATE perfectly. But those missing values may be wildly different from the observed values. Consider this unobservable ideal Preceptor Table:

ID | $$\text{Outcomes}$$ | $$\text{Estimand}$$ | |
---|---|---|---|

$$Y_t(u)$$ | $$Y_c(u)$$ | $$Y_t(u) - Y_c(u)$$ | |

Yao |
13 |
11 |
+2 |

Emma |
11 |
9 |
+2 |

Cassidy |
12 |
10 |
+2 |

Tahmid |
14 |
12 |
+2 |

Diego |
6 |
4 |
+2 |

In this example, there is indeed a constant treatment effect for everyone: \(+2\). Note that the *observed* values are all the same, but the unobserved values were such that our estimated ATE, \(-1\), is pretty far from the actual ATE, \(+2\). If we think we have a reasonable estimate of ATE, using that value as a constant for \(\tau\) might be our *best guess*. For more discussion, see more from Matt Blackwell:

## 4.4 Complications

### 4.4.1 Data problems

Assume that before you run the train experiment, you want to know the average attitude towards immigrants of *all* United States adults. At first, this seems like an easy problem — there’s nothing causal here! If you knew the true values, you could build a data set like this:

ID | Outcome |
---|---|

Attitude | |

Person 1 | 13 |

Person 2 | 11 |

Person 3 | 9 |

... | ... |

Person N | 10 |

Then, your answer is simply the average of all the values. But do we have that table? No! What we actually have is this:

ID | Outcome |
---|---|

Attitude | |

Person 1 | ? |

Person 2 | ? |

Person 3 | ? |

... | ... |

Person N | ? |

In reality, we don’t know the attitude towards immigrants of any United States adults. That is, we have a lot of *missing data*.

But maybe we could survey 1,000 people on their attitudes towards immigrants, and get a table that looks like this:

ID | Outcome |
---|---|

Attitude | |

Respondent 1 | 13 |

Respondent 2 | 9 |

Respondent 3 | 11 |

... | ... |

Respondent 1,000 | 10 |

By surveying 1,000 people on their attitudes towards immigrants we now have some values to work with. This, however, does not solve the missing data problem. We are likely interested in attitudes toward immigration *for the entire population*, not just for our 1,000 person sample. We’ll need to think about whether our *sample* is representative of the full population. For the vast majority of US adults we still have no value. This is the second of the two most common sources of missing data:

- For the units in our sample, we only see
*one*potential outcome. - For the units outside our sample, we see
*no*potential outcomes.

There are in fact many other potential sources of missing data, as we will explore below in the discussion about the infinite Preceptor Table. This missing data problem is what creates the need for statistical inferences. If the data were not missing, inference would not be needed.

Let’s consider a new example experiment to highlight another type of data problem we might encounter. Say we want to know the causal effect of being elected governor on longevity. In most states the minimum age requirement to be elected governor is 30. People under the age of 30 have no chance of being elected governor. That means for people less than 30 there is only one possible potential outcome. In our actual Preceptor Table, this means we have some rows with 2 columns (people old enough to be elected), and some rows with only 1 column (people too young to be elected).

An actual Preceptor Table for this problem might look something like this:

ID | $$Y_t(u)$$ | $$Y_c(u)$$ |
---|---|---|

Yao |
-- |
? |

Dean Khurana |
? |
? |

Cassidy |
-- |
? |

Preceptor |
? |
? |

Tahmid |
-- |
? |

For Yao, Cassidy and Tahmid there is no question mark in the treatment column because they have no chance of being elected governor. Often in the real world an actual Preceptor Table might look like this. Some rows have two (or more) potential outcomes, and some have fewer. So where do we go from here?

We should consider what we are actually interested in knowing. In this case, we don’t really care what the causal effect is on people who can’t possibly be elected governor. In other words, we don’t care what the causal effect is on the whole population, but rather only a subset of the population. Just like in the case of the infinite Preceptor Table we need to throw out rows by being more specific about our problem. Instead of saying we want to know the causal effect of being elected governor, we might specify that we want to know the causal effect for the American population over 30.

Once we have a defined problem and a reasonable actual Preceptor Table, we can begin to deal with all those question marks.

### 4.4.2 Causal and predictive models

Causal inference is often compared with prediction. In prediction, we want to know an outcome, \(Y(u)\). In causal inference, we want to know a function of *potential* outcomes, such as a the treatment effect: \(Y_t(u) - Y_c(u)\).

These are both missing data problems. Prediction involves getting an estimate for an outcome variable that we don’t have, and thus is missing, whether because it is in the future or because it is from data that we are unable to collect. Thus, prediction is the term for using statistical inference to fill in missing data for individual outcomes, for situations in which the concept of potential outcomes does not apply.

Causal inference, however, is the term for filling in missing data for potential outcomes. This is unlike prediction, where only one potential outcome can *ever* be observed, even in principle.

In both causal inference and prediction, the process by which some data is missing and some is observed is crucial. If we think that the missing data is similar to the observed data, we can make inferences more easily. If not, we have to think through the dissimilarities and consider how to model them.

Key point: In a predictive model, there is only one \(Y(u)\) value for each unit. This is very different to the RCM where there are (at least) two potential outcomes (treated and control). There is only one outcome column in a predictive model, whereas there are two or more in a causal model.

With a predictive model we cannot infer what would happen to the outcome \(Y(u)\) if we changed \(X\) *for a given unit*. We can only *compare* two units, one with one value of \(X\) and another with a different value of \(X\).

In a sense, all models are predictive. If we had more data from a stable distribution, then we could make a predictive forecast of someone’s attitude. Only a subset of models are causal, meaning that, for a given individual, you can change the value \(X\) and observe a change in outcome, \(Y(u)\), and from that calculate a causal effect.

### 4.4.3 The assignment mechanism

The “assignment mechanism” is the process by which some units received the treatment while the other units got the control.

We sidestepped the following question before: Is the difference in sample means between treated units and control units, \(\widehat{ATE}\), a good estimate of the \(ATE\)? That depends entirely on the method by which units are assigned treatment, which is called the *assignment mechanism*. That is the mechanism whereby some potential outcomes are missing and some potential outcomes are observed.

This already comes up in non-causal context when considering *sampling*. If we are trying to estimate the average attitude towards immigrants in the US, we usually do so by taking a sample. The process by which people enter our sample is called the *sampling mechanism*. If the process by which people enter our sample is related to their attitude, even indirectly, then estimates from our sample will not be good estimates for the population.

Whenever the assignment mechanism is correlated with the potential outcomes, we say that there is **confounding**. Confounding is a problem, since it means that our simple estimate of the \(ATE\) is, potentially, biased.

Assignment mechanisms can also intentionally be biased in order to manufacture desired outcomes. Let’s consider a scenario where once again an entire platform is either treated or a control. In this case the assignment mechanism is the choice of the Spanish-speakers; they are allowed to choose which platform they want to stand on. Let’s also say that they can *perfectly* predict the attitude of people on each platform. The Spanish-speakers know that a platform with more liberal attitudes towards immigrants will be more friendly, and therefore always choose to stand on those platforms. In this case, the assignment mechanism of platforms is not random. The Spanish-speakers know these values for the platforms in the experiment:

ID | $$\text{Outcomes}$$ | $$\text{Estimand}$$ | |
---|---|---|---|

$$Y_t(u)$$ | $$Y_c(u)$$ | $$Y_t(u) - Y_c(u)$$ | |

Platform 1 | 14 | 10 | +4 |

Platform 2 | 7 | 6 | +1 |

Platform 3 | 5 | 8 | -3 |

Platform 4 | 13 | 12 | +1 |

Platform 5 | 4 | 6 | -2 |

Platform 6 | 13 | 11 | +2 |

Based on this knowledge the Spanish-speakers in the experiment would choose the following treatment assignments:

ID | $$\text{Outcomes}$$ | $$\text{Estimand}$$ | |
---|---|---|---|

$$Y_t(u)$$ | $$Y_c(u)$$ | $$Y_t(u) - Y_c(u)$$ | |

Platform 1 | ? | 10 | ? |

Platform 2 | 7 | ? | ? |

Platform 3 | 5 | ? | ? |

Platform 4 | ? | 12 | ? |

Platform 5 | 4 | ? | ? |

Platform 6 | ? | 11 | ? |

The assignment mechanism used distorts the averages of both \(Y_t(u)\) and \(Y_c(u)\), which in turn distorts the difference in means. The average of the treated group is shifted lower (more liberal), while the average of the control group is shifted higher (more conservative). This gives the illusion that the average treatment effect \(\widehat{ATE}\) is negative. The true positive causal effect is masked by this non-random assignment mechanism.

Therefore, the difference in means is no longer a good estimate of the ATE. In fact, in this case it has the wrong sign! This is not merely a consequence of our small sample: even if there were a million platforms in the experiment, we could not get a good estimate of the ATE.

This is an extreme example of a problem called **selection bias**. Selection bias is when the person who is assigning treatment chooses on the basis of potential outcomes. The Spanish-speakers are not choosing platforms to stand on randomly. Rather, they are making treatment decisions based directly on the potential outcomes of each platform. Remember, whenever the assignment mechanism is correlated with the potential outcomes there is confounding, which is a problem because it means that our estimate is biased. Not all examples of confounding are caused by selection bias, but when there is selection bias there is *always* confounding.

Much like how the best way to avoid making poor inferences from a sample to a population is to take a random sample of the population, the best assignment mechanism for avoiding confounding is **randomization**. For each platform we could flip a coin to determine if it is in the treatment or control group.

Random assignment guarantees that, on average, there is no correlation between treatment assignment and anything else, neither covariates nor potential outcomes.

In many circumstances, however, randomized trials are not possible due to ethical or practical concerns. In such scenarios there is by necessity a non-random assignment mechanism.

For example, let’s say some of the train platforms in the experiment are so loud the Spanish-speakers might not be heard by anyone nearby. Therefore by necessity, only quieter platforms can be assigned to the treatment group. This non-random assignment may introduce confounding. Say there is some systematic difference between the people on quieter platforms compared with the people on the louder platforms. In that case, the assignment mechanism is correlated with potential outcomes, so there is confounding.

Many statistical methods have been developed for causal inference when there is a non-random assignment mechanism. Those methods, however, are beyond the scope of this book.

### 4.4.4 The infinite Preceptor Table

We have discussed the ideal Preceptor Tables (no missing data), actual Preceptor Tables (question marks representing the values we don’t know), but there is one more type of Preceptor Table. In the real world, a Preceptor Table has an infinite number of rows, and therefore an infinite amount of missing data. We call this type of Preceptor Table an infinite Preceptor Table. Such a reality is unworkable, so we make assumptions to reduce the true problem to something more manageable.

Let’s start by looking at what kinds of missing data make up the infinite Preceptor Table. For example, say we only care about the causal effect of this experiment on Yao. Do we only care about his attitude right after the experiment? No! We also care about Yao’s potential outcomes one year from now, two years from now, and so on.

So our full Preceptor Table includes people we know (Yao) and people we don’t (for example, Eliot), both now and in the future:

ID | $$\text{Outcomes}$$ | $$\text{Estimand}$$ | |
---|---|---|---|

$$Y_t(u)$$ | $$Y_c(u)$$ | $$Y_t(u) - Y_c(u)$$ | |

Yao this year | 13 | ? | ? |

Yao next year | ? | ? | ? |

Yao two years from now | ? | ? | ? |

Eliot this year | ? | ? | ? |

Eliot next year | ? | ? | ? |

Eliot two years from now | ? | ? | ? |

... | ? | ? | ? |

Person n at time t | ? | ? | ? |

In fact, because time is continuous, there is a row for Yao now, Yao one second from now, Yao one day from now and so on. The Preceptor Table extends downward forever. Thus, in order to estimate any causal effect, we need *assumptions*, so we aren’t dealing with an infinite table.

The most obvious way to eliminate some rows from the table is to assume the causal effect for Yao now is the same as all the ones for Yao in the future. Is that plausible? Sort of. Yao now and Yao in one second are pretty similar! Yao now and Yao in 30 years are less so. Unfortunately, there’s no magic way to get a good estimate of every missing value in the infinite Preceptor Table! But through some assumptions, we can reduce the true problem to the problem we were dealing with before.

Not only can we extend the Preceptor Table by adding people not in our sample to the rows, but we can also add additional treatments to the columns. Let’s go back to the original five people in our sample. What if we also wanted to test the causal effect of another language being spoken on the platform? We’ll call the original treatment \(t\) and the new treatment \(t'\).

ID | $$\text{Outcomes}$$ | ||
---|---|---|---|

$$Y_t(u)$$ | $$Y_{t'}(u)$$ | $$Y_c(u)$$ | |

Yao |
13 |
? |
? |

Emma |
11 |
? |
? |

Cassidy |
? |
? |
10 |

Tahmid |
? |
? |
12 |

Diego |
6 |
? |
? |

Note that for Yao, we now have three causal effects we can estimate: the difference between the original treatment and the new treatment, the difference between the original treatment and control, and the difference between the new treatment and control.

Even if you have just one language you are testing, there still could be multiple treatments. For example, the amount of time the commuter is on the platform with the Spanish-speakers might vary across commuters. In that case, each might receive a different treatment.

ID | $$\text{Outcomes}$$ | |||||
---|---|---|---|---|---|---|

$$Y_{\text{1 min}}(u)$$ | $$Y_{\text{5 mins}}(u)$$ | $$Y_{\text{10 mins}}(u)$$ | $$Y_{\text{15 mins}}(u)$$ | $$Y_{\text{20 mins}}(u)$$ | $$Y_c(u)$$ | |

Yao |
13 |
? |
? |
? |
? |
? |

Emma |
11 |
? |
? |
? |
? |
? |

Cassidy |
? |
? |
? |
? |
? |
10 |

Tahmid |
? |
? |
? |
? |
? |
12 |

Diego |
6 |
? |
? |
? |
? |
? |

Instead of considering the treatment in terms of duration, we could also consider different volume levels, measured in decibels (dB), at which Spanish is being spoken.

ID | $$\text{Outcomes}$$ | ||||
---|---|---|---|---|---|

$$Y_{\text{58 dB}}(u)$$ | $$Y_{\text{59 dB}}(u)$$ | $$Y_{\text{60 dB}}(u)$$ | $$Y_{\text{61 dB}}(u)$$ | $$Y_c(u)$$ | |

Yao |
13 |
? |
? |
? |
? |

Emma |
11 |
? |
? |
? |
? |

Cassidy |
? |
? |
? |
? |
10 |

Tahmid |
? |
? |
? |
? |
12 |

Diego |
6 |
? |
? |
? |
? |

Indeed, there are an infinite number of possible treatments. The Preceptor Table extends to the right forever. Again, assumptions come to our rescue. Or rather, we just throw up our hands and only try to estimate a few things. This is why it is crucial to define one’s estimand precisely: if we are interested in the difference in potential outcomes between Spanish being spoken for 10 minutes at a 60 dB versus control, we can ignore all the other possible columns in the infinite Preceptor Table.

Thus, whenever you are considering a causal question, the best way to think about it is to start with the infinite Preceptor Table. First we throw out the rows we think are duplicates (such as all the observations for Yao one second from now, two seconds from now, etc.) or that are outside the scope of what we are interested in for now (maybe we don’t care about outcomes 30 years in the future for this study). Second, we throw out the columns that we don’t care about, which are all the possible treatments we aren’t considering. Finally, we define precisely—in terms of potential outcomes—our estimand. It may be something simple, such as the average treatment effect, or something more complex. Once we have done these steps, we can start thinking about how to fill in the question marks. But remember that the infinite Preceptor Table is always there, and you should be conscious of which rows and columns you are throwing out!

### 4.4.5 No causation without manipulation

In order for a potential outcome to make sense, it must be possible, at least *a priori*. For example, if there is no way for Yao, under any circumstance, to ever be in the train study, then \(Y_{t}(u)\) is impossible for him. It can never happen. And if \(Y_{t}(u)\) can never be observed, even in theory, then the causal effect of treatment on Yao’s attitude is undefined.

The causal effect of the train study is well defined because it is the simple difference of two potential outcomes, both of which might happen. In this case, we (or something else) can manipulate the world, at least conceptually, so that it is possible that one thing or a different thing might happen.

This definition of causal effects becomes much more problematic if there is no way for one of the potential outcomes to happen, ever. For example, what is the causal effect of Yao’s height on his weight? It might seem we would just need to compare two potential outcomes: what would Yao’s weight be under the treatment (where treatment is defined as being 3 inches taller) and what would Yao’s weight be under the control (where control is defined as his current height).

A moment’s reflection highlights the problem: we can’t increase Yao’s height. There is no way to observe, even conceptually, what Yao’s weight would be if he were taller because there is no way to make him taller. We can’t manipulate Yao’s height, so it makes no sense to investigate the causal effect of height on weight. Hence the slogan: *No causation without manipulation.*

This then raises the question of what can and cannot be manipulated. If something cannot be manipulated, we should not consider it causal. So can race ever be considered causal? What about sex? A genetic condition like color-blindness? Can we manipulate these characteristics? In the modern world these questions are not simple.

Take color-blindness for example. Say we are interested in how color-blindness impacts ability to complete a jig-saw puzzle. Because color-blindness is genetic some might argue it cannot be manipulated. But advances in technology like gene-therapy might allow us to actually change someone’s genes. Could we then claim the ability to manipulate color-blindness? If yes, we could then measure the causal effect of color-blindness on ability to complete jig-saw puzzles.

The slogan of “No causation without manipulation” may at first seem straight forward, but it is clearly not so simple. Questions about race, sex, gender and genetics are very complex and should be considered with care.

### 4.4.6 Internal and external validity

Recall the two main sources of missing data:

- For the units in our sample, we only see one potential outcome
- For the units outside our sample, we see
*no*potential outcomes

If we have randomized assignment and a large sample, we can be confident that we have a good estimate of the average treatment effect *in that sample*. We say that the experiment has high **internal validity**: the inferences we are making are likely to reflect the truth about that sample.

However, we may be interested in a population beyond our particular sample, the second main source of missing data. For example, let’s look at the broader context of the train experiment. You likely are not exclusively concerned about the attitudes of people who ride trains, but rather the attitudes of a larger population. Train platforms, however, are a convenient setting in which to run the experiment. Let’s say that we ran a randomized experiment with 10,000 people in Boston, and found an \(\widehat{ATE} = -1\). Should we assume this estimate would be accurate for a larger general population?

The answer to that question depends in part on the **external validity** of the study. Are the 10,000 people in the study similar to the people who we want to generalize the findings to? Perhaps we want to generalize to train commuters in other cities. Let’s say that the 10,000 people in Boston all choose to ride trains for environmental reasons. That’s another form of **selection bias**. The sample is not randomly selected from the population in which we are interested. Why is that a problem? Those people differ systematically from other people in a way that may affect their response to the experiment. For example, their preference of public transportation for environmental reasons may be correlated to other political beliefs.

Note that this concern can be expressed in terms of the assignment mechanism. People who don’t ride the train have a 0% chance of receiving the treatment. Thus, the study *can’t* directly speak to how the treatment would impact their attitudes towards immigrants. The only way we can make such claims is by making additional assumptions, such as that train-riders reflect the same makeup of political beliefs as people who don’t ride trains.

The external validity of a study is often directly related to the **representativeness** of our sample. Representativeness has to do with how well our sample represents the larger population we are interested in generalizing to. Does the train experiment allow us to calculate a causal effect for people who commute by cars? Can we calculate the causal effect for people in New York City? Before we generalize to broader populations we have to consider if our experimental estimates are applicable beyond our experiment. Maybe we think that commuters in Boston and New York are similar enough, that our \(\widehat{ATE}\) also is a good estimate for the causal effect of our treatment in NYC. We could also conclude that people who commute by car are fundamentally different than people who commute by train. If that was true, then we could not say our estimate is true for all commuters because our sample does not accurately represent the broader group we want to generalize to.

The circumstances of the experiment may also affect the external validity of the study. Perhaps the train study was conducted during the middle of the summer when platforms are uncomfortably hot. Then, while we have variation in one aspect of the treatment (whether there were Spanish-speakers nearby), we don’t have variation in another (temperature of train platform). It may be that attitude towards immigrants is impacted when the train platform is uncomfortably hot, but the treatment has no impact on attitude otherwise.

When dealing with human subjects, there is a particular concern regarding external validity: the **Hawthorne effect**. When human subjects know that they are part of an experiment, they may change their behavior.

In this example, the Hawthorne effect can impact attitudes expressed on surveys. Maybe respondents are more extreme in their attitude in either direction (more liberal or more conservative) because the survey is an opportunity to express opinion.

There is a border concept of **validity** to consider as well. That is, if the data is *valid*, then it accurately captures the concepts we care about. We do not want to use data if there is not a connection between the problem we face and the data that we have.

### 4.4.7 Correlation with potential outcomes

When considering the relationship between a treatment and an outcome, one of the most important assumptions is a lack of correlation between treatment assignment and the potential outcomes. Consider a version of the train experiment. Assume that, if a Republican is not on the platform with Spanish-speakers, they will have an attitude value of 9. If they were on a platform where they could hear Spanish, their attitude would have been 11. A Democrat would have an attitude 9 regardless of whether or not they are on the platform with Spanish-speakers. In other words, the causal effect of being in the treatment group is +2 for Republicans, and 0 for Democrats. If we could run an experiment with random assignment, we would discover that the average causal effect is somewhere between 0 and 2, depending on the relative proportion of the Republicans and Democrats.

ID | $$Y_t(u)$$ | $$Y_c(u)$$ | $$Y_t(u) - Y_c(u)$$ | Party |
---|---|---|---|---|

1 |
11 |
9 |
+2 |
R |

2 |
11 |
9 |
+2 |
R |

3 |
9 |
9 |
0 |
D |

4 |
11 |
9 |
+2 |
R |

5 |
9 |
9 |
0 |
D |

Unfortunately (?), people do not choose to ride certain trains randomly. We might assume/hope that there is no correlation between which train you ride and potential outcomes. If this is true, then we would still be able to estimate the causal effect. Yet that is rarely true in general. What if, instead, all the Republicans are in the control group, and all the Democrats are in the treated? In that case, everyone has an attitude of 9! And it appears that the presence of Spanish-speakers on the platform “does not matter.” The correlation between treatment and potential outcomes invalidates the naive estimate of the average treatment effect.

ID | $$Y_t(u)$$ | $$Y_c(u)$$ | $$Y_t(u) - Y_c(u)$$ |
---|---|---|---|

R |
? |
9 |
? |

R |
? |
9 |
? |

D |
9 |
? |
? |

R |
? |
9 |
? |

D |
9 |
? |
? |

Keep in mind that the problem arises when there is a correlation between treatment assignment and *potential* outcome, not simply a correlation between treatment assignment and outcome. In this case, the correlation between treatment assignment and outcome is zero! Just looking at the outcomes we observe is not enough. We must make assumptions about the outcomes we don’t observe, about what would have happened.

## 4.5 Summary

The fundamental components of every problem in causal inference are units, treatments and outcomes. The units are the rows in the tibble. The treatments are the columns. The outcomes are the values. Whenever you confront a problem is causal inference, start by identifying the units, treatments and outcomes.

A causal effect is the difference between one potential outcome and another. How different would your life be if you missed the train?

A Preceptor Table includes all the data that we actually have, and all the data we would like to have, to solve our problem. The *ideal* Preceptor Table involves no missing data. We know what the outcome would have been for unit \(i\) under both treatment and control. With the ideal Preceptor Table it is easy to calculate, using only algebra, any quantity of interest. The *actual* Preceptor Table is littered with missing data, represented by a question mark. If we know the value of the outcome for unit \(i\) under treatment then, by definition, we can not know the outcome for unit \(i\) under control.

The tibble in which we store our data and the Preceptor Table have one key difference in structure. The tibble has one column for the outcome variable. The Preceptor Table has one column for each potential outcome, meaning one column for each of the values of the treatment variable.

The causal effect of a treatment on a single unit at a point in time is the difference between the value of the outcome variable with the treatment and without the treatment. We call these “potential outcomes” because, at most, we can only observe one of them. The Fundamental Problem of Causal Inference is that it is impossible to observe the causal effect on a single unit. We must make assumptions — i.e, we must make models — in order to estimate causal effects.

Random assignment of treatments to units is the best way to estimate causal effects. Other assignment mechanisms are subject to confounding. If the treatment assigned is correlated with the potential outcomes, it is very hard to estimate the true treatment effect. (As always, we use the terms “causal effects” and “treatment effects” interchangeably. With random assignment, we can, mostly safely, estimate the average treatment effect (ATE) by looking at the difference between the average outcomes of the treated and control units.

Be wary of claims made in situations without random assignment: Here be dragons!