# 4 Rubin Causal Model

*This chapter is being re-drafted.*

Have you ever wondered what the world would be like without you?

George Bailey, a character from the movie “It’s a Wonderful Life,” believes that his life has served no purpose. The movie follows George as he explores a world in which he was never born. It is clear that he had a profound impact on the lives of many people in his community. His actions mattered, more than he ever realized.

By showing what the world would have been like without George, we get an idea of the *causal effect* of his life on his town and the people who live there. This chapter explains causation using the framework of *potential outcomes* and the *Rubin Causal Model* (RCM).

## 4.1 Preceptor Tables

A **Preceptor Table** is a table with rows and columns such that, if none of the data is missing, the thing we want to know is trivial to calculate. Preceptor Tables vary in the number of their rows and columns. We use question marks to indicate missing data in a Preceptor Table. The rows in the Preceptor Table are the *units* — people, galaxies, oak trees — which are the subjects of interest. Even the simplest Preceptor Table will have two columns. The first is an *ID* column which serves to label each unit. The second column is the *outcome* of interest, the variable we are trying to predict/understand/change.

Assume that there are five adult brothers and you are given four of their heights. What is the average height of all five brothers? Consider a Preceptor Table for this problem:

ID | Outcome |
---|---|

Heights (cm) | |

Robert | 178 |

Andy | ? |

Beau | 172 |

Ishan | 173 |

Nicolas | 165 |

In this case, we have a row for each brother and a column for their height. An individual *unit* is a brother. An *outcome* is that brother’s height. We will always have an ID column in Preceptor Tables so that we can identify different units. It is always furthest to the left. In addition to the ID column, we call the column with the brothers’ heights the outcome column.

To estimate average height, we need to estimate Andy’s height – our missing data. One guess is just the average of the other four brothers: \[\frac{(178 + 165 + 172 + 173)}{4} = 172\] Is that realistic? We should consider things like why we know the four other brothers’ heights. Were they sampled randomly? Or do we know their heights because they are the tallest in the family? In that case would it make sense to use an average to estimate Andy’s height? Probably not!

Keep in mind that there is a *truth* out there, a state of the world independent of our knowledge of it. Andy is a specific height. If we had a complete Preceptor Table, with no missing values, we could calculate the average height of the brothers exactly. No fancy statistics would be needed. Just arithmetic.

### 4.1.1 Harvard Height

Consider a more complex problem. We have the heights of 100 Harvard students, and from that we want to know the average height of all students in the school.

ID | Outcome |
---|---|

Heights (cm) | |

Student 1 | ? |

Student 2 | ? |

... | ... |

Student 473 | 172 |

Student 474 | ? |

... | ... |

Student 3,258 | ? |

Student 3,259 | 162 |

... | ... |

Student 6,700 | ? |

Again, are these 100 students randomly sampled? Could we estimate the 90th percentile of height in the student population? These questions are more complicated, and we might be less confident in our best guess. Now let’s say we are given some characteristics other than height for the 100 sampled students, i.e., sex and age.

ID | Outcome | Covariates | |
---|---|---|---|

Heights (cm) | Age | Sex | |

Student 1 | ? | ? | ? |

Student 2 | ? | ? | ? |

... | ... | ... | ... |

Student 473 | 172 | 19 | M |

Student 474 | ? | ? | ? |

... | ... | ... | ... |

Student 3,258 | ? | ? | ? |

Student 3,259 | 162 | 20 | F |

... | ... | ... | ... |

Student 6,700 | ? | ? | ? |

You’ll notice our Preceptor Table has a new type of column: **covariates**. Are we better able to forecast a student’s height if we are given their age and sex?

In many data science problems, we can organize the rows in the Preceptor Table into three different categories: no missing data, only outcome missing, and everything missing. For example:

ID | Outcome | Covariates | |
---|---|---|---|

Heights (cm) | Age | Sex | |

Student 1 | ? | ? | ? |

Student 2 | ? | 21 | M |

... | ... | ... | ... |

Student 473 | 172 | 19 | M |

Student 474 | ? | ? | ? |

... | ... | ... | ... |

Student 3,258 | ? | 19 | F |

Student 3,259 | 162 | 20 | F |

... | ... | ... | ... |

Student 6,700 | ? | ? | ? |

In this case, we know the covariates, gender and age, for students 2 and 3,258. But we do not know their heights, the outcome variable. We want to create a model which uses the covariates to predict the outcome.

## 4.2 Population

One of the most important concepts in all of statistics is *population*, the set of all units with which we are concerned. In our first example, the population was the five brothers in one family. In the second, it was the 6,700 students at Harvard. In many cases, the *population* is a well-defined and (mostly) stable set. In other cases, the population is quite unstable, even ill-defined, and, perhaps, unknowable.

Perhaps more importantly, most real world examples involve the future, meaning that specifying the population is impossible. The future is always cloudy. There are two main sources of issue: the definition of the population and the population’s habit of changing. Let’s explore what those mean in more depth.

### 4.2.1 Defining population

The act of defining a population may seem simple. After all, once we know who or what we want to study, it should be easy to define, right? Unfortunately, it is often more complicated than we assume! To explore this, we will look at both our Harvard population and the population of New York voters.

#### 4.2.1.1 Harvard

When defining the population of “current” Harvard students for our calculation of height, what exactly do we need to know? Surely a list of current students would help. Is that all? Not exactly. We may ask ourselves the following:

Does the population of Harvard students include prospective students, who have been admitted but have not started their first term?

Does the population of Harvard students include Student X, a student that has been kicked out by the AdBoard, but is appealing the ruling?

Does the population of Harvard students include those taking a leave of absence?

As we can see, merely defining “current” students can be complicated. When deciding on our models, it is important to consider what our population means — and whether we may need to modify our definition for exceptions.

#### 4.2.1.2 New York Voters

Now, we have the population of “New York City voters.” But, what does the phrase “New York City voters” even mean? The very language we use is imprecise. There are a number of ways we may define NYC voters, such as:

- All the people who voted in the last election
- All of the people who are eligible to vote today
- All of the people who are eligible to vote in the next election

Are you only considered a “voter” if you have voted in the past, or is the term “voter” applied to any citizen who is eligible to vote? Who knows? It is our responsibility as data scientists to be sure that we make clear the definitions we are using.

Once we have defined our population — and made said population explicit — we have yet another complication: the population, like the world, changes constantly.

### 4.2.2 Changing population

Even after considering the units which are included or excluded from our population, there is another problem: the world. New units enter our population constantly; existing units exit our population constantly. In other words, there is always going to be data that expires, and there is always going to be new data that we don’t have. What does this mean?

#### 4.2.2.1 Harvard

To begin, visualize the population of Harvard students for which we would like to estimate height. We are (presumably) predicting average height for **today**. The population of Harvard students today may sound relatively easy to conceptualize, but there are often questions we must ask in regards to a *changing* population. For the purposes of this exercise, assume that we have data for all current students at Harvard.

With all of the data, we should be able to say, with one-hundred percent certainty, the average height for Harvard students. Right? Not exactly. Consider the following complications:

- A number of students graduate off-cycle, meaning that they are technically no longer a student at the time of our calculation.
- A student has transferred to another college. This person no longer qualifies as a “Harvard student.”
- A student has died. This piece of data is no longer useful.
- A transfer student has arrived on campus, but isn’t included in our data!

These changes in our population are going to cause some error in our calculations. With larger populations, this becomes even more apparent. To cement this concept, we will explore our New York City voters population.

#### 4.2.2.2 New York City voters

For the sake of this exercise, we define NYC voters as any and all citizens who are eligible to vote today. Even will this *precise* population, our population changes drastically. In the course of a day:

- People in NYC have died.
- Hundreds of people have just turned 18 and are now eligible to vote, who weren’t eligible yesterday!
- A number of people living in NYC
*moved away*. - A number of people
*moved into*NYC. - A number of people were arrested for a felony conviction and can no longer vote.

The world is an ever changing place, as are our populations. As such, it is virtually impossible to have the Preceptor Tables we wrote about earlier. How can we account for this randomness?

### 4.2.3 Preceptor Table

Given this, a better Preceptor Table for the set of Harvard students would look like this:

ID | Outcome | Covariates | |
---|---|---|---|

Heights (cm) | Age | Sex | |

Student 1 | ? | ? | ? |

Student 2 | ? | 21 | M |

... | ... | ... | ... |

Student 473 | 172 | 19 | M |

Student 474 | ? | ? | ? |

... | ... | ... | ... |

Student 3,258 | ? | 19 | F |

Student 3,259 | 162 | 20 | F |

... | ... | ... | ... |

Student N | ? | ? | ? |

The only difference is that we no longer know the exact number of students in the population. So, instead of referring to Student 6,700, we use Student \(N\). Even though we don’t know — and often can’t know — the number of units in the population, we still want to be able to label those units and to count them up. \(N\) is the variable which normally refers to the **n**number of units in the population.

## 4.3 Population Table

We use Preceptor Tables both for predictive problems, like forecasting height on the basis of age/gender, and for causal problems, like estimating the effect of Spanish-speakers on immigration attitudes. So, our previous discussion of the *population* applies here as well. In general, we can divide units of the population in three broad categories: no missing data, all missing data and outcome missing. Consider:

So far, we have only asked predictive questions. This chapter, however, primarily focuses on causal inference.

## 4.4 Causal effect

The Rubin Causal Model (RCM) is based on the idea of **potential outcomes.** For example, Enos (2014) measured attitudes toward immigration among Boston commuters. Individuals were exposed to one of two possible conditions, and then their attitudes towards immigrants were recorded. One condition was waiting on a train platform near individuals speaking Spanish. The other was being on a train platform without Spanish-speakers. To calculate the causal effect of having Spanish-speakers nearby, we need to compare the outcome for an individual in one possible state of the world (with Spanish-speakers) to the outcome for that same individual in another state of the world (without Spanish-speakers). It is impossible to observe both potential outcomes at once. One of the potential outcomes is always missing, since a unit cannot experience both treatments. This dilemma is the *Fundamental Problem of Causal Inference*.

In most circumstances, we are interested in comparing two experimental manipulations, one generally termed “treatment” and the other “control.” The difference between the potential outcome under treatment and the potential outcome under control is a “causal effect” or a “treatment effect.” According to the RCM, the **causal effect** of being on the platform with Spanish-speakers is the *difference* between what your attitude would have been under “treatment” (with Spanish-speakers) and under “control” (no Spanish-speakers).

The commuter survey consisted of three questions, each measuring agreement on a 1 to 5 integer scale, with 1 being liberal and 5 being conservative. For each person, the three answers were summed, generating an overall measure of attitude toward immigration which ranged from 3 (very liberal) to 15 (very conservative). If your attitude towards immigrants would have been a 13 with Spanish-speakers and a 9 without Spanish-speakers, then the causal effect of being on a platform with Spanish-speakers is a 4-point increase in your score.

We will use the symbol \(Y\) to represent potential outcomes, the variable we are interested in understanding and modeling. \(Y\) is called the *response* or *outcome* variable. It is the variable we want to “explain.” In our case this would be the attitude score. If we are trying to understand a causal effect, we need two symbols so that control and treated values can be represented separately: \(Y_t\) and \(Y_c\).

### 4.4.1 Potential outcomes

Suppose that Yao is one of the commuters surveyed in this experiment. If we were omniscient, we would know the outcomes for Yao under both treatment (with Spanish-speakers) and control (no Spanish-speakers). We can show this using a Preceptor Table. Calculating the number we are interested in is trivial because none of the data is missing.

ID | Outcomes | |
---|---|---|

Attitude if Treated | Attitude if Control | |

Yao | 13 | 9 |

From this table we only know the causal effect on Yao. Everyone else in the study might have a lower attitude score (more liberal) if treated. Regardless of what the causal effect is for other subjects, the causal effect for Yao of being on the train platform with Spanish-speakers is a shift towards a more conservative attitude.

Using the response variable — the actual symbol rather than a written description — makes for a more concise Preceptor Table.

ID | $$\text{Outcomes}$$ | |
---|---|---|

$$Y_t$$ | $$Y_c$$ | |

Yao |
13 |
9 |

Recall that the “causal effect” is the difference between Yao’s potential outcome under treatment and his potential outcome under control.

ID | $$\text{Outcomes}$$ | $$\text{Causal Effect}$$ | |
---|---|---|---|

$$Y_t$$ | $$Y_c$$ | $$Y_t - Y_c$$ | |

Yao |
13 |
9 |
+4 |

Remember that, in the real world, we will have a bunch of missing data! We can not use simple arithmetic to calculate the causal effect on Yao’s attitude toward immigration. Instead, we will be required to estimate it. An **estimand** is some unknown variable in the real world that we are trying to measure. In this case, it is \(Y_{t}-Y_{c}\), not \(+4\). An estimand is not the *value* you calculated, but is rather the *unknown variable* you want to estimate.

### 4.4.2 Causal and predictive models

Causal inference is often compared with prediction. In prediction, we want to know an outcome, \(Y(u)\). In causal inference, we want to know a function of *potential* outcomes, such as a the treatment effect: \(Y_t(u) - Y_c(u)\).

These are both missing data problems. Prediction involves getting an estimate for an outcome variable that we don’t have, and thus is missing, whether because it is in the future or because it is from data that we are unable to collect. Thus, prediction is the term for using statistical inference to fill in missing data for individual outcomes, for situations in which the concept of potential outcomes does not apply.

Causal inference, however, is the term for filling in missing data for potential outcomes. This is unlike prediction, where only one potential outcome can *ever* be observed, even in principle.

In both causal inference and prediction, the process by which some data is missing and some is observed is crucial. If we think that the missing data is similar to the observed data, we can make inferences more easily. If not, we have to think through the dissimilarities and consider how to model them.

Key point: In a predictive model, there is only one \(Y(u)\) value for each unit. This is very different to the RCM where there are (at least) two potential outcomes (treatment and control). There is only one outcome column in a predictive model, whereas there are two or more in a causal model.

With a predictive model we cannot infer what would happen to the outcome \(Y(u)\) if we changed \(X\) *for a given unit*. We can only *compare* two units, one with one value of \(X\) and another with a different value of \(X\).

In a sense, all models are predictive. However, only a subset of models are causal, meaning that, for a given individual, you can change the value \(X\) and observe a change in outcome, \(Y(u)\), and from that calculate a causal effect.

### 4.4.3 Multiple units

Generally, a study has many individuals (or, more broadly, “units”) who each have their own potential outcomes. More notation is needed to allow us to differentiate between different units.

In other words, there needs to be a distinction between \(Y_t\) for Yao, and \(Y_t\) for Emma. We use the variable \(u\) (\(u\) for “unit”) to indicate that the outcome under control and the outcome under treatment can differ for each individual unit (person).

Instead of \(Y_t\), we will use \(Y_t(u)\) to represent “Attitude if Treated.” If you want to talk about only Emma, you could say “Emma’s Attitude if Treated” or “\(Y_t(u = Emma)\)” or “the \(Y_t(u)\) for Emma,” but not just \(Y_t\). That notation is too ambiguous when there is more than one subject.

Let’s look at a Preceptor Table with more subjects using our new notation:

ID | $$\text{Outcomes}$$ | $$\text{Causal Effect}$$ | |
---|---|---|---|

$$Y_t(u)$$ | $$Y_c(u)$$ | $$Y_t(u) - Y_c(u)$$ | |

Yao |
13 |
9 |
+4 |

Emma |
14 |
11 |
+3 |

Cassidy |
11 |
6 |
+5 |

Tahmid |
9 |
12 |
-3 |

Diego |
3 |
4 |
-1 |

From this Preceptor Table, there are many possible estimands we might be interested in. Consider some examples, along with their true values:

- A potential outcome for one person, e.g., Yao’s potential outcome under treatment: \(13\).
- A causal effect for one person, such as for Emma. This is the difference between the potential outcomes: \(14 - 11 = +3\).
- The most positive causal effect: \(+5\), for Cassidy.
- The most negative causal effect: \(-3\), for Tahmid.
- The median causal effect: \(+3\).
- The median percentage change: \(+27.2\%\). To see this, calculate the percentage change for each person. You’ll get 5 percentages: \(+44.4\%\), \(+27.2\%\), \(+83.3\%\), \(-25.0\%\), and \(-25.0\%\).

And so on. There are a lot of things one might care about!

All of the variables calculated above are examples of estimands we might be interested in. One estimand is important enough that it has its own name: the **average treatment effect**, often abbreviated as **ATE**. The average treatment effect is the mean of all the individual causal effects. Here, the mean is \(+1.6\).

What does our real-world Preceptor Table look like?

ID | $$\text{Outcomes}$$ | $$\text{Causal Effect}$$ | |
---|---|---|---|

$$Y_t(u)$$ | $$Y_c(u)$$ | $$Y_t(u) - Y_c(u)$$ | |

Yao |
13 |
? |
? |

Emma |
14 |
? |
? |

Cassidy |
? |
6 |
? |

Tahmid |
? |
12 |
? |

Diego |
3 |
? |
? |

Calculating values from this table is no longer a simple math problem. See this discussion from Harvard Professor Matt Blackwell:

## 4.5 Simple models

How can we fill in the question marks? Because of the *Fundamental Problem of Causal Inference*, we can never *know* the missing values. Because we can never know the missing values, we must make assumptions. “Assumption” just means that we need a “model,” and all models have parameters.

### 4.5.1 A single value for tau

One model might be that the causal effect is the same for everyone. There is a single parameter, \(\tau\), which we then estimate. (\(\tau\) is a Greek letter, written as “tau” and rhyming with “cow.”) Once we have an estimate, we can fill in the Preceptor Table because, knowing it, we can estimate what the unobserved potential outcome is for each person. We use our assumption about \(\tau\) to estimate the counterfactual outcome for each unit.

Remember what our Preceptor Table looks like with all of the missing data:

ID | $$\text{Outcomes}$$ | $$\text{Causal Effect}$$ | |
---|---|---|---|

$$Y_t(u)$$ | $$Y_c(u)$$ | $$Y_t(u) - Y_c(u)$$ | |

Yao |
13 |
? |
? |

Emma |
14 |
? |
? |

Cassidy |
? |
6 |
? |

Tahmid |
? |
12 |
? |

Diego |
3 |
? |
? |

If we assume \(\tau\) is the treatment effect for everyone, how do we fill in the table? We are using \(\tau\) as an estimate for the causal effect. By definition: \(Y_t(u) - Y_c(u) = \tau\). Using simple algebra, it is then clear that \(Y_t(u) = Y_c(u) + \tau\) and \(Y_c(u) = Y_t(u) - \tau\). In other words, you could add it to the observed value of every observation in the control group (or subtract it from the observed value of every observation in the treatment group), and thus fill in all the missing values.

Assuming there is a constant treatment effect, \(\tau\), for everyone, filling in the missing values would look like this:

ID | $$\text{Outcomes}$$ | $$\text{Causal Effect}$$ | |
---|---|---|---|

$$Y_t(u)$$ | $$Y_c(u)$$ | $$Y_t(u) - Y_c(u)$$ | |

Yao |
13 |
$$13 - \tau$$ |
$$\tau$$ |

Emma |
14 |
$$14 - \tau$$ |
$$\tau$$ |

Cassidy |
$$6 + \tau$$ |
6 |
$$\tau$$ |

Tahmid |
$$12 + \tau$$ |
12 |
$$\tau$$ |

Diego |
3 |
$$3 - \tau$$ |
$$\tau$$ |

Now we need to find an estimate for \(\tau\) in order to fill in the missing values. One approach is to subtract the average of the observed control values from the average of the observed treated values. \[((13 + 14 + 3) / 3) - ((6 + 12) / 2)\] \[10 - 9 = +1\]

This gives us an estimate of \(+1\) for \(\tau\). Let’s fill in our missing values by adding \(\tau\) to the observed values under control and by subtracting \(\tau\) from the observed value under treatment like so:

ID | $$\text{Outcomes}$$ | $$\text{Causal Effect}$$ | |
---|---|---|---|

$$Y_t(u)$$ | $$Y_c(u)$$ | $$Y_t(u) - Y_c(u)$$ | |

Yao |
13 |
$$13 - (+1)$$ |
+1 |

Emma |
14 |
$$14 - (+1)$$ |
+1 |

Cassidy |
$$6 + (+1)$$ |
6 |
+1 |

Tahmid |
$$12 + (+1)$$ |
12 |
+1 |

Diego |
3 |
$$3 - (+1)$$ |
+1 |

Which gives us:

ID | $$\text{Outcomes}$$ | $$\text{Causal Effect}$$ | |
---|---|---|---|

$$Y_t(u)$$ | $$Y_c(u)$$ | $$Y_t(u) - Y_c(u)$$ | |

Yao |
13 |
12 |
+1 |

Emma |
14 |
13 |
+1 |

Cassidy |
7 |
6 |
+1 |

Tahmid |
13 |
12 |
+1 |

Diego |
3 |
2 |
+1 |

### 4.5.2 Two values for tau

A second model might assume that the causal effect is different between levels of a category but the same within those levels. For example, perhaps there is a \(\tau_F\) for females and \(\tau_M\) for males where \(\tau_F != \tau_M\). We are making this assumption to give us a different model with which we can fill in the missing values in our Preceptor Table. We can’t make any progress unless we make some assumptions. That is an inescapable result of the *Fundamental Problem of Causal Inference*.

Consider a model in which causal effects differ based on sex:

ID | $$\text{Outcomes}$$ | $$\text{Causal Effect}$$ | |
---|---|---|---|

$$Y_t(u)$$ | $$Y_c(u)$$ | $$Y_t(u) - Y_c(u)$$ | |

Yao |
13 |
$$13 - \tau_M$$ |
$$\tau_M$$ |

Emma |
14 |
$$14 - \tau_F$$ |
$$\tau_F$$ |

Cassidy |
$$6 + \tau_F$$ |
6 |
$$\tau_F$$ |

Tahmid |
$$12 + \tau_M$$ |
12 |
$$\tau_M$$ |

Diego |
3 |
$$3 - \tau_M$$ |
$$\tau_M$$ |

We would have two different estimates for \(\tau\).

\(\tau_M\) would be \[(13+3)/2 - 12 = -4\] \(\tau_F\) would be \[(14-6 = +8)\]

Using those values, we would fill out our new table like this:

ID | $$\text{Outcomes}$$ | $$\text{Causal Effect}$$ | |
---|---|---|---|

$$Y_t(u)$$ | $$Y_c(u)$$ | $$Y_t(u) - Y_c(u)$$ | |

Yao |
13 |
$$13 - (-4)$$ |
-4 |

Emma |
14 |
$$14 - (+8)$$ |
+8 |

Cassidy |
$$6 + (+8)$$ |
6 |
+8 |

Tahmid |
$$12 + (-4)$$ |
12 |
-4 |

Diego |
3 |
$$3 - (-4)$$ |
-4 |

Which gives us:

ID | $$\text{Outcomes}$$ | $$\text{Causal Effect}$$ | |
---|---|---|---|

$$Y_t(u)$$ | $$Y_c(u)$$ | $$Y_t(u) - Y_c(u)$$ | |

Yao |
13 |
17 |
-4 |

Emma |
14 |
6 |
+8 |

Cassidy |
14 |
6 |
+8 |

Tahmid |
8 |
12 |
-4 |

Diego |
3 |
7 |
-4 |

We now have two different estimates for Emma (and for everyone else in the table). When we estimate \(Y_c(Emma)\) using an assumption of constant treatment effect (a single value for \(\tau\)), we get \(Y_c(Emma) = 13\). When we estimate assuming treatment effect is constant for each sex, we calculate that \(Y_c(Emma) = 8\). This difference between our estimates for Emma highlights the difficulties of inference. Models drive inference. Different models will produce different inferences.

### 4.5.3 Heterogenous treatment effects

Is the assumption of a constant treatment effect, \(\tau\), usually true? No! It is never true. People vary. The effect of a pill on you will always be different from the effect of a pill on your friend, at least if we measure outcomes accurately enough. Treatment effects are always *heterogeneous*, meaning that they vary across individuals.

Reality looks like this:

ID | $$\text{Outcomes}$$ | $$\text{Causal Effect}$$ | |
---|---|---|---|

$$Y_t(u)$$ | $$Y_c(u)$$ | $$Y_t(u) - Y_c(u)$$ | |

Yao |
13 |
$$13 - \tau_{yao}$$ |
$$\tau_{yao}$$ |

Emma |
14 |
$$14 - \tau_{emma}$$ |
$$\tau_{emma}$$ |

Cassidy |
$$6 + \tau_{cassidy}$$ |
6 |
$$\tau_{cassidy}$$ |

Tahmid |
$$12 + \tau_{tahmid}$$ |
12 |
$$\tau_{tahmid}$$ |

Diego |
3 |
$$3 - \tau_{diego}$$ |
$$\tau_{diego}$$ |

Can we solve for \(\tau_{yao}\)? No! That is the *Fundamental Problem of Causal Inference*. So how can we make any progress from here if we are unwilling to assume there is at least some structure to the causal effect across different individuals? Instead of worrying about the causal effect for specific individuals, we, instead, focus on the causal effect for the entire population.

### 4.5.4 Average treatment effect

The average treatment effect is the **average** difference in *potential* outcomes between the treated group and the control groups. Because averaging is a linear operator, the average difference is the same as the difference between the averages. The distinction between this estimand and estimands like \(\tau\), \(\tau_M\) and \(\tau_F\), is that, in this case, we do not care about using the average treatment effect to fill in missing values in each row. The average treatment effect is useful because we don’t have to assume anything about each individuals’ \(\tau\), like \(\tau_{yao}\), but can still understand something about the average causal effect across the whole population.

As we did before, the simplest way to estimate the ATE is to take the mean of the treated group (\(10\)) and the mean of the control group (\(9\)) and then take the difference in those means (\(1\)). We’ll call this estimate of the average treatment effect, \(\widehat{ATE}\), pronounced “ATE-hat.”

If we already did this exact same calculation above, why are we talking about it again? Remember that we are unwilling to assume treatment effect is constant in our study population, and we cannot solve for \(\tau\) if \(\tau\) is different for different individuals. This is where \(\widehat{ATE}\) is helpful.

*Some* estimands may not require filling in all the question marks in the Preceptor Table. We can get a good estimate of the *average* treatment effect without filling in every question mark — the average treatment effect is just a single number. Rarely in a study do we care about what happens to individuals. In our case, we don’t care about what specifically would happen to Cassidy’s attitude if treated. Instead, we care generally about how our experiment impacts people’s attitudes towards immigrants. This is why an average estimate, like \(\widehat{ATE}\) can be helpful.

As we noted before, this is a popular estimand. Why?

There’s an obvious

*estimator*for this estimand: the difference in*observed*outcomes between the treated group and the control group: \(Y_t(u) - Y_c(u)\).If treatment is

*randomly assigned*, the estimator is*unbiased*: you can be fairly confident in the estimate if you have a large enough treatment and control group.As we did earlier, if you are willing to assume that the causal effect is the same for everyone (a big assumption!), you can use your estimate of the ATE, \(\widehat{ATE}\), to fill in the missing individual values in your Preceptor Table.

Just because the ATE is often a useful estimand doesn’t mean that it *always* is.

Consider point #3. For example, let’s say the treatment effect does vary dependent on sex. For males there is a strong negative effect (-3.5), but for females there is a smaller positive effect (+1). However, the average treatment effect for the whole sample, even if you estimate it correctly, will be a single negative number (-1) – since the negative effect for males is larger than the positive effect for females.

Estimating the average treatment effect, by calculating \(\widehat{ATE}\), is easy. But is our \(\widehat{ATE}\) a good estimate of the actual ATE? After all, if we knew all the missing values in the Preceptor Table, we could calculate the ATE perfectly. But those missing values may be wildly different from the observed values. Consider this Preceptor Table:

ID | $$\text{Outcomes}$$ | $$\text{Causal Effect}$$ | |
---|---|---|---|

$$Y_t(u)$$ | $$Y_c(u)$$ | $$Y_t(u) - Y_c(u)$$ | |

Yao |
13 |
10 |
+3 |

Emma |
14 |
11 |
+3 |

Cassidy |
9 |
6 |
+3 |

Tahmid |
15 |
12 |
+3 |

Diego |
3 |
0 |
+3 |

In this example, there is indeed a constant treatment effect for everyone: \(+3\). Note that the *observed* values are all the same, but the unobserved values were such that our estimated ATE, \(-1\), is pretty far from the actual ATE, \(+3\). If we think we have a reasonable estimate of ATE, using that value as a constant for \(\tau\) might be our *best guess*. For more discussion, see more from Matt Blackwell:

## 4.6 Validity

#### 4.6.0.1 Internal and external validity

There are two main sources of missing data:

- For the units in our sample, we only see one potential outcome
- For the units outside our sample, we see
*no*potential outcomes

If we have randomized assignment and a large sample, we can be confident that we have a good estimate of the average treatment effect *in that sample*. We say that the experiment has high **internal validity**: the inferences we are making are likely to reflect the truth about that sample.

However, we may be interested in a population beyond our particular sample, the second main source of missing data. For example, let’s look at the broader context of the train experiment. You likely are not exclusively concerned about the attitudes of people who ride trains, but rather the attitudes of a larger population. Train platforms, however, are a convenient setting in which to run the experiment. Let’s say that we ran a randomized experiment with 10,000 people in Boston, and found an \(\widehat{ATE} = -1\). Should we assume this estimate would be accurate for a larger general population?

The answer to that question depends in part on the **external validity** of the study. Are the 10,000 people in the study similar to the people who we want to generalize the findings to? Perhaps we want to generalize to train commuters in other cities. Let’s say that the 10,000 people in Boston all choose to ride trains for environmental reasons. That’s another form of **selection bias**. The sample is not randomly selected from the population in which we are interested. Why is that a problem? Those people differ systematically from other people in a way that may affect their response to the experiment. For example, their preference of public transportation for environmental reasons may be correlated to other political beliefs.

Note that this concern can be expressed in terms of the assignment mechanism. People who don’t ride the train have a 0% chance of receiving the treatment. Thus, the study *can’t* directly speak to how the treatment would impact their attitudes towards immigrants. The only way we can make such claims is by making additional assumptions, such as that train-riders reflect the same makeup of political beliefs as people who don’t ride trains.

The external validity of a study is often directly related to the **representativeness** of our sample. Representativeness has to do with how well our sample represents the larger population we are interested in generalizing to. Does the train experiment allow us to calculate a causal effect for people who commute by cars? Can we calculate the causal effect for people in New York City? Before we generalize to broader populations we have to consider if our experimental estimates are applicable beyond our experiment. Maybe we think that commuters in Boston and New York are similar enough, that our \(\widehat{ATE}\) also is a good estimate for the causal effect of our treatment in NYC. We could also conclude that people who commute by car are fundamentally different than people who commute by train. If that was true, then we could not say our estimate is true for all commuters because our sample does not accurately represent the broader group we want to generalize to.

The circumstances of the experiment may also affect the external validity of the study. Perhaps the train study was conducted during the middle of the summer when platforms are uncomfortably hot. Then, while we have variation in one aspect of the treatment (whether there were Spanish-speakers nearby), we don’t have variation in another (temperature of train platform). It may be that attitude towards immigrants is impacted when the train platform is uncomfortably hot, but the treatment has no impact on attitude otherwise.

When dealing with human subjects, there is a particular concern regarding external validity: the **Hawthorne effect**. When human subjects know that they are part of an experiment, they may change their behavior.

In this example, the Hawthorne effect can impact attitudes expressed on surveys. Maybe respondents are more extreme in their attitude in either direction (more liberal or more conservative) because the survey is an opportunity to express opinion.

There is a border concept of **validity** to consider as well. That is, if the data is *valid*, then it accurately captures the concepts we care about. We do not want to use data if there is not a connection between the problem we face and the data that we have.

#### 4.6.0.2 The infinite Preceptor Table

In reality, a Preceptor Table has an infinite number of rows, and therefore an infinite amount of missing data. We call this type of Preceptor Table an infinite Preceptor Table. Such a reality is unworkable, so we make assumptions to reduce the true problem to something more manageable.

Let’s start by looking at what kinds of missing data make up the infinite Preceptor Table. For example, say we only care about the causal effect of this experiment on Yao. Do we only care about his attitude right after the experiment? No! We also care about Yao’s potential outcomes one year from now, two years from now, and so on.

So our full Preceptor Table includes people we know (Yao) and people we don’t (for example, Eliot), both now and in the future:

ID | $$\text{Outcomes}$$ | $$\text{Causal Effect}$$ | |
---|---|---|---|

$$Y_t(u)$$ | $$Y_c(u)$$ | $$Y_t(u) - Y_c(u)$$ | |

Yao this year | 13 | ? | ? |

Yao next year | ? | ? | ? |

Yao two years from now | ? | ? | ? |

Eliot this year | ? | ? | ? |

Eliot next year | ? | ? | ? |

Eliot two years from now | ? | ? | ? |

... | ? | ? | ? |

Person n at time t | ? | ? | ? |

In fact, because time is continuous, there is a row for Yao now, Yao one second from now, Yao one day from now and so on. The Preceptor Table extends downward forever. Thus, in order to estimate any causal effect, we need *assumptions*, so we aren’t dealing with an infinite table.

The most obvious way to eliminate some rows from the table is to assume the causal effect for Yao now is the same as all the ones for Yao in the future. Is that plausible? Sort of. Yao now and Yao in one second are pretty similar! Yao now and Yao in 30 years are less so. Unfortunately, there’s no magic way to get a good estimate of every missing value in the infinite Preceptor Table! But through some assumptions, we can reduce the true problem to the problem we were dealing with before.

Not only can we extend the Preceptor Table by adding people not in our sample to the rows, but we can also add additional treatments to the columns. Let’s go back to the original five people in our sample. What if we also wanted to test the causal effect of another language being spoken on the platform? We’ll call the original treatment \(t\) and the new treatment \(t'\).

ID | $$\text{Outcomes}$$ | ||
---|---|---|---|

$$Y_t(u)$$ | $$Y_{t'}(u)$$ | $$Y_c(u)$$ | |

Yao |
13 |
? |
? |

Emma |
11 |
? |
? |

Cassidy |
? |
? |
10 |

Tahmid |
? |
? |
12 |

Diego |
6 |
? |
? |

Note that for Yao, we now have three causal effects we can estimate: the difference between the original treatment and the new treatment, the difference between the original treatment and control, and the difference between the new treatment and control.

Even if you have just one language you are testing, there still could be multiple treatments. For example, the amount of time the commuter is on the platform with the Spanish-speakers might vary across commuters. In that case, each might receive a different treatment.

ID | $$\text{Outcomes}$$ | |||||
---|---|---|---|---|---|---|

$$Y_{\text{1 min}}(u)$$ | $$Y_{\text{5 mins}}(u)$$ | $$Y_{\text{10 mins}}(u)$$ | $$Y_{\text{15 mins}}(u)$$ | $$Y_{\text{20 mins}}(u)$$ | $$Y_c(u)$$ | |

Yao |
13 |
? |
? |
? |
? |
? |

Emma |
11 |
? |
? |
? |
? |
? |

Cassidy |
? |
? |
? |
? |
? |
10 |

Tahmid |
? |
? |
? |
? |
? |
12 |

Diego |
6 |
? |
? |
? |
? |
? |

Instead of considering the treatment in terms of duration, we could also consider different volume levels, measured in decibels (dB), at which Spanish is being spoken.

ID | $$\text{Outcomes}$$ | ||||
---|---|---|---|---|---|

$$Y_{\text{58 dB}}(u)$$ | $$Y_{\text{59 dB}}(u)$$ | $$Y_{\text{60 dB}}(u)$$ | $$Y_{\text{61 dB}}(u)$$ | $$Y_c(u)$$ | |

Yao |
13 |
? |
? |
? |
? |

Emma |
11 |
? |
? |
? |
? |

Cassidy |
? |
? |
? |
? |
10 |

Tahmid |
? |
? |
? |
? |
12 |

Diego |
6 |
? |
? |
? |
? |

Indeed, there are an infinite number of possible treatments. The Preceptor Table extends to the right forever. Again, assumptions come to our rescue. Or rather, we just throw up our hands and only try to estimate a few things. This is why it is crucial to define one’s estimand precisely: if we are interested in the difference in potential outcomes between Spanish being spoken for 10 minutes at a 60 dB versus control, we can ignore all the other possible columns in the infinite Preceptor Table.

Thus, whenever you are considering a causal question, the best way to think about it is to start with the infinite Preceptor Table. First we throw out the rows we think are duplicates (such as all the observations for Yao one second from now, two seconds from now, etc.) or that are outside the scope of what we are interested in for now (maybe we don’t care about outcomes 30 years in the future for this study). Second, we throw out the columns that we don’t care about, which are all the possible treatments we aren’t considering. Finally, we define precisely—in terms of potential outcomes—our estimand. It may be something simple, such as the average treatment effect, or something more complex. Once we have done these steps, we can start thinking about how to fill in the question marks. But remember that the infinite Preceptor Table is always there, and you should be conscious of which rows and columns you are throwing out!

### 4.6.1 The assignment mechanism

The “assignment mechanism” is the process by which some units received the treatment while the other units got the control.

We sidestepped the following question before: Is the difference in sample means between treated units and control units, \(\widehat{ATE}\), a good estimate of the \(ATE\)? That depends entirely on the method by which units are assigned treatment, which is called the *assignment mechanism*. That is the mechanism whereby some potential outcomes are missing and some potential outcomes are observed.

This already comes up in non-causal context when considering *sampling*. If we are trying to estimate the average attitude towards immigrants in the US, we usually do so by taking a sample. The process by which people enter our sample is called the *sampling mechanism*. If the process by which people enter our sample is related to their attitude, even indirectly, then estimates from our sample will not be good estimates for the population.

Whenever the assignment mechanism is correlated with the potential outcomes, we say that there is **confounding**. Confounding is a problem, since it means that our simple estimate of the \(ATE\) is, potentially, biased.

Assignment mechanisms can also intentionally be biased in order to manufacture desired outcomes. Let’s consider a scenario where once again an entire platform is either treated or a control. In this case the assignment mechanism is the choice of the Spanish-speakers; they are allowed to choose which platform they want to stand on. Let’s also say that they can *perfectly* predict the attitude of people on each platform. The Spanish-speakers know that a platform with more liberal attitudes towards immigrants will be more friendly, and therefore always choose to stand on those platforms. In this case, the assignment mechanism of platforms is not random. The Spanish-speakers know these values for the platforms in the experiment:

ID | $$\text{Outcomes}$$ | $$\text{Causal Effect}$$ | |
---|---|---|---|

$$Y_t(u)$$ | $$Y_c(u)$$ | $$Y_t(u) - Y_c(u)$$ | |

Platform 1 | 14 | 10 | +4 |

Platform 2 | 7 | 6 | +1 |

Platform 3 | 5 | 8 | -3 |

Platform 4 | 13 | 12 | +1 |

Platform 5 | 4 | 6 | -2 |

Platform 6 | 13 | 11 | +2 |

Based on this knowledge the Spanish-speakers in the experiment would choose the following treatment assignments:

ID | $$\text{Outcomes}$$ | $$\text{Causal Effect}$$ | |
---|---|---|---|

$$Y_t(u)$$ | $$Y_c(u)$$ | $$Y_t(u) - Y_c(u)$$ | |

Platform 1 | ? | 10 | ? |

Platform 2 | 7 | ? | ? |

Platform 3 | 5 | ? | ? |

Platform 4 | ? | 12 | ? |

Platform 5 | 4 | ? | ? |

Platform 6 | ? | 11 | ? |

The assignment mechanism used distorts the averages of both \(Y_t(u)\) and \(Y_c(u)\), which in turn distorts the difference in means. The average of the treated group is shifted lower (more liberal), while the average of the control group is shifted higher (more conservative). This gives the illusion that the average treatment effect \(\widehat{ATE}\) is negative. The true positive causal effect is masked by this non-random assignment mechanism.

Therefore, the difference in means is no longer a good estimate of the ATE. In fact, in this case it has the wrong sign! This is not merely a consequence of our small sample: even if there were a million platforms in the experiment, we could not get a good estimate of the ATE.

This is an extreme example of a problem called **selection bias**. Selection bias is when the person who is assigning treatment chooses on the basis of potential outcomes. The Spanish-speakers are not choosing platforms to stand on randomly. Rather, they are making treatment decisions based directly on the potential outcomes of each platform. Remember, whenever the assignment mechanism is correlated with the potential outcomes there is confounding, which is a problem because it means that our estimate is biased. Not all examples of confounding are caused by selection bias, but when there is selection bias there is *always* confounding.

Much like how the best way to avoid making poor inferences from a sample to a population is to take a random sample of the population, the best assignment mechanism for avoiding confounding is **randomization**. For each platform we could flip a coin to determine if it is in the treatment or control group.

Random assignment guarantees that, on average, there is no correlation between treatment assignment and anything else, neither covariates nor potential outcomes.

In many circumstances, however, randomized trials are not possible due to ethical or practical concerns. In such scenarios there is by necessity a non-random assignment mechanism.

For example, let’s say some of the train platforms in the experiment are so loud the Spanish-speakers might not be heard by anyone nearby. Therefore by necessity, only quieter platforms can be assigned to the treatment group. This non-random assignment may introduce confounding. Say there is some systematic difference between the people on quieter platforms compared with the people on the louder platforms. In that case, the assignment mechanism is correlated with potential outcomes, so there is confounding.

Many statistical methods have been developed for causal inference when there is a non-random assignment mechanism. Those methods, however, are beyond the scope of this book.

### 4.6.2 Correlation with potential outcomes

When considering the relationship between a treatment and an outcome, one of the most important assumptions is a lack of correlation between treatment assignment and the potential outcomes. Consider a version of the train experiment. Assume that, if a Republican is not on the platform with Spanish-speakers, they will have an attitude value of 9. If they were on a platform where they could hear Spanish, their attitude would have been 11. A Democrat would have an attitude 9 regardless of whether or not they are on the platform with Spanish-speakers. In other words, the causal effect of being in the treatment group is +2 for Republicans, and 0 for Democrats. If we could run an experiment with random assignment, we would discover that the average causal effect is somewhere between 0 and 2, depending on the relative proportion of the Republicans and Democrats.

ID | $$Y_t(u)$$ | $$Y_c(u)$$ | $$Y_t(u) - Y_c(u)$$ | Party |
---|---|---|---|---|

1 |
11 |
9 |
+2 |
R |

2 |
11 |
9 |
+2 |
R |

3 |
9 |
9 |
0 |
D |

4 |
11 |
9 |
+2 |
R |

5 |
9 |
9 |
0 |
D |

Unfortunately (?), people do not choose to ride certain trains randomly. We might assume/hope that there is no correlation between which train you ride and potential outcomes. If this is true, then we would still be able to estimate the causal effect. Yet that is rarely true in general. What if, instead, all the Republicans are in the control group, and all the Democrats are in the treated? In that case, everyone has an attitude of 9! And it appears that the presence of Spanish-speakers on the platform “does not matter.” The correlation between treatment and potential outcomes invalidates the naive estimate of the average treatment effect.

ID | $$Y_t(u)$$ | $$Y_c(u)$$ | $$Y_t(u) - Y_c(u)$$ |
---|---|---|---|

R |
? |
9 |
? |

R |
? |
9 |
? |

D |
9 |
? |
? |

R |
? |
9 |
? |

D |
9 |
? |
? |

Keep in mind that the problem arises when there is a correlation between treatment assignment and *potential* outcome, not simply a correlation between treatment assignment and outcome. In this case, the correlation between treatment assignment and outcome is zero! Just looking at the outcomes we observe is not enough. We must make assumptions about the outcomes we don’t observe, about what would have happened.

### 4.6.3 No causation without manipulation

In order for a potential outcome to make sense, it must be possible, at least *a priori*. For example, if there is no way for Yao, under any circumstance, to ever be in the train study, then \(Y_{t}(u)\) is impossible for him. It can never happen. And if \(Y_{t}(u)\) can never be observed, even in theory, then the causal effect of treatment on Yao’s attitude is undefined.

The causal effect of the train study is well defined because it is the simple difference of two potential outcomes, both of which might happen. In this case, we (or something else) can manipulate the world, at least conceptually, so that it is possible that one thing or a different thing might happen.

This definition of causal effects becomes much more problematic if there is no way for one of the potential outcomes to happen, ever. For example, what is the causal effect of Yao’s height on his weight? It might seem we would just need to compare two potential outcomes: what would Yao’s weight be under the treatment (where treatment is defined as being 3 inches taller) and what would Yao’s weight be under the control (where control is defined as his current height).

A moment’s reflection highlights the problem: we can’t increase Yao’s height. There is no way to observe, even conceptually, what Yao’s weight would be if he were taller because there is no way to make him taller. We can’t manipulate Yao’s height, so it makes no sense to investigate the causal effect of height on weight. Hence the slogan: *No causation without manipulation.*

This then raises the question of what can and cannot be manipulated. If something cannot be manipulated, we should not consider it causal. So can race ever be considered causal? What about sex? A genetic condition like color-blindness? Can we manipulate these characteristics? In the modern world these questions are not simple.

Take color-blindness for example. Say we are interested in how color-blindness impacts ability to complete a jig-saw puzzle. Because color-blindness is genetic some might argue it cannot be manipulated. But advances in technology like gene-therapy might allow us to actually change someone’s genes. Could we then claim the ability to manipulate color-blindness? If yes, we could then measure the causal effect of color-blindness on ability to complete jig-saw puzzles.

The slogan of “No causation without manipulation” may at first seem straight forward, but it is clearly not so simple. Questions about race, sex, gender and genetics are very complex and should be considered with care.

## 4.7 Summary

The fundamental components of every problem in causal inference are units, treatments and outcomes. The units are the rows in the tibble. The treatments are the columns. The outcomes are the values. Whenever you confront a problem is causal inference, start by identifying the units, treatments and outcomes.

A causal effect is the difference between one potential outcome and another. How different would your life be if you missed the train?

A Preceptor Table includes all the data that we actually have, and all the data we would like to have, to solve our problem. The *ideal* Preceptor Table involves no missing data. We know what the outcome would have been for unit \(i\) under both treatment and control. With the ideal Preceptor Table it is easy to calculate, using only algebra, any quantity of interest. The *actual* Preceptor Table is littered with missing data, represented by a question mark. If we know the value of the outcome for unit \(i\) under treatment then, by definition, we can not know the outcome for unit \(i\) under control.

The tibble in which we store our data and the Preceptor Table have one key difference in structure. The tibble has one column for the outcome variable. The Preceptor Table has one column for each potential outcome, meaning one column for each of the values of the treatment variable.

The causal effect of a treatment on a single unit at a point in time is the difference between the value of the outcome variable with the treatment and without the treatment. We call these “potential outcomes” because, at most, we can only observe one of them. The *Fundamental Problem of Causal Inference* is that it is impossible to observe the causal effect on a single unit. We must make assumptions — i.e, we must make models — in order to estimate causal effects.

Random assignment of treatments to units is the best way to estimate causal effects. Other assignment mechanisms are subject to confounding. If the treatment assigned is correlated with the potential outcomes, it is very hard to estimate the true treatment effect. (As always, we use the terms “causal effects” and “treatment effects” interchangeably. With random assignment, we can, mostly safely, estimate the average treatment effect (ATE) by looking at the difference between the average outcomes of the treated and control units.

Be wary of claims made in situations without random assignment: Here be dragons!