# 4 Rubin Causal Model

Have you ever wondered what the world would be like without you?

George Bailey, a character from the movie “It’s a Wonderful Life,” believes that his life has served no purpose. The movie follows George as he explores a world in which he was never born. It is clear that he had a profound impact on the lives of many people in his community. His actions mattered, more than he ever realized.

By showing what the world would have been like without George, we get an idea of the *causal effect* of his life on his town and the people who live there. This chapter explains causation using the framework of *potential outcomes* and the *Rubin Causal Model* (RCM).

Before we start, it should be mentioned that there are several concepts, such as the Preceptor Table, that you will never see in other textbooks – it’s an *exclusive* concept which will only be used in the *Primer*. However, this whole chapter ultimately seeks to help you visualize all your data and allow you to figure out whether a question is truly worth pursuing or not.

## 4.1 Preceptor Table

A **Preceptor Table** is a table with rows and columns such that, if none of the data is missing, the thing we want to know is trivial to calculate. Preceptor Tables vary in the number of their rows and columns. We use question marks to indicate missing data in a Preceptor Table. The rows in the Preceptor Table are the *units* — people, galaxies, oak trees — which are the subjects of interest. Even the simplest Preceptor Table will have two columns. The first is an *ID* column which serves to label each unit. The second column is the *outcome* of interest, the variable we are trying to predict/understand/change.

Assume that there are five adult brothers and you are given four of their heights. What is the average height of all five brothers? Consider a Preceptor Table for this problem:

Preceptor Table | |
---|---|

ID | Outcome |

Height (cm) | |

Robert | 178 |

Andy | ? |

Beau | 172 |

Ishan | 173 |

Nicholas | 165 |

In this case, we have a row for each brother and a column for their height. An individual *unit* is a brother. An *outcome* is that brother’s height. We will always have an ID column in Preceptor Tables so that we can identify different units. It is always furthest to the left. In addition to the ID column, we call the column with the brothers’ heights the outcome column.

To calculate the average height, we need to know Andy’s height – our missing data. Keep in mind that there is a *truth* out there, a state of the world independent of our knowledge of it. Andy is a specific height. If we had a complete Preceptor Table, with no missing values, we could calculate the average height of the brothers exactly. No fancy statistics would be needed, just arithmetic.

### 4.1.1 Harvard Height

Consider a more complex problem. We have the heights of 100 Harvard students, and from that we want to know the average height of all students in the school.

Preceptor Table | |
---|---|

ID | Outcome |

Height (cm) | |

Student 1 | ? |

Student 2 | ? |

... | ... |

Student 473 | 172 |

Student 474 | ? |

... | ... |

Student 3,258 | ? |

Student 3,259 | 162 |

... | ... |

Student 6,700 | ? |

Again, are these 100 students randomly sampled? Could we estimate the 90th percentile of height in the student population? These questions are more complicated, and we might be less confident in our best guess. Note, also, that we will sometimes not know exactly how many rows there are in the Precptor Table. In that case, we would use “Student N” as the last ID, where N is the total number of students at Harvard.

### 4.1.2 Population Table

The Population Table is distinct from the Preceptor Table. The aim of the Population Table is to illustrate the broader population we are interested in, while including the data from our Preceptor Table and dataset. This table has three sources of data: the data for units we want to have (the Preceptor Table), the data for units which we actually have (our actual data), and the data for units we do not care about (the rest of the population, not included in the data or the Preceptor Table). It represents the pieces of data that *we actually have*.

Our Preceptor Table rows contain the information that we would want to know in order to answer our questions. These rows contain entries for our covariates (sex and year) but they do not contain any outcome results (height). We are trying to answer questions about the height of Harvard students in 2021, so our age column will read somewhere between 15 and 27 and our year entries of these rows will read “2021.”

Our actual data rows contain the information that we do know. These rows contain entries for both our covariates and the outcomes. In this case, the actual data comes from data on Harvard students in 2015, so age column will read as the matching age and our year entries of these rows will either read “2015.”

Our population rows contain no data. These are subjects which fall under our desired population, but for which we have no data. As such, all rows are missing.

Population Table | |||
---|---|---|---|

Source | Year | Age | Height |

Population |
2010 |
? |
? |

... |
... |
... |
... |

Data |
2015 |
18 |
180 |

Data |
2015 |
23 |
163 |

... |
... |
... |
... |

Population |
2018 |
? |
? |

... |
... |
... |
... |

Preceptor Table |
2021 |
19 |
? |

... |
... |
... |
... |

Population |
2025 |
? |
? |

Implicit in the Preceptor Table is a notion of time. Now that we can see our actual data compared with our greater population and our desired data, we must expand our observations. That is to say that, given our data is sourced from 2015 and our desired data is from 2021, we must include a greater time span in our predictions.

As such, we will see that rows from our larger population may include anywhere between 2010 and 2025. This is a ballpark range. Height is relatively stable, so it is simpler to expand the timeframe. The greater sense of time in future problems will depend on that individual circumstance.

## 4.2 Causal effect

The Rubin Causal Model (RCM) is based on the idea of **potential outcomes.** For example, Enos (2014) measured attitudes toward immigration among Boston commuters. Individuals were exposed to one of two possible conditions, and then their attitudes towards immigrants were recorded. One condition was waiting on a train platform near individuals speaking Spanish. The other was being on a train platform without Spanish-speakers. To calculate the causal effect of having Spanish-speakers nearby, we need to compare the outcome for an individual in one possible state of the world (with Spanish-speakers) to the outcome for that same individual in another state of the world (without Spanish-speakers). However, it is impossible to observe both potential outcomes at once. One of the potential outcomes is always missing, since a unit cannot travel back in time, and experience both treatments. This dilemma is the *Fundamental Problem of Causal Inference*.

In most circumstances, we are interested in comparing two experimental manipulations, one generally termed “treatment” and the other “control.” The difference between the potential outcome under treatment and the potential outcome under control is a “causal effect” or a “treatment effect.” According to the RCM, the **causal effect** of being on the platform with Spanish-speakers is the *difference* between what your attitude would have been under “treatment” (with Spanish-speakers) and under “control” (no Spanish-speakers).

The commuter survey consisted of three questions, each measuring agreement on a 1 to 5 integer scale, with 1 being liberal and 5 being conservative. For each person, the three answers were summed, generating an overall measure of attitude toward immigration which ranged from 3 (very liberal) to 15 (very conservative). If your attitude towards immigrants would have been a 13 with Spanish-speakers and a 9 without Spanish-speakers, then the causal effect of being on a platform with Spanish-speakers is a 4-point increase in your score.

We will use the symbol \(Y\) to represent potential outcomes, the variable we are interested in understanding and modeling. \(Y\) is called the *response* or *outcome* variable. It is the variable we want to “explain.” In our case this would be the attitude score. If we are trying to understand a causal effect, we need two symbols so that control and treated values can be represented separately: \(Y_t\) and \(Y_c\).

### 4.2.1 Potential outcomes

Suppose that Yao is one of the commuters surveyed in this experiment. If we were omniscient, we would know the outcomes for Yao under both treatment (with Spanish-speakers) and control (no Spanish-speakers), and we’d be able to ignore the Fundamental Problem of Causal Inference. We can show this using a Preceptor Table. Calculating the number we are interested in is trivial because none of the data is missing.

Preceptor Table | ||
---|---|---|

ID | Outcomes | |

Attitude if Treated | Attitude if Control | |

Yao | 13 | 9 |

From this table we only know the causal effect on Yao. Everyone else in the study might have a lower (more liberal) or higher (more conservative) attitude scores if treated. Regardless of what the causal effect is for other subjects, the causal effect for Yao of being on the train platform with Spanish-speakers is a shift towards a more conservative attitude.

Using the response variable — the actual symbol rather than a written description — makes for a more concise Preceptor Table.

Preceptor Table | ||
---|---|---|

ID | Outcomes | |

$$Y_t$$ | $$Y_c$$ | |

Yao |
13 |
9 |

Recall that the “causal effect” is the difference between Yao’s potential outcome under treatment and his potential outcome under control.

Preceptor Table | |||
---|---|---|---|

ID | Outcomes | Causal Effect | |

$$Y_t$$ | $$Y_c$$ | $$Y_t - Y_c$$ | |

Yao |
13 |
9 |
+4 |

Remember that, in the real world, we will have a bunch of missing data! We can not use simple arithmetic to calculate the causal effect on Yao’s attitude toward immigration. Instead, we will be required to estimate it. An **estimand** is some unknown variable in the real world that we are trying to measure. In this case, it is \(Y_{t}-Y_{c}\), not \(+4\). An estimand is not the *value* you calculated, but is rather the *unknown variable* you want to estimate.

### 4.2.2 Causal and predictive models

Causal inference is often compared with prediction. In prediction, we want to know an outcome, \(Y(u)\). In causal inference, we want to know a function of *potential* outcomes, such as the treatment effect: \(Y_t(u) - Y_c(u)\).

These are both missing data problems. Prediction involves estimating an outcome variable that we don’t have, and thus is missing, whether because it is in the future or because it is from data that we are unable to collect. Thus, prediction is the term for using statistical inference to fill in missing data for *individual* outcomes. Causal inference, however, is the term for filling in missing data for *numerous* potential outcomes. This is unlike prediction, where only one potential outcome can *ever* be observed, even in principle.

Key point: In a predictive model, there is only one \(Y(u)\) value for each unit. This is very different to the RCM where there are (at least) two potential outcomes (treatment and control). There is only one outcome column in a predictive model, whereas there are two or more in a causal model.

With a predictive model, we cannot infer what would happen to the outcome \(Y(u)\) if we changed \(X\) *for a given unit*. We can only *compare* two units, one with one value of \(X\) and another with a different value of \(X\).

In a sense, all models are predictive. However, only a subset of models are causal, meaning that, for a given individual, you can change the value \(X\) and observe a change in outcome, \(Y(u)\), and from that calculate a causal effect.

### 4.2.3 No causation without manipulation

In order for a potential outcome to make sense, it must be possible, at least *a priori*. For example, if there is no way for Yao, under any circumstance, to ever be in the train study, then \(Y_{t}(u)\) is impossible for him. It can never happen. And if \(Y_{t}(u)\) can never be observed, even in theory, then the causal effect of treatment on Yao’s attitude is undefined.

The causal effect of the train study is well defined because it is the simple difference of two potential outcomes, both of which might happen. In this case, we (or something else) can manipulate the world, at least conceptually, so that it is possible that one thing or a different thing might happen.

This definition of causal effects becomes much more problematic if there is no way for one of the potential outcomes to happen, ever. For example, what is the causal effect of Yao’s height on his weight? It might seem we would just need to compare two potential outcomes: what would Yao’s weight be under the treatment (where treatment is defined as being 3 inches taller) and what would Yao’s weight be under the control (where control is defined as his current height).

A moment’s reflection highlights the problem: we can’t increase Yao’s height. There is no way to observe, even conceptually, what Yao’s weight would be if he were taller because there is no way to make him taller. We can’t manipulate Yao’s height, so it makes no sense to investigate the causal effect of height on weight. Hence the slogan: *No causation without manipulation.*

This then raises the question of what can and cannot be manipulated. If something cannot be manipulated, we should not consider it causal. So can race ever be considered causal? What about sex? A genetic condition like color-blindness? Can we manipulate these characteristics? In the modern world these questions are not simple.

Take color-blindness for example. Say we are interested in how color-blindness impacts ability to complete a jig-saw puzzle. Because color-blindness is genetic some might argue it cannot be manipulated. But advances in technology like gene-therapy might allow us to actually change someone’s genes. Could we then claim the ability to manipulate color-blindness? If yes, we could then measure the causal effect of color-blindness on ability to complete jig-saw puzzles.

The slogan of “No causation without manipulation” may at first seem straight forward, but it is clearly not so simple. Questions about race, sex, gender and genetics are very complex and should be considered with care.

### 4.2.4 Multiple units

Generally, a study has many individuals (or, more broadly, “units”) who each have their own potential outcomes. More notation is needed to allow us to differentiate between different units.

In other words, there needs to be a distinction between \(Y_t\) for Yao, and \(Y_t\) for Emma. We use the variable \(u\) (\(u\) for “unit”) to indicate that the outcome under control and the outcome under treatment can differ for each individual unit (person).

Instead of \(Y_t\), we will use \(Y_t(u)\) to represent “Attitude if Treated.” If you want to talk about only Emma, you could say “Emma’s Attitude if Treated” or “\(Y_t(u = Emma)\)” or “the \(Y_t(u)\) for Emma,” but not just \(Y_t\). That notation is too ambiguous when there is more than one subject.

Let’s look at a Preceptor Table with more subjects using our new notation:

Preceptor Table | |||
---|---|---|---|

ID | Outcomes | Causal Effect | |

$$Y_t(u)$$ | $$Y_c(u)$$ | $$Y_t(u) - Y_c(u)$$ | |

Yao |
13 |
9 |
+4 |

Emma |
14 |
11 |
+3 |

Cassidy |
11 |
6 |
+5 |

Tahmid |
9 |
12 |
-3 |

Diego |
3 |
4 |
-1 |

From this Preceptor Table, there are many possible estimands we might be interested in. Consider some examples, along with their true values:

- A potential outcome for one person, e.g., Yao’s potential outcome under treatment: \(13\).
- A causal effect for one person, such as for Emma. This is the difference between the potential outcomes: \(14 - 11 = +3\).
- The most positive causal effect: \(+5\), for Cassidy.
- The most negative causal effect: \(-3\), for Tahmid.
- The median causal effect: \(+3\).
- The median percentage change: \(+27.2\%\). To see this, calculate the percentage change for each person. You’ll get 5 percentages: \(+44.4\%\), \(+27.2\%\), \(+83.3\%\), \(-25.0\%\), and \(-25.0\%\).

Similar concepts can also be applied to the Population Table, where similar to Preceptor Tables, which can also answer several questions that one may be interested in knowing.

Population Table | |||||
---|---|---|---|---|---|

Source | Year | ID | Outcomes | Causal Effect | |

$$Y_t(u)$$ | $$Y_c(u)$$ | $$Y_t(u) - Y_c(u)$$ | |||

Population |
2010 |
? |
? |
? |
? |

... |
... |
... |
... |
... |
... |

Data |
2015 |
Tao |
5 |
3 |
+2 |

Data |
2015 |
Cassidy |
-1 |
2 |
-3 |

... |
... |
... |
... |
... |
... |

Population |
2018 |
? |
? |
? |
? |

... |
... |
... |
... |
... |
... |

Preceptor Table |
2021 |
Tao |
13 |
9 |
+4 |

... |
... |
... |
... |
... |
... |

Population |
2025 |
? |
? |
? |
? |

For example, we get a much better picture of all our data, as it all combines into one nice looking Population Table. We can take a look a past data about Yao, or Cassidy, and their previous outcomes and causal effects. We can also see the rest of the units which fall under our desired population, but we don’t have any data about, hence the question makes.

Consider these examples:

- Difference in potential outcome for one person, eg., the difference between Yao’s \(Y_t(u)\) values: \(-8\)
- Difference in causal effect for on person, for Yao it would be \(-2\): \(+2- +4\)

All of the variables calculated in the Preceptor and Population Tables are examples of estimands we might be interested in. One estimand is important enough that it has its own name: the **average treatment effect**, often abbreviated as **ATE**. The average treatment effect is the mean of all the individual causal effects. Here, the mean is \(+1.6\).

What does our real-world Preceptor Table look like?

Causal Preceptor Table | |||
---|---|---|---|

ID | Outcomes | Causal Effect | |

$$Y_t(u)$$ | $$Y_c(u)$$ | $$Y_t(u) - Y_c(u)$$ | |

Yao |
13 |
? |
? |

Emma |
14 |
? |
? |

Cassidy |
? |
6 |
? |

Tahmid |
? |
12 |
? |

Diego |
3 |
? |
? |

Predictive Preceptor Table | |
---|---|

ID | $$Y_(u)$$ |

Yao |
13 |

Emma |
14 |

Cassidy |
6 |

Tahmid |
12 |

Diego |
3 |

Calculating values from this table is no longer a simple math problem. See this discussion from Harvard Professor Matt Blackwell:

## 4.3 Simple models

How can we fill in the question marks? Because of the *Fundamental Problem of Causal Inference*, we can never *know* the missing values. Because we can never know the missing values, we must make assumptions. “Assumption” just means that we need a “model,” and all models have parameters.

### 4.3.1 A single value for tau

One model might be that the causal effect is the same for everyone. There is a single parameter, \(\tau\), which we then estimate. (\(\tau\) is a Greek letter, written as “tau” and rhyming with “cow.”) Once we have an estimate, we can fill in the Preceptor Table because, knowing it, we can estimate what the unobserved potential outcome is for each person. We use our assumption about \(\tau\) to estimate the counterfactual outcome for each unit.

Remember what our Preceptor Table looks like with all of the missing data:

Preceptor Table | |||
---|---|---|---|

ID | Outcomes | Causal Effect | |

$$Y_t(u)$$ | $$Y_c(u)$$ | $$Y_t(u) - Y_c(u)$$ | |

Yao |
13 |
? |
? |

Emma |
14 |
? |
? |

Cassidy |
? |
6 |
? |

Tahmid |
? |
12 |
? |

Diego |
3 |
? |
? |

If we assume \(\tau\) is the treatment effect for everyone, how do we fill in the table? We are using \(\tau\) as an estimate for the causal effect. By definition: \(Y_t(u) - Y_c(u) = \tau\). Using simple algebra, it is then clear that \(Y_t(u) = Y_c(u) + \tau\) and \(Y_c(u) = Y_t(u) - \tau\). In other words, you could add it to the observed value of every observation in the control group (or subtract it from the observed value of every observation in the treatment group), and thus fill in all the missing values.

Assuming there is a constant treatment effect, \(\tau\), for everyone, filling in the missing values would look like this:

Preceptor Table | |||
---|---|---|---|

ID | Outcomes | Causal Effect | |

$$Y_t(u)$$ | $$Y_c(u)$$ | $$Y_t(u) - Y_c(u)$$ | |

Yao |
13 |
$$13 - \tau$$ |
$$\tau$$ |

Emma |
14 |
$$14 - \tau$$ |
$$\tau$$ |

Cassidy |
$$6 + \tau$$ |
6 |
$$\tau$$ |

Tahmid |
$$12 + \tau$$ |
12 |
$$\tau$$ |

Diego |
3 |
$$3 - \tau$$ |
$$\tau$$ |

Now we need to find an estimate for \(\tau\) in order to fill in the missing values. One approach is to subtract the average of the observed control values from the average of the observed treated values. \[((13 + 14 + 3) / 3) - ((6 + 12) / 2)\] \[10 - 9 = +1\]

Or, in other words, we use this formula:

\[$\frac{\Sigma Y_t(u)}{n_t} + \frac{\Sigma Y_c(u)}{n_c} = \widehat{ATE}\]

\(\Sigma\) represents the sum of the treated/control values, and n represents the amount of values within the control and treated group.This formula is for something called \(\widehat{ATE}\), which we will discuss in more depth in a later section.

Continuing with the example, calculating the ATE or the causal effect, gives us an estimate of \(+1\) for \(\tau\). Let’s fill in our missing values by adding \(\tau\) to the observed values under control and by subtracting \(\tau\) from the observed value under treatment like so:

Preceptor Table | |||
---|---|---|---|

ID | Outcomes | Causal Effect | |

$$Y_t(u)$$ | $$Y_c(u)$$ | $$Y_t(u) - Y_c(u)$$ | |

Yao |
13 |
$$13 - (+1)$$ |
+1 |

Emma |
14 |
$$14 - (+1)$$ |
+1 |

Cassidy |
$$6 + (+1)$$ |
6 |
+1 |

Tahmid |
$$12 + (+1)$$ |
12 |
+1 |

Diego |
3 |
$$3 - (+1)$$ |
+1 |

Which gives us:

Preceptor Table | |||
---|---|---|---|

ID | Outcomes | Causal Effect | |

$$Y_t(u)$$ | $$Y_c(u)$$ | $$Y_t(u) - Y_c(u)$$ | |

Yao |
13 |
12 |
+1 |

Emma |
14 |
13 |
+1 |

Cassidy |
7 |
6 |
+1 |

Tahmid |
13 |
12 |
+1 |

Diego |
3 |
2 |
+1 |

If we make the assumption that there is a single value for \(\tau\) *and* that \(1\) is a good estimate of that value, then we can determine the missing potential outcomes. The Preceptor Table no longer has any missing values, so we can use it to easily answer (almost) any conceivable question.

### 4.3.2 Two values for tau

A second model might assume that the causal effect is different between levels of a category but the same within those levels. For example, perhaps there is a \(\tau_F\) for females and \(\tau_M\) for males where \(\tau_F != \tau_M\). We are making this assumption to give us a different model with which we can fill in the missing values in our Preceptor Table. We can’t make any progress unless we make some assumptions. That is an inescapable result of the *Fundamental Problem of Causal Inference*.

Consider a model in which causal effects differ based on sex. When we are looking at a “category” of units — for instance, gender — we call this a *covariate*. A covariate is an independent variable that can influence the outcome of a given statistical trial, but which is not of direct interest. Possible covariates include, but are not limited to, sex, age, and political party.

Preceptor Table | ||||
---|---|---|---|---|

ID | Outcomes | Covariate | Causal Effect | |

$$Y_t(u)$$ | $$Y_c(u)$$ | gender | $$Y_t(u) - Y_c(u)$$ | |

Yao |
13 |
$$13 - \tau_M$$ |
Male |
$$\tau_M$$ |

Emma |
14 |
$$14 - \tau_F$$ |
Female |
$$\tau_F$$ |

Cassidy |
$$6 + \tau_F$$ |
6 |
Female |
$$\tau_F$$ |

Tahmid |
$$12 + \tau_M$$ |
12 |
Male |
$$\tau_M$$ |

Diego |
3 |
$$3 - \tau_M$$ |
Male |
$$\tau_M$$ |

We would have two different estimates for \(\tau\).

\(\tau_M\) would be \[(13+3)/2 - 12 = -4\] \(\tau_F\) would be \[(14-6 = +8)\]

Using those values, we would fill out our new table like this:

Preceptor Table | |||
---|---|---|---|

ID | Outcomes | Causal Effect | |

$$Y_t(u)$$ | $$Y_c(u)$$ | $$Y_t(u) - Y_c(u)$$ | |

Yao |
13 |
$$13 - (-4)$$ |
-4 |

Emma |
14 |
$$14 - (+8)$$ |
+8 |

Cassidy |
$$6 + (+8)$$ |
6 |
+8 |

Tahmid |
$$12 + (-4)$$ |
12 |
-4 |

Diego |
3 |
$$3 - (-4)$$ |
-4 |

Which gives us:

Preceptor Table | |||
---|---|---|---|

ID | Outcomes | Causal Effect | |

$$Y_t(u)$$ | $$Y_c(u)$$ | $$Y_t(u) - Y_c(u)$$ | |

Yao |
13 |
17 |
-4 |

Emma |
14 |
6 |
+8 |

Cassidy |
14 |
6 |
+8 |

Tahmid |
8 |
12 |
-4 |

Diego |
3 |
7 |
-4 |

We now have two different estimates for Emma (and for everyone else in the table). When we estimate \(Y_c(Emma)\) using an assumption of constant treatment effect (a single value for \(\tau\)), we get \(Y_c(Emma) = 13\). When we estimate assuming treatment effect is constant for each sex, we calculate that \(Y_c(Emma) = 8\). This difference between our estimates for Emma highlights the difficulties of inference. Models drive inference. Different models will produce different inferences.

### 4.3.3 Heterogenous treatment effects

Is the assumption of a constant treatment effect, \(\tau\), usually true? No! It is never true. People vary. The effect of a pill on you will always be different from the effect of a pill on your friend, at least if we measure outcomes accurately enough. Treatment effects are always *heterogeneous*, meaning that they vary across individuals.

Reality looks like this:

Preceptor Table | |||
---|---|---|---|

ID | Outcomes | Causal Effect | |

$$Y_t(u)$$ | $$Y_c(u)$$ | $$Y_t(u) - Y_c(u)$$ | |

Yao |
13 |
$$13 - \tau_{yao}$$ |
$$\tau_{yao}$$ |

Emma |
14 |
$$14 - \tau_{emma}$$ |
$$\tau_{emma}$$ |

Cassidy |
$$6 + \tau_{cassidy}$$ |
6 |
$$\tau_{cassidy}$$ |

Tahmid |
$$12 + \tau_{tahmid}$$ |
12 |
$$\tau_{tahmid}$$ |

Diego |
3 |
$$3 - \tau_{diego}$$ |
$$\tau_{diego}$$ |

Can we solve for \(\tau_{yao}\)? No! That is the *Fundamental Problem of Causal Inference*. So how can we make any progress from here if we are unwilling to assume there is at least some structure to the causal effect across different individuals? Instead of worrying about the causal effect for specific individuals, we, instead, focus on the causal effect for the entire population.

### 4.3.4 Average treatment effect

The average treatment effect (ATE) is the **average** difference in *potential* outcomes between the treated group and the control groups. Because averaging is a linear operator, the average difference is the same as the difference between the averages. The distinction between this estimand and estimands like \(\tau\), \(\tau_M\) and \(\tau_F\), is that, in this case, we do not care about using the average treatment effect to fill in missing values in each row. The average treatment effect is useful because we don’t have to assume anything about each individuals’ \(\tau\), like \(\tau_{yao}\), but can still understand something about the average causal effect across the whole population.

As we did before, the simplest way to estimate the ATE is to take the mean of the treated group (\(10\)) and the mean of the control group (\(9\)) and then take the difference in those means (\(1\)). If we use this method to an estimate of the ATE, we’ll call it \(\widehat{ATE}\), pronounced “ATE-hat.”

If we already did this exact same calculation above, why are we talking about it again? Remember that we are unwilling to assume treatment effect is constant in our study population, and we cannot solve for \(\tau\) if \(\tau\) is different for different individuals. This is where \(\widehat{ATE}\) is helpful.

*Some* estimands may not require filling in all the question marks in the Preceptor Table. We can get a good estimate of the *average* treatment effect without filling in every question mark — the average treatment effect is just a single number. Rarely in a study do we care about what happens to individuals. In our case, we don’t care about what specifically would happen to Cassidy’s attitude if treated. Instead, we care generally about how our experiment impacts people’s attitudes towards immigrants. This is why an average estimate, like \(\widehat{ATE}\) can be helpful.

As we noted before, this is a popular estimand. Why?

There’s an obvious

*estimator*for this estimand: the mean difference of the*observed*outcomes between the treated group and the control group: \(Y_t(u) - Y_c(u)\).If treatment is

*randomly assigned*, the estimator is*unbiased*: you can be fairly confident in the estimate if you have a large enough treatment and control group.As we did earlier, if you are willing to assume that the causal effect is the same for everyone (a big assumption!), you can use your estimate of the ATE, \(\widehat{ATE}\), to fill in the missing individual values in your Preceptor Table.

Just because the ATE is often a useful estimand doesn’t mean that it *always* is.

Consider point #3. For example, let’s say the treatment effect does vary dependent on sex. For males there is a small negative effect (-4), but for females there is a larger positive effect (+8). However, the average treatment effect for the whole sample, even if you estimate it correctly, will be a single negative number (+1) – since the positive effect for females is larger than the negative effect for males.

Estimating the average treatment effect, by calculating \(\widehat{ATE}\), is easy. But is our \(\widehat{ATE}\) a good estimate of the actual ATE? After all, if we knew all the missing values in the Preceptor Table, we could calculate the ATE perfectly. But those missing values may be wildly different from the observed values. Consider this Preceptor Table:

Preceptor Table | |||
---|---|---|---|

ID | Outcomes | Causal Effect | |

$$Y_t(u)$$ | $$Y_c(u)$$ | $$Y_t(u) - Y_c(u)$$ | |

Yao |
13 |
10 |
+3 |

Emma |
14 |
11 |
+3 |

Cassidy |
9 |
6 |
+3 |

Tahmid |
15 |
12 |
+3 |

Diego |
3 |
0 |
+3 |

In this example, there is indeed a constant treatment effect for everyone: \(+3\). Note that the *observed* values are all the same, but the unobserved values were such that our estimated ATE, \(+1\), is pretty far from the actual ATE, \(+3\). If we think we have a reasonable estimate of ATE, using that value as a constant for \(\tau\) might be our *best guess*. For more discussion, see more from Matt Blackwell:

## 4.4 Assumptions

In this section, we will explore three topics: validity, stability, and representativeness. To start, we will look at a new Population Table.

### 4.4.1 Population Table

Our earlier Population Table familiarized us with the three sources of data for which we are making inferences: our data, the greater population, and the Preceptor Table.

The rows from our data have all information we desire. The rows from the Preceptor Table include covariates, but not outcomes. The rows from our greater population include no data, as we know nothing about these units.

Population Table | |||||
---|---|---|---|---|---|

Source | Outcomes | Year | $$\text{Covariates}$$ | Causal Effect | |

$$Y_t(u)$$ | $$Y_c(u)$$ | Sex | $$Y_t(u) - Y_c(u)$$ | ||

Population | ? | ? | 2012 | ? | ? |

Population | ? | ? | 2012 | ? | ? |

... | ... | ... | ... | ... | ... |

Data | 13 | ? | 2014 | Male | ? |

Data | ? | 9 | 2014 | Female | ? |

... | ... | ... | ... | ... | ... |

Preceptor Table | ? | ? | 2021 | Female | ? |

Preceptor Table | ? | ? | 2021 | Female | ? |

... | ... | ... | ... | ... | ... |

Population | ? | ? | 2023 | ? | ? |

Population | ? | ? | 2023 | ? | ? |

### 4.4.2 Validity

To understand validity in regards to the Population Table, we must first recognize an inherent flaw in any experiment design: *no two units receive exactly the same treatment*.

If this doesn’t ring true, consider our Spanish speaking train experiment. The units on the Spanish-speaking platform received the same treatment, right? No, actually!

Consider different volume levels, measured in decibels (dB), at which Spanish is being spoken.

ID | Outcomes | ||||
---|---|---|---|---|---|

$$Y_{\text{58 dB}}(u)$$ | $$Y_{\text{59 dB}}(u)$$ | $$Y_{\text{60 dB}}(u)$$ | $$Y_{\text{61 dB}}(u)$$ | $$Y_c(u)$$ | |

Yao |
13 |
? |
? |
? |
? |

Emma |
11 |
? |
? |
? |
? |

Cassidy |
? |
? |
? |
? |
10 |

Tahmid |
? |
? |
? |
? |
12 |

Diego |
6 |
? |
? |
? |
? |

As we can see, certain units heard the Spanish-speakers at higher volumes than other units. And it is entirely possible that the volume of the speaking affects the outcome. There is also the issue of the time spent on the platform. Maybe Yao tends to run late, and only hears the Spanish-speakers for thirty seconds. Emma, on the other hand, plans ahead. She hears the Spanish-speakers for fifteen minutes before the train arrives! Thus, despite the fact that Emma and Yao are in the same treatment — that is, hearing Spanish on the platform — they had very different versions of that treatment.

Indeed, there are an infinite number of possible treatments. This is why it is crucial to define one’s estimand precisely: if we are interested in the difference in potential outcomes between Spanish being spoken for 10 minutes at a 60 dB versus control, we can ignore all the other possible columns in the Population Table.

### 4.4.3 Stability

Stability means that the relationship between the columns is the same for three categories of rows: the data, the Preceptor Table, and the larger population from which both are drawn.

In our height example, it is much easier to assume stability over a greater period of time. Changes in global height occur extremely slowly, so height being stable across a span of 20 years is reasonable to assume. Can we say the same for this example, where we are looking at attitudes on immigration?

With something like political ideology, it is much harder to make the assertion that data collected in 2010 would be stable to data collected in 2025. Our data, for instance, was collected in 2014. We want to make predictions for 2021. And, frankly, it may be difficult to argue that our results would be stable if we re-conducted the experiment.

When we are confronted with this uncertainty, we can consider making our timeframe smaller. However, we would still need to assume stability from 2014 (time of data collection) to today. Stability allows us to ignore the issue of time.

Alternatively, if we believe that it is unlikely that our columns are stable, we have two choices. First, we abandon the experiment. If we believe our data is useless, so is our experiment. Second, we can choose to provide a sort of warning message with our conclusions: *this is based on data from ten years ago, but that was the most recent data available to us.*

When forming our future models, we will ask ourselves:

- Are the units from X years ago likely to have similar outcomes in future year Y?
- Has the world changed drastically from year Z, when the data was collected?

When we decide that Yao from today and Yao from a week in the future are *stable*, we can collapse the two units into one unit. This is an assumption we make when forming our Population Table.

### 4.4.4 Representativeness

The external validity of a study is often directly related to the representativeness of our sample. Representativeness has to do with how well our sample represents the larger population we are interested in generalizing to.

Does the train experiment allow us to calculate a causal effect for people who commute by cars? Can we calculate the causal effect for people in New York City? Before we generalize to broader populations we have to consider if our experimental estimates are applicable beyond our experiment. Maybe we think that commuters in Boston and New York are similar enough to generalize our findings. We could also conclude that people who commute by car are fundamentally different than people who commute by train. If that was true, then we could not say our estimate is true for all commuters because our sample does not accurately represent the broader group we want to generalize to.

Generally: *if there was no chance that a certain type of person would have been in this experiment, we cannot make an assumption for that person*.

Additionally, representativeness can also consider whether or not the sample is biased or not. Let’s look at it from the lenses of missing data mechanisms. If we have a Preceptor Table with no missing values, we have it easy. We just calculate the answer. Sadly, Preceptor Tables are (almost) always missing values, resulting in “missing data.” The process by which some data is missing is known as the “missing data mechanism.” The are two main missing data mechanisms which are of interest: the *assignment mechanism* and the *sampling mechanism*. The assignment mechanism is the process by which some units receive (or are “assigned”) the treatment and some units receive the control. The sampling mechanism is the process by which some units appear in our data and some do not. The sampling mechanism will be talked about in this section, whereas the assignment mechanism will be discussed in the next section.

You can not just select healthy people for a drug trial, as that might lead to an incorrect conclusion on whether or not the drug is valuable. For example, the study might conclude that they got healthier because the drug was valuable. In reality, maybe they only got better because they were already healthy to begin with. Or, perhaps the drug actually is valuable, but the study might be considered null since they only used healthy people. I suppose we can conclude there isn’t a right or wrong in representativeness, we just need to discuss and consider all sides when conducting our experiment.

A way to improve your representativeness, at least regarding a biased sample, is to use **sampling**, and having a good sampling mechanism. The sampling mechanism is the process by which some units appear in our data and some do not. If we are trying to estimate the average attitude towards immigrants in the US, we usually do so by taking a sample. The process by which people enter our sample is called the sampling mechanism. If the process by which people enter our sample is related to their attitude, even indirectly, then estimates from our sample will not be good estimates for the population.

### 4.4.5 Unconfoundedness

A fourth assumption we use when working with causal models — but not with predictive models — is **“unconfoundedness.”** If whether or not a unit received treatment or control is a random, we write that treatment assignment is not “confounded.” If, however, treatment assignment depends on the value of a potential outcome, then treatment assignment is confounded. Our lives are easiest if we can (reasonably!) assume unconfoundedness. In that case, we can estimate the average treatment effect by subtracting the average outcome of control unit from the average outcome of treated units, as we do above.

Consider the “Perfect Doctor” as an example of the problems caused by confounded treatment assignments. Imagine we have this omniscient doctor who knows how any patient will respond to a certain drug. She has perfect knowledge of the entire Preceptor Table. Using this information, she always assign each patient the treatment with the best outcome, whether that is treatment or control. Consider:

Holy Grail of Information | |||
---|---|---|---|

ID | Blood Pressure Outcomes | Causal Effect | |

$$Y_t(u)$$ | $$Y_c(u)$$ | $$Y_t(u) - Y_c(u)$$ | |

Yao |
130 |
105 |
+25 |

Emma |
120 |
140 |
-20 |

Cassidy |
100 |
170 |
-70 |

Tahmid |
115 |
125 |
-10 |

Diego |
135 |
100 |
35 |

MEAN |
120 |
128 |
-8 |

The Perfect Doctor would assign the treatment to Emma, Cassidy and Tahmid. She would assign control to Yao and Diego. And that is good! This is what the doctor should do. This is the best treatment assignment for the patients. But **it is not a good assignment mechanism for estimating the average causal effect because treatment assignment is confounded by the values of the potential outcomes.**

We, the non-Perfect Doctors, do not have access to the entire Precetor Table. We can only see this:

Skewed Holy Grail of Information | |||
---|---|---|---|

ID | Blood Pressure Outcomes | Causal Effect | |

$$Y_t(u)$$ | $$Y_c(u)$$ | $$Y_t(u) - Y_c(u)$$ | |

Yao |
? |
105 |
? |

Emma |
120 |
? |
? |

Cassidy |
100 |
? |
? |

Tahmid |
115 |
? |
? |

Diego |
? |
100 |
? |

MEAN |
111.66 |
102.5 |
9.16 |

The true causal effect of the treatment, as we can see in the first table, is -8. In other words, the treatment lowers blood pressure on average. But, using just the data we have access to if the Perfect Doctor performs the treatment assignment, we would — if we mistakenly assume random assignment — that the causal effect is positive, that treatment increase blood pressure.

Unconfoundedness can also be seen if:

- There’s a pandemic and you only give healthy people drugs, and see a significant improvement in health
- This may not be because of the drug, but because these people were already healthy to being with.

- You’re handing out free breadsticks to Harvard students who got a 97/100 on their test, and you see that they cry more
- They’re not sobbing because they hate breadsticks, they’re sobbing because they think they’ve brought dishonor to their family

Now that we’ve established that unconfoundedness sucks, let’s discuss how we can prevent it. All we have to do is ensure that covariants and potential outcomes don’t affect the assignment mechanism. Sounds easy!

The only way we actually care about is randomization, where you randomly assign control and treatment groups. Flip a coin, draw sticks, whatever. As long as you use randomization as your assignment mechanism, you’re good. There is the possibility that you can’t use pure randomization due to ethical or practical reasons, so we are forced to use non-random assignment mechanisms. Many statistical methods have been developed for causal inference when there is a non-random assignment mechanism. Those methods, however, are beyond the scope of this book. So, we can pretend they don’t exist, and just say “try your best to randomize everything!”

## 4.5 Summary

The fundamental components of every problem in causal inference are units, treatments and outcomes. The units are the rows in the tibble. The treatments are the columns. The outcomes are the values. Whenever you confront a problem is causal inference, start by identifying the units, treatments and outcomes.

A causal effect is the difference between one potential outcome and another. How different would your life be if you missed the train?

A Preceptor Table includes all the data that we would like to have to solve our problem. The Preceptor Table involves no missing data. We know what the outcome would have been for unit \(i\) under both treatment and control. With the ideal Preceptor Table it is easy to calculate, using only algebra, any quantity of interest.

The Population Table has three sources of data: the Preceptor Table, the dataset, and the greater population. For rows from the Preceptor Table, we only know covariates (no outcomes). Rows from the dataset include both covariates and outcomes. Rows from the greater population include no data (since we have not observed these units).

The causal effect of a treatment on a single unit at a point in time is the difference between the value of the outcome variable with the treatment and without the treatment. We call these “potential outcomes” because, at most, we can only observe one of them. The *Fundamental Problem of Causal Inference* is that it is impossible to observe the causal effect on a single unit. We must make assumptions — i.e, we must make models — in order to estimate causal effects.

Random assignment of treatments to units is the best way to estimate causal effects. Other assignment mechanisms are subject to confounding. If the treatment assigned is correlated with the potential outcomes, it is very hard to estimate the true treatment effect. (As always, we use the terms “causal effects” and “treatment effects” interchangeably. With random assignment, we can, mostly safely, estimate the average treatment effect (ATE) by looking at the difference between the average outcomes of the treated and control units.

Be wary of claims made in situations without random assignment: Here be dragons!