1  Rubin Causal Model

Have you ever wondered what the world would be like without you?

George Bailey, the main character in “It’s a Wonderful Life,” believes that his existence has served no purpose. The movie follows George as he explores a world in which he was never born. It is clear that he had a profound impact on the lives of many people in his community. His actions mattered, more than he ever realized.

By showing what the world would have been like without George, we get an idea of the causal effect of his life on his town and the people who live there. This chapter explains causation using the framework of potential outcomes and the Rubin Causal Model (RCM).

1.1 Preceptor Table

We would not need data science if we (and our bosses, colleagues, and clients) did not have questions. Every data science project starts with a question. Examples:

What is the average height of a student at New College of Florida?

What are the chances that, out of the next four New College students we meet, one will be taller than 183 centimeters?

A Preceptor Table1 is the smallest possible table with rows and columns such that, if none of the data is missing, then the things we want to know are easy to calculate. Consider:

Preceptor Table
ID Outcome
Height (cm)
Student 1 190
Student 2 160
... ...
Student 47 172
Student 48 176
... ...
Student 325 180
Student 326 162
... ...
Student 670 185

If we had a table with the height of every New College student, then statistics like the average would be easy to calculate. It would also be straightforward, using simulation, to estimate the the chances of various scenarios, as in our second question.

Even the simplest Preceptor Table will have two columns. The first is an ID column which serves to label each unit. The second column is the outcome of interest, the variable we are trying to predict/understand/influence. The rows of the Preceptor Table are the units, the objects on which the outcome is measured.

To answer questions, we need data. Consider:

Data for New College Students
ID Outcome
Height (cm)
Student 11 160
Student 32 189
Student 47 172
Student 68 170
Student 425 175
Student 436 162
Student 570 188

The notion of time is important, both in our Preceptor Table and in our data. To what moment in time do the questions refer? At what moment in time was our data collected? Those two moments are almost always different. Our data comes from some point in the past, even if it was collected yesterday. Our questions usually refer to now or to an indeterminate moment in the future.

In order to use this data to answer our questions, we need to consider the concept of validity. Is the data we have valid for answering our questions? Height, as an outcome, seems to make this simple, but, even here, complications can arise. For example, was height measured with shoes on or off? Which measure do we want?

1.2 Population Table

We create the Population Table2 by combining the Preceptor Table and the data. The aim of the Population Table is to illustrate the broader population in which we are interested. This table has three sources of data: the data for units we want to have (the Preceptor Table), the data for units which we actually have (our data), and the data for units we do not care about (the rest of the population, not included in the data or the Preceptor Table).

We are trying to answer questions about the height of New College students in 2023, so the “year” entries for these rows will read “2023.” In this case, the actual data comes from a survey of New College students in 2015.

The “Source” column refers to the source of the rows in the Population Table. The rows with no “Source” are from the larger population from which, by assumption, both the Preceptor Table and the data are drawn. As such, all values, other than year, are missing.

Population Table
Source Year Height

2010

Data

2015

180

Data

2015

163

2018

Preceptor Table

2023

?

Preceptor Table

2023

?

2030

2030

The question marks indicate the data which we do not know but which we need in order to answer the questions.

Implicit in the Preceptor Table is a notion of time. Now that we can see our actual data compared with our greater population and our desired data, we must expand our observations. That is to say that, given our data is sourced from 2015 and our desired data is from 2023, we should include, by assumption, a greater time span in our population.

1.3 Causal effect

This study, which shows the impact of exposure to Spanish-speaking individuals on attitudes towards immigration, was conducted by Ryan Enos.

The Rubin Causal Model (RCM) is based on the idea of potential outcomes. For example, Enos (2014) measured attitudes toward immigration among Boston commuters. Individuals were exposed to one of two possible conditions, and then their attitudes towards immigrants were recorded. One condition was waiting on a train platform near individuals speaking Spanish. The other was being on a train platform without Spanish-speakers. To calculate the causal effect of having Spanish-speakers nearby, we need to compare the outcome for an individual in one possible state of the world (with Spanish-speakers) to the outcome for that same individual in another state of the world (without Spanish-speakers). However, it is impossible to observe both potential outcomes at once. One of the potential outcomes is always missing, since a unit cannot travel back in time, and experience both treatments. This dilemma is the Fundamental Problem of Causal Inference.

In most circumstances, we are interested in comparing two experimental manipulations, one generally termed “treatment” and the other “control.” The difference between the potential outcome under treatment and the potential outcome under control is a “causal effect” or a “treatment effect.” According to the RCM, the causal effect of being on the platform with Spanish-speakers is the difference between what your attitude would have been under “treatment” (with Spanish-speakers) and under “control” (no Spanish-speakers).

The commuter survey consisted of three questions, each measuring agreement on a 1 to 5 integer scale, with 1 being liberal and 5 being conservative. For each person, the three answers were summed, generating an overall measure of attitude toward immigration which ranged from 3 (very liberal) to 15 (very conservative). If your attitude towards immigrants would have been a 13 after being exposed to Spanish-speakers and a 9 with no such exposure, then the causal effect of being on a platform with Spanish-speakers is a 4-point increase in your score.

We will use the symbol \(Y\) to represent potential outcomes, the variable we are interested in understanding and modeling. \(Y\) is called the response or outcome variable. It is the variable we want to “explain.” In our case this would be the attitude score. If we are trying to understand a causal effect, we need two symbols so that control and treated values can be represented separately: \(Y_t\) and \(Y_c\).

1.3.1 Potential outcomes

Suppose that Yao is one of the commuters surveyed in this experiment. If we were omniscient, we would know the outcomes for Yao under both treatment (with Spanish-speakers) and control (no Spanish-speakers), and we’d be able to ignore the Fundamental Problem of Causal Inference. We can show this using a Preceptor Table. Calculating the number we are interested in is trivial because none of the data is missing.

Preceptor Table
ID Outcomes
Attitude if Treated Attitude if Control
Yao 13 9

Regardless of what the causal effect is for other subjects, the causal effect for Yao of being on the train platform with Spanish-speakers is a shift towards a more conservative attitude.

Using the response variable — the actual symbol rather than a written description — makes for a more concise Preceptor Table.

Preceptor Table
ID Outcomes
$$Y_t$$ $$Y_c$$

Yao

13

9

The “causal effect” is the difference between Yao’s potential outcome under treatment and his potential outcome under control.

Preceptor Table
ID Outcomes Causal Effect
$$Y_t$$ $$Y_c$$ $$Y_t - Y_c$$

Yao

13

9

+4

Remember that, in the real world, we will have a bunch of missing data! We can not use simple arithmetic to calculate the causal effect on Yao’s attitude toward immigration. Instead, we will be required to estimate it. An estimand is some unknown variable in the real world that we are trying to measure. In this case, it is \(Y_{t}-Y_{c}\), not \(+4\). An estimand is not the value you calculated, but is rather the unknown variable you want to estimate.

Don Rubin was a Professor of Statistics at Harvard.

1.3.2 Causal and predictive models

Causal inference is often compared with prediction. In prediction, we want to know an outcome, \(Y\). In causal inference, we want to know a function of potential outcomes, such as the treatment effect: \(Y_t - Y_c\).

These are both missing data problems. Prediction involves estimating an outcome variable that we don’t have, and thus is missing, whether because it is in the future or because it is from data that we are unable to collect. Thus, prediction is the term for using statistical inference to fill in missing data for individual outcomes. Causal inference, however, involves filling in missing data for more than one potential outcome. This is unlike prediction, where only one outcome can ever be observed, even in principle.

Key point: In a predictive model, there is only one \(Y\) value for each unit. This is very different to the RCM where there are (at least) two potential outcomes (treatment and control). There is only one outcome column in a predictive model, whereas there are two or more in a causal model.

With a predictive model, we cannot infer what would happen to the outcome \(Y\) if we changed \(X\) for a given unit. We can only compare two units, one with one value of \(X\) and another with a different value of \(X\).

In a sense, all models are predictive. However, only a subset of models are causal, meaning that, for a given individual, you can change the value \(X\) and observe a change in outcome, \(Y(u)\), and from that calculate a causal effect.

1.3.3 No causation without manipulation

In order for a potential outcome to make sense, it must be possible, at least a priori. For example, if there is no way for Yao, under any circumstance, to ever be in the train study, then \(Y_{t}\) is impossible for him. It can never happen. And if \(Y_{t}\) can never be observed, even in theory, then the causal effect of treatment on Yao’s attitude is undefined.

The causal effect of exposure to Spanish-speakers is well defined because it is the simple difference of two potential outcomes, both of which might happen. In this case, we (or something else) can manipulate the world, at least conceptually, so that it is possible that one thing or a different thing might happen.

This definition of causal effects becomes much more problematic if there is no way for one of the potential outcomes to happen, ever. For example, what is the causal effect of Yao’s height on his weight? It might seem we would just need to compare two potential outcomes: Yao’s weight under the treatment (where treatment is defined as being 3 inches taller) and Yao’s weight under the control (where control is defined as his current height).

A moment’s reflection highlights the problem: we can’t increase Yao’s height. There is no way to observe, even conceptually, what Yao’s weight would be if he were taller because there is no way to make him taller. We can’t manipulate Yao’s height, so it makes no sense to investigate the causal effect of height on weight. Hence the slogan: No causation without manipulation.

This then raises the question of what can and cannot be manipulated. If something cannot be manipulated, we should not consider it causal. So can race ever be considered causal? What about sex? A genetic condition like color-blindness? Can we manipulate these characteristics? In the modern world these questions are not simple.

Take color-blindness for example. Say we are interested in how color-blindness impacts ability to complete a jig-saw puzzle. Because color-blindness is genetic some might argue it cannot be manipulated. But advances in technology like gene-therapy might allow us to actually change someone’s genes. Could we then claim the ability to manipulate color-blindness? If yes, we could then measure the causal effect of color-blindness on ability to complete jig-saw puzzles.

The slogan of “No causation without manipulation” may at first seem straight-forward, but it is clearly not so simple. Questions about race, sex, gender and genetics are very complex and should be considered with care.

1.3.4 Multiple units

Generally, a study has many individuals (or, more broadly, “units”) who each have their own potential outcomes. More notation is needed to allow us to differentiate between different units.

In other words, there needs to be a distinction between \(Y_t\) for Yao, and \(Y_t\) for Emma. We use the variable \(u\) (\(u\) for “unit”) to indicate that the outcome under control and the outcome under treatment can differ for each individual unit (person).

Instead of \(Y_t\), we will use \(Y_t(u)\) to represent “Attitude if Treated.” If you want to talk about only Emma, you could say “Emma’s Attitude if Treated” or “\(Y_t(u = Emma)\)” or “the \(Y_t(u)\) for Emma”, but not just \(Y_t\). That notation is too ambiguous when there is more than one subject.

Let’s look at a Preceptor Table with more subjects using our new notation:

Preceptor Table
ID Outcomes Causal Effect
$$Y_t(u)$$ $$Y_c(u)$$ $$Y_t(u) - Y_c(u)$$

Yao

13

9

+4

Emma

14

11

+3

Cassidy

11

6

+5

Tahmid

9

12

-3

Diego

3

4

-1

From this Preceptor Table, there are many possible estimands we might be interested in. Consider some examples, along with their true values:

  • A potential outcome for one person, e.g., Yao’s potential outcome under treatment: \(13\).
  • A causal effect for one person, such as for Emma. This is the difference between the potential outcomes: \(14 - 11 = +3\).
  • The most positive causal effect: \(+5\), for Cassidy.
  • The most negative causal effect: \(-3\), for Tahmid.
  • The median causal effect: \(+3\).
  • The median percentage change: \(+27.2\%\). To see this, calculate the percentage change for each person. You’ll get 5 percentages: \(+44.4\%\), \(+27.2\%\), \(+83.3\%\), \(-25.0\%\), and \(-25.0\%\).

Similar concepts can also be applied to the Population Table:

Population Table
Source Year ID Outcomes Causal Effect
$$Y_t(u)$$ $$Y_c(u)$$ $$Y_t(u) - Y_c(u)$$

2010

?

?

?

?

Data

2015

Yao

5

3

+2

Data

2015

Cassidy

3

6

-3

2018

?

?

?

?

Preceptor Table

2023

Yao

13

9

+4

2030

?

?

?

?

For example, we get a much better picture of all our data, as it all combines into one nice looking Population Table. We can take a look a past data about Yao, or Cassidy, and their previous outcomes and causal effects. We can also see the rest of the units which fall under our desired population, but we don’t have any data about, hence the question makes.

Consider these examples:

  • Difference in potential outcome for one person, eg., the difference between Yao’s \(Y_t(u)\) values: \(-8\)
  • Difference in causal effect for one person, for Yao it would be \(-2\): \(+2- +4\)

All of the variables calculated in the Preceptor and Population Tables are examples of estimands we might be interested in. One estimand is important enough that it has its own name: the average treatment effect, often abbreviated as ATE. The average treatment effect is the mean of all the individual causal effects. Here, the mean is \(+1.6\).

What does our real-world Preceptor Table look like?

Causal Preceptor Table
ID Outcomes Causal Effect
$$Y_t(u)$$ $$Y_c(u)$$ $$Y_t(u) - Y_c(u)$$

Yao

13

?

?

Emma

14

?

?

Cassidy

?

6

?

Tahmid

?

12

?

Diego

3

?

?

Predictive Preceptor Table
ID $$Y_t(u)$$

Yao

13

Emma

14

Cassidy

6

Tahmid

12

Diego

3

1.4 Assumptions

In this section, we will explore four topics: validity, stability, representativeness and unconfoundedness.

Our earlier Population Table familiarized us with the three sources of data for which we are making inferences: the Preceptor Table, our data, and the greater population from which both are drawn. Consider a new Population Table.

Population Table
Source Outcomes Year Covariates Causal Effect
$$Y_t(u)$$ $$Y_c(u)$$ Sex $$Y_t(u) - Y_c(u)$$
... ? ? 2010 ? ?
... ? ? 2010 ? ?
... ... ... ... ... ...
Data 13 ? 2012 Male ?
Data ? 9 2012 Female ?
... ... ... ... ... ...
Preceptor Table ? ? 2023 Female ?
Preceptor Table ? ? 2023 Female ?
... ... ... ... ... ...
... ? ? 2023 ? ?
... ? ? 2023 ? ?

The rows from our data have covariates and two potential outcomes. (By definition, we only know the value for one of the potential outcomes for each unit.) The rows from the Preceptor Table include covariates, but not outcomes. (We have no outcome data for 2023, but we will, in theory, be given the sex of any units about whom we need to make inferences.) The rows from our greater population include no data, as we know nothing about these units.

1.4.1 Validity

Validity is the consistency, or lack there of, in the columns of your dataset and the corresponding columns in your Preceptor Table. In order to consider the two datasets to be drawn from the same population, the columns from one must have a valid correspondence with the columns in the other. Validity, if true (or at least reasonable), allows us to construct the Population Table.

Consider the potential outcomes: attitude toward immigration under treatment and control. How is this measured? In this case, we use answers to survey questions. Our we going to use the exact same questions in 2023 as were used in our data from 2012? Probably not. Even if we used the same wording, will the meaning of the words be constant. Almost certainly not! The word “undocumented” could mean something very different in the America before the election of Donald Trump in 2016. Consider the treatment: exposure to Spanish speakers on a train platform. The speakers used in 2012 will not be available in 2023 if we were to redo the experiment.

Validity is an assumption which allows us to “stack” the data and the Preceptor Table on top of one another, into a single Population Table. The variables from each source, while not identical in meaning, are similar enough that we can treat them as if they are the same thing.

1.4.2 Stability

Stability means that the relationship between the columns in the Population Table is the same for three categories of rows: the data, the Preceptor Table, and the larger population from which both are drawn.

We will be constructing models, mathematical relationships between different variables. For example, perhaps the treatment has a greater effect on women than on men. However, we can only directly observe that relationship in the data which we have, from 2012. We must assume stability in order to use the model created from the data to make inferences about other rows in the Population Table, specifically the rows corresponding to the Preceptor Table.

Must the world be stable? No! In fact, the world is always changing! The stability assumption, in this case, is a claim that the world has not changes that much between 2012 and 2023. Stability applies, not just to the data and the Preceptor Table rows, but to the entire Population Table. Stability allows us to ignore the passage of time.

1.4.3 Representativeness

Representativeness, or the lack thereof, concerns two relationship, among the rows in the Population Table. The first is between the Preceptor Table and the other rows. The second is between our data and the other rows. Ideally, we would like both the Precepor Table and our data to be random samples from the population. Sadly, this is almost never the case.

Does the train experiment allow us to calculate a causal effect for people who commute by cars? Can we calculate the causal effect for people in New York City? Before we generalize to a broader population we have to consider if our experimental estimates are applicable beyond our experiment. Maybe we think that commuters in Boston and New York are similar enough to generalize our findings. We could also conclude that people who commute by car are fundamentally different than people who commute by train. If that was true, then we could not say our estimate is true for all commuters because our sample does not accurately represent the broader group to which we want to generalize.

1.4.4 Unconfoundedness

A fourth assumption we use when working with causal models — but not with predictive models — is “unconfoundedness.” If whether or not a unit received treatment or control is random, we write that treatment assignment is not “confounded.” If, however, treatment assignment depends on the value of a potential outcome, then treatment assignment is confounded. Our lives are easiest if we can (reasonably!) assume unconfoundedness. In that case, we can estimate the average treatment effect by subtracting the average outcome for control units from the average outcome for treated units, as we do above.

Consider the “Perfect Doctor” as an example of the problems caused by confounded treatment assignments. Imagine we have this omniscient doctor who knows how any patient will respond to a certain drug. She has perfect knowledge of the entire Preceptor Table. Using this information, she always assign each patient the treatment with the best outcome, whether that is treatment or control. Consider:

Holy Grail of Information
ID Blood Pressure Outcomes Causal Effect
$$Y_t(u)$$ $$Y_c(u)$$ $$Y_t(u) - Y_c(u)$$

Yao

130

105

+25

Emma

120

140

-20

Cassidy

100

170

-70

Tahmid

115

125

-10

Diego

135

100

35

MEAN

120

128

-8

The Perfect Doctor would assign the treatment to Emma, Cassidy and Tahmid. She would assign control to Yao and Diego. And that is good! This is what the doctor should do. This is the best treatment assignment for the patients. But it is not a good assignment mechanism for estimating the average causal effect because treatment assignment is confounded by the values of the potential outcomes.

We, the non-Perfect Doctors, do not have access to the entire Preceptor Table. We can only see this:

Skewed Holy Grail of Information
ID Blood Pressure Outcomes Causal Effect
$$Y_t(u)$$ $$Y_c(u)$$ $$Y_t(u) - Y_c(u)$$

Yao

?

105

?

Emma

120

?

?

Cassidy

100

?

?

Tahmid

115

?

?

Diego

?

100

?

MEAN

111.66

102.5

9.16

The true causal effect of the treatment, as we can see in the first table, is -8. In other words, the treatment lowers blood pressure on average. But, using just the data we have access to if the Perfect Doctor performs the treatment assignment, we would estimate — if we mistakenly assume random assignment — that the causal effect is positive, that treatment increases blood pressure.

The best way to ensure unconfoundedness is to randomize the treatment across units. Don’t let the doctor decide who gets the treatment and who gets the control. Randomize assignment. As long as you use randomization as your assignment mechanism, you are good. There is the possibility that you can’t use pure randomization due to ethical or practical reasons, so we are forced to use a non-random assignment mechanisms. Many statistical methods have been developed for causal inference when there is a non-random assignment mechanism. Those methods, however, are largely beyond the scope of this book.

1.5 Simple models

How can we fill in the question marks? Because of the Fundamental Problem of Causal Inference, we can never know the missing values. Because we can never know the missing values, we must make assumptions. “Assumption” just means that we need a “model,” and all models have parameters.

1.5.1 A single value for tau

One model might be that the causal effect is the same for everyone. There is a single parameter, \(\tau\), which we then estimate. (\(\tau\) is a Greek letter, written as “tau” and rhyming with “cow.”) Once we have an estimate, we can fill in the Preceptor Table because, knowing it, we can estimate what the unobserved potential outcome is for each person. We use our assumption about \(\tau\) to estimate the counterfactual outcome for each unit.

Remember what our Preceptor Table looks like with all of the missing data:

Preceptor Table
ID Outcomes Causal Effect
$$Y_t(u)$$ $$Y_c(u)$$ $$Y_t(u) - Y_c(u)$$

Yao

13

?

?

Emma

14

?

?

Cassidy

?

6

?

Tahmid

?

12

?

Diego

3

?

?

If we assume \(\tau\) is the treatment effect for everyone, how do we fill in the table? We are using \(\tau\) as an estimate for the causal effect. By definition: \(Y_t(u) - Y_c(u) = \tau\). Using simple algebra, it is then clear that \(Y_t(u) = Y_c(u) + \tau\) and \(Y_c(u) = Y_t(u) - \tau\). In other words, you could add it to the observed value of every observation in the control group (or subtract it from the observed value of every observation in the treatment group), and thus fill in all the missing values.

Assuming there is a constant treatment effect, \(\tau\), for everyone, filling in the missing values would look like this:

Preceptor Table
ID Outcomes Causal Effect
$$Y_t(u)$$ $$Y_c(u)$$ $$Y_t(u) - Y_c(u)$$

Yao

13

\[13 - \tau\]

\[\tau\]

Emma

14

\[14 - \tau\]

\[\tau\]

Cassidy

\[6 + \tau\]

6

\[\tau\]

Tahmid

\[12 + \tau\]

12

\[\tau\]

Diego

3

\[3 - \tau\]

\[\tau\]

Now we need to find an estimate for \(\tau\) in order to fill in the missing values. One approach is to subtract the average of the observed control values from the average of the observed treated values. \[((13 + 14 + 3) / 3) - ((6 + 12) / 2)\] \[10 - 9 = +1\]

Or, in other words, we use this formula:

\[\frac{\Sigma Y_t(u)}{n_t} + \frac{\Sigma Y_c(u)}{n_c} = \widehat{ATE}\]

\(\Sigma\) represents the sum of the treated/control values, and \(n_t\)/\(n_c\) represents the number of values within the treated and control groups. This formula is for something called \(\widehat{ATE}\), which we will discuss in more depth in a later section.

Continuing with the example, calculating the ATE or the causal effect, gives us an estimate of \(+1\) for \(\tau\). Let’s fill in our missing values by adding \(\tau\) to the observed values under control and by subtracting \(\tau\) from the observed value under treatment like so:

Preceptor Table
ID Outcomes Causal Effect
$$Y_t(u)$$ $$Y_c(u)$$ $$Y_t(u) - Y_c(u)$$

Yao

13

\[13 - (+1)\]

+1

Emma

14

\[14 - (+1)\]

+1

Cassidy

\[6 + (+1)\]

6

+1

Tahmid

\[12 + (+1)\]

12

+1

Diego

3

\[3 - (+1)\]

+1

Which gives us:

Preceptor Table
ID Outcomes Causal Effect
$$Y_t(u)$$ $$Y_c(u)$$ $$Y_t(u) - Y_c(u)$$

Yao

13

12

+1

Emma

14

13

+1

Cassidy

7

6

+1

Tahmid

13

12

+1

Diego

3

2

+1

If we make the assumption that there is a single value for \(\tau\) and that \(1\) is a good estimate of that value, then we can determine the missing potential outcomes. The Preceptor Table no longer has any missing values, so we can use it to easily answer (almost) any conceivable question.

1.5.2 Two values for tau

A second model might assume that the causal effect is different between levels of a category but the same within those levels. For example, perhaps there is a \(\tau_F\) for females and \(\tau_M\) for males where \(\tau_F != \tau_M\). We are making this assumption to give us a different model with which we can fill in the missing values in our Preceptor Table. We can’t make any progress unless we make some assumptions. That is an inescapable consequence of the Fundamental Problem of Causal Inference.

Consider a model in which causal effects differ based on sex. When we are looking at a “category” of units — for instance, gender — we call this a covariate. Possible covariates include, but are not limited to, sex, age, political party and almost everything else which might be associated with attitudes toward immigration.

Preceptor Table
ID Outcomes Covariate Causal Effect
$$Y_t(u)$$ $$Y_c(u)$$ Gender $$Y_t(u) - Y_c(u)$$

Yao

13

\[13 - \tau_M\]

Male

\[\tau_M\]

Emma

14

\[14 - \tau_F\]

Female

\[\tau_F\]

Cassidy

\[6 + \tau_F\]

6

Female

\[\tau_F\]

Tahmid

\[12 + \tau_M\]

12

Male

\[\tau_M\]

Diego

3

\[3 - \tau_M\]

Male

\[\tau_M\]

We would have two different estimates for \(\tau\).

\(\tau_M\) would be \[(13+3)/2 - 12 = -4\] \(\tau_F\) would be \[(14-6 = +8)\]

Using those values, we would fill out our new table like this:

Preceptor Table
ID Outcomes Causal Effect
$$Y_t(u)$$ $$Y_c(u)$$ $$Y_t(u) - Y_c(u)$$

Yao

13

\[13 - (-4)\]

-4

Emma

14

\[14 - (+8)\]

+8

Cassidy

\[6 + (+8)\]

6

+8

Tahmid

\[12 + (-4)\]

12

-4

Diego

3

\[3 - (-4)\]

-4

Which gives us:

Preceptor Table
ID Outcomes Causal Effect
$$Y_t(u)$$ $$Y_c(u)$$ $$Y_t(u) - Y_c(u)$$

Yao

13

17

-4

Emma

14

6

+8

Cassidy

14

6

+8

Tahmid

8

12

-4

Diego

3

7

-4

We now have two different estimates for Emma (and for everyone else in the table). When we estimate \(Y_c(Emma)\) using an assumption of constant treatment effect (a single value for \(\tau\)), we get \(Y_c(Emma) = 13\). When we estimate assuming treatment effect is constant for each sex, we calculate that \(Y_c(Emma) = 8\). This difference between our estimates for Emma highlights the difficulties of inference. Models drive inference. Different models will produce different inferences.

1.5.3 Heterogenous treatment effects

Is the assumption of a constant treatment effect, \(\tau\), usually true? No! It is never true. People vary. The effect of a pill on you will always be different from the effect of a pill on your friend, at least if we measure outcomes accurately enough. Treatment effects are always heterogeneous, meaning that they vary across individuals.

Reality looks like this:

Preceptor Table
ID Outcomes Causal Effect
$$Y_t(u)$$ $$Y_c(u)$$ $$Y_t(u) - Y_c(u)$$

Yao

13

\[13 - \tau_{yao}\]

\[\tau_{yao}\]

Emma

14

\[14 - \tau_{emma}\]

\[\tau_{emma}\]

Cassidy

\[6 + \tau_{cassidy}\]

6

\[\tau_{cassidy}\]

Tahmid

\[12 + \tau_{tahmid}\]

12

\[\tau_{tahmid}\]

Diego

3

\[3 - \tau_{diego}\]

\[\tau_{diego}\]

Can we solve for \(\tau_{yao}\)? No! That is the Fundamental Problem of Causal Inference. So how can we make any progress from here if we are unwilling to assume there is at least some structure to the causal effect across different individuals? Instead of worrying about the causal effect for specific individuals, we, instead, focus on the causal effect for the entire population.

1.5.4 Average treatment effect

The average treatment effect (ATE) is the average difference in potential outcomes between the treated group and the control groups. Because averaging is a linear operator, the average difference is the same as the difference between the averages. The distinction between this estimand and estimands like \(\tau\), \(\tau_M\) and \(\tau_F\), is that, in this case, we do not care about using the average treatment effect to fill in missing values in each row. The average treatment effect is useful because we don’t have to assume anything about each individuals’ \(\tau\), like \(\tau_{yao}\), but can still understand something about the average causal effect across the whole population.

As we did before, the simplest way to estimate the ATE is to take the mean of the treated group (\(10\)) and the mean of the control group (\(9\)) and then take the difference in those means (\(1\)). If we use this method to an estimate of the ATE, we’ll call it \(\widehat{ATE}\), pronounced “ATE-hat.”

If we already did this exact same calculation above, why are we talking about it again? Remember that we are unwilling to assume treatment effect is constant in our study population, and we cannot solve for \(\tau\) if \(\tau\) is different for different individuals. This is where \(\widehat{ATE}\) is helpful.

Some estimands may not require filling in all the question marks in the Preceptor Table. We can get a good estimate of the average treatment effect without filling in every question mark — the average treatment effect is just a single number. Rarely in a study do we care about what happens to individuals. In our case, we don’t care about what specifically would happen to Cassidy’s attitude if treated. Instead, we care generally about how our experiment impacts people’s attitudes towards immigrants. This is why an average estimate, like \(\widehat{ATE}\), can be helpful.

As we noted before, this is a popular estimand. Why?

  1. There’s an obvious estimator for this estimand: the mean difference of the observed outcomes between the treated group and the control group: \(\overline{Y_t(u)} - \overline{Y_c(u)}\).

  2. If treatment is randomly assigned, the estimator is unbiased: you can be fairly confident in the estimate if you have a large enough treatment and control groups.

  3. As we did earlier, if you are willing to assume that the causal effect is the same for everyone (a big assumption!), you can use your estimate of the ATE, \(\widehat{ATE}\), to fill in the missing individual values in your Preceptor Table.

Just because the ATE is often a useful estimand doesn’t mean that it always is.

Consider point #3. For example, let’s say the treatment effect does vary dependent on sex. For males there is a small negative effect (-4), but for females there is a larger positive effect (+8). However, the average treatment effect for the whole sample, even if you estimate it correctly, will be a single positive number (+1) – since the positive effect for females is larger than the negative effect for males.

Estimating the average treatment effect, by calculating \(\widehat{ATE}\), is easy. But is our \(\widehat{ATE}\) a good estimate of the actual ATE? After all, if we knew all the missing values in the Preceptor Table, we could calculate the ATE perfectly. But those missing values may be wildly different from the observed values. Consider this Preceptor Table:

Preceptor Table
ID Outcomes Causal Effect
$$Y_t(u)$$ $$Y_c(u)$$ $$Y_t(u) - Y_c(u)$$

Yao

13

10

+3

Emma

14

11

+3

Cassidy

9

6

+3

Tahmid

15

12

+3

Diego

3

0

+3

In this example, there is indeed a constant treatment effect for everyone: \(+3\). Note that the observed values are all the same, but the unobserved values were such that our estimated ATE, \(+1\), is pretty far from the actual ATE, \(+3\). If we think we have a reasonable estimate of ATE, using that value as a constant for \(\tau\) might be our best guess.

1.6 Summary

Glossary
  • A Preceptor Table is a table that includes all rows and columns such that, if no data is missing, it is easy to calculate our quantity of interest. Unfortunately, there is always data missing.

  • A Population Table has three sources of data: the Preceptor Table, the dataset, and the greater population from which both are drawn.

  • A Potential Outcome is the outcome for an individual depending on if they receive the treatment or not.

  • A Causal Effect is the difference between two potential outcomes.

  • The Fundamental Problem of Causal Inference is that it is impossible to observe the causal effect on a single unit because we can never observe both potential outcomes for that unit. We must make assumptions in order to estimate causal effects.

  • When confronting a causal question, first identify the units, treatments and outcomes.

The fundamental components of every problem in causal inference are units, treatments and outcomes. The units are the rows in the table. The treatments are (some of) the columns. The outcomes are the values under the treatment columns. (There are also covariate columns and the values within them.) Whenever you confront a problem in causal inference, start by identifying the units, treatments and outcomes.

A causal effect is the difference between one potential outcome and another. How different would your life be if you missed the train?

A Preceptor Table includes all rows and columns such that, if no data is missing, it is easy to calculate our quantity of interest. Unfortunately, data is always missing in causal models because, at most, we can only observe one potential outcome for each unit. The causal effect of a treatment on a single unit at a point in time is the difference between the value of the outcome variable with the treatment and without the treatment. The Fundamental Problem of Causal Inference is that it is impossible to observe the causal effect on a single unit. We must make assumptions — i.e, we must make models — in order to estimate causal effects.

The Population Table has three sources of data: the Preceptor Table, the dataset, and the greater population from which both are drawn.

The assumption of validity, if met, allows us to create the Population Table because it ensures that the columns of data from the different sources can be put into a single table. If the relationships among the data are the same (or at least same’ish), over time, then we can assume stability. A model estimated on our data will also apply to our Preceptor Table. Representativeness examines the rows we have relative to the rows in the Population Table which we might have had. Unconfoundedness, which only matters in causal settings, means that either the treatment was randomly assigned or that we can act as if it was.

Random assignment of treatments to units is the best experimental set up for estimating causal effects. Other assignment mechanisms are subject to confounding. If the treatment assigned is correlated with the potential outcomes, it is very hard to estimate the true treatment effect. (As always, we use the terms “causal effects” and “treatment effects” interchangeably.) With random assignment, we can, mostly safely, estimate the average treatment effect (ATE) by looking at the difference between the average outcomes of the treated and control units.

Be wary of claims made in situations without random assignment: Here be dragons!


  1. The term Preceptor Table is, perhaps unsurprisingly, unique to this textbook. We hope you find it useful.↩︎

  2. The term Population Table is, perhaps unsurprisingly, unique to this textbook. We hope you find it useful.↩︎