Key Concepts

These are the key concepts from the Primer. You should apply them whenever you face a choice, and some data to help you make it. The world confronts us. Make decisions we must.

Quantity of Interest is the number which you want to estimate. You will almost always calculate a posterior probability distribution for your Quantity of Interest since, in the real world, you will never know your QoI precisely. Most problems involve more than one QoI.

Predictive Models and Causal Models are different because predictive models have only one outcome column. Causal models have more than one (potential) outcome column because we need more than one potential outcome in order to estimate a causal effect. The first step in a data science problem is to determine if your QoI requires a causal or a predictive model.

Units, outcomes and covariates are important parts of every data science model. Causal, but not predictive, models also include treatments, which are just covariates which we can, at least in theory, manipulate. The QoI determines the units and outcomes for your model.

Preceptor Table is the smallest possible table of data with rows and columns such that, if there is no missing data, then it is easy to calculate the quantities of interest.

Potential Outcome is the outcome for an individual under a specific treatment.

Causal Effect is the difference between two potential outcomes.

Rubin Causal Model is an approach to the statistical analysis of cause and effect based on the framework of potential outcomes.

Wisdom is the first Cardinal Virtue in data science. Begin with the quantity of interest. Is that QoI a causal effect or simply a forecast? Which units and outcomes does it imply? What Preceptor Table would allow you to calculate your QoI easily? Perform an exploratory data analysis (EDA) on the data you have. Is it valid to consider the data you have and the (theoretical) data from the Preceptor Table to have arisen out of the same population? If so, you may continue. If not, your attempt to estimate your QoI ends now.

Validity is the consistency, or lack thereof, in the columns of your dataset and the corresponding columns in your Preceptor Table. In order to consider the two datasets to be drawn from the same population, the columns from one must have a valid correspondence with the columns in the other. Validity, if true (or at least reasonable), allows us to construct the Population Table, which is the first step in Justice.

Population Table can be constructed if the validity assumption is (mostly) true. It includes all the rows and columns from the Preceptor Table. It also includes the rows and columns from the data set. It generally has other rows as well, rows which represent unit/time combinations from other components of the population.

Justice is the second Cardinal Virtue in data science. Justice starts with the Population Table – the data we want to have, the data which we actually have, and all the other data from that same population. Each row of the Population Table is defined by a unique unit/time combination. We explore three key issues. First, does the relationship among the variables demonstrate stability, meaning is the model stable across different time periods? Second, are the rows associated with the data and, separately, the rows associated with the Preceptor Table representative of all the units from the population? Third, for causal models only, we consider unconfoundedness.

Stability means that the relationship between the columns in the Population Table is the same for three categories of rows: the data, the Preceptor Table, and the larger population from which both are drawn.

Representativeness, or the lack thereof, concerns two relationship, among the rows in the Population Table. The first is between the Preceptor Table and the other rows. The second is between our data and the other rows. Ideally, we would like both the Preceptor Table and our data to be random samples from the population. Sadly, this is almost never the case.

Unconfoundedness means that the treatment assignment is independent of the potential outcomes, when we condition on pre-treatment covariates. This assumption is only relevant for causal models. We describe a model as “confounded” if this is not true. The easiest way to ensure unconfoundedness is to assign treatment randomly.

Courage is the third Cardinal Virtue in data science. Justice gives us the Population Table. Courage selects the data generating mechanism. We first specify the mathematical formula which connects the outcome variable we are interested in with the other data that we have. We explore different models. We need to decide which variables to include and to estimate the values of unknown parameters. We check our models for consistency with the data we have. We avoid hypothesis tests. We select one model.

Data Generating Mechanism (DGM) is also called the data generating model or the data generating process. The true DGM is the reality of the world, the physical process which actually generates the data which we observe. The estimated DGM is the mathematical formula we create which models the true DGM, which we can never know. We use the estimated DGM to draw inferences about our Quantities of Interest.

Temperance is the fourth Cardinal Virtue in data science. Courage gave us the data generating mechanism. Temperance guides us in the use of the DGM — or the “model” — we have created to answer the questions with which we began. We create posteriors for the quantities of interest. We should be modest in the claims we make. The posteriors we create are never the “truth.” The assumptions we made to create the model are never perfect. Yet decisions made with flawed posteriors are almost always better than decisions made without them.

Wisdom clarifies the question. Justice gives us the Population Table. Courage creates the data generating mechanism. Temperance answers the question.

Steps to Take in Every Data Science Problem

In the spirit of transparency, here are the guidelines which we provide to colleagues updating chapters in the Primer. This is how we think data science ought to be taught. It is also, perhaps unsurprisingly, how we think data science ought to be done, at least at this introductory level.

Question

Start with the question. This will generally be something simple, easily expressed in just a sentence, using colloquial language. Every data science problem starts with a question. Your goal is to provide an answer to the question, along with your uncertainty about that answer. The most common form for the answer is a posterior probability distribution. This PPD is either the answer itself, or the tool we use to answer the question. Example:

“What proportion of people who make $100,000 are liberal?”

Of course, every data science problem does not start with a question. It actually starts with a decision. The world confronts us. Make decisions we must. Yet, in this introductory textbook, starting with a decision to make is too hard. So, we simplify and start with a question. But we should always at least mention the sort of decision which the answer to the question might help us to make.

Note that the sophistication of these discussions increases as we go further into the book. Your discussion should be more sophisticated than the one found in the previous chapter and less sophisticated than what comes later.

Each chapter features all the same sections and sub-sections as we use below. That is, there are three sub-sections to Wisdom, four sub-sections for Justice, and so on.

Once we have our question, we can start with the Cardinal Virtues. Each section begins with a one sentence summary about the component steps of the relevant virtue. These will, obviously, be highly similar from chapter to chapter. But that is OK! We want to reinforce the steps in the path over and over again.

Wisdom

Wisdom requires the creation of a Preceptor Table, an examination of our data, and a determination, using the concept of validity, as to whether or not we can (reasonably!) assume that the two come from the same population.

Preceptor Table

The Preceptor Table always begins with a restatement of the definition: “A Preceptor Table is smallest possible table with rows and columns such that, if there is no missing data, our question is easy to answer.”

To create the Preceptor Table, we answer a series of questions. (Don’t ask these questions rhetorically. Just describe the answer. There is also no need to number them, although you should always use this order.)

Is the question causal? Looks for verbs like “cause” or “affect” or “influence.” Look for a question which implies a comparison, for a single individual unit, between two states of the world, one where the unit receives treatment X and one where the unit gets treatment Y. Look for a discussion of something which we can manipulate. Remember the motto: No causation without manipulation. We look to see if the question seeks to compare two potential outcomes within the same unit, rather than the same outcome between two different units.

If none of this is present, use a predictive model. If all you need to know to answer the question is the outcome under one value of the treatment, then the model is predictive. Example: What is the att_end for all women if they were to get the treatment? This is a predictive question, not a causal one, because we do not need to know the outcome under treatment and under control for any individual woman.

What is the moment in time to which the question refers? Every question refers to a moment in time, even if that moment stretches a bit. The set of adults today is different from the set 10 years ago, or even yesterday. We need to refine the original question. Assume that we are referring to July 1, 2020 even though, in most cases, people are interested in now. We have changed the original question from:

“What proportion of people who make $100,000 are liberal?”

“On July 1, 2020, what proportion of people who make $100,000 were liberal?”

What are the units? The question often makes this fairly clear, at least in terms of what each row corresponds to, whether it be individuals, classrooms, countries, or whatever. But, questions often fail to make clear the total number of the rows. Our example question above does not specify the relevant population. Is it about all the people in the world? All the adults? All the adults in the United States? The purpose of this paragraph is to refine the question, to make it more specific. Assume that we are interested in all the adults in Chicago. Our question now is:

“On July 1, 2020, what proportion of the adults in Chicago who make $100,000 were liberal?”

This back-and-forth between the question and the analysis is a standard part of data science. We rarely answer the exact question we started with, especially because that question is never specific enough to answer without further qualifications. Furthermore, the data we have may not allow us to answer that question, but it may be enough to answer a related question. Is that good enough for the boss/client/colleague who asked the original question? Maybe? We won’t know till we ask.

Our job as data scientists is not to simply answer the question we have been asked, but to help the questioner determine a question which can be answered with the data we have, a question which helps them to make the decision which they face.

What are the outcomes? (If the model is causal, then there must be at least two potential outcomes. If you can’t figure them out, then the model is probably predictive.) If the model is predictive, then there is only one outcome. This paragraph does more than just name the relevant variable. It also starts the discussion about how exactly we might measure this variable. We consider both the underlying concept, “liberal,” and the process by which we might operationalize the concept. Perhaps we are using a written survey with a YES/NO answer. Perhaps it is an in-person interview with a 1-7 Likert scale, in which answers of 1 or 2 are coded, by us, as “liberal.” The details may or may not matter, but we at least need to discuss the issue.

What are the covariates? Discussing covariates in the context of the Preceptor Table is different than discussing covariates in the context of the data. Recall that the Preceptor Table is the smallest possible table, so we don’t need to include every relevant variable. We only need to discuss variables that are necessary to answer the question.

What are the treatments, if any? (There are no “treatments” in predictive models. There are only covariates.) A treatment is a covariate which, at least in theory, we can manipulate and the manipulation of which is necessary to answer our question.

With all the above, create the Preceptor Table. In this case, our Preceptor Table includes N rows, one for every adult in Chicago on July 1, 2020. It includes two columns: the outcome (liberal) and a single covariate (income).

Preceptor Table
ID	Outcome	Covariate
ID	Liberal	Income
1	0	150000
2	0	50000
…	…	…
10	1	65000
11	1	35000
…	…	…
N	1	78000

If we have the Preceptor Table, with no missing data, then it is trivial to calculate the percentage of adults (who make more than $100,000) who are liberal.

EDA

There is always short section devoted to exploratory data analysis. You can never look at the data too much. Each EDA will include at least one textual look at the data, usually using summary(), but with skim(), glimpse(), print() and slice_sample() also available. It will also include at least one graphic, almost always with the outcome variable on the y-axis and one of the covariates on the x-axis. The data set will often include columns and rows which are irrelevant to the question. Those columns and rows are removed, creating a tibble which will be used in the Courage section. The name of that tibble will often be something convenient like ch_7.

It also makes sense to include some discussion about where this data comes from. What are the definitions of the variables? Who chose the sample? Where is the documentation? This sort of background sets the stage for discussing validity.

Validity

Validity discussions always have one (short) paragraph about each relevant variable (the outcome and any relevant covariates), with examples of why validity might not hold. Validity discussion finishes with a brief discussion along the lines of: “Despite these concerns, we will assume that validity does hold.”

These section can be longer of course, depending on how many details you discussed during the EDA. The central point is that we have two (potentially!) completely difference things: the Preceptor Table and the data. Just because two columns have the same name does not mean that they are the same thing. Indeed, they will often be quite different! But because we control the Preceptor Table and, to a lesser extent, the original question, we can adjust those variables to be “closer” to the data that we actually have. This is another example of the iterative nature of data science. If the data is not close enough to the question, then we check with our boss/colleague/customer to see if we can modify the question in order to make the match between the data and the Preceptor Table close enough for validity to hold.

The Wisdom section finishes with a sentence or two about the question and the data. Example:

Using data from a 2012 survey of Boston-area commuters, we seek to understand the relationship between income and political ideology in Chicago and similar cities in 2020. In particular, what percentage of individuals who make more than $100,000 per year are liberal?

The idea is that, by thinking hard about the original question and the data, we have come up with a question which may be possible to answer with the data we have. Note that each Cardinal Virtue section finishes with a sentence or two summarizing what you have learned. Those sentences are combined at the end of the analysis. One of the key products of a data science project is a paragraph which summarizes the key conclusions.

Justice

Justice concerns four topics: the Population Table, stability, representativeness, and unconfoundedness.

Population Table

If validity holds, then we can create a Population Table.

Population Table
Source	Year	Outcome	Covariates
Source	Year	Income	Age	City
…	…	…	…	…
Data	2012	150000	43	Boston
Data	2012	50000	52	Boston
…	…	…	…	…
…	…	…	…	…
Preceptor Table	2020	…	…	Chicago
Preceptor Table	2020	…	…	Chicago
…	…	…	…	…

The “Source” column highlights that the Population Table includes three categories of rows: The data, the Preceptor Table, and the rest of the population, from which both the data and the Preceptor Table are drawn. The ... indicates rows from the population which are not included in either the data or the Preceptor Table.
The “ID” column is implicit, and often not included. After all, it should be obvious that each row refers to a specific unit. If we don’t really care about the individual units, there is no need to label them.
There should always be a column, in this case “Year,” which indicates the moment in time at which the covariates were recorded. A given unit may appear in multiple rows, with each row providing the data at a different time. In this example, we will have a row for Sarah in 2012, when she was 43, and a row for Sarah in 2020, when she was 51, and so on. Note that Sarah might just be member of the population, neither in the data we have nor in the Preceptor Table. Or she might be in one or the other. We are not concerned with any specific individuals.
Each row in the Population Table represents a unique Unit/Time combination.
The “Outcome” column is the variable which we are trying to understand/explain/predict. There is always an outcome column, although it will often just be labelled with the variable name, as here with “Income.”
The “Covariates” are all the columns other than those already discussed.

Stability

If the assumption of stability holds, then the relationship between the columns in the Population Table is the same for three categories of rows: the data, the Preceptor Table, and the larger population from which both are drawn. Each Stability section begins with a version of this definition. We then discuss at least two examples of why stability might not hold in this case. These examples are almost always connected to the passage of time. Whatever the relationship between political ideology and income that might have held in 2012, when we gathered our data, might not be true either before or afterwards. Provide specific speculations about what might have changed in the world.

Regardless of those concerns, we always conclude that, although the assumption of stability might not hold perfectly, the world is probably stable enough over this time period to make inference possible.

Representativeness

As with all our assumptions, we always begin by restating the definition. Representativeness, or the lack thereof, concerns two relationship, among the rows in the Population Table. The first is between the Preceptor Table and the other rows. The second is between our data and the other rows.

We mention specific examples of two potential problems. First, is our data representative of the population? Never! Not least because it is virtually impossible to do random sampling across time. Second, are the rows associated with the Preceptor Table representative of the population? Again, No!

Provide specific examples of how a lack of representativeness might be a problem, one large enough to affect your ability to answer the question.

But, to continue the analysis, we always assume/pretend that the rows from both the data and the Preceptor Table are representative enough of the larger population from which both are drawn.

Unconfoundedness

If the model is predictive, then unconfoundedness is not a concern. Just mention that fact in a sentence at the end of the section on representativeness. But, if the model is causal, then we need a section devoted to this topic.

As always, bstart y restating the definition, something along the lines of “Unconfoundedness means that the treatment assignment is independent of the potential outcomes, when we condition on pre-treatment covariates. A model as”confounded” if this is not true.”

If treatment assignment was random, then unconfoundedness is guaranteed, although experienced researchers often worry about the exact process involved in such “random” assignment. If, however, treatment assignment was not random, then there will always be a concern that it is correlated with potential outcomes. Discuss at least two scenarios in which this might be a concern. But then, as usual, conclude that, although there might be some issues with confoundedness, they are probably small enough to not worry about.

The Justice section concludes with a sentence or two about how, despite any problems with the core assumptions of stability, representativeness and unconfoundedness, we can still proceed to next steps because the assumptions hold enough.

The last step may be to revisit the key sentences from the Wisdom section. Recall:

Using data from a 2012 survey of Boston-area commuters, we seek to understand the relationship between income and political ideology in Chicago and similar cities in 2020. In particular, what percentage of individuals who make more than $100,000 per year are liberal?

Are these sentences still correct, or does a serious consideration of the assumptions of stability, representativeness and unconfoundedness require us to modify them? The answer, of course, is that the assumptions are never perfect! So, we have an obligation to add a sentence or two which highlights (no more than) one or two concerns. Examples:

There is some concern that survey participants may not be perfectly representative of the underlying population.

The relationship between income and ideology may have changed over that eight year period.

There is no need to use technical terms like “stability.” However, most readers will understand what “representative” means. The key point is honesty. We have an obligation to at least mention some possible concerns. At the end, we conclude with an update as to how the concluding paragraph looks now:

Using data from a 2012 survey of Boston-area commuters, we seek to understand the relationship between income and political ideology in Chicago and similar cities in 2020. In particular, what percentage of individuals who make more than $100,000 per year are liberal? The relationship between income and ideology may have changed over that eight year period.

Courage

Justice gave us the Population Table. Courage selects the data generating mechanism. We first specify the mathematical formula which connects the outcome variable we are interested in with the other data that we have. We explore different models. We need to decide which variables to include and to estimate the values of unknown parameters. We check our models for consistency with the data we have. We avoid hypothesis tests. We select one model.

The structure of the Courage section begins with a version of this opening paragraph, followed by a discussion of the functional form we will be using. This is usually straight-forward because it follows directly from the type of the outcome variable: continuous implies a linear model and binary implies logistic. We provide the mathematical formula for this model, using y and x as variables. The rest of the discussion is broken up into three sections: “Models,” “Data Generating Mechanism,” and “Tests.”

Models

When exploring different models, we need to decide which variables to include and to estimate the values of unknown parameters. We estimate the models and then print out the model results. We do not give another version of the math, or use tbl_regression() yet. The goal is to explore and interpret different models.

If a parameter’s estimated value is more than 2 or 3 MAD_SD’s away from zero, we generally keep that parameter (and its associated variable) in the model. This is, probably, a variable which “matters.” The main exception to this rule is a parameter whose value is so close to zero that changes in its associated variable, within the general range of that variable, can’t change the value of the outcome by much.

Depending on the chapter, we will use different tools to choose among the different possible models.

Data Generating Mechanism

We create a final model, the data generating mechanism. We provide the math for this model, using variable names instead of y and x as we did at the start of the chapter. We present the final parameter estimates nicely, using tbl_regression().

The model you have made by the end of Courage is almost always too complex to answer the simple question you started with, because the question rarely specifies the values of all the covariates which are included in the model. But any covariates or treatments which are part of the initial question(s) must be included in the model, otherwise we can’t answer any questions about them.

The DGM section ends with a clear statement in English, in its own paragraph, describing the model. That is, what are the two sentences which a student would say at a presentation describing the model. First sentence specifies the model, including making clear the units, outcome and key covariates. (No need to use the terms “units,” “outcomes,” and so on.) The second sentence tells us something about the model, generally the relationship between one of the covariates and the outcome variable. In general, there is no discussion af specific numbers or their uncertainty. First, who cares? Parameter estimates are boring and irrelevant. Second, the Temperance section is where we answer the question. Example:

We modeled liberal, a binary TRUE/FALSE variable, as a logistic function of income. Individuals with higher income were more likely to be liberal.

Update our concluding paragraph with this addition:

Using data from a 2012 survey of Boston-area commuters, we seek to understand the relationship between income and political ideology in Chicago and similar cities in 2020. In particular, what percentage of individuals who make more than $100,000 per year are liberal? The relationship between income and ideology may have changed over that eight year period. We modeled liberal, a binary TRUE/FALSE variable, as a logistic function of income. Individuals with higher income were more likely to be liberal.

Feel free to use “I” instead of “We” if the project is solo.

Tests

We check our models for consistency with the data we have using posterior predictive testing. We avoid hypothesis tests.

Temperance

Courage produced the data generating mechanism. Temperance guides us in the use of the DGM — or the “model” — we have created to answer the questions with which we began. We create posteriors for the quantities of interest. We should be modest in the claims we make. The posteriors we create are never the “truth.” The assumptions we made to create the model are never perfect. Yet decisions made with flawed posteriors are almost always better than decisions made without them.

The two sub-sections of Temperance are: Questions and Answers, and Humility.

Questions and Answers

We go back to the question with which we started the journey. We discuss how that question has evolved, in a back-and-forth process by which we try to ensure that the data we have and the question we ask are close enough to make the process plausible.

We revist the Preceptor Table. (We might or might not reprint it, perhaps modified based on the back-and-forth research process.) We emphasize that the DGM allows us to fill in missing outcomes in the Preceptor Table, thereby allowing us to answer our questions.

We use the data generating mechanism from Courage to answer the question. This is, obviously, the core of the Temperance section.

The Answer section always concludes with a one sentence summary of our final conclusion. This summary does not include any technical terms. It is meant for non-statisticians. It is something which we might say in explaining our take-away conclusion to a non-statistician. It will always feature at least one number, and our uncertainty associated with that number. Example:

55% ($\pm$ 2%) of the people who make more than $100,000 per year are liberal.

Of the people making $100,000 or more per year, about 55% are liberal, although the true number could be as low as 53% or as high as 57%.

Some cases, like these, feature numbers which have a natural interpretation. We know what percentages are. But, often, the key numbers require some more discussion. For example:

The causal effect of hearing Spanish-speakers is a more conservative attitude toward immigration, a change of about 1.5 ($\pm$ 0.5) on a 15 point scale.

This is correct, as far as it goes, but we have no idea if 1.5 is a “big” or “small” change. We need some perspective>

The causal effect of hearing Spanish-speakers is a more conservative attitude toward immigration, a change of about 1.5 ($\pm$ 0.5) on a 15 point scale. For perspective, the difference between Democrats and Republicans on that same scale is about 2.1.

Depending on the context, you might have more than one Quantity of Interest to discuss. But there must be at least one. You are now ready to provide the entire concluding paragraph.

Using data from a 2012 survey of Boston-area commuters, we seek to understand the relationship between income and political ideology in Chicago and similar cities in 2020. In particular, what percentage of individuals who make more than $100,000 per year are liberal? The relationship between income and ideology may have changed over that eight year period. We modeled liberal, a binary TRUE/FALSE variable, as a logistic function of income. Individuals with higher income were more likely to be liberal. Of the people making $100,000 or more per year, about 55% are liberal, although the true number could be as low as 53% or as high as 57%.

The final result of data science project is a paragraph like this one. We begin with a question and some data. We produce a paragraph.

Humility

Temperance guides us in the use of the DGM to answer the questions with which we began.

The Humility section always begins with single sentence, something along the lines of:

We can never know the truth.

Over time, we hope to collect a serious of quotations along this theme.

Having answered the question, we now (quickly) review all the reasons why our answer might be wrong. Review the specific concerns we had about validity, stability, representativeness, and (if a causal model) unconfoundedness. Those concerns remain.

Review the three levels of “truth”: Knowing all the entries in the Preceptor Table, knowing the true DGM, and then knowing our estimated DGM. (This explanation can become more sophisticated as the chapters progress.)

We can never know all the entries in the Preceptor Table. That knowledge is reserved for God. If all our assumptions are correct, then our DGM is true, it accurately describes the way in which the world works. There is no better way to predict the future, or to model the past, than to use it. Sadly, this will only be the case with toy examples involving things like coins and dice. We hope that our DGM is close to the true DGM but, since our assumption are never perfectly correct, our DGM will always be different. The estimated magnitude and importance of that difference is a matter of judgment.

The problem with our concluding paragraph is that it implies that our DGM is the truth, rather than just an imperfect approximation of the true DGM. There are two main ways in which are DGM might be wrong. First, the central portion of our estimate, 55% in this case, might be wrong. We might be biased low or high. It is hard to know what to do about that, other than to be aware.

The second way that our DGM might be wrong, relative to the true DGM, is that our uncertainty interval, the 4% from 53% to 57%, might be off. It might be too narrow or too wide. In reality, however, it is almost certainly too narrow, relative to the true DGM. Problems with our assumptions, which are inevitable, almost always make our confidence intervals too narrow.

In later chapters we should also estimate a different (plausible) DGM and show the answer it provides to our question. That answer will always be different than the one in the concluding paragraph. (Ideally, we choose a small change in the model which produces a large change in the estimates for the QoI.)

This generates a final discussion about the difficulties of data science. We don’t change our concluding paragraph. We just step back and remember that, because any such paragraph will always depend on a model, and because there is always a differently, equally plausible model which we could have used instead, that we must demonstrate humility as we report and consider using our results. Perhaps there is even some final discussion about the sorts of decisions which this information is supposed to help us make.

Last line in every chapter is always: “The world is always more uncertain than our models would have us believe.”