5 Probability

The usual touchstone of whether what someone asserts is mere persuasion or at least a subjective conviction, i.e., firm belief, is betting. Often someone pronounces his propositions with such confident and inflexible defiance that he seems to have entirely laid aside all concern for error. A bet disconcerts him. Sometimes he reveals that he is persuaded enough for one ducat but not for ten. For he would happily bet one, but at ten he suddenly becomes aware of what he had not previously noticed, namely that it is quite possible that he has erred. -— Immanuel Kant, Critique of Pure Reason

The central tension, and opportunity, in data science is the interplay between the data and the science, between our empirical observations and the models which we use to understand them. Probability is the language we use to explore that interplay; it connects models to data, and data to models.

The only package we need in this chapter is tidyverse.

5.1 List-columns and map functions

Before learning about probability, we need to expand our collection of R tricks by understanding list-columns and map_* functions. Recall that a list is different from an atomic vector. In atomic vectors, each element of the vector has one value. Lists, however, can contain vectors, and even more complex objects, as elements.

x <- list(c(4, 16, 9), c("A", "Z"))
## [[1]]
## [1]  4 16  9
## [[2]]
## [1] "A" "Z"

x is a list with two elements. The first element is a numeric vector of length 3. The second element is a character vector of length 2. We use [[]] to extract specific elements.

## [1]  4 16  9

There are a number of built-in R functions that output lists. For example, the ggplot objects you have been making store all of the plot information in lists. Any function that returns multiple values can be used to create a list output by wrapping that returned object with list().

x <- rnorm(10)

# range() returns the min and max of the argument 

tibble(col_1 = list(range(x))) 
## # A tibble: 1 x 1
##   col_1    
##   <list>   
## 1 <dbl [2]>

Notice this is a 1-by-1 tibble with one observation, which is a list of one element. Voila! You have just created a list-column.

If a function returns multiple values as a vector, like range() does, you must use list() as a wrapper if you want to create a list-column.

A list column is a column of your data which is a list rather than an atomic vector. As with stand-alone list objects, you can pipe to str() to examine the column.

tibble(col_1 = list(range(x))) %>%
## tibble [1 × 1] (S3: tbl_df/tbl/data.frame)
##  $ col_1:List of 1
##   ..$ : num [1:2] -1.737 0.834

We can use map_* functions to both create a list-column and then, much more importantly, work with that list-column afterwards.

tibble(col_1 = list(range(x))) %>%
  mutate(col_2 = map_dbl(col_1, ~ sum(.))) %>% 
## tibble [1 × 2] (S3: tbl_df/tbl/data.frame)
##  $ col_1:List of 1
##   ..$ : num [1:2] -1.737 0.834
##  $ col_2: num -0.902

map_* functions, like map_dbl() in this example, take two key arguments, .x (the data which will be acted on) and .f (the function which will act on this data). Here, .x is the data in col_1, which is a list-column. .f is the function sum(). However, we can not simply write map_dbl(col_1, sum). Instead, each use of map_* functions requires the use of a tilde — a ~ — to indicate the start of the function and the use of a dot — a . — to specify where the data goes in the function.

map_* functions are a family of functions, with the suffix specifying the type of the object to be returned. map() itself returns a list. map_dbl() returns a double. map_int() returns an integer. map_chr() returns a character, and so on.

tibble(ID = 1) %>% 
  mutate(col_1 = map(ID, ~range(rnorm(10)))) %>%
  mutate(col_2 = map_dbl(col_1, ~ sum(.))) %>% 
  mutate(col_3 = map_int(col_1, ~ length(.))) %>% 
  mutate(col_4 = map_chr(col_1, ~ sum(.))) %>% 
## tibble [1 × 5] (S3: tbl_df/tbl/data.frame)
##  $ ID   : num 1
##  $ col_1:List of 1
##   ..$ : num [1:2] -1.28 1.07
##  $ col_2: num -0.216
##  $ col_3: int 2
##  $ col_4: chr "-0.216124"

Consider a more detailed example:

# This simple example demonstrates the workflow which we will often follow.
# Start by creating a tibble which will be used to store the results. (Or start
# with a tibble which already exists and to which you will be adding more
# columns.) It is often convenient to get all the code working with just a few
# rows. Once it is working, we increase the number of rows to a thousand or
# million or whatever we need.

tibble(ID = 1:3) %>% 
  # The big convenience is being able to store a list in each row of the tibble.
  # Note that we are not using the value of ID in the call to rnorm(). (That is
  # why we don't have a "." anywhere.) But we are still using ID as a way of
  # iterating through each row; ID is keeping count for us, in a sense.
  mutate(draws = map(ID, ~ rnorm(10))) %>% 
  # Each succeeding step of the pipe works with columns already in the tibble
  # while, in general, adding more columns. The next step calculates the max
  # value in each of the draw vectors. We use map_dbl() because we know that
  # max() will returns a single number.
  mutate(max = map_dbl(draws, ~ max(.))) %>% 
  # We will often need to calculate more than one item from a given column like
  # draws. For example, in addition to knowing the max value, we would like to
  # know the range. Because the range is a vector, we need to store the result
  # in a list column. map() does that for us automatically.
  mutate(min_max = map(draws, ~ range(.)))
## # A tibble: 3 x 4
##      ID draws        max min_max  
##   <int> <list>     <dbl> <list>   
## 1     1 <dbl [10]>  1.01 <dbl [2]>
## 2     2 <dbl [10]>  1.04 <dbl [2]>
## 3     3 <dbl [10]>  1.70 <dbl [2]>

This flexibility is only possible via the use of list-columns and map_* functions. This workflow is extremely common. We start with an empty tibble, using ID to specify the number of rows. With that skeleton, each step of the pipe adds a new column, working off a column which already exists.

5.2 Probability distributions

Dice and Probability.

FIGURE 5.1: Dice and Probability.

What does it mean that Trump had a 30% chance of winning re-election in the fall of 2020? That there is a 90% probability of rain today? That the dice at the casino are unfair?

Probability is about quantifying uncertainty. We can think of probability as a proportion. The probability of an event occurring is a number from 0 to 1, where 0 means that the event is impossible and 1 means that the event is 100% certain.

Let’s begin with the simplest events: coin flips and dice rolls. If the dice and the coins are fair, we can operate under the assumption that all outcomes are equally likely.

This allows us to make the following statements:

  • The probability of rolling a 1 or a 2 is 2/6, or 1/3.
  • The probability of rolling a 1, 2, 3, 4, 5, or 6 is 1.
  • The probability of flipping a coin and getting tails is 1/2.

For the purposes of this Primer, a probability distribution is a mathematical object that covers a set of outcomes, where each distinct outcome has a chance of occurring between 0 and 1 inclusive. The chances must sum to 1. The set of possible outcomes — heads or tails for the coin, 1 through 6 for a single die, 2 through 12 for a pair of dice — can be either discrete or continuous. This set of outcomes is the domain of the probability distribution. There are three types of probability distributions: mathematical, empirical, and posterior.

The key difference between a distribution, as we have explored them Section 2.8, and a probability distribution is the requirement that the sum of the probabilities of the individual outcomes must be exactly 1. There is no such requirement in a distribution. But any distribution can be turned into a probability distribution by “normalizing” it, as we will explore. In this context, we will often refer to a distribution which is not (yet) a probability distribution as an “unnormalized” distribution.

Pay attention to notation. Whenever we are talking about a specific probability (represented by a single value), we will use \(\rho\) (the Greek letter “rho” but spoken aloud as “p” by us) with a subscript which specifies the exact outcome of which it is the probability. For instance, \(\rho_h = 0.5\) denotes the probability of getting heads on a coin toss when the coin is fair. \(\rho_t\) — spoken as “PT” or “P sub T” or “P tails” — denotes the probability of getting tails on a coin toss. However, when we are referring to the entire probability distribution over a set of outcomes, we will use \(\text{Prob}()\). For example, the probability distribution of a coin toss is \(\text{Prob}(\text{coin})\). That is, \(\text{Prob}(\text{coin})\) is composed of the two specific probabilities (50% and 50%) mapped from the two values in the domain (Heads and Tails). Similarly, \(\text{Prob}(\text{sum of two dice})\) is the probability distribution over the set of 11 outcomes (2 through 12) which are possible when you take the sum of two dice. \(\text{Prob}(\text{sum of two dice})\) is made up of 11 numbers — \(\rho_2\), \(\rho_3\), …, \(\rho_{12}\) — each representing the unknown probability that the sum will equal their value. That is, \(\rho_2\) is the probability of rolling a 2.

5.2.1 Flipping a coin

A mathematical distribution is based on a mathematical formula. Assuming that the coin is perfectly fair, we should, on average, get heads as often as we get tails.

An empirical distribution is based on data. You can think of this as the probability distribution created by running a simulation. In theory, if we increase the number of coins we flip in our simulation, the empirical distribution will look more and more similar to the mathematical distribution. The probability distribution is the Platonic form. The empirical distribution will often look like the mathematical probability distribution, but it will rarely be exactly the same.

In this simulation, there are 56 heads and 44 tails. The outcome will vary every time we run the simulation, but the proportion of heads to tails should not be too different if this coin is fair.

# We are flipping one fair coin a hundreds times. We need to get the same result
# each time we create this graphic because we want the results to match the
# description in the text. Using set.seed() guarantees that the random results
# are the same each time. We define 0 as tails and 1 as heads.


tibble(results = sample(c(0, 1), 100, replace = TRUE)) %>% 
  ggplot(aes(x = results)) +
    geom_histogram(aes(y = after_stat(count/sum(count))), 
                   binwidth = 0.5, 
                   color = "white") +
    labs(title = "Empirical Probability Distribution",
         subtitle = "Flipping one coin a hundred times",
         x = "Outcome\nResult of Coin Flip",
         y = "Probability") +
    scale_x_continuous(breaks = c(0, 1), 
                       labels = c("Heads", "Tails")) +
    scale_y_continuous(labels = 
                         scales::percent_format(accuracy = 1)) +

A posterior distribution is based on beliefs and expectations. It displays your belief about things you can’t see right now. You may have posterior distributions for outcomes in the past, present, or future.

In the case of the coin toss, the posterior distribution changes depending on your beliefs. For instance, let’s say your friend brought a coin to school and asked to bet you. If the result is heads, you have to pay them $5.

This makes you suspicious. Your posterior distribution would reflect this. You might believe that \(\rho_h\) is 0.95 and \(\rho_t\) is 0.05.

The full terminology is mathematical (or empirical or posterior) probability distribution. But we will often shorten this to just mathematical (or empirical or posterior) distribution. The word “probability” is understood, even if it is not present.

5.2.2 Rolling two dice

Our mathematical distribution tells us that, with a fair dice, the probability of getting 1, 2, 3, 4, 5, and 6 are equal: there is a 1/6 chance of each. When we roll two dice at the same time and sum the numbers, the values closest to the middle are more common than values at the edge because there are more combinations of numbers that add up to the middle values.

We get an empirical distribution by rolling two dice a hundred times, either by hand or with a computer simulation. The result is not identical to the mathematical distribution because of the inherent randomness of the real world and/or of simulation.

# In the coin example, we create the vector ahead of time, and then assigned
# that vector to a tibble. There was nothing wrong with that approach. And we
# could do the same thing here. But the use of map_* functions is more powerful.


emp_dist_dice <- tibble(ID = 1:100) %>% 
  mutate(die_1 = map_dbl(ID, ~ sample(c(1:6), size = 1))) %>% 
  mutate(die_2 = map_dbl(ID, ~ sample(c(1:6), size = 1))) %>% 
  mutate(sum = die_1 + die_2) %>% 
  ggplot(aes(x = sum)) +
    geom_histogram(aes(y = after_stat(count/sum(count))), 
                   binwidth = 1, 
                   color = "white") +
    labs(title = "Empirical Probability Distribution",
         subtitle = "Sum from rolling two dice, replicated one hundred times",
         x = "Outcome\nSum of Two Die",
         y = "Probability") +
    scale_x_continuous(breaks = seq(2, 12, 1), labels = 2:12) +
    scale_y_continuous(labels = 
                         scales::percent_format(accuracy = 1)) +


We might consider labeling the y-axis in plots of empirical distributions as “Proportion” rather than “Probability” since it is an actual proportion, calculated from real (or simulated) data. We will keep it as “Probability” since we want to emphasize the parallels between mathematical, empirical and posterior probability distributions.

The posterior distribution for rolling two dice a hundred times depends on your beliefs. If you take the dice from your Monopoly set, you have reason to believe that the assumptions underlying the mathematical distribution are true. However, if you walk into a crooked casino and a host asks you to play craps, you might be suspicious. In craps, a come-out roll of 7 and 11 is a “natural,” resulting in a win for the “shooter” and a loss for the casino. You might expect those numbers to occur less often than they would with fair dice. Meanwhile, a come-out roll of 2, 3 or 12 is a loss for the shooter. You might also expect values like 2, 3 and 12 to occur more frequently. Your posterior distribution might look like this:

Someone less suspicious of the casino would have a posterior distribution which looks more like the mathematical distribution.

5.2.3 Presidential elections

Now let’s say we are building probability distributions for political events, like a presidential election. We want to know the probability that Democratic candidate wins X electoral votes, where X comes from the range of possible outcomes: 0 to 538. (The total number of electoral votes in US elections since 1964 is 538.)

We can start with a mathematical distribution for X which assumes that the chances of the Democratic candidate winning any given state’s electoral votes is 0.5 and the results from each state are independent.

We know that campaign platforms, donations, charisma, and many other factors will contribute to a candidate’s success. Elections are more complicated than coin tosses. We also know that many presidential elections in history have resulted in much bigger victories or defeats than this distribution seems to allow for.

The empirical distribution in this case could involve looking into past elections in the United States and counting the number of electoral votes that the Democrats won in each. For the empirical distribution, we create a tibble with electoral vote results from past elections. Looking at elections since 1964, we can observe that the number of electoral votes that the Democrats received in each one is different. Given that we only have 15 entries, it is difficult to draw conclusions or make predictions based off of this empirical distribution.

However, this model is enough to suggest that the assumptions of the mathematical probability distribution above do not work for electoral votes. The model assumes that the Democrats have a 50% chance of receiving each of the 538 votes. Just looking at the mathematical probability distribution, we can observe that receiving 13 or 17 or 486 votes out of 538 would be extreme and almost impossible under this mathematical model. However, our empirical distribution tells us that those were real election results.

The posterior distribution of electoral votes is a popular topic, and an area of strong disagreement, among data scientists. Consider this posterior from FiveThirtyEight.

Here is a posterior from the FiveThirtyEight website from August 13, 2020. This was created using the same data as the above distribution, but simply displayed differently. For each electoral result, the height of the bar represents the probability that a given event will occur. However, there are no lablels y-axis telling us what the specific probability of each outcome is. And that is OK! The specific values are not that useful. If we removed the labels on our y-axes, would it matter?

Here is the posterior from The Economist, also from August 13, 2020. This looks confusing at first because they chose to merge the axes for Republican and Democratic electoral votes. We can tell that The Economist was less optimistic, relative to FiveThirtyEight, about Trump’s chances in the election.

These two models, built by smart people using similar data sources, have reached fairly different conclusions. Data science is difficult! There is not one “right” answer. Real life is not a problem set.

Watch the makers of these two models throw shade at each other on Twitter! Eliot Morris is one of the primary authors of the Economist model. Nate Silver is in charge of 538. They don't seem to be too impressed with each other's work! More smack talk [here](https://statmodeling.stat.columbia.edu/2020/08/31/more-on-that-fivethirtyeight-prediction-that-biden-might-only-get-42-of-the-vote-in-florida/) and [here](https://statmodeling.stat.columbia.edu/2020/08/31/problem-of-the-between-state-correlations-in-the-fivethirtyeight-election-forecast/).

FIGURE 5.2: Watch the makers of these two models throw shade at each other on Twitter! Eliot Morris is one of the primary authors of the Economist model. Nate Silver is in charge of 538. They don’t seem to be too impressed with each other’s work! More smack talk here and here.

There are many political science questions you could explore with posterior distributions. They can relate to the past, present, or future.

  • Past: How many electoral votes would Hilary Clinton have won if she had picked a different VP?
  • Present: What are the total campaign donations from Harvard faculty?
  • Future: How many electoral votes will the Democratic candidate for president win in 2024?

5.2.4 Continuous distributions

The three examples above are all discrete probability distributions, meaning that the outcome variable can only take on a limited set of values. A coin flip has two outcomes. The sum of a pair of dice has 11 outcomes. The total electoral votes for the Democratic candidate has 539 possible outcomes. In the limit, we can also create continuous probability distributions which have an infinite number of possible outcomes. For example, the average height for an American male could be any real number between 0 inches and 100 inches. (Of course, an average value anywhere near 0 or 100 is absurd. The point is that the average could be 68.564, 68.5643, 68.56432 68.564327, or any real number.)

All the characteristics for discrete probability distributions which we reviewed above apply just as much to continuous probability distributions. For example, we can create mathematical, empirical and posterior probability distributions for continuous outcomes just as we did for discrete outcomes. For example, a posterior probability distribution for \(p\), the percentage of US citizens who approve of Biden, might look like:


  • The truth is out there. If we asked all 300+ million Americans whether or not they approve of President Biden, we could know \(p\) exactly. Alas, we can’t do that. We use a posterior probability distribution to summarize are beliefs about the true value of \(p\), a truth we can never confirm.

  • Continuous variables are a myth. Nothing that can be represented on a computer is truly continuous. Even something which appears continuous, like \(p\), actually can only take on a (very large) set of discrete variables. In this case, there are approximately 300 million possible true values of \(p\), one for each total number of people who approve of President Biden.

  • The math of continuous probability distributions can be tricky. Read a book on mathematical probability for all the messy details. Little of that matters in applied work.

  • The most important difference is that, with discrete distributions, it makes sense to estimate the probability of a specific outcome. What is the probability of rolling a 9? With continuous distributions, this makes no sense because there are an infinite number of possible outcomes. With continuous variables, we only estimate intervals. What is the probability that Biden’s true approval percentage is between 50% and 55%?

Don’t worry about the distinctions between discrete and continuous outcomes, or between the discrete and continuous probability distributions which we will use to summarize our beliefs about those outcomes. The basic intuition is the same in both cases.

5.2.5 Working with probability distributions

Bruno de Finetti, an Italian statistician who wrote a famous treatise on the theory of probability that began with the statement "PROBABILITY DOES NOT EXIST." This is because probability only exists subjectively in our minds.

FIGURE 5.3: Bruno de Finetti, an Italian statistician who wrote a famous treatise on the theory of probability that began with the statement “PROBABILITY DOES NOT EXIST.” This is because probability only exists subjectively in our minds.

A probability distribution is not always easy to work with. It is a complex object. And, in many contexts, we don’t really care about all that complexity. So, instead of providing the full probability distribution, we often just use a summary measure, a number or two or three which captures those aspects of the entire distribution which are relevant to the matter at hand. Let’s explore these issues using the 538 posterior probability distribution, as of August 13, 2020, for the number of electoral votes which will be won by Joe Biden. Here is a tibble with 1,000,000 draws from that distribution:

## # A tibble: 1,000,000 x 2
##       ID electoral_votes
##    <int>           <int>
##  1     1             333
##  2     2             352
##  3     3             338
##  4     4             437
##  5     5             346
##  6     6             171
##  7     7             318
##  8     8             210
##  9     9             261
## 10    10             298
## # … with 999,990 more rows

A distribution and a sample of draws from that distribution are different things. But, if you squint, they are sort of the same thing, at least for our purposes. For example, if you want to know the mean of the distribution, then the mean of the draws will be a fairly good estimate, especially if the number of draws is large enough.

Recall from Chapter 2 how we can draw randomly from specified probability distributions:

##  [1]  0.030  0.732  0.091  0.263 -0.100 -1.533 -0.194  0.604 -1.913  0.345
##  [1] 0.128 0.618 0.975 0.543 0.374 0.920 0.709 0.210 0.048 0.936

The elements of these vectors are all “draws” from the specified probability distributions. In most applied situations, our tools will produce draws rather than summary objects. Fortunately, a vector of draws is very easy to work with. Start with summary statistics:

key_stats <- draws %>% 
  summarize(mn = mean(electoral_votes),
            md = median(electoral_votes),
            sd = sd(electoral_votes),
            mad = mad(electoral_votes))

## # A tibble: 1 x 4
##      mn    md    sd   mad
##   <dbl> <dbl> <dbl> <dbl>
## 1  325.   326  86.9  101.

Calculate a 95% interval directly:

quantile(draws$electoral_votes, probs = c(0.025, 0.975))
##  2.5% 97.5% 
##   172   483

Approximate the 95% interval in two ways:

c(key_stats$mn - 2 * key_stats$sd, 
  key_stats$mn + 2 * key_stats$sd)
## [1] 152 499
c(key_stats$md - 2 * key_stats$mad, 
  key_stats$md + 2 * key_stats$mad)
## [1] 124 528

In this case, using the mean and standard deviation produces a 95% interval which is closer to the true interval. In other cases, the median and scaled median absolute deviation will do better. Either approximation is generally “good enough” for most work. But, if you need to know the exact 95% interval, you must use quantile().

5.2.6 Unnormalized distributions

Remember that probability distributions are mathematical objects that cover a set of outcomes, where each outcome in the domain is mapped to a probability value between 0 and 1 inclusive and the sum of all mappings is 1. Sometimes, you may see distributions similar to probability distributions, only the y-axis displays raw counts instead of proportions. Unnormalized distributions are not probability distributions, but it is easy to convert between the two. You simply divide all the outcome counts on the y-axis by the sum of all outcome counts to “normalize” the unnormalized distribution. Unnormalized distributions are often an intermediary step; it is sometimes handy to work with counts until the very end.

For instance, we can generate the following unnormalized distribution for the sum of rolling two dice. (This uses the same code as above, but without the normalization step.)

Notice that the shape of the distribution is the same as the empirical probability distribution we generated earlier, except that the y-axis is labeled differently.

The two plots — unnormalized and normalized — have the exact same shape. In many ways, they are the same object. Yet normalization is required if we want to work with a probability distribution.

5.2.7 Joint distributions

Recall that \(\text{Prob}(\text{coin})\) is the probability distribution for the result of a coin toss. It includes two parts, the probability of heads (\(\rho_h\)) and the probability of tails (\(\rho_t\)). This is a univariate distribution because there is only one outcome, which can be heads or tails. If there is more than one outcome, then we have a joint distribution.

Joint distributions are also mathematical objects that cover a set of outcomes, where each distinct outcome has a chance of occurring between 0 and 1 and the sum of all chances must equal 1. The key to a joint distribution is it measures the chance that both events A and B will occur. The notation is \(\text{Prob}(A, B)\).

Let’s say that you are rolling two six-sided dice simultaneously. Die 1 is weighted so that there is a 50% chance of rolling a 6 and a 10% chance of each of the other values. Die 2 is weighted so there is a 50% chance of rolling a 5 and a 10% chance of rolling each of the other values. Let’s roll both dice 1,000 times. In previous examples involving two dice, we cared about the sum of results and not the outcomes of the first versus the second die of each simulation. With a joint distributions, the order matters; so instead of 11 possible outcomes on the x-axis of our distribution plot (ranging from 2 to 12), we have 36. Furthermore, a 2D probability distribution is not sufficient to represent all of the variables involved, so the joint distribution for this example is displayed using a 3D plot.

## Warning in make_shadow(heightmap, shadowdepth, shadowwidth, background, :
## `magick` package required for smooth shadow--using basic shadow instead.