5 Probability

The usual touchstone of whether what someone asserts is mere persuasion or at least a subjective conviction, i.e., firm belief, is betting. Often someone pronounces his propositions with such confident and inflexible defiance that he seems to have entirely laid aside all concern for error. A bet disconcerts him. Sometimes he reveals that he is persuaded enough for one ducat but not for ten. For he would happily bet one, but at ten he suddenly becomes aware of what he had not previously noticed, namely that it is quite possible that he has erred. -— Immanuel Kant, Critique of Pure Reason

The central tension, and opportunity, in data science is the interplay between the data and the science, between our empirical observations and the models which we use to understand them. Probability is the language we use to explore that interplay; it connects models to data, and data to models.

The only package students need in this chapter is tidyverse.

We also load the rayshader and rgl packages. We use them to create the 3-D plots below. You can read more about rayshader here but we will not be explaining how to use it in this Primer.

5.1 Probability distributions

Dice and Probability.

FIGURE 5.1: Dice and Probability.

What does it mean that Trump had a 30% chance of winning re-election in the fall of 2020? That there is a 90% probability of rain today? That the dice at the casino are unfair?

Probability is about quantifying uncertainty. We can think of probability as a proportion. The probability of an event occurring is a number from 0 to 1, where 0 means that the event is impossible and 1 means that the event is 100% certain.

Let’s begin with the simplest events: coin flips and dice rolls. If the dice and the coins are fair, we can operate under the assumption that all outcomes are equally likely.

This allows us to make the following statements:

  • The probability of rolling a 1 or a 2 is 2/6, or 1/3.
  • The probability of rolling a 1, 2, 3, 4, 5, or 6 is 1.
  • The probability of flipping a coin and getting tails is 1/2.

For the purposes of this Primer, a probability distribution is a mathematical object that covers a set of outcomes, where each distinct outcome has a chance of occurring between 0 and 1 inclusive. The chances must sum to 1. The set of possible outcomes — heads or tails for the coin, 1 through 6 for a single die, 2 through 12 for a pair of dice — can be either discrete or continuous. This set of outcomes is the domain of the probability distribution. There are three types of probability distributions: mathematical, empirical, and posterior.

The key difference between a distribution, as we have explored them in earlier chapters, and a probability distribution is the requirement that the sum of the probabilities of the individual outcomes must be exactly 1. There is no such requirement in a distribution. But any distribution can be turned into a probability distribution by “normalizing” it, as we will explore. In this context, we will often refer to a distribution which is not (yet) a probability distribution as an “unnormalized” distribution.

Pay attention to notation. Whenever we are talking about a specific probability (represented by a single value), we will use \(\rho\) (the Greek letter “rho” but spoken aloud as “p” by us) with a subscript which specifies the exact outcome of which it is the probability. For instance, \(\rho_h = 0.5\) denotes the probability of getting heads on a coin toss when the coin is fair. \(\rho_t\) — spoken as “PT” or “P sub T” or “P tails” — denotes the probability of getting tails on a coin toss. However, when we are referring to the entire probability distribution over a set of outcomes, we will use \(\text{Prob}()\). For example, the probability distribution of a coin toss is \(\text{Prob}(\text{coin})\). That is, \(\text{Prob}(\text{coin})\) is composed of the two specific probabilities (50% and 50%) mapped from the two values in the domain (Heads and Tails). Similarly, \(\text{Prob}(\text{sum of two dice})\) is the probability distribution over the set of 11 outcomes (2 through 12) which are possible when you take the sum of two dice. \(\text{Prob}(\text{sum of two dice})\) is made up of 11 numbers — \(\rho_2\), \(\rho_3\), …, \(\rho_{12}\) — each representing the unknown probability that the sum will equal their value. That is, \(\rho_2\) is the probability of rolling a 2.

5.1.1 Flipping a coin

A mathematical distribution is based on a mathematical formula. Assuming that the coin is perfectly fair, we should, on average, get heads as often as we get tails.

An empirical distribution is based on data. You can think of this as the probability distribution created by running a simulation. In theory, if we increase the number of coins we flip in our simulation, the empirical distribution will look more and more similar to the mathematical distribution. The probability distribution is the Platonic form. The empirical distribution will often look like the mathematical probability distribution, but it will rarely be exactly the same.

In this simulation, there are 56 heads and 44 tails. The outcome will vary every time we run the simulation, but the proportion of heads to tails should not be too different if this coin is fair.

# We are flipping one fair coin a hundreds times. We need to get the same result
# each time we create this graphic because we want the results to match the
# description in the text. Using set.seed() guarantees that the random results
# are the same each time. We define 0 as tails and 1 as heads.

set.seed(3)

tibble(results = sample(c(0, 1), 100, replace = TRUE)) %>% 
  ggplot(aes(x = results)) +
    geom_histogram(aes(y = after_stat(count/sum(count))), 
                   binwidth = 0.5, 
                   color = "white") +
    labs(title = "Empirical Probability Distribution",
         subtitle = "Flipping one coin a hundred times",
         x = "Outcome\nResult of Coin Flip",
         y = "Probability") +
    scale_x_continuous(breaks = c(0, 1), 
                       labels = c("Heads", "Tails")) +
    scale_y_continuous(labels = 
                         scales::percent_format(accuracy = 1)) +
    theme_classic()

A posterior distribution is based on beliefs and expectations. It displays your belief about things you can’t see right now. You may have posterior distributions for events in the past, present, or future.

In the case of the coin toss, the posterior distribution changes depending on your beliefs. For instance, let’s say your friend brought a coin to school and asked to bet you. If the result is heads, you have to pay them $5.

This makes you suspicious; your posterior distribution would reflect this. You might believe that \(\rho_h\) is 0.95 and \(\rho_t\) is 0.05.

The full terminology is mathematical (or empirical or posterior) probability distribution. But we will often shorten this to just mathematical (or empirical or posterior) distribution. The word “probability” is understood, even if it is not present.

5.1.2 Rolling two dice

Our mathematical distribution tells us that, with a fair dice, the probability of getting 1, 2, 3, 4, 5, and 6 are equal: there is a 1/6 chance of each. When we roll two dice at the same time and sum the numbers, the values closest to the middle are more common than values at the edge because there are more combinations of numbers that add up to the middle values.

We get an empirical distribution by running a simulation and rolling two dice a hundred times. The result is not identical to the mathematical distribution because of the inherent randomness of the real world.

# In the coin example, we create the vector ahead of time, and then assigned
# that vector to a tibble. There was nothing wrong with that approach. And we
# could do the same thing here. But the use of map_* functions is more powerful
# and will be necessary in later chapters. Still need set.seed(), as we almost
# always will when creating random objects.

set.seed(1)

emp_dist_dice <- tibble(ID = 1:100) %>% 
  mutate(die_1 = map_dbl(ID, ~ sample(c(1:6), size = 1))) %>% 
  mutate(die_2 = map_dbl(ID, ~ sample(c(1:6), size = 1))) %>% 
  mutate(sum = die_1 + die_2) %>% 
  ggplot(aes(x = sum)) +
    geom_histogram(aes(y = after_stat(count/sum(count))), 
                   binwidth = 1, 
                   color = "white") +
    labs(title = "Empirical Probability Distribution",
         subtitle = "Sum from rolling two dice, replicated one hundred times",
         x = "Outcome\nSum of Two Die",
         y = "Probability") +
    scale_x_continuous(breaks = seq(2, 12, 1), labels = 2:12) +
    scale_y_continuous(labels = 
                         scales::percent_format(accuracy = 1)) +
    theme_classic()

emp_dist_dice

We might consider labeling the y-axis in plots of empirical distributions as “Proportion” rather than “Probability” since it is an actual proportion, calculated from real (or simulated) data. We will keep it as “Probability” since we want to emphasize the parallels between mathematical, empirical and posterior probability distributions.

The posterior distribution for rolling two dice a hundred times depends on your expectations. If you take the dice from your Monopoly set, you have reason to believe that the assumptions underlying the mathematical distribution are true. However, if you walk into a crooked casino and a host asks you to play craps, you might be suspicious. In craps, a come-out roll of 7 and 11 is a “natural,” resulting in a win for the “shooter” and a loss for the casino. You might expect those numbers to occur less often than they would with fair dice. Meanwhile, a come-out roll of 2, 3 or 12 is a loss for the shooter. You might also expect values like 2, 3 and 12 to occur more frequently. Your posterior distribution might look like this:

Someone less suspicious of the casino would have a posterior distribution which looks more like the mathematical distribution.

5.1.3 Presidential elections

Now let’s say we are building probability distributions for political events, like a presidential election. We want to know the probability that Democratic candidate wins X electoral votes, where X comes from the range of possible outcomes: 0 to 538. (The total number of electoral votes in US elections since 1964 is 538.)

We can start with a mathematical distribution for X which assumes that the chances of the Democratic candidate winning any given state’s electoral votes is 0.5 and the results from each state are independent.

We know that campaign platforms, donations, charisma, and many other factors will contribute to a candidate’s success. Elections are more complicated than coin tosses. We also know that many presidential elections in history have resulted in much bigger victories or defeats than this distribution seems to allow for.

The empirical distribution in this case could involve looking into past elections in the United States and counting the number of electoral votes that the Democrats won in each. For the empirical distribution, we create a tibble with electoral vote results from past elections. Looking at elections since 1964, we can observe that the number of electoral votes that the Democrats received in each one is different. Given that we only have 15 entries, it is difficult to draw conclusions or make predictions based off of this empirical distribution.

However, this model is enough to suggest that the assumptions of the mathematical probability distribution above do not work for electoral votes. The model assumes that the Democrats have a 50% chance of receiving each of the 538 votes. Just looking at the mathematical probability distribution, we can observe that receiving 13 or 17 or 486 votes out of 538 would be extreme and almost impossible under this mathematical model. However, our empirical distribution tells us that those were real election results.

The posterior distribution of electoral votes is a popular topic, and an area of strong disagreement, among data scientists. Consider this posterior from FiveThirtyEight.

Here is a posterior from the FiveThirtyEight website from August 13, 2020. This was created using the same data as the above distribution, but simply displayed differently. For each electoral result, the height of the bar represents the probability that a given event will occur. However, there are no lablels y-axis telling us what the specific probability of each outcome is. And that is OK! The specific values are not that useful. If we removed the labels on our y-axes, would it matter?

Here is the posterior from The Economist, also from August 13, 2020. This looks confusing at first because they chose to merge the axes for Republican and Democratic electoral votes. We can tell that The Economist was less optimistic, relative to FiveThirtyEight, about Trump’s chances in the election.

These two models, built by smart people using similar data sources, have reached fairly different conclusions. Data science is difficult! There is not one “right” answer. Real life is not a problem set.

Watch the makers of these two models throw shade at each other on Twitter! Eliot Morris is one of the primary authors of the Economist model. Nate Silver is in charge of 538. They don't seem to be too impressed with each other's work! More smack talk [here](https://statmodeling.stat.columbia.edu/2020/08/31/more-on-that-fivethirtyeight-prediction-that-biden-might-only-get-42-of-the-vote-in-florida/) and [here](https://statmodeling.stat.columbia.edu/2020/08/31/problem-of-the-between-state-correlations-in-the-fivethirtyeight-election-forecast/).

FIGURE 5.2: Watch the makers of these two models throw shade at each other on Twitter! Eliot Morris is one of the primary authors of the Economist model. Nate Silver is in charge of 538. They don’t seem to be too impressed with each other’s work! More smack talk here and here.

There are many political science questions you could explore with posterior distributions. They can relate to the past, present, or future.

  • Past: How many electoral votes would Hilary Clinton have won if she had picked a different VP?
  • Present: What are the total campaign donations from Harvard faculty?
  • Future: How many electoral votes will the Democratic candidate for president win in 2024?

5.1.4 Continuous distributions

The three examples above are all discrete probability distributions, meaning that the outcome variable can only take on a limited set of values. A coin flip has two outcomes. The sum of a pair of dice has 11 outcomes. The total electoral votes for the Democratic candidate has 539 possible outcomes. In the limit, we can also create continuous probability distributions which have an infinite number of possible outcomes. For example, the percentage of Americans who “approve” of the job that President Biden is doing is a continuous variable, meaning it can take on any value between 0 and 1.

All the characteristics for discrete probability distributions which we reviewed above apply just as much to continuous probability distributions. For example, we can create mathematical, empirical and posterior probability distributions for continuous outcomes just as we did for discrete outcomes. For example, a posterior probability distribution for \(p\), the percentage of US citizens who approve of Biden, might look like:

Comments:

  • The truth is out there. If we asked all 300+ million Americans whether or not they approve of President Biden, we could know \(p\) exactly. Alas, we can’t do that. We use a posterior probability distribution to summarize are beliefs about the true value of \(p\), a truth we can never confirm.

  • Continuous variables are a myth. Nothing that can be represented on a computer is truly continuous. Even something which appears continuous, like \(p\), actually can only take on a (very large) set of discrete variables. In this case, there are approximately 300 million possible true values of \(p\), one for each total number of people who approve of President Biden.

  • The math of continuous probability distributions can be tricky. Read a book on mathematical probability for all the messy details. Little of that matters in applied work.

  • The most important difference is that, with discrete distributions, it makes sense to estimate the probability of a specific outcome. What is the probability of rolling a 9? With continuous distributions, this makes no sense because there are an infinite number of possible outcomes. With continuous variables, we only estimate intervals. What is the probabilty that Biden’s true approval percentage is between 50% and 55%?

Don’t worry about the distinctions between discrete and continuous outcomes, or between the discrete and continuous probability distributions which we will use to summarize our beliefs about those outcomes. The basic intuition is the same in both cases.

5.1.5 Working with probability distributions

Bruno de Finetti, an Italian statistician who wrote a famous treatise on the theory of probability that began with the statement "PROBABILITY DOES NOT EXIST." This is because probability only exists subjectively in our minds.

FIGURE 5.3: Bruno de Finetti, an Italian statistician who wrote a famous treatise on the theory of probability that began with the statement “PROBABILITY DOES NOT EXIST.” This is because probability only exists subjectively in our minds.

A probability distribution is not always easy to work with. It is a complex object. And, in many contexts, we don’t really care about all that complexity. So, instead of providing the full probability distribution, we often just use a summary measure, a number or two or three which captures those aspects of the entire distribution which are relevant to the matter at hand. Let’s explore these issues using the 538 posterior probability distribution, as of August 13, 2020, for the number of electoral votes which will be won by Joe Biden. Here is a tibble with 1,000,000 draws from that distribution:

draws
## # A tibble: 1,000,000 x 2
##       ID electoral_votes
##    <int>           <int>
##  1     1             333
##  2     2             352
##  3     3             338
##  4     4             437
##  5     5             346
##  6     6             171
##  7     7             318
##  8     8             210
##  9     9             261
## 10    10             298
## # … with 999,990 more rows

A distribution and a sample of draws from that distribution are different things. But, if you squint, they are sort of the same thing, at least for our purposes. For example, if you want to know the mean of the distribution, then the mean of the draws will be a fairly good estimate, especially if the number of draws is large enough.

Recall from Chapter 2 how we can draw randomly from specified probability distributions:

rnorm(10)
##  [1]  0.030  0.732  0.091  0.263 -0.100 -1.533 -0.194  0.604 -1.913  0.345
runif(10)
##  [1] 0.128 0.618 0.975 0.543 0.374 0.920 0.709 0.210 0.048 0.936

The elements of these vectors are all “draws” from the specified probability distributions. In most applied situations, our tools will produce draws rather than summary objects. Fortunately, a vector of draws is very easy to work with. Start with summary statistics:

key_stats <- draws %>% 
  summarize(mn = mean(electoral_votes),
            md = median(electoral_votes),
            sd = sd(electoral_votes),
            mad = mad(electoral_votes))

key_stats
## # A tibble: 1 x 4
##      mn    md    sd   mad
##   <dbl> <dbl> <dbl> <dbl>
## 1  325.   326  86.9  101.

Calculate a 95% interval directly:

quantile(draws$electoral_votes, probs = c(0.025, 0.975))
## 2.5%  98% 
##  172  483

Approximate the 95% interval in two ways:

c(key_stats$mn - 2 * key_stats$sd, 
  key_stats$mn + 2 * key_stats$sd)
## [1] 152 499
c(key_stats$md - 2 * key_stats$mad, 
  key_stats$md + 2 * key_stats$mad)
## [1] 124 528

In this case, using the mean and standard deviation produces a 95% interval which is closer to the true interval. In other cases, the median and scaled median absolute deviation will do better. Either approximation is generally “good enough” for most work. But, if you need to know the exact 95% interval, you must use quantile().

5.1.6 Unnormalized distributions

Remember that probability distributions are mathematical objects that cover a set of outcomes, where each outcome in the domain is mapped to a probability value between 0 and 1 inclusive and the sum of all mappings is 1. Sometimes, you may see distributions similar to probability distributions, only the y-axis displays raw counts instead of proportions. Unnormalized distributions are not probability distributions, but it is easy to convert between the two. You simply divide all the outcome counts on the y-axis by the sum of all outcome counts to “normalize” the unnormalized distribution. Unnormalized distributions are often an intermediary step; it is sometimes handy to work with counts until the very end.

For instance, we can generate the following unnormalized distribution for the sum of rolling two dice. (This uses the same code as above, but without the normalization step.)

Notice that the shape of the distribution is the same as the empirical probability distribution we generated earlier, except that the y-axis is labeled differently.

The two plots — unnormalized and normalized — have the exact same shape. In many ways, they are the same object. Yet normalization is required if we want to work with a probability distribution.

5.1.7 Joint distributions

Recall that \(\text{Prob}(\text{coin})\) is the probability distribution for the result of a coin toss. It includes two parts, the probability of heads (\(\rho_h\)) and the probability of tails (\(\rho_t\)). This is a univariate distribution because there is only one outcome, which can be heads or tails. If there is more than one outcome, then we have a joint distribution.

Joint distributions are also mathematical objects that cover a set of outcomes, where each distinct outcome has a chance of occurring between 0 and 1 and the sum of all chances must equal 1. The key to a joint distribution is it measures the chance that both events A and B will occur. The notation is \(\text{Prob}(A, B)\).

Let’s say that you are rolling two six-sided dice simultaneously. Die 1 is weighted so that there is a 50% chance of rolling a 6 and a 10% chance of each of the other values. Die 2 is weighted so there is a 50% chance of rolling a 5 and a 10% chance of rolling each of the other values. Let’s roll both dice 1,000 times. In previous examples involving two dice, we cared about the sum of results and not the outcomes of the first versus the second die of each simulation. With a joint distributions, the order matters; so instead of 11 possible outcomes on the x-axis of our distribution plot (ranging from 2 to 12), we have 36. Furthermore, a 2D probability distribution is not sufficient to represent all of the variables involved, so the joint distribution for this example is displayed using a 3D plot.