5 Probability

This chapter is a draft. There are errors.

The usual touchstone of whether what someone asserts is mere persuasion or at least a subjective conviction, i.e., firm belief, is betting. Often someone pronounces his propositions with such confident and inflexible defiance that he seems to have entirely laid aside all concern for error. A bet disconcerts him. Sometimes he reveals that he is persuaded enough for one ducat but not for ten. For he would happily bet one, but at ten he suddenly becomes aware of what he had not previously noticed, namely that it is quite possible that he has erred. -— Immanuel Kant, Critique of Pure Reason

The central tension, and opportunity, in data science is the interplay between the data and the science, between our empirical observations and the models which we use to understand them. Probability is the language we use to explore that interplay; it connects models to data, and data to models.

What does it mean that Trump had a 30% chance of winning election in the fall of 2016? That there is a 90% probability of rain today? That the dice at the casino are fair?

Probability quantifies uncertainty. Think of probability as a proportion. The probability of an event occurring is a number from 0 to 1, where 0 means that the event is impossible and 1 means that the event is 100% certain.

Begin with the simplest events: coin flips and dice rolls. The set of all outcomes is the sample space. With fair coins and dice, we know that:

  • The probability of rolling a 1 or a 2 is 2/6, or 1/3.
  • The probability of rolling a 1, 2, 3, 4, 5, or 6 is 1.
  • The probability of flipping a coin and getting tails is 1/2.

If the probability of an outcome is unknown, we will often refer to it as an unknown parameter, something which we might use data to estimate. We usually use Greek letters to refer to parameters. Whenever we are talking about a specific probability (represented by a single value), we will use \(\rho\) (the Greek letter “rho” but spoken aloud as “p” by us) with a subscript which specifies the exact outcome of which it is the probability. For instance, \(\rho_h = 0.5\) denotes the probability of getting heads on a coin toss when the coin is fair. \(\rho_t\) — spoken as “PT” or “P sub T” or “P tails” — denotes the probability of getting tails on a coin toss. This notation can become annoying if the outcome whose probability we seek is less concise. For example, we might write the probability of rolling a 1, 2 or 3 using a fair die as:

\[ \rho_{die\ roll\ is\ 1,\ 2\ or\ 3} = 0.5 \]

We will rarely write out the full definition of the event with the \(\rho\) symbol. Instead, we will define an event “a” as when a rolled die equals 1, 2 or 3 and, then, write

\[\rho_a = 0.5\]

A random variable is a function which produces a value from a sample set. A random variable can be either discrete — where the sample set has a limited number of members, like H or T for the result of a coin flip, or 2, 3, …, 12 for the sum of two die — or continuous (any value within a range). Probability is a claim about the value of a random variable, i.e., that you have a 50% probability of getting a 1, 2 or 3 when you roll a fair die.

We usually use capital letters for random variables. So, \(C\) might be our symbol for the random variable which is a coin toss and \(D\) might be our symbol for the random variable which is the sum of two dice. When discussing random variables in general, or when we grow tired of coming up with new symbols, we will use \(Y\).

Small letters refer to a single outcome or result from a random variable. \(c\) is the outcome from one coin toss. \(d\) is the result from one throw of the dice. The value of the outcome must come from the sample space. So, \(c\) can only take on two possible values: heads or tails. When discussing random variables in general, we use \(y\) to refer to one outcome of the random variable \(Y\). If there are multiple outcomes — if we have, for example, flipped the coin multiple times — then we use subscripts to indicate the separate outcomes: \(y_1\), \(y_2\), and so on. The symbol for an arbitrary outcome is \(y_i\), where \(i\) ranges from 1 through \(N\), the total number of events or experiments for which an outcome \(y\) was produced.

The only package we need in this chapter is tidyverse.

To understand probability more fully, we first need to understand distributions.

5.1 Distributions

A variable in a tibble is a column, a vector of values. We sometimes refer to this vector as a “distribution.” This is somewhat sloppy in that a distribution can be many things, most commonly a mathematical formula. But, strictly speaking, a “frequency distribution” or an “empirical distribution” is a list of values, so this usage is not unreasonable.

5.1.1 Scaling distributions

Consider the vector which is the result of rolling one die 10 times.

ten_rolls <- c(5, 5, 1, 5, 4, 2, 6, 2, 1, 5)

There are other ways of storing the data in this vector. Instead of reporting every observation, we could record the number of times each value appears and the percentage of the total which this number accounts for.

Distribution of Ten Rolls of a Fair Die
Counts and percentages reflect the same information
Outcome Count Percentage
1 2 0.2
2 2 0.2
4 1 0.1
5 4 0.4
6 1 0.1

In this case, with only 10 values, it is actually less efficient to store the data like this. But what happens when we have 1,000 rolls?

Distribution of One Thousand Rolls of a Fair Die
Counts and percentages reflect the same information
Outcome Count Percentage
1 190 0.19
2 138 0.14
3 160 0.16
4 173 0.17
5 169 0.17
6 170 0.17

Instead of keeping around a vector of length 1,000, we can just keep 12 values — the 6 possible outcomes and their frequency — without losing any information.

Two distributions can be identical even if they are of very different lengths. Let’s compare our original distribution of 10 rolls of the die with another distribution which just features 100 copies of those 10 rolls.

more_rolls <- rep(ten_rolls, 100)

The two graphs have the exact same shape because, even though the vectors are of different lengths, the relative proportions of the outcomes are identical. In some sense, both vectors are from the same distribution. Relative proportions, not the total counts, are what matter.

5.1.2 Normalizing distributions

If two distributions have the same shape, then they only differ by the labels on the y-axis. There are various ways of “normalizing” distributions to make them all the same scale. The most common scale is one in which the area under the distribution adds to 1, e.g., 100%. For example, we can transform the plots above to look like:

We sometimes refer to a distribution as “unnormalized” if the area under the curve does not add up to 1.

5.1.3 Simulating distributions

There are two distinct concepts: a distribution and a set values drawn from that distribution. But, in everyday use, we use “distribution” for both. When given a distribution (meaning a vector of numbers), we often use geom_histogram() or geom_density() to graph it. But, sometimes, we don’t want to look at the whole thing. We just want some summary measures which report the key aspects of the distribution. The two most important attributes of a distribution are its center and its variation around that center.

We use summarize() to calculate statistics for a variable, a column, a vector of values or a distribution. Note the language sloppiness. For the purposes of this book, “variable,” “column,” “vector,” and “distribution” all mean the same thing. Popular statistical functions include: mean(), median(), min(), max(), n() and sum(). Functions which may be new to you include three measures of the “spread” of a distribution: sd() (the standard deviation), mad() (the scaled median absolute deviation) and quantile(), which is used to calculate an interval which includes a specified proportion of the values.

Think of the distribution of a variable as an urn from which we can pull out, at random, values for that variable. Drawing a thousand or so values from that urn, and then looking at a histogram, can show where the values are centered and how they vary. Because people are sloppy, they will use the word distribution to refer to at least three related entities:

  1. the (imaginary!) urn from which we are drawing values.
  2. all the values in the urn
  3. all the values which we have drawn from the urn, whether that be 10 or 1,000

Sloppiness in the usage of the word distribution is universal. However, keep three distinct ideas separate:

  • The unknown true distribution which, in reality, generates the data which we see. Outside of stylized examples in which we assume that a distribution follows a simple mathematical formula, we will never have access to the unknown true distribution. We can only estimate it. This unknown true distribution is often referred to as the data generating mechanism, or DGM. It is a function or black box or urn which produces data. We can see the data. We can’t see the urn.

  • The estimated distribution which, we think, generates the data which we see. Again, we can never know the unknown true distribution. But, by making some assumptions and using the data we have, we can estimate a distribution. Our estimate may be very close to the true distribution. Or it may be far away. The main task of data science to to create and use these estimated distributions. Almost always, these distributions are instantiated in computer code. Just as there is a true data generating mechanism associated with the (unknown) true distribution, there is an estimated data generating mechanism associated with the estimated ditribution.

  • A vector of numbers drawn from the estimated distribution. Both true and estimated distributions can be complex animals, difficult to describe accurately and in detail. But a vector of numbers drawn from a distribution is easy to understand and use. So, in general, we work with vectors of numbers. When someone — either a colleague or a piece of R code — creates a distribution which we want to use to answer a question, we don’t really want the distribution itself. Rather, we want a vectors of “draws” from that distribution. Vectors are easy to work with! Complex computer code is not.

Again, people (including us!) will often be sloppy and use the same word, “distribution,” without making it clear whether they are talking about the true distribution, the estimated distribution, or a vector of draws from the estimated distribution. The same sloppiness applies to the use of the term data generating mechanism. Try not to be sloppy.

Much of the rest of the Primer involves learning how to work with distributions, which generally means working with the draws from those distributions. Fortunately, the usual rules of arithmetic apply. You can add/subtract/multiply/divide distributions by working with draws from those distributions, just as you can add/subtract/multiply/divide regular numbers.

5.2 Probability distributions

Bruno de Finetti, an Italian statistician who wrote a famous treatise on the theory of probability that began with the statement "PROBABILITY DOES NOT EXIST."

FIGURE 5.1: Bruno de Finetti, an Italian statistician who wrote a famous treatise on the theory of probability that began with the statement “PROBABILITY DOES NOT EXIST.”

For the purposes of this Primer, a probability distribution is a mathematical object which maps a set of outcomes to probabilities, where each distinct outcome has a chance of occurring between 0 and 1 inclusive. The probabilities must sum to 1. The set of possible outcomes, i.e., the sample space — heads or tails for the coin, 1 through 6 for a single die, 2 through 12 for the sum of a pair of dice — can be either discrete or continuous. Remember, discrete data can only take on certain values. Continuous data, like height and weight, can take any value within a range. The set of outcomes is the domain of the probability distribution. The range is the associated probabilities.

Assume that a probability distribution is created by a probability function, a set function which maps outcomes to probabilities. The concept of a “probability function” is often split into two categories: probability mass functions (for discrete random variables) and probability density functions (for continuous random variables). As usual, we will be a bit sloppy, using the term probability distribution for both the mapping itself and for the function which creates the mapping.

We discuss three types of probability distributions: empirical, mathematical, and posterior.

The key difference between a distribution, as we have explored them in Section 5.1, and a probability distribution is the requirement that the sum of the probabilities of the individual outcomes must be exactly 1. There is no such requirement for a distribution in general. But any distribution can be turned into a probability distribution by “normalizing” it. In this context, we will often refer to a distribution which is not (yet) a probability distribution as an “unnormalized” distribution.

Pay attention to notation. Recall that when we are talking about a specific probability (represented by a single value), we will use \(\rho\) (the Greek letter “rho”) with a subscript which specifies the exact outcome of which it is the probability. For instance, \(\rho_h = 0.5\) denotes the probability of getting heads on a coin toss when the coin is fair. \(\rho_t\) — spoken as “PT” or “P sub T” or “P tails” — denotes the probability of getting tails on a coin toss. However, when we are referring to the entire probability distribution over a set of outcomes, we will use \(P()\). For example, the probability distribution of a coin toss is \(P(\text{coin})\). That is, \(P(\text{coin})\) is composed of the two specific probabilities (50% and 50%) mapped from the two values in the domain (Heads and Tails). Similarly, \(P(\text{sum of two dice})\) is the probability distribution over the set of 11 outcomes (2 through 12) which are possible when you take the sum of two dice. \(P(\text{sum of two dice})\) is made up of 11 numbers — \(\rho_2\), \(\rho_3\), …, \(\rho_{12}\) — each representing the unknown probability that the sum will equal their value. That is, \(\rho_2\) is the probability of rolling a 2.

5.2.1 Flipping a coin

All data science problems start with a question. Example: What are the chances of getting three heads in a row when flipping a fair coin? All questions are answered with the help of probability distributions.

An empirical distribution is based on data. You can think of this as the probability distribution created by running a simulation. In theory, if we increase the number of coins we flip in our simulation, the empirical distribution will look more and more similar to the mathematical distribution. The mathematical distribution is the Platonic form. The empirical distribution will often look like the mathematical probability distribution, but it will rarely be exactly the same.

In this simulation, there are 56 heads and 44 tails. The outcome will vary every time we run the simulation, but the proportion of heads to tails should not be too different if this coin is fair.

# We are flipping one fair coin a hundreds times. We need to get the same result
# each time we create this graphic because we want the results to match the
# description in the text. Using set.seed() guarantees that the random results
# are the same each time. We define 0 as tails and 1 as heads.


tibble(results = sample(c(0, 1), 100, replace = TRUE)) %>% 
  ggplot(aes(x = results)) +
    geom_histogram(aes(y = after_stat(count/sum(count))), 
                   binwidth = 0.5, 
                   color = "white") +
    labs(title = "Empirical Probability Distribution",
         subtitle = "Flipping one coin a hundred times",
         x = "Outcome\nResult of Coin Flip",
         y = "Probability") +
    scale_x_continuous(breaks = c(0, 1), 
                       labels = c("Heads", "Tails")) +
    scale_y_continuous(labels = 
                         scales::percent_format(accuracy = 1)) +

A mathematical distribution is based on a mathematical formula. Assuming that the coin is perfectly fair, we should, on average, get heads as often as we get tails.

The distribution of a single observation is described by this formula.

\[ P(Y = y) = \begin{cases} 1/2 &\text{for }y= \text{Heads}\\ 1/2 &\text{for }y= \text{Tails} \end{cases}\] We sometimes do not know that the probability of heads and the probability of tails both equal 50%. In that case, we might write:

\[ P(Y = y) = \begin{cases} \rho_H &\text{for }y= \text{Heads}\\ \rho_T &\text{for }y= \text{Tails} \end{cases}\]

Yet, we know that, by definition, \(\rho_H + \rho_T = 1\), so we can rewrite the above asL

\[ P(Y = y) = \begin{cases} \rho_H &\text{for }y= \text{Heads}\\ 1- \rho_H &\text{for }y= \text{Tails} \end{cases}\]

Coin flipping (and related scenarios with only two possible outcomes) are such common problems, that the notation is often simplified further, with \(\rho\) understood, by convention, to be the probability of heads. In that case, we can write the mathematical distribution is two canonical forms:

\[P(Y) = Bernoulli(\rho)\] and

\[y_i \sim Bernoulli(\rho)\] All five of these versions mean the same thing! The first four describe the mathematical probability distribution for a fair coin. The capital \(Y\) within the \(P()\) indicates a random variable. The fifth highlights one “draw” from that random variable, hence the lower case \(y\) and the subscript \(i\).

Most probability distributions do not have special names, which is why we will use the generic symbol \(P\) to define them. But some common probability distributions do have names, like “Bernoulli” in this case.

If the mathematical assumptions are correct, then as your sample size increases, the empirical probability distribution will look more and more like the mathematical distribution.

A posterior distribution is based on beliefs and expectations. It displays your belief about things you can’t see right now. You may have posterior distributions for outcomes in the past, present, or future.

In the case of the coin toss, the posterior distribution changes depending on your beliefs. For instance, let’s say your friend brought a coin to school and asked to bet you. If the result is heads, you have to pay them $5. In that case, your posterior probability distribution might look like this:

The full terminology is mathematical (or empirical or posterior) probability distribution. But we will often shorten this to just mathematical (or empirical or posterior) distribution. The word “probability” is understood, even if it is not present.

Recall the question with which we started this section: What are the chances of getting three heads in a row when flipping a fair coin? To answer this question, we need to use a probability distribution as our data generating mechanism. Fortunately, the rbinom() function allows us to generate the results for coin flips. For example:

rbinom(n = 10, size = 1, prob = 0.5)
##  [1] 1 1 0 1 1 0 0 0 0 0

generates the results of 10 coin flips, where a result of heads is presented as 1 and tails as 0. With this tool, we can generate 1,000 draws from our experiment:

tibble(toss_1 = rbinom(n = 1000, size = 1, prob = 0.5),
       toss_2 = rbinom(n = 1000, size = 1, prob = 0.5),
       toss_3 = rbinom(n = 1000, size = 1, prob = 0.5))
## # A tibble: 1,000 x 3
##    toss_1 toss_2 toss_3
##     <int>  <int>  <int>
##  1      0      1      1
##  2      0      1      1
##  3      0      1      0
##  4      0      0      1
##  5      1      1      0
##  6      1      0      1
##  7      1      0      0
##  8      1      0      1
##  9      0      0      1
## 10      0      1      1
## # … with 990 more rows

Because the flips are independent, we can consider each row to be a draw from the experiment. Then, we simply count up the proportion of experiments in which resulted in three heads.

tibble(toss_1 = rbinom(n = 1000, size = 1, prob = 0.5),
       toss_2 = rbinom(n = 1000, size = 1, prob = 0.5),
       toss_3 = rbinom(n = 1000, size = 1, prob = 0.5)) %>% 
  mutate(three_heads = toss_1 + toss_2 + toss_3 == 3) %>% 
  summarize(chance = mean(three_heads))
## # A tibble: 1 x 1
##   chance
##    <dbl>
## 1  0.104

This is close enough to the correct answer of \(1/8\)th. If we increase the sample size, we will get closer to the truth.

This is our first example of using a data generating mechanism — meaning rbinom() — to answer a question. We will see many more in the chapters to come.

5.2.2 Rolling two dice

We get an empirical distribution by rolling two dice a hundred times, either by hand or with a computer simulation. The result is not identical to the mathematical distribution because of the inherent randomness of the real world and/or of simulation.

# In the coin example, we create the vector ahead of time, and then assigned
# that vector to a tibble. There was nothing wrong with that approach. And we
# could do the same thing here. But the use of map_* functions is more powerful,
# although it requires creating the 100 rows of the tibble at the start and then
# doing things "row-by_row."


emp_dist_dice <- tibble(ID = 1:100) %>% 
  mutate(die_1 = map_dbl(ID, ~ sample(c(1:6), size = 1))) %>% 
  mutate(die_2 = map_dbl(ID, ~ sample(c(1:6), size = 1))) %>% 
  mutate(sum = die_1 + die_2) %>% 
  ggplot(aes(x = sum)) +
    geom_histogram(aes(y = after_stat(count/sum(count))), 
                   binwidth = 1, 
                   color = "white") +
    labs(title = "Empirical Probability Distribution",
         subtitle = "Sum from rolling two dice, replicated one hundred times",
         x = "Outcome\nSum of Two Die",
         y = "Probability") +
    scale_x_continuous(breaks = seq(2, 12, 1), labels = 2:12) +
    scale_y_continuous(labels = 
                         scales::percent_format(accuracy = 1)) +


We might consider labeling the y-axis in plots of empirical distributions as “Proportion” rather than “Probability” since it is an actual proportion, calculated from real (or simulated) data. We will keep it as “Probability” since we want to emphasize the parallels between mathematical, empirical and posterior probability distributions.

Our mathematical distribution tells us that, with a fair dice, the probability of getting 1, 2, 3, 4, 5, and 6 are equal: there is a 1/6 chance of each. When we roll two dice at the same time and sum the numbers, the values closest to the middle are more common than values at the edge because there are more combinations of numbers that add up to the middle values.

\[ P(Y = y) = \begin{cases} \dfrac{y-1}{36} &\text{for }y=1,2,3,4,5,6 \\ \dfrac{13-y}{36} &\text{for }y=7,8,9,10,11,12 \\ 0 &\text{otherwise} \end{cases} \]

The posterior distribution for rolling two dice a hundred times depends on your beliefs. If you take the dice from your Monopoly set, you have reason to believe that the assumptions underlying the mathematical distribution are true. However, if you walk into a crooked casino and a host asks you to play craps, you might be suspicious, just as in the “flipping a coin example” the word “suspicious” means you no longer trust the “population” where the mathematical and empircal distribution drawn their data from. For example, in craps, a come-out roll of 7 and 11 is a “natural,” resulting in a win for the “shooter” and a loss for the casino. You might expect those numbers to occur less often than they would with fair dice. Meanwhile, a come-out roll of 2, 3 or 12 is a loss for the shooter. You might also expect values like 2, 3 and 12 to occur more frequently. Your posterior distribution might look like this:

Someone less suspicious of the casino would have a posterior distribution which looks more like the mathematical distribution.

5.2.3 Presidential elections

Now let’s say we are building probability distributions for political events, like a presidential election. We want to know the probability that Democratic candidate wins X electoral votes, where X comes from the range of possible outcomes: 0 to 538. (The total number of electoral votes in US elections since 1964 is 538.)

The empirical distribution in this case could involve looking into past elections in the United States and counting the number of electoral votes that the Democrats won in each. For the empirical distribution, we create a tibble with electoral vote results from past elections. Looking at elections since 1964, we can observe that the number of electoral votes that the Democrats received in each one is different. Given that we only have 15 entries, it is difficult to draw conclusions or make predictions based off of this empirical distribution.

We can build a mathematical distribution for X which assumes that the chances of the Democratic candidate winning any given state’s electoral votes is 0.5 and the results from each state are independent.

If our assumptions about this mathematical distribution are correct (they are not), then as the sample size increase, the empirical distribution should look more and more similar to the our mathematical distribution.

However, the data from past elections is more than enough to demonstrate that the assumptions of the mathematical probability distribution above do not work for electoral votes. The model assumes that the Democrats have a 50% chance of receiving each of the 538 votes. Just looking at the mathematical probability distribution, we can observe that receiving 13 or 17 or 486 votes out of 538 would be extreme and almost impossible if the mathematical model were accurate. However, our empirical distribution shows that such extreme outcomes are quite common. Presidential elections have resulted in much bigger victories or defeats than this distribution seems to allow for.

The posterior distribution of electoral votes is a popular topic, and an area of strong disagreement, among data scientists. Consider this posterior from FiveThirtyEight.

Here is a posterior from the FiveThirtyEight website from August 13, 2020. This was created using the same data as the above distribution, but simply displayed differently. For each electoral result, the height of the bar represents the probability that a given event will occur. However, there are no lablels y-axis telling us what the specific probability of each outcome is. And that is OK! The specific values are not that useful. If we removed the labels on our y-axes, would it matter?

Here is the posterior from The Economist, also from August 13, 2020. This looks confusing at first because they chose to merge the axes for Republican and Democratic electoral votes. We can tell that The Economist was less optimistic, relative to FiveThirtyEight, about Trump’s chances in the election.

These two models, built by smart people using similar data sources, have reached fairly different conclusions. Data science is difficult! There is not one “right” answer. Real life is not a problem set.

Watch the makers of these two models throw shade at each other on Twitter! Eliot Morris is one of the primary authors of the Economist model. Nate Silver is in charge of 538. They don't seem to be too impressed with each other's work! More smack talk [here](https://statmodeling.stat.columbia.edu/2020/08/31/more-on-that-fivethirtyeight-prediction-that-biden-might-only-get-42-of-the-vote-in-florida/) and [here](https://statmodeling.stat.columbia.edu/2020/08/31/problem-of-the-between-state-correlations-in-the-fivethirtyeight-election-forecast/).

FIGURE 5.2: Watch the makers of these two models throw shade at each other on Twitter! Eliot Morris is one of the primary authors of the Economist model. Nate Silver is in charge of 538. They don’t seem to be too impressed with each other’s work! More smack talk here and here.

There are many political science questions you could explore with posterior distributions. They can relate to the past, present, or future.

  • Past: How many electoral votes would Hilary Clinton have won if she had picked a different VP?
  • Present: What are the total campaign donations from Harvard faculty?
  • Future: How many electoral votes will the Democratic candidate for president win in 2024?

5.2.4 Height

Question: What is the height of the next adult male we will meet?

The three examples above are all discrete probability distributions, meaning that the outcome variable can only take on a limited set of values. A coin flip has two outcomes. The sum of a pair of dice has 11 outcomes. The total electoral votes for the Democratic candidate has 539 possible outcomes. In the limit, we can also create continuous probability distributions which have an infinite number of possible outcomes. For example, the average height for an American male could be any real number between 0 inches and 100 inches. (Of course, an average value anywhere near 0 or 100 is absurd. The point is that the average could be 68.564, 68.5643, 68.56432 68.564327, or any real number.)

All the characteristics for discrete probability distributions which we reviewed above apply just as much to continuous probability distributions. For example, we can create mathematical, empirical and posterior probability distributions for continuous outcomes just as we did for discrete outcomes.

The empirical distribution involves using data from the National Health and Nutrition Examination Survey (NHANES). What we are doing here is instead making an model by ourself using some mathematical formula, we use the actual data, we can get the data from either simulated by our own like in the “flipping a coin” and “Rolling two dice” scenario, or we used the data from someone else, like the presidential election and this scenario.

Mathematical distribution is complete based on mathematical formula and assumptions like in the Flipping a coin session we assume that the coin is an perfectly fair coin where where the probability landing on heads or tails is equal. In this case, we assume that the average hight of men is 175 cm, as well as the standard deviation for height is around 9 cm. When we have these two values, the average which we also called the mean, and standard deviation (sd), we can create an normal distribution using the rnorm() function. And an normal distribution is an good approximation and generalization for height in our scenario.

Mathematical Distribution:

Again, the Normal distribution which is an probability distribution that is symmetric about the mean is described by this formula.

\[y_i \sim N(\mu, \sigma^2)\].

Each value \(y_i\) is drawn from a normal distribution with parameters \(\mu\) for the mean and \(\sigma\) for the standard deviation. If the mathematical assumptions are correct in this case the two parameters \(\mu\) and \(\sigma\), then as our sample size increases, the empirical probability distribution will look more and more like the mathematical distribution.

The posterior distribution for heights depends on the context. Are we considering all the adult men in America? In that case, our posterior would probably look a lot like the empirical distribution using NHANES data. If we are being asked about the distribution of heights among players in the NBA, then our posterior might look like:


  • Continuous variables are a myth. Nothing that can be represented on a computer is truly continuous. Even something which appears continuous, like height, actually can only take on a (very large) set of discrete variables.

  • The math of continuous probability distributions can be tricky. Read a book on mathematical probability for all the messy details. Little of that matters in applied work.

  • The most important difference is that, with discrete distributions, it makes sense to estimate the probability of a specific outcome. What is the probability of rolling a 9? With continuous distributions, this makes no sense because there are an infinite number of possible outcomes. With continuous variables, we only estimate intervals.

Don’t worry about the distinctions between discrete and continuous outcomes, or between the discrete and continuous probability distributions which we will use to summarize our beliefs about those outcomes. The basic intuition is the same in both cases.

5.2.5 Joint distributions

Recall that \(P(\text{coin})\) is the probability distribution for the result of a coin toss. It includes two parts, the probability of heads (\(\rho_h\)) and the probability of tails (\(\rho_t\)). This is a univariate distribution because there is only one outcome, which can be heads or tails. If there is more than one outcome, then we have a joint distribution.

Joint distributions are also mathematical objects that cover a set of outcomes, where each distinct outcome has a chance of occurring between 0 and 1 and the sum of all chances must equal 1. The key to a joint distribution is it measures the chance that both events A and B will occur. The notation is \(P(A, B)\).

Let’s say that you are rolling two six-sided dice simultaneously. Die 1 is weighted so that there is a 50% chance of rolling a 6 and a 10% chance of each of the other values. Die 2 is weighted so there is a 50% chance of rolling a 5 and a 10% chance of rolling each of the other values. Let’s roll both dice 1,000 times. In previous examples involving two dice, we cared about the sum of results and not the outcomes of the first versus the second die of each simulation. With a joint distributions, the order matters; so instead of 11 possible outcomes on the x-axis of our distribution plot (ranging from 2 to 12), we have 36. Furthermore, a 2D probability distribution is not sufficient to represent all of the variables involved, so the joint distribution for this example is displayed using a 3D plot.