# 6 One Parameter

This scene is from The Hunger Games, a dystopian novel in which children are selected via lottery to fight to the death. Primrose Everdeen is selected from the urn. Why does she have the misfortune of being selected? Or, as we data scientists say, sampled?

In Chapter 5, we learned about probability, the framework for quantifying uncertainty. In this chapter, we will learn about sampling, the beginning of our journey toward inference. When we sample, we take some units from a population, calculate statistics based on those units, and make inferences about unknown parameters associated with the population.

The urn below has a certain number of red and a certain number of white beads all of equal size, mixed well together. What proportion, $$p$$, of this urn’s beads are red?

One way to answer this question would be to perform an exhaustive count: remove each bead individually, count the number of red beads, count the number of white beads, and divide the number of red beads by the total number of beads. Call that ratio $$p$$, the proportion of red beads in the urn. However, this would be a long and tedious process. Therefore, we will use sampling! Consider two questions:

If we get 17 red beads in a random sample of size 50 taken from the urn, what proportion $$p$$ of the beads in the urn are red?

What is the probability, using that same urn, that we will draw more than 8 red beads if we use a shovel of size 20?

To begin this chapter, we will look at a real sampling activity: the urn. Then, we will simulate the urn example using R code. This will help us to understand the standard error and the ways in which uncertainty factors into our predictions. We then create a joint distribution of models and data for the urn example. We derive a posterior distribution from that joint distribution, and then use that posterior to answer our questions.

Use the tidyverse package.

library(tidyverse)
# Needed for the one 3-D plot below.

library(rgl)

## 6.1 Real sampling activity

### 6.1.1 Using the shovel method once

Instead of performing an exhaustive count, let’s insert a shovel into the urn and remove $$5 \cdot 10 = 50$$ beads. We are taking a sample of the total population of beads.

Observe that 17 of the 50 sampled beads are red and thus 17/50 = 0.34 = 34% of the shovel’s beads are red. We can view the proportion of beads that are red in this shovel as a guess of the proportion of beads that are red in the entire urn. While not as exact as doing an exhaustive count of all the beads in the urn, our guess of 34% took much less time and energy to make.

Recall that $$p$$ is the true value of the proportion of red beads. There is only one $$p$$. Our guesses at the proportion of red beads are known as $$\hat{p}$$, where $$\hat{p}$$ is the estimated value of $$p$$ which comes from taking a sample. There can be an infinite number of $$\hat{p}$$.

Imagine that we started this activity over from the beginning, replacing the 50 beads back into the urn and starting over. Would we remove exactly 17 red beads? Maybe?

What if we repeated this activity many times? Would our guess at the proportion of the urn’s beads that are red, $$\hat{p}$$, be exactly 34% every time? Surely not.

Let’s repeat this exercise with the help of 33 groups of friends to understand how the value varies across 33 independent trials.

### 6.1.2 Using the shovel 33 times

Each of our 33 groups of friends will do the following:

• Use the shovel to remove 50 beads each.
• Count the number of red beads and compute the proportion of the 50 beads that are red.
• Return the beads into the urn.
• Mix the contents of the urn to not let a previous group’s results influence the next group’s.

Each of our 33 groups of friends make note of their proportion of red beads from their sample collected. Each group then marks their proportion of their 50 beads that were red in the appropriate bin in a hand-drawn histogram as seen below.

Histograms allow us to visualize the distribution of a numerical variable. In particular, where the center of the values falls and how the values vary. A partially completed histogram of the first 10 out of 33 groups of friends’ results can be seen in the figure below.

Observe the following details in the histogram:

• At the low end, one group removed 50 beads from the urn with proportion red between 0.20 and 0.25.
• At the high end, another group removed 50 beads from the urn with proportion between 0.45 and 0.5 red.
• However, the most frequently occurring proportions were between 0.30 and 0.35 red, right in the middle of the distribution.
• The distribution is somewhat bell-shaped.

tactile_sample_urn saves the results from our 33 groups of friends.

## # A tibble: 33 x 4
##    <chr>             <dbl>    <dbl> <int>
##  1 Mal, Francis         17     0.34     1
##  2 Nam, Joshua          19     0.38     2
##  3 Mark, Ramses         21     0.42     3
##  4 Maeve, Josh          18     0.36     4
##  5 Morgan, Emily        21     0.42     5
##  6 Ace, Chris           18     0.36     6
##  7 Mia, James           15     0.3      7
##  8 Griffin, Mary        18     0.36     8
##  9 Yuki, Harry          21     0.42     9
## 10 Frank, Clara         21     0.42    10
## # … with 23 more rows

For each group, we are given their names, the number of red_beads they obtained, and the corresponding proportion out of 50 beads that were red, called prop_red. We also have an ID variable which gives each of the 33 groups a unique identifier. Each row can be viewed as one instance of a replicated activity: using the shovel to remove 50 beads and computing the proportion of those beads that are red.

Let’s visualize the distribution of these 33 proportions using geom_histogram() with binwidth = 0.05. This is a computerized and complete version of the partially completed hand-drawn histogram you saw earlier. Setting boundary = 0.4 indicates that we want a binning scheme such that one of the bins’ boundary is at 0.4. color = "white" modifies the color of the boundary for visual clarity.

tactile_sample_urn %>%
ggplot(aes(x = prop_red)) +
geom_histogram(binwidth = 0.05,
boundary = 0.4,
color = "white") +

# Add scale_y_continuous with breaks by 1, as the default shows the y-axis
# from 1 to 10 with breaks at .5. Breaks by 1 is better for this plot, as all
# resulting values are integers.

scale_y_continuous(breaks = seq(from = 0, to = 10, by = 1)) +

# The call expression() is used to insert a mathematical expression, like
# p-hat. The paste after expression allows us to paste text prior to said
# expression.

labs(x = expression(paste("Proportion of 50 beads that were red ", hat(p))),
y = "Count",
title = "Proportions Red in 33 Samples") 

### 6.1.3 What did we just do?

What we just demonstrated in this activity is the statistical concept of sampling. We want to know the proportion of the urn’s beads that are red. Performing an exhaustive count of the red and white beads would be time-consuming. Therefore, we extracted a sample of 50 beads using the shovel. Using this sample of 50 beads, we estimated the proportion of the urn’s beads that are red to be 34%.

Moreover, because we mixed the beads before each use of the shovel, the samples were random and independent. Because each sample was drawn at random, the samples were different from each other. Because the samples were different from each other, we obtained the different proportions red observed in the previous histogram. This is an example of sampling variation.

In Section 6.2, we’ll mimic the hands-on sampling activity we just performed on a computer. This will allow us not only to repeat the sampling exercise much more than 33 times, but it will also allow us to use shovels with different numbers of slots than just 50.

Afterwards, we’ll present you with definitions, terminology, and notation related to sampling in Section 6.3. As in many disciplines, such necessary background knowledge may seem confusing at first. However, if you truly understand the underlying concepts and practice, you’ll be able to master them.

To tie the contents of this chapter to the real world, we’ll present an example of one of the most common uses of sampling: polls. In Section 6.5.2 we’ll look at a particular case study: a 2013 poll on then U.S. President Barack Obama’s popularity among young Americans, conducted by the Kennedy School’s Institute of Politics at Harvard University.

## 6.2 Virtual sampling

We just performed a tactile sampling activity. We used a physical urn of beads and a physical shovel. We did this by hand so that we could develop our intuition about the ideas behind sampling. In this section, we mimic this physical sampling with virtual sampling, using a computer.

### 6.2.1 Using the virtual shovel once

Virtual sampling requires a virtual urn and a virtual shovel. Create a tibble named urn. The rows of urn correspond exactly to the contents of the actual urn.

# set.seed() ensures that the beads in our virtual urn are always in the same
# order. This ensures that the figures in the book match their written
# descriptions. We want 40% of the beads to be red.

set.seed(10)

urn <- tibble(color = c(rep("red", 400),
rep("white", 600))) %>%

# sample_frac() keeps all the rows in the tibble but rearranges their order.
# We don't need to do this. A virtual urn does not care about the order of the
# beads. But we find it aesthetically pleasing to mix them up.

sample_frac() %>%
mutate(ID = 1:1000) %>%
select(ID, color)

urn  
## # A tibble: 1,000 x 2
##       ID color
##    <int> <chr>
##  1     1 white
##  2     2 white
##  3     3 red
##  4     4 red
##  5     5 white
##  6     6 white
##  7     7 white
##  8     8 white
##  9     9 white
## 10    10 white
## # … with 990 more rows

Observe that urn has 1,000 rows, meaning that the urn contains 1,000 beads. The first variable ID is used as an identification variable. None of the beads in the actual urn are marked with numbers. The second variable color indicates whether a particular virtual bead is red or white.

Note that, in this chapter, we have used the variable ID in two different ways: first, to keep track of the samples drawn by the 33 individual teams and, second, to keep track of the 1,000 beads in our virtual urn. And that is OK! ID means the same concept in both cases.

Our virtual urn needs a virtual shovel. We use slice_sample() and a list-column to take a sample of 50 beads from our virtual urn.

# Define ID as 1 to look at one example of drawing fifty beads. When ID is
# called within map(), we are performing slice_sample() upon our urn once (ID =
# 1) and taking a sample of 50 beads. Note that the ID in the main tibble has
# nothing to do wkth the ID variable in urn. The former refers to samples from
# the urn. The latter refers to beads in the urn.

tibble(ID = 1) %>%
mutate(shovel = map(ID, ~ slice_sample(urn, n = 50)))
## # A tibble: 1 x 2
##      ID shovel
##   <dbl> <list>
## 1     1 <tibble [50 × 2]>

As usual, map functions and list-columns are powerful but confusing. The str() function is a good way to explore a tibble with a list-column.

tibble(ID = 1) %>%
mutate(shovel = map(ID, ~ slice_sample(urn, n = 50))) %>%
str()
## tibble [1 × 2] (S3: tbl_df/tbl/data.frame)
##  $ID : num 1 ##$ shovel:List of 1
##   ..$: tibble [50 × 2] (S3: tbl_df/tbl/data.frame) ## .. ..$ ID   : int [1:50] 812 903 227 283 229 160 523 893 66 277 ...
##   .. ..$color: chr [1:50] "white" "white" "white" "red" ... There are two levels. There is one row in the tibble for each sample. So far, we have only drawn one sample. Within each row, there is a second level, the tibble which is the sample. That tibble has two variables: ID and color. Let’s compute the proportion of beads in our virtual sample that are red. First, add a column which indicates the number of red beads in the sample taken in the shovel. tibble(ID = 1) %>% mutate(shovel = map(ID, ~ slice_sample(urn, n = 50))) %>% # Using map_int, perform the sum where (within the shovel -- denoted by the # period) the color is equal to red. This counts the number of red beads in # the shovel. mutate(numb_red = map_int(shovel, ~ sum(.$color == "red")))
## # A tibble: 1 x 3
##      ID shovel            numb_red
##   <dbl> <list>               <int>
## 1     1 <tibble [50 × 2]>       20

How does this work? R treats TRUE like the number 1 and FALSE like the number 0. So summing the number of TRUEs and FALSEs is equivalent to summing 1’s and 0’s. In the end, this operation counts the number of beads where color equals “red.”

Second, add a column for the total number of beads. (We already “know” that this is 50, but it never hurts to make our code more general.)

tibble(ID = 1) %>%
mutate(shovel = map(ID, ~ slice_sample(urn, n = 50))) %>%
mutate(numb_red = map_int(shovel, ~ sum(.$color == "red"))) %>% mutate(numb_beads = map_int(shovel, ~ length(.$color)))
## # A tibble: 1 x 4
##   <dbl> <list>               <int>      <int>
## 1     1 <tibble [50 × 2]>       23         50

Third, calculate the proportion red:

tibble(ID = 1) %>%
mutate(shovel = map(ID, ~ slice_sample(urn, n = 50))) %>%
mutate(numb_red = map_int(shovel, ~ sum(.$color == "red"))) %>% mutate(numb_beads = map_int(shovel, ~ length(.$color))) %>%
mutate(prop_red = numb_red / numb_beads)
## # A tibble: 1 x 5
##      ID shovel            numb_red numb_beads prop_red
##   <dbl> <list>               <int>      <int>    <dbl>
## 1     1 <tibble [50 × 2]>       18         50     0.36

Careful readers will note that the numb_red is changing in each example above. The reason, of course, is that each block re-runs the shovel exercise, getting a (potentially) different number of red beads. If we wanted the same number in each block, we would need to use set.seed() each time, always providing the same seed each time.

Let’s now perform the virtual analog of having 33 groups of students use the sampling shovel!

### 6.2.2 Using the virtual shovel 33 times

In our tactile sampling exercise in Section 6.1, we had 33 groups of students use the shovel, yielding 33 samples of size 50 beads. We then used these 33 samples to compute 33 proportions. We can perform this repeated/replicated sampling virtually by just doing the same thing 33 times.

We’ll save these results in a data frame called virtual_samples.

set.seed(9)

virtual_samples <- tibble(ID = 1:33) %>%
mutate(shovel = map(ID, ~ slice_sample(urn, n = 50))) %>%
mutate(numb_red = map_int(shovel, ~ sum(.$color == "red"))) %>% mutate(numb_beads = map_int(shovel, ~ length(.$color))) %>%

virtual_samples
## # A tibble: 33 x 5
##       ID shovel            numb_red numb_beads prop_red
##    <int> <list>               <int>      <int>    <dbl>
##  1     1 <tibble [50 × 2]>       21         50     0.42
##  2     2 <tibble [50 × 2]>       19         50     0.38
##  3     3 <tibble [50 × 2]>       17         50     0.34
##  4     4 <tibble [50 × 2]>       15         50     0.3
##  5     5 <tibble [50 × 2]>       17         50     0.34
##  6     6 <tibble [50 × 2]>       21         50     0.42
##  7     7 <tibble [50 × 2]>        9         50     0.18
##  8     8 <tibble [50 × 2]>       21         50     0.42
##  9     9 <tibble [50 × 2]>       16         50     0.32
## 10    10 <tibble [50 × 2]>       20         50     0.4
## # … with 23 more rows

Let’s visualize this variation in a histogram:

virtual_samples %>%
ggplot(aes(x = prop_red)) +
geom_histogram(binwidth = 0.05,
boundary = 0.4,
color = "white") +

# To use mathematical symbols in titles and labels, use the expression()
# function, as here.

labs(x = expression(hat(p)),
y = "Count",
title = "Distribution of 33 proportions red") +

# Label the y-axis in an attractive fashion. Without this code, the axis
# labels would include 2.5, which makes no sense because all the values are
# integers.

scale_y_continuous(breaks = seq(2, 10, 2))

We set the boundary and binwidth arguments. Setting boundary = 0.4 ensures a binning scheme with one bin’s boundary at 0.4. Since binwidth = 0.05, this will create bins with boundaries at 0.30, 0.35, 0.45, and so on. Recall that $$\hat{p}$$ is equal to the proportion of beads which are red in our samples.

Observe that we occasionally obtained proportions red that are less than 30%. On the other hand, we occasionally obtained proportions that are greater than 45%. However, the most frequently occurring proportions were between 35% and 45%. Why do we have these differences in proportions red? Because of sampling variation.

Compare our virtual results with our tactile results from the previous section. Observe that both histograms are somewhat similar in their center and variation, although not identical. These slight differences are again due to random sampling variation. Furthermore, observe that both distributions are somewhat bell-shaped.

This visualization allows us to see how our results differed between our tactile and virtual urn results. As we can see, there is some variation between our results. This is not a cause for concern, as there is always expected sampling variation between results.

### 6.2.3 Using the virtual shovel 1,000 times

Now say we want to study the effects of sampling variation not for 33 samples, but for a larger number of samples (1000). We have two choices at this point. We could have our groups of friends manually take 1,000 samples of 50 beads and compute the corresponding 1,000 proportions. However, this would be time-consuming. This is where computers excel: automating long and repetitive tasks while performing them quickly. At this point, we will abandon tactile sampling in favor of virtual sampling.

set.seed(9)

virtual_samples <- tibble(ID = 1:1000) %>%
mutate(shovel = map(ID, ~ slice_sample(urn, n = 50))) %>%
mutate(numb_red = map_int(shovel, ~ sum(.$color == "red"))) %>% mutate(numb_beads = map_int(shovel, ~ length(.$color))) %>%

virtual_samples
## # A tibble: 1,000 x 5
##       ID shovel            numb_red numb_beads prop_red
##    <int> <list>               <int>      <int>    <dbl>
##  1     1 <tibble [50 × 2]>       21         50     0.42
##  2     2 <tibble [50 × 2]>       19         50     0.38
##  3     3 <tibble [50 × 2]>       17         50     0.34
##  4     4 <tibble [50 × 2]>       15         50     0.3
##  5     5 <tibble [50 × 2]>       17         50     0.34
##  6     6 <tibble [50 × 2]>       21         50     0.42
##  7     7 <tibble [50 × 2]>        9         50     0.18
##  8     8 <tibble [50 × 2]>       21         50     0.42
##  9     9 <tibble [50 × 2]>       16         50     0.32
## 10    10 <tibble [50 × 2]>       20         50     0.4
## # … with 990 more rows

Observe that we now have 1,000 replicates of prop_red, the proportion of 50 beads that are red. Using the same code as earlier, let’s now visualize the distribution of these 1,000 replicates of prop_red in a histogram:

virtual_samples %>%
ggplot(aes(x = prop_red)) +
geom_histogram(binwidth = 0.01,
boundary = 0.4,
color = "white") +
labs(x = expression(hat(p)),
y = "Count",
title = "Distribution of 1,000 proportions red") 

The most frequently occurring proportions of red beads occur, again, between 35% and 45%. Every now and then, we obtain proportions much lower or higher. These are rare, however. Observe that we now have a much more symmetric and smoother bell-shaped distribution. This shape is, in fact, well approximated by a normal distribution.

Why the empty spaces among the bars? Recall that, with only 50 beads, there are only 51 possible values for $$\hat{p}$$: 0, 0.02, 0.04, …, 0.98, 1. A value of 0.03 or 0.39 is impossible, hence the gaps.

### 6.2.4 The effect of different shovel sizes

Instead of just one shovel, imagine we have three choices of shovels to extract a sample of beads with: shovels of size 25, 50, and 100. Using our newly developed tools for virtual sampling, let’s unpack the effect of having different sample sizes. Start by virtually using the shovel 1,000 times. Then, compute the resulting 1,000 replicates of proportion red. Finally, plot the distribution using a histogram.

virtual_samples_25 <- tibble(ID = 1:1000) %>%
mutate(shovel = map(ID, ~ slice_sample(urn, n = 25))) %>%
mutate(numb_red = map_int(shovel, ~ sum(.$color == "red"))) %>% mutate(numb_beads = map_int(shovel, ~ length(.$color))) %>%

virtual_samples_25 %>%
ggplot(aes(x = prop_red)) +
geom_histogram(binwidth = 0.01,
boundary = 0.4,
color = "white") +
labs(x = "Proportion of 25 beads that were red",
title = "25") 

We will repeat this process with a shovel size of 50.

virtual_samples_50 <- tibble(ID = 1:1000) %>%
mutate(shovel = map(ID, ~ slice_sample(urn, n = 50))) %>%
mutate(numb_red = map_int(shovel, ~ sum(.$color == "red"))) %>% mutate(numb_beads = map_int(shovel, ~ length(.$color))) %>%

virtual_samples_50  %>%
ggplot(aes(x = prop_red)) +
geom_histogram(binwidth = 0.01,
boundary = 0.4,
color = "white") +
labs(x = "Proportion of 50 beads that were red",
title = "50")  

There are about twice as many bars with the size-50 shovel than the size-25 shovel because the former allows for many more possible values for $$\hat{p}$$.

Finally, we will perform the same process with 1000 replicates to map the histogram using a shovel size of 100.

virtual_samples_100 <- tibble(ID = 1:1000) %>%
mutate(shovel = map(ID, ~ slice_sample(urn, n = 100))) %>%
mutate(numb_red = map_int(shovel, ~ sum(.$color == "red"))) %>% mutate(numb_beads = map_int(shovel, ~ length(.$color))) %>%

virtual_samples_100 %>%
ggplot(aes(x = prop_red)) +
geom_histogram(binwidth = 0.01,
boundary = 0.4,
color = "white") +
labs(x = "Proportion of 100 beads that were red",
title = "100") 

For easy comparison, we present the three resulting histograms in a single row with matching x and y axes:

# Use bind_rows to combine the data from our three saved virtual sampling
# objects. Use mutate() in each to clarify the n as the necessary number
# of samples taken. This makes our data easier to interpret and prevents
# duplicate elements.

virtual_prop <- bind_rows(virtual_samples_25 %>%
mutate(n = 25),
virtual_samples_50 %>%
mutate(n = 50),
virtual_samples_100 %>%
mutate(n = 100))

# Plot our new object with the x-axis showing prop_red. Add elements binwidth,
# boundary, and color for stylistic clarity. Use labs() to add an x-axis label
# and title. Facet_wrap() splits the graph into multiple plots by the variable
# (~n).

comparing_sampling_distributions <- ggplot(virtual_prop, aes(x = prop_red)) +
geom_histogram(binwidth = 0.01, boundary = 0.4, color = "white") +
labs(x = "Proportion of shovel's beads that are red",
title = "Comparing distributions of proportions red for three different shovel sizes.") +
facet_wrap(~ n)

# Inspect our new faceted graph.

comparing_sampling_distributions

Observe that as the sample size increases, the variation of the 1,000 replicates of the proportion of red decreases. In other words, as the sample size increases, there are fewer differences due to sampling variation and the distribution centers more tightly around the same value. All three histograms center around roughly 40%.

We can be numerically explicit about the amount of variation in our three sets of 1,000 values of prop_red by using the standard deviation. A standard deviation is a summary statistic that measures the amount of variation within a numerical variable. For all three sample sizes, let’s compute the standard deviation of the 1,000 proportions red by running the following data wrangling code that uses the sd() summary function.

virtual_samples_25 %>%
summarize(sd = sd(prop_red), .groups = 'drop_last')
## # A tibble: 1 x 1
##       sd
##    <dbl>
## 1 0.0987
virtual_samples_50 %>%
summarize(sd = sd(prop_red), .groups = 'drop_last')
## # A tibble: 1 x 1
##       sd
##    <dbl>
## 1 0.0673
virtual_samples_100 %>%
summarize(sd = sd(prop_red), .groups = 'drop_last')
## # A tibble: 1 x 1
##       sd
##    <dbl>
## 1 0.0453
Comparing standard deviations of proportions red for three different shovels
Number of slots in shovel Standard deviation of proportions red
25 0.099
50 0.067
100 0.045

As the sample size increases, the variation decreases. In other words, there is less variation in the 1,000 values of the proportion red. As the sample size increases, our guesses at the true proportion of the urn’s beads that are red get more precise. The larger the shovel, the more precise the result.

### 6.2.5 Functions are your friend!

Note that in the last section, we ran more or less the same code three times, but with different sizes for our shovel: 25, 50, and 100. Whenever you find yourself writing the same code three or more times, you should write a function that does the same thing. Let’s look at the code that used a shovel of size 25 and calculated the proportion of beads that were red one more time:

tibble(ID = 1:1000) %>%
mutate(shovel = map(ID, ~ slice_sample(urn, n = 25))) %>%
mutate(numb_red = map_int(shovel, ~ sum(.$color == "red"))) %>% mutate(numb_beads = map_int(shovel, ~ length(.$color))) %>%
mutate(prop_red = numb_red / numb_beads)
## # A tibble: 1,000 x 5
##       ID shovel            numb_red numb_beads prop_red
##    <int> <list>               <int>      <int>    <dbl>
##  1     1 <tibble [25 × 2]>        8         25     0.32
##  2     2 <tibble [25 × 2]>       12         25     0.48
##  3     3 <tibble [25 × 2]>       14         25     0.56
##  4     4 <tibble [25 × 2]>        9         25     0.36
##  5     5 <tibble [25 × 2]>       11         25     0.44
##  6     6 <tibble [25 × 2]>       10         25     0.4
##  7     7 <tibble [25 × 2]>        9         25     0.36
##  8     8 <tibble [25 × 2]>       10         25     0.4
##  9     9 <tibble [25 × 2]>        6         25     0.24
## 10    10 <tibble [25 × 2]>        8         25     0.32
## # … with 990 more rows

Once you have a pipe which does what you want, creating a function is relatively easy. Just place the code within a function definition and “pull out” specific variables as arguments.

make_prop_red <- function(x, shovel_size, reps){
tibble(ID = 1:reps) %>%
mutate(shovel = map(ID, ~ slice_sample(x, n = shovel_size))) %>%
mutate(numb_red = map_int(shovel, ~ sum(.$color == "red"))) %>% mutate(numb_beads = map_int(shovel, ~ length(.$color))) %>%
}

See how this just uses the code we had before to create virtual_prop_red_25, but generalizes it. Now we can create the same tibbles we did before, ready to plot the histograms, with only three lines of code:

virtual_prop_red_25  <- make_prop_red(x = urn, shovel_size = 25,  reps = 1000)
virtual_prop_red_50  <- make_prop_red(x = urn, shovel_size = 50,  reps = 1000)
virtual_prop_red_100 <- make_prop_red(x = urn, shovel_size = 100, reps = 1000)

But this still isn’t the best way. Note that we have three objects we need to deal with, virtual_prop_red_25, virtual_prop_red_50, and virtual_prop_red_100. Instead, let’s store our results in a single tibble. How can we do this? By using map() to create a list-column!

First, we’ll create a tibble that has a variable named shovel_size with our values (25, 50, 100):

tibble(shovel_size = c(25, 50, 100))
## # A tibble: 3 x 1
##   shovel_size
##         <dbl>
## 1          25
## 2          50
## 3         100

Next, we’ll create list column called prop_red_results that is the output of make_prop_red().

tibble(shovel_size = c(25, 50, 100)) %>%
mutate(prop_red_results = map(shovel_size,
~ make_prop_red(x = urn,
shovel_size = .x,
reps = 1000)))
## # A tibble: 3 x 2
##   shovel_size prop_red_results
##         <dbl> <list>
## 1          25 <tibble [1,000 × 5]>
## 2          50 <tibble [1,000 × 5]>
## 3         100 <tibble [1,000 × 5]>

Adding another map function will let us get the standard deviations of our estimated proportions:

shovels <- tibble(shovel_size = c(25, 50, 100)) %>%
mutate(prop_red_results = map(shovel_size,
~ make_prop_red(x = urn,
shovel_size = .x,
reps = 1000))) %>%

# Use map_dbl() to create a column that draws the standard deviations of our
# prop_red values. We can use pull after our ~ to isolate the prop_red values
# and use a pipe (as we normally do) here to draw the sd() of this column.

mutate(prop_red_sd = map_dbl(prop_red_results,
~ pull(., prop_red) %>% sd()))

glimpse(shovels)
## Rows: 3
## Columns: 3
## $shovel_size <dbl> 25, 50, 100 ##$ prop_red_results <list> [<tbl_df[1000 x 5]>, <tbl_df[1000 x 5]>, <tbl_df[10…
## $prop_red_sd <dbl> 0.099, 0.070, 0.046 Now that we have this framework, there’s no need to limit ourselves to the sizes 25, 50, and 100. Why not try all integers from 1 to 100? We can use the same code, except we’ll now set shovel_size = 1:100 when initializing the tibble. shovels_100 <- tibble(shovel_size = 1:100) %>% mutate(prop_red_results = map(shovel_size, ~ make_prop_red(x = urn, shovel_size = .x, reps = 1000))) %>% mutate(prop_red_sd = map_dbl(prop_red_results, ~ pull(., prop_red) %>% sd())) glimpse(shovels_100) ## Rows: 100 ## Columns: 3 ##$ shovel_size      <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
## $prop_red_results <list> [<tbl_df[1000 x 5]>, <tbl_df[1000 x 5]>, <tbl_df[10… ##$ prop_red_sd      <dbl> 0.483, 0.337, 0.271, 0.250, 0.214, 0.195, 0.191, 0.1…

Now, we have the standard deviation of prop_red for all shovel sizes from 1 to 100. Let’s plot that value to see how it changes as the shovel gets larger:

The red line here represents an important statistical concept: standard error (SE). The mathematical definition of SE for our purposes is the standard deviation divided by the square root of the sample size. As the shovel size increases, and thus our sample size increases, we find that the standard error decreases. If this is confusing right now, fear not! We will delve into this explanation in our next section.

This is the power of running many analyses at once using map functions and list columns: before, we could tell that the standard deviation was decreasing as the shovel size increased, but when only looking at shovel sizes of 25, 50, and 100, it wasn’t clear how quickly it was decreasing.

## 6.3 Standard error

Standard errors (SE) quantify the effect of sampling variation on our estimates. In other words, they quantify how much we can expect the calculated proportions of a shovel’s beads that are red to vary from one sample to another sample to another sample, and so on. As a general rule, as sample size increases, the standard error decreases.

The standard error is the standard deviation of a sample statistic (aka point estimate), such as the proportion. For example, the standard error of the mean refers to the standard deviation of the distribution of sample means taken from a population.

The relationship between the standard error and the standard deviation is that, for a given sample size, the standard error equals the standard deviation divided by the square root of the sample size. Accordingly, the standard error is inversely proportional to the sample size. The larger the sample size, the smaller the standard error.

If this sounds confusing, don’t worry! It is. Before we can explain this in more depth, it is important to understand some terminology.

### 6.3.1 Terminology and notation

All of the concepts underlying this terminology, notation, and definitions tie directly to the concepts underlying our tactile and virtual sampling activities. It will simply take time and practice to master them.

First, a population is the set of relevant units. The population’s size is upper-case $$N$$. In our sampling activities, the population is the collection of $$N$$ = 1,000 identically sized red and white beads in the urn. This is about the simplest possible population. Other examples are all the adult men in the US, all the classrooms in a school, all the wheelbarrows in Massachusetts, all the values of your blood pressure, read at five minute intervals, for your entire life. Often, the population is extends over time, as with your blood pressure readings and is, therefore, more amorphous. Consider all the people who have run for governor of a US state since 1900, or all the people who will run for governor through 2050. Those are also populations.

Second, a population parameter is a numerical summary quantity about the population that is unknown, but you wish you knew. For example, when this quantity is the mean, the population parameter of interest is the population mean. This is mathematically denoted with the Greek letter $$\mu$$ pronounced “mu.” In our earlier sampling from the urn activity, however, since we were interested in the proportion of the urn’s beads that were red, the population parameter is the population proportion, denoted with $$p$$.

Third, a census is an exhaustive enumeration or counting of all $$N$$ units in the population in order to compute the population parameter’s value exactly. In our sampling activity, this would correspond to counting the number of beads out of $$N$$ = 1000 that are red and computing the population proportion $$p$$ that are red exactly. When the number $$N$$ of individuals or observations in our population is large as was the case with our urn, a census can be quite expensive in terms of time, energy, and money. A census is impossible for any populations which includes the future, like our blood pressure next year or candidates for governor in 2040. There is a truth but we could not, even in theory, calculate it.

Fourth, sampling is the act of collecting a sample from the population when we can not, or do not want to, perform a census. The sample size is lower case $$n$$, as opposed to upper case $$N$$ for the population’s size. Typically the sample size $$n$$ is much smaller than the population size $$N$$. In our sampling activities, we used shovels with varying slots to extract samples of size $$n$$ = 1 through $$n$$ = 100.

Fifth, a point estimate, also known as a sample statistic, is a measure computed from a sample that estimates an unknown population parameter. In our sampling activities, recall that the unknown population parameter was the proportion of red beads and that this is mathematically denoted with $$p$$. Our point estimate is the sample proportion: the proportion of the shovel’s beads that are red. In other words, it is our guess at the proportion of the urn’s beads that are red. The point estimate of the parameter $$p$$ is $$\hat{p}$$. The “hat” on top of the $$p$$ indicates that it is an estimate of the unknown population proportion $$p$$.

Sixth, a sample is said to be representative if it roughly looks like the population. In other words, are the sample’s characteristics a good representation of the population’s characteristics? In our sampling activity, are the samples of $$n$$ beads extracted using our shovels representative of the urn’s $$N$$ = 1000 beads?

Seventh, a sample is generalizable if any results based on the sample can generalize to the population. In our sampling activity, can we generalize the sample proportion from our shovels to the entire urn? Using our mathematical notation, this is akin to asking if $$\hat{p}$$ is a “good guess” of $$p$$?

Eighth, biased sampling occurs if certain individuals or observations in a population have a higher chance of being included in a sample than others. We say a sampling procedure is unbiased if every observation in a population had an equal chance of being sampled. Had the red beads been much smaller than the white beads, and therefore more prone to falling out of the shovel, our sample would have been biased. In our sampling activities, since we mixed all $$N = 1000$$ beads prior to each group’s sampling and since each of the equally sized beads had an equal chance of being sampled, our samples were unbiased.

Ninth, a sampling procedure is random if we sample randomly from the population in an unbiased fashion. In our sampling activities, this would correspond to sufficiently mixing the urn before each use of the shovel.

In general:

• If the sampling of a sample of size $$n$$ is done at random, then
• the sample is unbiased and representative of the population of size $$N$$, thus
• any result based on the sample can generalize to the population, thus
• the point estimate is a “good guess” of the unknown population parameter, thus
• instead of performing a census, we can draw inferences about the population using sampling.

Specific to our sampling activity:

• If we extract a sample of $$n=50$$ beads at random, in other words, we mix all of the equally sized beads before using the shovel, then
• the contents of the shovel are an unbiased representation of the contents of the urn’s 1000 beads, thus
• any result based on the shovel’s beads can generalize to the urn, thus
• the sample proportion $$\hat{p}$$ of the $$n=50$$ beads in the shovel that are red is a “good guess” of the population proportion $$p$$ of the $$N=1000$$ beads that are red, thus
• instead of manually going over all 1,000 beads in the urn, we can make inferences about the urn by using the results from the shovel.

### 6.3.2 Statistical definitions

Now, for some important statistical definitions related to sampling. As a refresher of our 1,000 repeated/replicated virtual samples of size $$n$$ = 25, $$n$$ = 50, and $$n$$ = 100 in Section 6.2, let’s display our figure showing the difference in proportions red according to different shovel sizes.

These types of distributions have a special name: sampling distributions; their visualization displays the effect of sampling variation on the distribution of a point estimate; in this case, the sample proportion $$\hat{p}$$. Using these sampling distributions, for a given sample size $$n$$, we can make statements about what values we typically expect.

For example, observe the centers of all three sampling distributions: they are all roughly centered around $$0.4 = 40\%$$. Furthermore, observe that while we are somewhat likely to observe sample proportions of red beads of $$0.2 = 20\%$$ when using the shovel with 25 slots, we will almost never observe a proportion of 20% when using the shovel with 100 slots. Observe also the effect of sample size on the sampling variation. As the sample size $$n$$ increases from 25 to 50 to 100, the variation of the sampling distribution decreases and thus the values cluster more and more tightly around the same center of around 40%. We quantified this variation using the standard deviation of our sample proportions, seeing that the standard deviation decreases with the square root of the sample size:

So as the sample size increases, the standard deviation of the proportion of red beads decreases. This type of standard deviation has another special name: standard error

### 6.3.3 What is a “standard error?”

The “standard error” (SE) is a term that measures the accuracy with which a sample distribution represents a population through the use of standard deviation. Specifically, SE is used to refer to the standard deviation of a sample statistic (aka point estimate), such as the mean or median. For example, the “standard error of the mean” refers to the standard deviation of the distribution of sample means taken from a population.

In statistics, a sample mean deviates from the actual mean of a population; this deviation is the standard error of the mean.

Many students struggle to differentiate the standard error from the standard deviation. The relationship between the standard error and the standard deviation is such that, for a given sample size, the standard error equals the standard deviation divided by the square root of the sample size. Accordingly, the standard error is inversely proportional to the sample size; the larger the sample size, the smaller the standard error because the statistic will approach the actual value.

The more data points involved in the calculations of the mean, the smaller the standard error tends to be. When the standard error is small, the data is said to be more representative of the true mean. In cases where the standard error is large, the data may have some notable irregularities. Thus, larger sample size = smaller standard error = more representative of the truth.

To help reinforce these concepts, let’s re-display our previous figure but using our new sampling terminology, notation, and definitions:

Furthermore, let’s display the graph of standard errors for $$n = 1$$ to $$n = 100$$ using our new terminology, notation, and definitions relating to sampling.

Remember the key message of this last table: that as the sample size $$n$$ goes up, the “typical” error of your point estimate will go down, as quantified by the standard error.

### 6.3.4 The moral of the story

If we could only know two pieces of information from our data, what would they be? First, you need a measure of the center of the distribution. This would include the mean or median, which shows the center of our data points. Second, we need a measure of the variability of the distribution. To understand our center, we must understand how different (or how spread) our data points are from one another. Thus, we need a measure like sd() or MAD. These are summary statistics which are necessary to understanding a distribution. Do those two figures encompass all you need to know about a distribution? No! But, if you are only allowed two numbers to keep, those are the most valuable.

The mean or median is a good estimate for the center of the posterior and the standard error or mad is a good estimate for the variability of the posterior, with +/- 2 standard errors covering 95% of the outcomes.

The standard error measures the accuracy of a sample distribution as compared to the population by using the standard deviation. Specifically, the standard deviation of our data points divided by the square root of the sample size. As such, we find that larger sample sizes = lower standard errors = more accurate and representative guesses.

To really drive home our point: standard error is just a fancy term for your uncertainty about something you don’t know. Standard error == our (uncertain) beliefs.

This hierarchy represents the knowledge we need to understand standard error (SE). At the bottom, we have math. It’s the foundation for our understanding, but it doesn’t need to be what we take away from this lesson. As we go up, we simplify the topic. The top of the pyramid are the basic levels of understanding that will help you to remember in the future.

If I know your estimate plus or minus two standard errors, I know your 95% confidence interval. This is valuable information. Standard error is really just a measure for how uncertain we are about something we do not know, the thing we are estimating. When we recall SE, we should remember that, all in all, it’s a complicated concept that can be distilled into: the way old people talk about confidence intervals.

Recall that $$\hat{p}$$ is the estimated value of p which comes from taking a sample. There can be billions and billions of $$\hat{p}$$’s. We look at a large group of $$\hat{p}$$’s, create a distribution of results to represent the possible values of p based on our findings, and then we compute a standard error to account for our own uncertainty about our predictions. Our 95% confidence interval for our prediction == our estimate plus or minus two standard errors.

In regards to the fifth layer of the hierarchy, we may wonder:

“I thought that MADs were the same thing as standard deviations. Now you say they are the same things as standard errors. Which is it?”

MADs and standard deviations are, more or less, the same thing. They are both measures of the variability of a distribution. In most cases, they have very similar values. A standard error is also a standard deviation. Specifically, it is the standard deviation of the distribution of the estimates, and that distribution of estimates is, more or less, your posterior. Therefore, we can use MAD, like standard error, to describe that distribution and the variability of that distribution.

Recall the questions we asked at the beginning of our chapter:

If we get 17 red beads in a random sample of size 50 taken from an urn, what proportion $$p$$ of the beads in the urn are red?

What is the probability, using the same urn, that we will draw more than 8 red beads if we use a shovel of size 20?

We have looked at a number of different models and their accuracy and precision. To delve into our primary questions, we must return to our discussion of distributions: posterior, joint, and marginal.

### 6.4.1 Joint distribution

Let’s start with our first question: If we get 17 red beads in a random sample of size 50 taken from an urn with 1000 (white or red) beads, how many red beads are in the urn?

In Chapter 5 we outlined the key intuition behind all inference: there is a joint distribution of data-which-we-might-see-from-our-experiment and models-which-we-are-considering. We take this joint distribution, combine it with the actual results of the experiment, and calculate a posterior distribution over the set of possible models. In other words, we start with

$\text{Prob}(\text{models}, \text{data})$

We then add the specific results of the experiment, and calculate the conditional distribution of the models, given the data that we have seen:

$\text{Prob}(\text{models} | \text{data} = \text{results of experiment})$

The rest is just details.

It is useful to define a parameter, $$p$$, which is the proportion of red beads in the urn. This also takes on exactly 1,001 possible values: $$0, 1/1000, 2/1000, ..., 999/1000, 1$$. With these assumptions, we can create an unnormalized joint distribution:

set.seed(10)

# Set values for the urn, paddle, and reps as this helps with creating our joint
# distribution later. Using more than reps = 100 causes the 3-D plot to blow up.

urn_size <- 1000
reps <- 100

# Create a tibble where red_in_urn creates a sequence from 0 to urn_size(1000)
# by a value of 1. Create p as red_in_urn divided by the total urn_size. Column
# red_in_paddle will map p using rbinom(). Call n = reps (where reps is defined
# as 100) and size as paddle_size (defined as 50). Finally, unnest!

joint_dist <-
tibble(red_in_urn = seq(from = 0, to = urn_size, by = 1)) %>%
mutate(p = red_in_urn / urn_size) %>%
mutate(red_in_paddle = map(p, ~ rbinom(n = reps,
prob = .))) %>%

# After making our joint distribution, take a sample of size 10.

joint_dist %>%
slice_sample(n = 10)
## # A tibble: 10 x 3
##         <dbl> <dbl>         <int>
##  1        974 0.974            46
##  2        957 0.957            47
##  3        199 0.199            11
##  4        860 0.86             46
##  5         36 0.036             3
##  6        221 0.221            13
##  7        862 0.862            44
##  8        117 0.117             3
##  9        664 0.664            30
## 10        944 0.944            49

After looking at our joint distribution, let’s plot it. Call the red_in_urn for our y-axis, to indicate the number of red beads in the urn. Call red_in_paddle for our x-axis, to indicate the number of red beads in each paddle. This allows us to see which possible values of red beads in our samples are most associated with certain values for the number of red beads in the urn.

joint_dist %>%
ggplot(aes(y = red_in_urn, x = red_in_paddle)) +
geom_point(alpha = 0.005) +
labs(title = "Joint Distribution of Red Beads in Paddle and in Urn",
y = "Number of Red Beads in Urn")

It is easier to define our “model” in this joint distribution in terms of the unknown parameter $$p$$ rather than, as above, in terms of the unknown number of red beads in the urn. Showing our plot in terms of $$p$$ and using the term “proportion” rather than “number of red beads” is closer to what we are trying to model.

joint_dist %>%
ggplot(aes(y = p, x = red_in_paddle)) +
geom_point(alpha = 0.005) +
labs(title = expression(paste(
hat(p))),

# Add the symbol for p-hat.

y = expression(paste("Proportion of Red Beads in Urn ", hat(p)))) +

# Highlight y-axis change in red to view change.

theme(axis.title.y = element_text(colour = "red"))

Now, we have explored the most likely possibilities according to all possible models of this scenario. Let’s create a posterior distribution by taking a slice of our joint distribution based on the number of red beads in the paddle.

### 6.4.2 Posterior distribution

When we look at our question – we are asking for the posterior probability of a certain case. This case is where we find that, using the 50 slot shovel, we get a result of 17 red beads. What does that make our posterior probability? We want:

$\text{Prob}(\text{models} | \text{data} = 17)$

The same approach works again. We take a “slice” of the joint distribution where the number of red beads in the paddle is equal to 17. After normalization, that gives us the posterior probability for $$p$$, or for the number of red beads in the urn. Recall that there is a one-to-one mapping between $$p$$ and the number of red beads.

joint_dist %>%

# Filtering for where the red_in_paddle equals 17 allows us to isolate the
# data that pertains to our question.

ggplot(aes(x = p)) +
geom_histogram(aes(y = after_stat(count/sum(count))),
bins = 50) +
labs(title = "Posterior Probability Distribution",
subtitle = "Proportion of red beads in urn is centered around 0.34",
x = "Proportion p of Red Beads in Urn",
y = "Probability") +

# Here, the labels call allows us to format our axises by whether we would
# like them to display as numbers or percentages.

scale_x_continuous(labels = scales::number_format()) +
scale_y_continuous(labels = scales::percent_format()) +
theme_classic()

To visualize this in a 3D plot, see the slice in the figure below. There are a few things to note. First, the x-axis only extends to 600 (rather than the total 1000) due to the fact that, when the paddle only draws 17/50 = .34 = 34% red beads, it is almost impossible for the total number red beads in the total urn to exceed 600. The most likely number of red beads in the total urn, when the paddle draws 17/50 red beads, is somewhere in the 290-350 range. This is signified by the peak between these points. The probability is lower below 280 and above 350 because it is less likely that this is the number of red beads in the urn, conditional on the fact that the paddle drew 17/50 red beads. This is our posterior distribution.

For any value of red beads found in the paddle (from 0 to 50), there is a posterior distribution of possible values of red beads in the total urn. This distribution is meant to illustrate that, though our sample gives an estimate of the total proportion of red beads in the entire urn, there is much variation in what the actual proportion might be. To summarize: the distribution of the hat things (the things we are predicting) is, more or less, your posterior.

Recall the first question from the start of the Chapter:

If we get 17 red beads in a random sample of size 50 taken from an urn, what proportion $$p$$ of the beads in the urn are red?

The answer is not a single number. Our posterior distribution is just that: a distribution. From our sample, we have many different results for the proportion of red beads in the entire urn. Certain proportions, like the extremes close to 0% or 100%, are essentially impossible due to our sample value being 32%. Some sample proportions, like 40%, occur far more frequently than other sample proportions.

This means that, while we can provide a range of possibilities (and we can estimate which of those possibilities occur most frequently), we can never say that we know the total number of red beads with certainty.

The key issue with the urn paradigm is that, unlike real world data science, we can technically find the true proportion. We just have to perform that exhaustive count that we’ve been so desperately avoiding. In the real world, the “true” solution can never be known. This is why we require inference, and why tools like sampling are so helpful for drawing conclusions about the world around us.

### 6.4.3 Hypothesis tests

Recall our view on hypothesis tests: Amateurs test. Professionals summarize. Traditionally, most scientific papers are not so much interested in estimating p. They are interested in testing specific hypotheses. What do we mean by that?

Let’s look at a possible hypothesis in our urn paradigm: there are equal number of red and white beads in the urn. The null hypothesis is denoted by $$H_0$$, while the alternative hypothesis is denoted by $$H_a$$. Therefore, our hypothesis is designed as such:

$$H_0$$: There are an equal number of red and white beads in the urn. $$H_a$$: There are not an equal number of red and white beads in the urn.

Can we reject that hypothesis? Convention: if the 95% confidence interval excludes the null hypothesis, then we reject it. Here, that would mean if our posterior estimate (plus or minus 2 standard errors) excluded the possibility of the red and white beads being equal, translating into a proportion red of 50%, we can reject the null hypothesis. Otherwise, we don’t reject it. However, this does not mean that we accept it. More to the point: rejecting or not rejecting hypotheses doesn’t helps us to answer real questions. There is no reason to test when you can summarize by providing the full posterior probability distribution, as we will do below.

The same arguments apply in the case of “insignificant” results, with p > 0.5, when we can’t “reject” the null hypothesis. The fact that a difference is not “significant” has no relevance to how we use the posterior to make decisions. The same reasoning applies to every parameter we estimate, to every prediction we make. Never test — unless your boss demands a test. Use your judgment, make your models, summarize your knowledge of the world, and use that summary to make decisions. In the next section, we will create our posterior distribution and use this posterior to predict outcomes.

### 6.4.4 Using the posterior

Now that we have our posterior distribution for the proportion $$p$$ of red beads in the urn, we can use that object to forecast other outcomes, outcomes which we have not yet observed. Recall, from the start of the Chapter:

Conditional on our answer to Question 1, what is the probability, using the same urn, that we will draw more than 8 red beads if we use a shovel of size 20?

We need to create an (unnormalized) posterior distribution for the number of red beads in a new draw with a 20-slot paddle.

set.seed(17)

# Define a paddle size for our joint distributions

# To create our posterior distribution, filter for where the red in paddle is
# equal to 17. This takes our slice.

post_dist <- joint_dist %>%

# Remove the columns of red_in_urn and red_in_paddle so we are left with p.

# Create object new_reds which maps p using rbinom with size defined as our
# previously created paddle_size of 20.

mutate(new_reds = map_int(p, ~ rbinom(n = 1,
prob = .)))

# Look at our created posterior distribution.

post_dist 
## # A tibble: 1,919 x 2
##        p new_reds
##    <dbl>    <int>
##  1 0.161        2
##  2 0.171        7
##  3 0.175        3
##  4 0.176        5
##  5 0.176        3
##  6 0.186        4
##  7 0.189        2
##  8 0.197        2
##  9 0.198        5
## 10 0.2          2
## # … with 1,909 more rows

Note that we have two columns to focus on after removing the columns red_in_urn and red_in_paddle. The column p represents our probability for the number of red beads we draw with our paddle size of 20 (conditional upon a previous draw of 17 red beads). The column new_reds represents the number of red beads found in our draw of twenty beads with the corresponding value for p. Scroll through our object to look at the probabilities for various results of new_reds.

With this unnormalized distribution, it is easy to create the normalized posterior probability distribution.

post_dist %>%
ggplot(aes(x = new_reds)) +
geom_histogram(aes(y = after_stat(count/sum(count))),
bins = 50) +
labs(title = "Posterior Probability Distribution",
subtitle = "Number of red beads in new draw of 20",
x = "Number of Red Beads",
y = "Probability") +
scale_x_continuous(labels = scales::number_format(accuracy = 1)) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
theme_classic()

To calculate the exact probability of drawing more than 8 reds, we can work directly with the unnormalized distribution.

sum(post_dist$new_reds > 8) / length(post_dist$new_reds)
## [1] 0.26

The answer is about 26%. To observe this visually, take a look at the post_dist object when we filter for where the number of red beads is equal to and greater than 8. The answer above measures the area under this portion of the curve as compared to the entire area under the curve. Each question requires looking at a new area under the curve. When someones asks you a question, they are doing two things. First, they are providing instructions as to the posterior your should create. Here, the results with a shovel of 20 slots. Second, they are asking a question about the area under the curve in a specific region. Here, the region where the number of red beads is greater than 8 is highlighted in red. Therefore, the area below the curve that is red is how we get our estimate.

post_dist %>%

# Create a column above_eight to identify True or False for new_reds being
# above eight.

mutate(above_eight = ifelse(new_reds > 8, "True", "False")) %>%

# Set fill as above_eight.

ggplot(aes(x = new_reds, fill = above_eight)) +
geom_histogram(aes(y = after_stat(count/sum(count))),
bins = 50) +

# Scale_fill_manual() here is calling grey for the first color and red for the
# second color. This is going to highlight the portion of the curve that we
# need to isolate in red.

scale_fill_manual(values = c('grey50', 'red'))+
labs(title = "Posterior Probability Distribution",
subtitle = "Number of red beads in new draw of 20",
x = "Number of Red Beads",
y = "Probability",
fill = "Above Eight Beads?") +
scale_x_continuous(labels = scales::number_format(accuracy = 1)) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
theme_classic()

To examine the validity of our final conclusion, let’s look to our Cardinal Virtues.

### 6.4.5 Cardinal virtues

To start, recall the virtue of Wisdom. Wisdom encompasses the topics of validity, data, population, and ethics. Most importantly, we want to know if there is a decent connection between the problem we face and the data we have. Our problem is answering the following question: Conditional on our answer to Question 1, what is the probability, using the same urn, that we will draw more than 8 red beads if we use a shovel of size 20? The way in which we have approached answering this question is to, from our posterior distribution from our urn, create a joint distribution to estimate the probability of drawing more than 8 red beads with a shovel of size 20. Is this question about this particular urn, or all urns? Clearly, as it is conditional upon a result from our specific urn, our conclusions can only be applied to this particular urn; our conclusions are not generalizable to all urns.

Justice encompasses two main topics that apply to our urn: predictive versus causal model and mathematical formulas. Are we are modeling (just) for prediction or are we (also) modeling for causation? Predictive models care nothing about causation. With prediction, all we care about is forecasting Y given X on some as-yet-unseen data. There is no notion of “manipulation” in such models. As we can see, then, our model is predictive; we are not manipulating anything. The math of our model can be: $T_i \sim B(p_H, n = 20)$ The total number $$T$$ of red beads in experiment $$i$$ with 20 sampled beads, $$T_i$$, is distributed as a binomial with $$n = 20$$ and an unknown probability $$\rho_h$$ of the proportion red exceeding 8 out of 20 beads.

Courage focuses on code – the structure of our model. Code allows us to “fit” a model by estimating the values of the unknown parameters. Though we can never know the true values of these parameters, we can express our uncertain knowledge in the form of posterior probability distributions. With those distributions, we can compare the actual values of the outcome variable with the “fitted” or “predicted” results of the model. We can examine the “residuals,” the difference between the fitted and actual values. To answer our question, we have taken a slice of our posterior distribution to create a joint distribution. To find our final solution, we looked at the area under the curve that includes all values of 8 red beads or above.

When we think of Temperance, we should think of uncertainty. Asserting a posterior distribution is, conceptually, the same as opening a casino and allowing people to place wagers on outcomes, with the odds consistent with your posterior. If your posterior is wrong, they will take all your money. That would be bad! So, what do we do? Rule of thumb: the future is always more variable than our models would suggest. The entire point of the virtue of Temperance to be aware of that fact and take account of it. Courage — that bold bastard — comes along and says, “Here is the model! Let’s calculate the posterior and open the casino!” Temperance says, “Hold on! Yes, I can calculate the posterior, but we need to be cautious.” Temperance is concerned with the tails of our predictions, not the center. What is the best strategy for opening our casino, then? Limit the bets on the first few nights to make sure our posterior is OK. Temperance is concerned with the sensible management of the casino.

## 6.5 Discussion

### 6.5.1 stan_glm()

Instead of building the joint and conditional probability distributions “by hand” as above, we can just use stan_glm() from the rstanarm package. For now, we will just walk through the code for doing this very quickly. Chapter 7 will provide a step-by-step explanation. To be clear, you do not need to fully understand this section or how this code works. This is an introduction, not a formal lesson.

Assume that, as before, a shovel of size 50 drawn from the urn has returned 17 red beads and 33 white.

library(rstanarm)

fit_1 <- stan_glm(formula = red ~ 1,
data = tibble(red = c(rep(1, 17),
rep(0, 33))),
family = binomial,
refresh = 0,
seed = 2021)

This single function call has done all the work which we did on our own previously: created the joint distribution and then estimated the posterior probability distribution, conditional on the data which was passed in to the data argument. Once we have the fit_1 object, it is easy to answer two sorts of questions: the posterior probability distribution for $$p$$ and predictions for new draws from the urn. The key new functions are posterior_epred() for the former and posterior_predict() for the latter.

For the posterior probability distribution:

posterior_epred(fit_1,
newdata = tibble(constant = 1)) %>%
as_tibble() %>%
ggplot(aes(x = 1)) +
geom_histogram(aes(y = after_stat(count/sum(count))),
bins = 50) +
labs(title = "Posterior Probability Distribution",
subtitle = "Proportion of red beads in urn is centered around 0.34",
x = "Proportion p of Red Beads in Urn",
y = "Probability") +

scale_x_continuous(labels = scales::number_format()) +
scale_y_continuous(labels = scales::percent_format()) +
theme_classic()

For the posterior probability distribution for the number of red beads drawn with a shovel of size 20:

posterior_predict(fit_1,
newdata = tibble(constant = rep(1, 20))) %>%
as_tibble() %>%
mutate(total = rowSums(across(1:20))) %>%
select(total) %>%
ggplot(aes(total)) +
geom_histogram(aes(y = after_stat(count/sum(count))),
bins = 50) +
labs(title = "Posterior Probability Distribution",
subtitle = "Number of red beads in 20-place shovel",
x = "Number of Red Beads",
y = "Probability") +
scale_x_continuous(labels = scales::number_format(accuracy = 1)) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
theme_classic()

These two posteriors are very similar to the ones we created, much more laboriously, above. See Chapter 7 for a thorough discussion of the use of rstanarm. This package will be our main tool for the rest of the Primer.

### 6.5.2 Case study: Polls

Let’s now switch gears to a more realistic sampling scenario: a poll. In practice, pollsters do not take 1,000 repeated samples as we did in our previous sampling activities, but rather take only a single sample that’s as large as possible.

On December 4, 2013, National Public Radio in the US reported on a poll of President Obama’s approval rating among young Americans aged 18-29 in an article, “Poll: Support For Obama Among Young Americans Eroding.” The poll was conducted by the Kennedy School’s Institute of Politics at Harvard University. A quote from the article:

After voting for him in large numbers in 2008 and 2012, young Americans are souring on President Obama.

According to a new Harvard University Institute of Politics poll, just 41 percent of millennials — adults ages 18-29 — approve of Obama’s job performance, his lowest-ever standing among the group and an 11-point drop from April.

Let’s tie elements of the real life poll in this new article with our “tactile” and “virtual” urn activity from Sections 6.1 and 6.2 using the terminology, notations, and definitions we learned in Section 6.3. You’ll see that our sampling activity with the urn is an idealized version of what pollsters are trying to do in real life.

First, who is the (Study) Population of $$N$$ individuals or observations of interest?

• Urn: $$N$$ = 1000 identically sized red and white beads
• Obama poll: $$N$$ = ? young Americans aged 18-29

Second, what is the population parameter?

• Urn: The population proportion $$p$$ of all the beads in the urn that are red.
• Obama poll: The population proportion $$p$$ of all young Americans who approve of Obama’s job performance.

Third, what would a census look like?

• Urn: Manually going over all $$N$$ = 1000 beads and exactly computing the population proportion $$p$$ of the beads that are red.
• Obama poll: Locating all $$N$$ young Americans and asking them all if they approve of Obama’s job performance. In this case, we don’t even know what the population size $$N$$ is!

Fourth, how do you perform sampling to obtain a sample of size $$n$$?

• Urn: Using a shovel with $$n$$ slots.
• Obama poll: One method is to get a list of phone numbers of all young Americans and pick out $$n$$ phone numbers. In this poll’s case, the sample size of this poll was $$n = 2089$$ young Americans.

Fifth, what is your point estimate (AKA sample statistic) of the unknown population parameter?

• Urn: The sample proportion $$\hat{p}$$ of the beads in the shovel that were red.
• Obama poll: The sample proportion $$\hat{p}$$ of young Americans in the sample that approve of Obama’s job performance. In this poll’s case, $$\hat{p} = 0.41 = 41\%$$, the quoted percentage in the second paragraph of the article.

Sixth, is the sampling procedure representative?

• Urn: Are the contents of the shovel representative of the contents of the urn? Because we mixed the urn before sampling, we can feel confident that they are.
• Obama poll: Is the sample of $$n = 2089$$ young Americans representative of all young Americans aged 18-29? This depends on whether the sampling was random.

Seventh, are the samples generalizable to the greater population?

• Urn: Is the sample proportion $$\hat{p}$$ of the shovel’s beads that are red a “good guess” of the population proportion $$p$$ of the urn’s beads that are red? Given that the sample was representative, the answer is yes.
• Obama poll: Is the sample proportion $$\hat{p} = 0.41$$ of the sample of young Americans who supported Obama a “good guess” of the population proportion $$p$$ of all young Americans who supported Obama at this time in 2013? In other words, can we confidently say that roughly 41% of all young Americans approved of Obama at the time of the poll? Again, this depends on whether the sampling was random.

Eighth, is the sampling procedure unbiased? In other words, do all observations have an equal chance of being included in the sample?

• Urn: Since each bead was equally sized and we mixed the urn before using the shovel, each bead had an equal chance of being included in a sample and hence the sampling was unbiased.
• Obama poll: Did all young Americans have an equal chance at being represented in this poll? Again, this depends on whether the sampling was random.

Ninth and lastly, was the sampling done at random?

• Urn: As long as you mixed the urn sufficiently before sampling, your samples would be random.
• Obama poll: Was the sample conducted at random? We can’t answer this question without knowing about the sampling methodology used by Kennedy School’s Institute of Politics at Harvard University. We’ll discuss this more at the end of this section.

In other words, the poll by Kennedy School’s Institute of Politics at Harvard University can be thought of as an instance of using the shovel to sample beads from the urn. Furthermore, if another polling company conducted a similar poll of young Americans at roughly the same time, they would likely get a different estimate than 41%. This is due to sampling variation.

Let’s now revisit the sampling paradigm from Subsection 6.3.1:

In general:

• If the sampling of a sample of size $$n$$ is done at random, then
• the sample is unbiased and representative of the population of size $$N$$, thus
• any result based on the sample can generalize to the population, thus
• the point estimate is a “good guess” of the unknown population parameter, thus
• instead of performing a census, we can infer about the population using sampling.

Specific to the urn:

• If we extract a sample of $$n = 50$$ beads at random, in other words, we mix all of the equally sized beads before using the shovel, then
• the contents of the shovel are an unbiased representation of the contents of the urn’s 1000 beads, thus
• any result based on the shovel’s beads can generalize to the urn, thus
• the sample proportion $$\hat{p}$$ of the $$n = 50$$ beads in the shovel that are red is a “good guess” of the population proportion $$p$$ of the $$N = 1000$$ beads that are red, thus
• instead of manually going over all 1000 beads in the urn, we can infer about the urn using the shovel.

Specific to the Obama poll:

• If we had a way of contacting a randomly chosen sample of 2089 young Americans and polling their approval of President Obama in 2013, then
• these 2089 young Americans would be an unbiased and representative sample of all young Americans in 2013, thus
• any results based on this sample of 2089 young Americans can generalize to the entire population of all young Americans in 2013, thus
• the reported sample approval rating of 41% of these 2089 young Americans is a good guess of the true approval rating among all young Americans in 2013, thus
• instead of performing an expensive census of all young Americans in 2013, we can infer about all young Americans in 2013 using polling.

So as you can see, it was critical for the sample obtained by Kennedy School’s Institute of Politics at Harvard University to be truly random in order to infer about all young Americans’ opinions about Obama. Was their sample truly random? It’s hard to answer such questions without knowing about the sampling methodology they used.

For example, what if Kennedy School’s Institute of Politics at Harvard University conducted this poll using only mobile phone numbers? People without mobile phones would be left out and therefore not represented in the sample. This flaw is an example of censoring, the exclusion of certain datapoints due to an issue with data collection. This results in an incomplete observation and increases the prediction uncertainty of the estimand, Obama’s approval rating among young Americans. Ensuring that our samples were random was easy to do in our sampling urn exercises; however, in a real life situation like the Obama poll, this is much harder to do.

What you have visualized in this chapter was a demonstration of a famous theorem, or mathematically proven truth, called the Central Limit Theorem. It loosely states that when sample means are based on larger and larger sample sizes, the sampling distribution of these sample means becomes both more and more normally shaped and more and more narrow. In other words, the sampling distribution increasingly follows a normal distribution and the variation of these sampling distributions gets smaller, as quantified by their standard errors.

### 6.5.3 Sampling Mechanism

One of the most important aspects of sampling is the sampling mechanism: the mechanism through which we sample our population. This concept is related, but distinctly different, from the assignment mechanism that we learned about in Chapter 3.

The assignment mechanism sorts units into control and experiment groups, while the sampling mechanism is the means through which we acquire our sample. Assignment mechanisms do not have a place in our urn paradigm since we are not measuring any causal relationship or assigning beads to specific groups.

To think about the sampling mechanism further: why are certain beads sampled, while others are not? Is this completely random?

#### 6.5.3.1 Preceptor Tables

Recall that a Preceptor Table is a table with rows and columns for all the data we would (reasonably) like to have. There are two different types of Preceptor Tables that are applicable to our urn example: actual and ideal.

An actual Preceptor Table shows what we actually know. Accordingly, the table is riddled with question marks that the real world saddles us with. The ideal Preceptor Table is the Preceptor Table with no question marks, and a reasonable number of rows and columns. With an ideal Preceptor Table, there is no need for inference; the estimand is a simple matter of arithmetic.

To visualize the different Preceptor Tables and their usefulness to us data scientists, let’s revisit our urn.

As is the case in the Rubin Causal Model, we wish we knew the values for every single unit in every possible scenario. The analogous ideal here would be a table where we know the color identity of every single bead. To compare to our previous Preceptor Table, this is what our ideal Preceptor Table would look like:

ID Year
1 white
2 white
... ...
200 red
201 white
... ...
1000 red

We can create the ideal Preceptor Table by performing the exhaustive census of the entire urn. Let’s say that, after this tedious process, we find that the true and real proportion of red beads is exactly 40%. We know the color of every bead.

In this world, the estimand, the proportion of red beads in the urn, is a simple matter of arithmetic. However, as has been emphasized before, performing an exhaustive count is not the easiest way to estimate the proportion of red beads. Real life sampling is far more complex. The process is extremely prone to error. Despite that, most people overestimate the validity of conclusions drawn from sampling. To stress the unknowns in sampling, let’s look at our actual Preceptor Table.

Let’s imagine we use our shovel and sample 100 beads from the urn. After taking our sample, we find that 40% of the sampled beads are red. Let’s visualize this by looking at the entire urn after we have taken our sample. This is an actual Preceptor Table. We only know the colors of our randomly sampled 100 beads, the remaining bead colors are our missing data!

ID Color
1 ?
2 ?
... ...
... ...
200 red
201 ?
... ...
... ...
999 red
1000 ?

Where is all of our data? Well, when we take a sample of 100 beads, we only have color identifications for 100 of the total 1000 beads in the urn. The rest of the beads were not sampled and we cannot say for certain whether they are white or red.

Something else we must consider is why some beads do get sampled, while others do not. This is a consequence of the sampling mechanism. Because we are drawing the sample using a shovel with 100 slots, we only have 100 known values. Therefore, our shovel (in addition to mixing the urn beforehand) is our sampling mechanism.

Consider this: what does this information tell us, specifically, about Bead 1? Bead 2? We know the proportion of red beads in our sample is 40%. Does this mean that bead 1 has precisely a 40% chance of being red? As we learned in Chapter 5, this is not true! There is uncertainty.

We can only claim for certain that, of the 100 beads that we sampled (of the total 1000 beads in the urn), 40% were red. If we were making a prediction of the probability of one of our sampled beads being red, 40% would be the correct probability. If we were making a prediction of the probability that an unsampled bead was red, the answer of 40% would be incorrect, but nor is it likely to be extremely incorrect. The sample does tell us something.

### 6.5.4 Precision versus accuracy (or bias versus variance)

We saw in a previous section that as our sample size $$n$$ increases, our point estimates will vary less and less and be more and more concentrated around the true population parameter. This variation is quantified by the decreasing standard error. In other words, the typical error of your point estimates will decrease. In our sampling exercise, as the sample size increased, the variation of our sample proportions $$\hat{p}$$ decreased. This is also known as having a precise estimate.

So random sampling ensures our point estimates are accurate, while on the other hand having a large sample size ensures our point estimates are precise. While the terms “accuracy” and “precision” may sound like they mean the same thing, there is a subtle difference. Accuracy describes how “on target” our estimates are, whereas precision describes how “consistent” our estimates are. The image below illustrates the difference.

Now, it’s obvious that the best case scenario is the most precise and the most accurate option. However, real life sampling isn’t so easy!

What if we had the option to use a shovel with 200 slots, but it had a minor magnetic property that caused it to pick up slightly more red beads than the 100 slotted shovel? On one hand, the larger shovel gives us increased precision due to a larger sample size. On the other, its magnetic property gives us decreased accuracy due to a sampling bias. Here, as is often the case in the real world, there is a tradeoff!

### 6.5.5 El Jefe’s or Felipe’s?

To apply this to a more complex example than an urn, imagine we are conducting a poll to gather information on whether Harvard students prefer El Jefe’s Taqueria or Felipe’s Taqueria. We are given two options: 1) we can conduct the poll on 50 students walking out of Lamont Library or 2) we can conduct the poll on 300 students walking outside of El Jefe’s Taqueria.

The first scenario provides a more accurate estimate, since the students outside of Lamont Library are much less likely to be a biased sample. The second scenario, however, provides a more precise estimate due to the larger sample size. Even though the students outside of El Jefe’s might not be patrons, it is safe to assume that at least some will be patrons, making our sample somewhat biased. What is the better option? It depends on your views of the trade-offs between accuracy and precision. The more that precision matters, the more you should prefer the larger sample.

## 6.6 Summary

Key lesson:

There is a truth! There is a true value for $$p$$ which we do not know! We want to create a posterior probability distribution which summarizes our knowledge. We care about the posterior probability distribution of p. The center of that distribution is around the mean or median of the proportion in your sample. The sd (or mad) of that posterior is the standard deviation divided by the square root of our sample size! Note that this is the same thing as the standard deviation of the repeated samples.

Key lesson:

We have a journey from reality, to our predictions, to the standard error of our predictions, to the posterior probability distribution for p. This is our sequence:

p (the truth) $$\Rightarrow$$ $$\hat{p}$$ (my estimate) $$\Rightarrow$$ the standard error of $$\hat{p}$$ (black box of math mumbo jumbo and computer simulation magic) $$\Rightarrow$$ our posterior probability distribution for p (our beliefs about the truth).

This journey shows how our beliefs about the truth develop through our work. We begin with p; p is the truth, the true but unknown value we are estimating. $$\hat{p}$$ is our estimate for p. There can be millions and millions of $$\hat{p}$$’s. Next, we must take the standard error of our estimates (our $$\hat{p}$$’s) to account for our uncertainty with our predictions. Finally – the thing we need most – we create a posterior probability distribution for p. This distribution is used to answer key questions about p.

Other highlights:

### 6.6.1 Tactial sampling

• Sampling allows us to make guesses at an unknown, difficult-to-obtain value by looking at a smaller subset of data and generalizing it to the larger population.
• Sampling is preferable in the urn example because counting out 1000 beads from an urn is intensive and tedious.
• Sampling is preferable in the real-world because it is often impossible to sample “all the beads” (all the people) in a population. With sampling, you see variations in your results. This is known as sampling variation and is expected, especially if you draw samples at random (unbiased).

### 6.6.2 Virtual sampling

• By creating a virtual analog of our urn and our shovel, we were able to look at even more samples and observe the effects of sampling size on our results.
• More samples yield more even distributions which should resemble a bell.
• Larger sample sizes decrease standard deviation, meaning that the resulting proportions red are closer to one another than when the sample sizes are smaller. This means that larger samples = lower SD = more precise guesses.
• When we are writing a lot of code for something we want to perform frequently, write a function instead! This saves time and functions can be called within map() for more ease when plotting results.

### 6.6.3 Standard error

• Standard error is just a fancy term for your uncertainty about something you don’t know. Standard error $$\approx$$ our (uncertain) beliefs.
• The standard error measures the accuracy of a sample distribution as compared to the population by using the standard deviation.
• We find that larger sample sizes $$\implies$$ lower standard errors $$\implies$$ more accurate estimates.
• If we could only know two pieces of information from our data, we would need a measure of the center of the distribution (like mean or median) and a measure of the variability of the distribution (like sd or MAD).
• SE refers to the standard deviation of a sample statistic (aka point estimate), such as the mean or median. Therefore, the “standard error of the mean” refers to the standard deviation of the distribution of sample means taken from a population.

• A joint distribution allows us to explore the most likely possibilities according to all possible models of this scenario.
• The posterior distribution takes a slice of our joint distribution that is more specific to our question.
• stan_glm() can create a joint distribution and then estimate the posterior probability distribution, conditional on the data which was passed in to the data argument. This is a much easier way to create the posterior distribution, and will be explored in more detail in Chapter 7.
In this chapter, we performed both tactile and virtual sampling exercises to infer about an unknown parameter We also presented a case study of sampling in real life with polls. In each case, we used the sample proportion $$\hat{p}$$ to estimate the population proportion $$p$$. However, we are not just limited to scenarios related to proportions. In other words, we can use sampling to estimate other population parameters using other point estimates as well.