EDA of shaming
Imagine you are running for Governor and want to do a better job of getting your voters to vote. You recently read about a large-scale experiment showing the effect of sending out a voting reminder that “shames” citizens who do not vote. You are considering sending out a “shaming” voting reminder yourself. What will happen if you do? Will more voters show up to the polls? Additionally, on the day of the election a female citizen is randomly selected. What is the probability she will vote?
Consider a new data set, shaming
, corresponding to an experiment carried out by Gerber, Green, and Larimer (2008) titled “Social Pressure and Voter Turnout: Evidence from a Large-Scale Field Experiment.” This experiment used several hundred thousand registered voters and a series of mailings to determine the effect of social pressure on voter turnout.
Let’s now do another EDA, starting off by running glimpse()
.
## Rows: 344,084
## Columns: 15
## $ cluster <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", …
## $ primary_06 <int> 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1…
## $ treatment <fct> Civic Duty, Civic Duty, Hawthorne, Hawthorne, Hawthorne…
## $ sex <chr> "Male", "Female", "Male", "Female", "Female", "Male", "…
## $ age <int> 65, 59, 55, 56, 24, 25, 47, 50, 38, 39, 65, 61, 57, 37,…
## $ primary_00 <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "…
## $ general_00 <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "No", "Yes", "Yes", …
## $ primary_02 <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "No", "Yes", "Yes", …
## $ general_02 <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "No", "Yes", "Yes", …
## $ primary_04 <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "…
## $ general_04 <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes",…
## $ hh_size <int> 2, 2, 3, 3, 3, 3, 3, 3, 2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 1…
## $ hh_primary_04 <dbl> 0.095, 0.095, 0.048, 0.048, 0.048, 0.048, 0.048, 0.048,…
## $ hh_general_04 <dbl> 0.86, 0.86, 0.86, 0.86, 0.86, 0.90, 0.90, 0.90, 0.90, 0…
## $ neighbors <int> 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,…
Here we see that glimpse()
gives us a look at the raw data contained within the shaming
data set. At the very top of the output, we can see the number of rows and columns, or observations and variables respectively. We see that there are 344,084 observations, with each row corresponding to a unique respondent. The “Columns: 10” tells us that there are 10 variables within this data set. Below this, we see a cutoff version of the entire data set that has the variables on the left as rows and the observations as a list separated by commas, as compared to the tibble output that presents with the variables as columns and the observations as rows running horizontally.
From this summary, we get an idea of some of the variables we will be working with. Variables of particular interest to us are sex
, hh_size
, and primary_06
. The variable hh_size
tells us the size of the respondent’s household, sex
tells us the sex of the respondent, and primary_06
tells us whether or not the respondent voted in the 2006 Primary election.
There are a few things to note while exploring this data set. You may – or may not – have noticed that the only response to the general_04
variable is “Yes.” In their published article, the authors note that “Only registered voters who voted in November 2004 were selected for our sample” (Gerber, Green, Larimer, 2008). After this, the authors found their history then sent out the mailings.
It is also important to identify the dependent variable and its meaning. In this shaming experiment, the dependent variable is primary_06
, which is a variable coded either 0 or 1 for whether or not the respondent voted in the 2006 primary election. This is the dependent variable because the authors are trying to measure the effect that the treatments have on the proportion of people who vote in the 2006 general election.
The voting results from other years, such as 2002 and 2004, are of less interest to us and can be removed from the abbreviated data set. In addition to removing general_04
, primary_02
, general_02
, or primary_04
, we also will not be taking particular interest in age
, or no_of_names
within this chapter.
By narrowing down the set of variables we are looking at and investigating, we will find more meaningful relationships among them. However, we have not yet discussed the most important variable of them all: treatment
. The treatment
variable is a factor variable with 5 levels, including the control. Since we are curious as to how sending mailings affects voter turnout, the treatment variable will tell us about the impact each type of mailing can make. Let’s start off by taking a broad look at the different treatments.
shaming %>%
count(treatment)
## # A tibble: 5 x 2
## treatment n
## <fct> <int>
## 1 Control 191243
## 2 Civic Duty 38218
## 3 Hawthorne 38204
## 4 Self 38218
## 5 Neighbors 38201
Four types of treatments were used in the experiment, with voters receiving one of the four types of mailing. All of the mailing treatments carried the message, “DO YOUR CIVIC DUTY - VOTE!”
The first treatment, Civic Duty, also read, “Remember your rights and responsibilities as a citizen. Remember to vote.” This message acted as a baseline for the other treatments, since it carried a message very similar to the one displayed on all the mailings.
In the second treatment, Hawthorne, households received a mailing which told the voters that they were being studied and their voting behavior would be examined through public records. This adds a small amount of social pressure to the households receiving this mailing.
In the third treatment, Self, the mailing includes the recent voting record of each member of the household, placing the word “Voted” next to their name if they did in fact vote in the 2004 election or a blank space next to the name if they did not. In this mailing, the households were also told, “we intend to mail an updated chart” with the voting record of the household members after the 2006 primary. By emphasizing the public nature of voting records, this type of mailing exerts more social pressure on voting than the Hawthorne treatment.
The fourth treatment, Neighbors, provides the household members’ voting records, as well as the voting records of those who live nearby. This mailing also told recipients, “we intend to mail an updated chart” of who voted in the 2006 election to the entire neighborhood.
For now, let’s focus on a subset of the data. We will sample just 10,000 rows because otherwise stan_glm()
takes an annoyingly large amount of time to work. Nothing substantive changes.
set.seed(9)
ch9_sham <- shaming %>%
filter(treatment %in% c("Control", "Neighbors")) %>%
droplevels() %>%
mutate(solo = ifelse(hh_size == 1, TRUE, FALSE)) %>%
select(primary_06, treatment, solo, sex, age) %>%
slice_sample(n = 10000, replace = FALSE)
We create the variable solo
, which is TRUE for voters who live alone and FALSE for those that do not. We are curious to see if the treatment effect, if any, is the same for voters who live alone as it is for those who do not. We have also focused in on only two “treatments”: Control and Neighbors. This is for the sake of simplification. We want to know if social pressure impacts voting behavior, so it makes sense to look at the treatment that provides the most social pressure.
TABLE 11.1: Data summary
Name |
Piped data |
Number of rows |
10000 |
Number of columns |
5 |
_______________________ |
|
Column type frequency: |
|
character |
1 |
factor |
1 |
logical |
1 |
numeric |
2 |
________________________ |
|
Group variables |
None |
Variable type: character
Variable type: factor
treatment |
0 |
1 |
FALSE |
2 |
Con: 8365, Nei: 1635 |
Variable type: logical
solo |
0 |
1 |
0.14 |
FAL: 8560, TRU: 1440 |
Variable type: numeric
primary_06 |
0 |
1 |
0.31 |
0.46 |
0 |
0 |
0 |
1 |
1 |
▇▁▁▁▃ |
age |
0 |
1 |
49.58 |
14.49 |
20 |
41 |
50 |
59 |
97 |
▃▇▇▂▁ |
Let’s focus on a few observations that may be relevant to our analysis. First, note that each treatment has approximately 38,000 respondents. The control group, denoted by Con, has approximately 190 thousand respondents. For the logical variable solo
, we see that approximately 47 thousand of the total respondents live alone (TRUE), while approximately 296 thousand live in households greater than 1 (FALSE). It may also be important to note that the average age of the respondents is 49.8 years with a standard deviation of 14.4 years.
To get a better sense of some respondents’ information, let’s use slice_sample()
to gather a random sample of n observations from the data set.
ch9_sham %>%
slice_sample(n = 10)
## # A tibble: 10 x 5
## primary_06 treatment solo sex age
## <int> <fct> <lgl> <chr> <int>
## 1 0 Control FALSE Male 52
## 2 0 Control FALSE Female 20
## 3 1 Control FALSE Female 76
## 4 0 Control FALSE Male 39
## 5 0 Neighbors FALSE Female 35
## 6 1 Control TRUE Male 84
## 7 0 Control FALSE Male 61
## 8 1 Control FALSE Male 73
## 9 1 Neighbors FALSE Male 46
## 10 0 Neighbors FALSE Male 38
Now we have a table with 5 random observations and the respondents’ information in a regular table output. By taking a few random samples, we may start to see some patterns within the data. Do you notice anything in particular about the variable treatment
?
One other helpful summarizing technique we can use is skim()
. To make the information it contains simpler, we will only be looking at three variables: primary_06
, treatment
, and sex
.
shaming %>%
select(primary_06, treatment, sex) %>%
skim()
TABLE 11.2: Data summary
Name |
Piped data |
Number of rows |
344084 |
Number of columns |
3 |
_______________________ |
|
Column type frequency: |
|
character |
1 |
factor |
1 |
numeric |
1 |
________________________ |
|
Group variables |
None |
Variable type: character
Variable type: factor
treatment |
0 |
1 |
FALSE |
5 |
Con: 191243, Civ: 38218, Sel: 38218, Haw: 38204 |
Variable type: numeric
primary_06 |
0 |
1 |
0.32 |
0.46 |
0 |
0 |
0 |
1 |
1 |
▇▁▁▁▃ |
Running the skim()
command gives us a summary of the data set as a whole, as well as the types of variables and individual variable summaries. At the top, we see the number of columns and rows within the selected data set. Below this we are given a list with the different types of variables, or columns, and how often they appear within the data we are skimming. The variables are then separated by their column type, and we are given individual summaries based on the type.
Having created models with one parameter in Chapter 6 and two parameters in Chapter 7, you are now ready to make the jump to \(N\) parameters. The more parameters we include in our models, the more flexible they can become. But we must be careful of overfitting, of making models which are inaccurate because they don’t have enough data to accurately estimate those parameters. The tension between overfitting and underfitting is central to the practice of data science.
Let’s consider one of the most important virtues of data science: wisdom. The map from our data to our question. Recall that our mission here is to increase our voter turnout while we are running for Governor.
To investigate this, we are given a dataset in which respondents were encouraged to vote under four treatments. This was accomplished by sending a letter to citizens that voted in the previous primary election with varying degrees of social pressure. The remainder of the respondents fall under a control group, which received no such mailings. The dataset offers a number of details about each respondent, including their age, sex, treatment type, and voting outcome.
What we truly want to know is how to make citizens vote. One immediate problem with our dataset is that, due to our study population, we are only studying people that voted in the previous primary election. In other words, if someone did not vote in the previous primary election, they were not included. This would be a large problem, since that means we can only figure out how to make citizens that have already voted vote. Though we can’t be sure, it is reasonable to assume that it is easier to encourage citizens to vote in the next primary election if they have a history of recently voting in primary elections.
Does this mean our data is unhelpful? Of course not! With four treatments (and therefore four different methods of encouraging voting), we can gain quite a bit of knowledge. Mostly, we will know the most effective way to incentivize people with a history of voting to vote again. We will also know if no method of persuasion (the control) is the best option. We will further be able to tell if certain methods of persuasion work better for certain groups of people, according to factors such as age, sex, or household size. This can help tremendously in our election.
That being said, the map from our question to our data is almost never perfect. In data science, we often have to look at our data, understand its limitations, and try our best to make inferences that help our cause.
primary_06 and (treatment + age)
Because we will be going through a series of models in this chapter, it is useful to combine the virtues of Justice and Courage. To begin, let’s model primary_06
, which represents whether a citizen voted or not, against age and treatment to see if there is a connection.
Let’s look at the relationship between primary voting and treatment + age.
model_3 <- stan_glm(data = ch9_sham,
formula = primary_06 ~ treatment + age,
refresh = 0)
print(model_3, digits = 3)
## stan_glm
## family: gaussian [identity]
## formula: primary_06 ~ treatment + age
## observations: 10000
## predictors: 3
## ------
## Median MAD_SD
## (Intercept) 0.084 0.016
## treatmentNeighbors 0.079 0.012
## age 0.004 0.000
##
## Auxiliary parameter(s):
## Median MAD_SD
## sigma 0.457 0.003
##
## ------
## * For help interpreting the printed output see ?print.stanreg
## * For info on the priors used see ?prior_summary.stanreg
The (Intercept) here has two key details. First, since Control comes alphabetically before Neighbors, the Control group is our baseline for comparison. This holds similarly for age. This is the slope for age of only those participants in the Control group.
Second, remember that for this data, the (Intercept) does have a mathematical interpretation, but it does not have a practical interpretation. Why is this? Because the slope for age starts at zero. This is nonsensical for our purposes, as no voter can be of zero age.
Therefore, this model shows that, within the control group, the percent voting is 0.084 = 8.4%. How do we then calculate our percent voting in the Neighbors group? Recall that the treatmentNeighbors
median is not giving a standalone figure for this group, but rather represents the offset between the Control and Neighbors groups. To find the Neighbors value, we must add the offset to the original value: 0.084 + 0.079 = .163 = 16.3%. This is nearly double the rate in the Control group!
Let’s turn to our age median. Begin by grouping our observations by age
and counting by primary_06
, which gives us counts for 1 (yes) or 0 (no) for number voting in each age category.
age <- ch9_sham %>%
group_by(age) %>%
count(primary_06)
age
## # A tibble: 149 x 3
## # Groups: age [77]
## age primary_06 n
## <int> <int> <int>
## 1 20 0 119
## 2 20 1 15
## 3 21 0 120
## 4 21 1 15
## 5 22 0 150
## 6 22 1 24
## 7 23 0 119
## 8 23 1 19
## 9 24 0 102
## 10 24 1 21
## # … with 139 more rows
To explore this relationship visually, let’s create a graph. We are coercing primary_06
into a character variable as it more closely represents “yes” or “no” as opposed to a numeric value.
age %>%
mutate(primary_06 = as.character(primary_06)) %>%
ggplot(aes(x = age, y = n, color = primary_06)) +
geom_point() +
labs(
title = "Relationship Between Age and Voting",
subtitle = "In the 2006 Primary Elections",
x = "Age",
y = "Count"
)
There are some interesting takeaways here.
- First, in almost every age bracket (other than above 90), the majority participants did not vote.
- The spike between ages 40 and 60 illustrates that most participants exist in this age bracket.
- The differences between voters and non-voters narrows greatly after age 60.
Let’s now look at another graph that aims to show the same phenomena, but also includes a formula using lm
. This more clearly shows the upward trend in voting as a participants age increases. We can also see that the highest concentrations in the “Voted” row exist from ages 45-50, whereas the highest concentrations for the “Did Not Vote” row exist in the 18-25 and 30-60 age groups. Again, we see that, for almost all ages, the partcipants are more likely to not vote than vote. This is illustrated by the darker concentration of dots in the “Did Not Vote” row. The slope of our regression line, however, shows a clear picture: the older you are, the more likely you are to vote.
shaming %>%
ggplot(aes(age, primary_06)) +
geom_jitter(alpha = 0.005, height = 0.1) +
geom_smooth(formula = y ~ x, method = "lm", se = FALSE) +
scale_y_continuous(breaks = c(0, 1), labels = c("Did Not Vote", "Voted")) +
labs(title = "Age and Voting in 2012 Michigan Primary Election",
subtitle = "Older people are more likely to vote",
x = "Age",
y = NULL,
caption = "Data from Gerber, Green, and Larimer (2008)")
Note that the median for age is 0.004. Age is therefore positively correlated with voting in the primary election. What does that mean? It means that, for every year that a participant’s age increases, their odds of voting in the primary increases by 0.004. Now, this might not seem like a huge difference. However, think of it like this: for every decade older that a participant is, their odds of voting increase .04 = 4%! This makes sense considering that we just learned that older citizens are more likely to vote.
Now, let’s return to our voting difference between the Control and Neighbors groups. Let’s model the posterior probability distribution for the rates of voting.
# In progress. Modify x-axis labels *10.
model_3 %>%
as_tibble() %>%
mutate(Neighbors = `(Intercept)` + `treatmentNeighbors`) %>%
mutate(Control = `(Intercept)`) %>%
select(Neighbors, Control) %>%
pivot_longer(cols = Neighbors:Control,
names_to = "parameters",
values_to = "percent_voting") %>%
ggplot(aes(percent_voting, fill = parameters)) +
geom_histogram(aes(y = after_stat(count/sum(count))),
alpha = 0.5,
bins = 100,
position = "identity") +
labs(title = "Posterior Probability Distribution",
subtitle = "for Control versus Neighbors voting rates",
x = "% of group voting",
y = "Probability") +
scale_x_continuous(labels = scales::number_format()) +
scale_y_continuous(labels = scales::percent_format()) +
theme_classic()