6  Four Parameters

Our forecasts are more uncertain that a naive use of our models would suggest.

In our haste to make progress — to get all the way through the process of building, interpreting and using models — we have given short shrift to some of the messy details of model building and evaluation. This chapter fills in those lacunae. We will also introduce models with four parameters, including parallel slopes models.

6.1 Transforming variables

It is often convenient to transform a predictor variable so that our model makes more sense.

6.1.1 Centering

Recall our model of income as a function of age.

\[ y_i = \beta_0 + \beta_1 age_i + \epsilon_i\]

We fit this using the trains data from primer.data. We will also be using a new package, broom.mixed, which allows us to tidy regression data for plotting.

Code
fit_1 <- stan_glm(formula = income ~ age, 
         data = trains, 
         refresh = 0,
         seed = 9)

print(fit_1, detail = FALSE)
            Median   MAD_SD  
(Intercept) 103927.5  23785.9
age            887.1    555.9

Auxiliary parameter(s):
      Median  MAD_SD 
sigma 74153.4  4905.9

There is nothing wrong with this model. Yet the interpretation of \(\beta_0\), the intercept in the regression, is awkward. It represents the average income for people of age zero. That is useless! There are no people of zero age in our data. And, even if there were, it would be weird to think about such people taking the commuter train into Boston and filling out our survey forms.

It is easy, however, to transform age into a variable which makes the intercept more meaningful. Consider a new variable, c_age, which is age minus the average age in the sample. Using this centered version of age does not change the predictions or residuals in the model, but it does make the intercept easier to interpret.

Code
trains_2 <- trains |> 
  mutate(c_age = age - mean(age))

fit_1_c <- stan_glm(formula = income ~ c_age, 
                    data = trains_2, 
                    refresh = 0,
                    seed = 9)

print(fit_1_c, detail = FALSE)
            Median   MAD_SD  
(Intercept) 141922.8   6829.9
c_age          915.0    559.2

Auxiliary parameter(s):
      Median  MAD_SD 
sigma 74234.5  4723.3

The intercept, 141,923, is the expected income for someone with c_age = 0, i.e., someone of an average age in the data, which is around 42.

6.1.2 Scaling

Centering — changing a variable via addition/subtraction — often makes the intercept easier to interpret. Scaling — changing a variable via multiplication/division — often makes it easier to interpret coefficients. The most common scaling method is to divide the variable by its standard deviation.

Code
trains_3 <- trains |> 
  mutate(s_age = age / sd(age))

fit_1_s <- stan_glm(formula = income ~ s_age, 
                    data = trains_3, 
                    refresh = 0,
                    seed = 9)

print(fit_1_s, detail = FALSE)
            Median   MAD_SD  
(Intercept) 103621.9  25156.0
s_age        10975.5   7115.7

Auxiliary parameter(s):
      Median  MAD_SD 
sigma 74432.7  4950.9

s_age is age scaled by its own standard deviation. A change in one unit of s_age is the same as a change in one standard deviation of the age, which is about 12. The interpretation of \(\beta_1\) is now:

When comparing two people, one about 1 standard deviation worth of years older than the other, we expect the older person to earn about 11,000 dollars more.

But, because we scaled without centering, the intercept is now back to the (nonsensical) meaning of the expected income for people of age 0.

6.1.3 z-scores

The most common transformation applies both centering and scaling. The base R function scale() subtracts the mean and divides by the standard deviation. A variable so transformed is a “z-score,” meaning a variable with a mean of zero and a standard deviation of one. Using z-scores makes interpretation easier, especially when we seek to compare the importance of different predictors.

Code
trains_4 <- trains |> 
  mutate(z_age = scale(age))

fit_1_z <- stan_glm(formula = income ~ z_age, 
                    data = trains_4, 
                    refresh = 0,
                    seed = 9)

print(fit_1_z, detail = FALSE)
            Median   MAD_SD  
(Intercept) 141737.9   6877.0
z_age        11041.8   7035.7

Auxiliary parameter(s):
      Median  MAD_SD 
sigma 74306.7  5154.8

The two parameters are easy to interpret after this transformation.

The expected income of someone of average age, which is about 42 in this study, is about 142,000 dollars.

When comparing two individuals who differ in age by one standard deviation, which is about 12 years in this study, the older person is expected to earn about 11,000 dollars more than the younger.

Note that, when using z-scores, we would often phrase this comparison in terms of “sigmas.” One person is “one sigma” older than another person means that they are one standard deviation older. This is simple enough, once you get used to it, but also confusing since we are already using the word “sigma” to mean \(\sigma\), the standard deviation of \(\epsilon_i\). Alas, language is something we deal with rather than control. You will hear the same word “sigma” applied to both concepts, even in the same sentence. Determine meaning by context.

6.1.4 Taking logs

It is often helpful to take the log of predictor variables, especially in cases in which their distribution is skewed. You should generally only take the log of variables for which all the values are strictly positive. The log of a negative number is not defined. Consider the number of registered voters (rv13) at each of the polling stations in kenya.

Code
x <- kenya |> 
  filter(rv13 > 0)

rv_p <- x |> 
  ggplot(aes(rv13)) + 
    geom_histogram(bins = 100) +
    labs(x = "Registered Voters",
         y = NULL) 

log_rv_p <- x |> 
  ggplot(aes(log(rv13))) + 
    geom_histogram(bins = 100) +
    labs(x = "Log of Registered Voters",
         y = NULL) +
    expand_limits(y = c(0, 175))

rv_p + log_rv_p +
  plot_annotation(title = 'Registered Votes In Kenya Communities',
                  subtitle = "Taking logs helps us deal with outliers")

Most experienced data scientists would use the log of rv13 rather than the raw value. Comments:

  • We do not know the “true” model. Who is to say that a model using the raw value is right or wrong?

  • Check whether or not this choice meaningfully affects the answer to your question. Much of the time, it won’t. That is, our inferences are often fairly “robust” to small changes in the model. If you get the same answer with rv13 as from log_rv13, then no one cares which you use.

  • Follow the conventions in your field. If everyone does X, then you should probably do X, unless you have a good reason not to. If you do have such a reason, explain it prominently.

  • Most professionals, when presented with data distributed like rv13, would take the log. Professionals hate (irrationally?) outliers. Any transformation which makes a distribution look more normal is generally considered a good idea.

Many of these suggestions apply to every aspect of the modeling process.

6.1.5 Adding transformed terms

Instead of simply transforming variables, we can add more terms which are transformed versions of a variable. Consider the relation of height to age in nhanes. Let’s start by dropping the missing values.

Code
no_na_nhanes <- nhanes |> 
  select(height, age) |> 
  drop_na() 

Fit and plot a simple linear model:

Code
nhanes_1 <- stan_glm(height ~ age,
                     data = no_na_nhanes,
                     refresh = 0,
                     seed = 47)

no_na_nhanes |> 
  ggplot(aes(x = age, y = height)) +
    geom_point(alpha = 0.1) +
    geom_line(aes(y = fitted(nhanes_1)), 
              color = "red") +
    labs(title = "Age and Height",
         subtitle = "Children are shorter, but a linear fit is poor",
         x = "Age",
         y = "Height (cm)",
         caption = "Data source: NHANES")

That is not a very good model, obviously.

Adding a quadratic term makes it better. (Note the need for I() in creating the squared term within the formula argument.)

Code
nhanes_2 <- stan_glm(height ~ age + I(age^2),
                     data = no_na_nhanes,
                     refresh = 0,
                     seed = 33)

no_na_nhanes |> 
  ggplot(aes(x = age, y = height)) +
    geom_point(alpha = 0.1) +
    geom_line(aes(y = fitted(nhanes_2)), 
              color = "red") +
    labs(title = "Age and Height",
         subtitle = "Quadratic fit is much better, but still poor",
         x = "Age",
         y = "Height (cm)",
         caption = "Data source: NHANES")

Still, we have not made use of our background knowledge in creating these variables. We know that people don’t get any taller after age 18 or so. Let’s create variables which capture that break.

Code
nhanes_3 <- stan_glm(height ~ I(ifelse(age > 18, 18, age)),
                     data = no_na_nhanes,
                     refresh = 0,
                     seed = 23)

no_na_nhanes |> 
  ggplot(aes(x = age, y = height)) +
    geom_point(alpha = 0.1) +
    geom_line(aes(y = fitted(nhanes_3)), 
              color = "red") +
    labs(title = "Age and Height",
         subtitle = "Domain knowledge makes for better models",
         x = "Age",
         y = "Height (cm)",
         caption = "Data source: NHANES")

The point is that we should not take the variables we receive as given. We are the captains of our souls. We transform variables as needed.

6.1.6 Transforming the outcome variable

Transforming predictor variables is uncontroversial. It does not matter much. Change most continuous predictor variables to \(z\)-scores and you won’t go far wrong. Or keep them in their original form, and take care with your interpretations. It’s all good.

Transforming the outcome variable is a much more difficult question. Imagine that we seek to create a model which explains rv13 from the kenya tibble. Should we transform it?

  • Maybe? There are no right answers. A model with rv13 as the outcome variable is different from a model with log(rv13) as the outcome. The two are not directly comparable.

  • Much of the same advice with regard to taking logs of predictor variables applies here as well.

See Gelman, Hill, and Vehtari (2020)

6.1.7 Interpreting coefficients

When we interpret coefficients, it is important to know the difference between across unit and within unit comparisons. When we compare across unit, meaning comparing Joe and George, we are not looking at a causal relationship. Within unit discussions, where we are comparing Joe under treatment versus Joe under control, are causal. This means that within unit interpretation is only possible in causal models, where we are studying one unit under two conditions. When we talk about two potential outcomes, we are discussing the same person or unit under two conditions.

To put this in terms of the Preceptor tables, a within unit comparison is looking at one row of data: the data for Joe under control and the data for Joe under treatment. We are comparing one unit, or (in this case) one person, to itself under two conditions. An across unit comparison is looking at multiple rows of data, with a focus on differences across columns. We are looking at differences without making any causal claims. We are predicting, but we are not implying causation.

The magnitude of the coefficients in linear models are relatively easy to understand. That is not true for logistic regressions. In that case, use the Divide-by-Four rule: Take a logistic regression coefficient (other than the constant term) and divide it by 4 to get an upper bound on the predictive difference corresponding to a unit difference in that variable. All this means is that, when evaluating if a predictor is helpful in a logistic regression only, divide the coefficient by four.

6.2 Selecting variables

How do we decide which variables to include in a model? There is no one right answer to this question.

6.2.1 General guidelines for selecting variables

When deciding which variables to keep or discard in our models, our advice is to keep a variable X if any of the following circumstances apply:

  • The variable has a large and well-estimated coefficient. This means, roughly, that the 95% confidence interval excludes zero. “Large” can only be defined in the context of the specific model. Speaking roughly, removing a variable with a large coefficient meaningfully changes the predictions of the model.

  • Underlying theory/observation suggests that X has a meaningfully connection to the outcome variable.

  • If the variable has a small standard error relative to the size of the coefficient, it is general practice to include it in our model to improve predictions. The rule of thumb is to keep variables for which the estimated coefficient is more than two standard errors away from zero. Some of these variables won’t “matter” much to the model. That is, their coefficients, although well-estimated, are small enough that removing the variable from the model will not affect the model’s predictions very much.

  • If the standard error is large relative to the coefficient, i.e., if the magnitude of the coefficient is more than two standard errors from zero, and we find no other reason to include it in our model, we should probably remove the variable from your model.

  • The exception to this rule is if the variable is relevant to answering a question which we have. For example, if we want to know if the ending attitude toward immigration differs between men and women, we need to include gender in the model, even if its coefficient is small and closer to zero than two standard errors.

  • It is standard in your field to include X in such regressions.

  • Your boss/client/reviewer/supervisor wants to include X.

Let’s use the trains dataset to evaluate how helpful certain variables are to creating an effective model, based on the guidelines above.

6.2.2 Variables in the trains dataset

To look at our recommendations in practice, let’s focus on the trains dataset. The variables in trains include gender, liberal, party, age, income, att_start, treatment, and att_end. Which variables would be best to include in a model?

6.2.3 att_end ~ treatment + att_start

First, let’s look at a model with a left hand variable, att_end, and two right side variables, treatment and att_start.

Code
fit_1_model <- stan_glm(att_end ~ treatment + att_start, 
                    data = trains, 
                    refresh = 0)

fit_1_model
stan_glm
 family:       gaussian [identity]
 formula:      att_end ~ treatment + att_start
 observations: 115
 predictors:   3
------
                 Median MAD_SD
(Intercept)       2.3    0.4  
treatmentControl -0.9    0.3  
att_start         0.8    0.0  

Auxiliary parameter(s):
      Median MAD_SD
sigma 1.3    0.1   

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

How do we decide which variables are useful? First, let’s interpret our coefficients.

  • The variable before the tilde, att_end, is our outcome.

  • The explanatory variables are treatment, which says whether a commuter relieved treatment or control conditions, and att_start, which measures attitude at the start of the study.

  • The 95% confidence interval for att_end is equal to the coefficient– 2– plus or minus two standard errors. This shows the estimate for the att_end where the commuters were under treatment and were not in the control group.

  • The variable treatmentControl represents the offset in att_end from the estimate for our (Intercept). This offset is for the group of people that were in the Control group. To find the estimated att_end for those in this group, you must add the median for treatmentControl to the (Intercept) median.

  • The variable att_start measures the expected difference in att_end for every one unit increase in att_start.

The causal effect of the variable treatmentControl is -1. This means that, compared with the predicted att_end for groups under treatment, those in the control group have a predicted attitude that is one entire point lower. As we can see, this is a large and well estimated coefficient. Recall that this means, roughly, that the 95% confidence interval excludes zero. To calculate the 95% confidence interval, we take the coefficient plus or minus two standard errors: -1.3 and -0.5. As we can see, the 95% confidence interval does exclude zero, suggesting that treatment is a worthy variable.

In addition to being meaningful, which is enough to justify inclusion in our model, this variable satisfies a number of other qualifications: - The variable has a small standard error. - The variable is considered an indicator variable, which separates two groups of significance (treatment and control) that we would like to study.

The variable att_start, with a coefficient of 1, is also meaningful. Due to the fact that the MAD_SD value here is 0, we do not need to find the 95% confidence interval to know –intuitively– that this variable has a large and well estimated coefficient.

Conclusion: keep both variables! treatment and att_start are both significant, as well as satisfying other requirements in our guidelines. They are more than worthy of inclusion in our model.

6.2.4 income ~ age + liberal

Now, we will look at income as a function of age and liberal, a proxy for political party.

Code
fit_2 <- stan_glm(income ~ age + liberal, 
                    data = trains, 
                    refresh = 0)

fit_2
stan_glm
 family:       gaussian [identity]
 formula:      income ~ age + liberal
 observations: 115
 predictors:   3
------
            Median   MAD_SD  
(Intercept) 110298.3  23690.0
age           1079.4    534.6
liberalTRUE -32490.6  13481.3

Auxiliary parameter(s):
      Median  MAD_SD 
sigma 72547.0  4894.7

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

Great! We have an estimate for income of those who fall into the category of liberalFALSE, as well as data on our right hand side variables of age and liberalTRUE.

First, let’s interpret our coefficients.

  • The variable before the tilde, income, is our outcome.

  • The explanatory variables are liberal, which has a resulting value of TRUE or FALSE, and age, a numeric variable.

  • The (Intercept) is estimating income where liberal == FALSE. Therefore, it is the estimated income for commuters that are not liberals and who have age = 0. The estimate for age is showing the increase in income with every additional year of age.

  • The estimate for liberalTRUE represents the offset in predicted income for commuters who are liberal. To find the estimate, we must add the coefficient to our (Intercept) value. We see that, on average, liberal commuters make less money.

It is important to note that we are not looking at a causal relationship for either of these explanatory variables. We are noting the differences between two groups, without considering causality. This is known as an across unit comparison. When we compare across unit we are not looking at a causal relationship.

When comparing liberalTRUE with our (Intercept), recall that the (Intercept) is calculating income for the case where liberal == FALSE. As we can see, then, the coefficient for liberalTRUE, -32,491, shows that the liberals in our dataset make less on average than non-liberals in our dataset. The coefficient is large relative to the (Intercept) and, with rough mental math, we see that the 95% confidence interval excludes 0. Therefore, liberal is a helpful variable!

The variable age, however, does not appear to have a meaningful impact on our (Intercept). The coefficient of age is low and the 95% confidence interval does not exclude 0.

Conclusion: definitely keep liberal! age is less clear. It is really a matter of preference at this point.

6.2.5 liberal ~ I(income/1e+05) + party

We will now look at liberal as a function of a transformed income and party. The reason we have transformed income here is due to the fact that, without a transformation, income makes predictions according to very small income disparities. This is not helpful. To enhance the significance of income, we have divided the variable by \(100000\). (Note the need for I() in creating the income term within the formula argument.)

Code
fit_3 <- stan_glm(liberal ~ I(income/1e+05) + party, 
                    data = trains, 
                    refresh = 0)

fit_3
stan_glm
 family:       gaussian [identity]
 formula:      liberal ~ I(income/1e+05) + party
 observations: 115
 predictors:   3
------
                Median MAD_SD
(Intercept)      0.7    0.1  
I(income/1e+05) -0.1    0.1  
partyRepublican -0.5    0.1  

Auxiliary parameter(s):
      Median MAD_SD
sigma 0.5    0.0   

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

Recall that, for logistic models, the (Intercept) is nearly impossible to interpret. To evaluate the impact of variables, we will need to use the Divide-by-Four rule. This instructs us to take a logistic regression coefficient (other than the constant term) and divide it by 4 to get an upper bound on the predictive difference corresponding to a unit difference in that variable. All this means is that, when evaluating if a predictor is helpful, in a logistic regression only, to divide the coefficient by four.

For partyRepublican, we take our coefficient, 0, and divide it by 4: -0.125. This is the upper bound on the predictive difference corresponding to a unit difference in this variable. Therefore, we see that partyRepublican is meaningful. We want this variable in our model.

The variable income, however, seems less promising. In addition to the resulting value from the Divide-by-Four rule being low (-0.023), the MAD_SD for income is also equal to the posterior median itself, indicating a standard error which is large relative to the magnitude of the coefficient. Though we may choose to include income for reasons unrelated to its impact in our model, it does not appear to be worthy of inclusion on the basis of predictive value alone.

Conclusion: the variable party is significant; we will include it in our model. Unless we have a strong reason to include income, we should exclude it.

6.2.6 Final thoughts

Now that we have looked at three cases of variables and decided whether they should be included, let’s discuss the concept of selecting variables generally.

The variables we decided to keep and discard are not necessarily the variables you would keep or discard, or the variables that any other data scientist would keep or discard. It is much easier to keep a variable than it is to build a case for discarding a variable. Variables are helpful even when not significant, particularly when they are indicator variables which may separate the data into two groups that we want to study.

The process of selecting variables – though we have guidelines – is complicated. There are many reasons to include a variable in a model. The main reasons to exclude a variable are if the variable isn’t significant or if the variable has a large standard error.

6.3 Comparing models in theory

Deciding which variables to include in a model is a subset of the larger question: How do we decide which model, out of the set of possible models, to choose?

Consider two models which explain attitudes to immigration among Boston commuters.

Code
fit_liberal <- stan_glm(formula = att_end ~ liberal,
                  data = trains,
                  refresh = 0,
                  seed = 42)

print(fit_liberal, detail = FALSE)
            Median MAD_SD
(Intercept) 10.0    0.3  
liberalTRUE -2.0    0.5  

Auxiliary parameter(s):
      Median MAD_SD
sigma 2.7    0.2   
Code
fit_att_start <- stan_glm(formula = att_end ~ att_start,
                  data = trains,
                  refresh = 0,
                  seed = 85)

print(fit_att_start, detail = FALSE)
            Median MAD_SD
(Intercept) 1.6    0.4   
att_start   0.8    0.0   

Auxiliary parameter(s):
      Median MAD_SD
sigma 1.4    0.1   

They both seem like good models! The results make sense. People who are liberal have more liberal attitudes about immigration, so we would expect their att_end scores to be lower. We would also expect people to provide similar answers in two surveys administered a week or two apart. It makes sense that those with higher (more conservative) values for att_start would also have higher values for att_end.

How do we choose between these models?

6.3.1 Better models make better predictions

The most obvious criteria for comparing models is the accuracy of the predictions. For example, consider the use of liberal to predict att_end.

Code
trains |> 
  mutate(pred_liberal = fitted(fit_liberal)) |> 
  ggplot(aes(x = pred_liberal, y = att_end)) +
    geom_jitter(width = 0.05, height = 0.2, alpha = 0.5) +
  
  # Add a red circle where our predictions are most accurate (where the x and y
  # values are the same, which is where our predictions = the true attitudes).
  # pch = 1 makes the inside of the point translucent to show the number of
  # correct predictions.
  
  geom_point(aes(x = 8, y = 8), 
             size = 20, pch = 1, 
             color = "red") +
  geom_point(aes(x = 10, y = 10), 
             size = 20, pch = 1, 
             color = "red") +
    labs(title = "Modeling Attitude Toward Immigration",
         subtitle = "Liberals are less conservative",
         x = "Predicted Attitude",
         y = "True Attitude")

Because there are only two possible values for liberal — TRUE and FALSE — there are only two predictions which this model will make: about 10 for liberal == FALSE and about 8 for liberal == TRUE. (The points in the above plot are jittered.) For some individuals, these are perfect predictions. For others, they are poor predictions. the red circles on our plot illustrate the areas where our predictions are equal to true values. As we can see, the model isn’t great at predicting attitude end. (Note the two individuals who are liberal == TRUE, and who the model thinks will have att_end == 8, but who have att_end == 15. The model got them both very, very wrong.)

Consider our second model, using att_start to forecast att_end.

Code
trains |> 
  mutate(pred_liberal = fitted(fit_att_start)) |> 
  ggplot(aes(x = pred_liberal, y = att_end)) +
    geom_jitter(width = 0.05, height = 0.2, alpha = 0.5) +
  
  # Insert red line where our predictions = the truth using geom_abline with an
  # intercept, slope, and color.
  
  geom_abline(intercept = 0, slope = 1, color = "red") +
    labs(title = "Modeling Attitude Toward Immigration",
         subtitle = "Survey responses are somewhat consistent",
         x = "Predicted Attitude",
         y = "True Attitude")

Because att_end takes on 13 unique values, the model makes 13 unique predictions. Some of those predictions are perfect! But others are very wrong. The red line shows the points where our predictions match the truth. Note the individual with a predicted att_end of around 9 but with an actual value of 15. That is a big miss!

Rather than looking at individual cases, we need to look at the errors for all the predictions. Fortunately, a prediction error is the same thing as a residual, which is easy enough to calculate.

Code
trains |> 
  select(att_end, att_start, liberal) |> 
  mutate(pred_lib = fitted(fit_liberal)) |> 
  mutate(resid_lib = fitted(fit_liberal) - att_end) |> 
  mutate(pred_as = fitted(fit_att_start)) |> 
  mutate(resid_as = fitted(fit_att_start) - att_end)
# A tibble: 115 × 7
   att_end att_start liberal pred_lib resid_lib pred_as resid_as
     <dbl>     <dbl> <lgl>      <dbl>     <dbl>   <dbl>    <dbl>
 1      11        11 FALSE      10.0    -0.974    10.6    -0.399
 2      10         9 FALSE      10.0     0.0259    8.97   -1.03 
 3       5         3 TRUE        8.02    3.02      4.08   -0.916
 4      11        11 FALSE      10.0    -0.974    10.6    -0.399
 5       5         8 TRUE        8.02    3.02      8.16    3.16 
 6      13        13 FALSE      10.0    -2.97     12.2    -0.769
 7      13        13 FALSE      10.0    -2.97     12.2    -0.769
 8      11        10 FALSE      10.0    -0.974     9.79   -1.21 
 9      12        12 FALSE      10.0    -1.97     11.4    -0.584
10      10         9 FALSE      10.0     0.0259    8.97   -1.03 
# ℹ 105 more rows

Let’s look at the square root of the average squared error.

Code
trains |> 
  select(att_end, att_start, liberal) |> 
  mutate(lib_err = (fitted(fit_liberal) - att_end)^2) |> 
  mutate(as_err = (fitted(fit_att_start) - att_end)^2) |> 
  summarize(lib_sigma = sqrt(mean(lib_err)),
            as_sigma = sqrt(mean(as_err))) 
# A tibble: 1 × 2
  lib_sigma as_sigma
      <dbl>    <dbl>
1      2.68     1.35

There are many different measures of the error which we might calculate. The squared difference is most common for historical reasons: it was the mathematically most tractable in the pre-computer age. Having calculated a squared difference for each observation, we can sum them or take their average or take the square root of their average. All produce the same relative ranking, but the last is most popular because it (more or less) corresponds to the estimated \(\sigma\) for a linear model. Note how these measures are the same as the ones produced by the Bayesian models created above.

Sadly, it is not wise to simply select the model which fits the data best because doing so can be misleading. After all, you are cheating! You are using that very data to select your parameters and then, after using the data once, turning around and “checking” to see how well your model fits the data. It better fit! You used it to pick your parameters! This is the danger of overfitting.

6.3.2 Beware overfitting

One of the biggest dangers in data science is overfitting, using a model with too many parameters which fits the data we have too well and, therefore, works poorly on data we have yet to see. Consider a simple example with 10 data points.

What happens when we fit a model with one predictor?

That is a reasonable model. It does not fit the data particularly well, but we certainly believe that higher values of x are associated with higher values of y. A linear fit is not unreasonable.

But we can also use some of the lessons from above and try a quadratic fit by adding \(x^2\) as a predictor.

Is this a better model? Maybe?

But why stop at adding \(x^2\) to the regression? Why not add \(x^3\), \(x^4\) and all the way to \(x^9\)? When we do so, the fit is much better.

Code
nine_pred <- lm(y ~ poly(x, 9),
                       data = ovrftng)

newdata <- tibble(x = seq(1, 10, by = 0.01),
                  y = predict(nine_pred, 
                              newdata = tibble(x = x)))

ovrftng |> 
  ggplot(aes(x, y)) +
    geom_point() +
    geom_line(data = newdata, 
              aes(x, y)) +
    labs(title = "`y` as a 9-Degree Polynomial Function of `x`") +
    scale_x_continuous(breaks = seq(2, 10, 2)) +
    scale_y_continuous(breaks = seq(2, 10, 2)) 

If the only criteria we cared about was how well the model predicts using the data on which the parameters were estimated, then a model with more parameters will always be better. But that is not what truly matters. What matters is how well the model works on data which was not used to create the model.

6.3.3 Better models make better predictions on new data

The most sensible way to test a model is to use the model to make predictions and compare those predictions to new data. After fitting the model using stan_glm, we would use posterior_predict to obtain simulations representing the predictive distribution for new cases. For instance, if we were to predict how someone’s attitude changes toward immigration among Boston commuters based on political affiliation, we would want to go out and test our theories on new Boston commuters.

When thinking of generalization to new data, it is important to consider what is relevant new data in the context of the modeling problem. Some models are used to predict the future and, in those cases, we can wait and eventually observe the future and check how good our model is for making predictions. Often models are used to obtain insight to some phenomenon without immediate plan for predictions. This is the case with our Boston commuters example. In such cases, we are also interested whether learned insights from on part of the data generalizes to other parts of the data. For example, if we know how political attitudes informed future immigration stances in Boston commuters, we may want to know if those same conclusions could generalize to train commuters in different locations.

Even if we had detected clear problems with our predictions, this would not necessarily mean that there is anything wrong with the model as fit to the original dataset. However, we would need to understand it further before generalizing to other commuters.

Often, we would like to evaluate and compare models without waiting for new data. One can simply evaluate predictions on the observed data. But since these data have already been used to fit the model parameters, these predictions are optimistic.

In cross validation, part of the data is used to fit the model and the rest of the data—the hold-out set—is used as a proxy for future data. When there is no natural prediction task for future data, we can think of cross validation as a way to assess generalization from one part of the data to another part.

In any form of cross validation, the model is re-fit leaving out one part of the data and then the prediction for the held-out part is evaluated. In the next section, we will look at a type of cross validation called leave-one-out (LOO) cross validation.

6.4 Comparing models in practice

To compare models without waiting for new data, we evaluate predictions on the observed data. However, due to the fact that the data has been used to fit the model parameters, our predictions are often optimistic when assessing generalization.

In cross validation, part of the data is used to fit the model, while the rest of the data is used as a proxy for future data. We can think of cross validation as a way to assess generalization from one part of the data to another part. How do we do this?

We can hold out individual observations, called leave-one-out (LOO) cross validation; or groups of observations, called leave-one-group-out cross validation; or use past data to predict future observations, called leave-future-out cross validation. When we perform cross validation, the model is re-fit leaving out one part of the data and then the prediction for the held-out part is evaluated.

For our purposes, we will be performing cross validation using leave-one-out (LOO) cross validation.

6.4.1 Cross validation using loo()

To compare models using leave-one-out (LOO) cross validation, one piece of data is excluded from our model. The model is then re-fit and makes a prediction for the missing piece of data. The difference between the predicted value and the real value is calculated. This process is repeated for every row of data in the dataset.

In essence: One piece of data is excluded from our model, the model is re-fit, the model attempts to predict the value of the missing piece, we compare the true value to the predicted value, and we assess the accuracy of our model’s prediction. This process occurs for each piece of data, allowing us to assess the model’s accuracy in making predictions.

To perform leave-one-out(LOO) cross validation, we will be using the function loo() from an R package. This is how we will determine which model is superior for our purposes.

First, we will refamiliarize ourselves with our first model, fit_liberal.

Code
fit_liberal
stan_glm
 family:       gaussian [identity]
 formula:      att_end ~ liberal
 observations: 115
 predictors:   2
------
            Median MAD_SD
(Intercept) 10.0    0.3  
liberalTRUE -2.0    0.5  

Auxiliary parameter(s):
      Median MAD_SD
sigma 2.7    0.2   

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

Now, we will perform loo() on our model and look at the results.

Code
loo_liberal <- loo(fit_liberal)

loo_liberal

Computed from 4000 by 115 log-likelihood matrix

         Estimate   SE
elpd_loo   -279.4  7.5
p_loo         2.9  0.5
looic       558.9 15.0
------
Monte Carlo SE of elpd_loo is 0.0.

All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.

What does any of this mean?

  • elpd_loo is the estimated log score along with a standard error representing uncertainty due to using only 115 data points.
  • p_loo is the estimated “effective number of parameters” in the model.
  • looic is the LOO information criterion, −2 elpd_loo, which we compute for comparability to deviance.

For our purposes, we mostly need to focus on elpd_loo. Let’s explain, in more depth, what this information means.

Basically, when we run loo(), we are telling R to take a piece of data out of our dataset, re-estimate all parameters, and then predict the value for the missing piece of data. The value for elpd_loo() is based off of how close our estimate was to the truth. Therefore, elpd_loo() values inform us of the effectiveness of our model in predicting data it has not seen before.

The higher our value for elpd_loo, the better our model performs. This means that, when comparing models, we want to select the model with the higher value for elpd_loo.

Let’s turn our attention to our second model. To begin, let’s observe the qualities of fit_att_start once again.

Code
fit_att_start
stan_glm
 family:       gaussian [identity]
 formula:      att_end ~ att_start
 observations: 115
 predictors:   2
------
            Median MAD_SD
(Intercept) 1.6    0.4   
att_start   0.8    0.0   

Auxiliary parameter(s):
      Median MAD_SD
sigma 1.4    0.1   

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

Great! Now, let’s perform loo() on this model.

Code
loo_att_start <- loo(fit_att_start) 

loo_att_start

Computed from 4000 by 115 log-likelihood matrix

         Estimate   SE
elpd_loo   -201.7 12.5
p_loo         4.2  1.8
looic       403.4 25.1
------
Monte Carlo SE of elpd_loo is 0.1.

All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.

The elpd_loo value for this model is -201.7. This is higher than the elpd_loo for att_liberal, implying that this model is superior. However, we can’t see our estimates together. Is there a simpler way to calculate which model is better?

Actually, yes! Using the function loo_compare(), we can compare the models directly.

6.4.2 Comparing models using loo_compare()

To compare the two models directly, we can use the function loo_compare with our two loo objects created above. This will calculate the difference in elpd_loo() between our models for us, making our job easier:

Code
loo_compare(loo_att_start, loo_liberal)
              elpd_diff se_diff
fit_att_start   0.0       0.0  
fit_liberal   -77.7      12.0  

The value for elpd_diff is equal to the difference in elpd_loo between our two models. These_diff shows that the difference in standard error.

To interpret the results directly, it is important to note that the first row will be the superior model. The values of elpd_diff and att_start will be 0, as these columns show the offset in the estimates compared to the better model. To reiterate: when the better model is compared to itself, those values will be 0. The following rows show the offset in elpd and se values between the less effective model, fit_liberal, and the more effective model, fit_att_start.

The better model is clear: fit_att_start. Therefore, the attitude at the start of the trains study is more significant to predicting final attitude when compared with the variable liberal, which is an analog for political affiliation.

As we have seen, loo_compare is a shortcut for comparing two models. When you are deciding between two models, loo_compare() is a great way to simplify your decision.

What do we do when the value of loo_compare() is small? As a general practice, differences smaller than four are hard to distinguish from noise. In other words: when elpd_diff is less than 4, there is no advantage to one model over the other.

6.5 Testing is nonsense

As always, it is important to look at the practices of other professionals and the reasons we may choose not to follow those tactics. For instance, our continued problem with hypothesis testing. In hypothesis testing, we assert a null hypothesis \(H_0\) about our data and an alternative hypothesis \(H_a\).

When performing hypothesis testing, we either reject the hypothesis or we do not reject it. The qualifications for rejecting are met if the 95% confidence interval excludes the null hypothesis. If the hypothesis is included in our 95% confidence interval, we do not reject it. In the case of “insignificant” results, with p > 0.5, we also can’t “reject” the null hypothesis. However, this does not mean that we accept it.

The premise of hypothesis testing is to answer a specific question – one that may not even be particularly relevant to our understanding of the world – about our data. So, what are our problems with hypothesis testing? - Rejecting or not rejecting hypotheses doesn’t helps us to answer real questions. - The fact that a difference is not “significant” has no relevance to how we use the posterior to make decisions. - Statistical significance is not equal to practical importance. - There is no reason to test when you can summarize by providing the full posterior probability distribution.

6.6 Parallel lines

To conclude this chapter, we will look at a four parameter model. The model we will use will measure att_end as a function of liberal or att_start or treatment or some combination of these variables.

6.6.1 Wisdom

Wisdom

Before making a model which seeks to explain att_end, we should plot it and the variables we think are connected to it:

Code
ggplot(trains, aes(x = att_start, y = att_end, color = liberal)) +
  geom_point() +
  labs(title = "Attitude End Compared with Attitude Start and Liberal",
       x = "Attitude at Start of Study",
       y = "Attitude at End of Study",
       color = "Liberal?")

Is that data believable? Maybe? One could imagine that att_end would be predicted fairly well by att_start. This makes sense for most of our data points, which show not much difference between the attitudes. But what about the great disparities in attitude shown for the individual with a starting attitude of 9 and an ending attitude around 15? In a real data science project, this would require further investigation. For now, we ignore the issue and blithely press on.

Another component of Wisdom is the population. The concept of the “population” is subtle and important. The population is not the set of commuters for which we have data. That is the dataset. The population is the larger — potentially much larger — set of individuals about whom we want to make inferences. The parameters in our models refer to the population, not to the dataset.

There are many different populations, each with its own \(\mu\), in which we might be interested. For instance:

  • The population of Boston commuters on the specific train and time included in our dataset.

  • The population of all Boston commuters.

  • The population of commuters in the United States.

All of these populations are different, so each has a different \(\mu\). Which \(\mu\) we are interested in depends on the problem we are trying to solve. It is a judgment call, a matter of Wisdom, as to whether or not that data we have is “close enough” to the population we are interested in to justify making a model.

The major part of Wisdom is deciding what questions you can’t answer because of the data you don’t have.

6.6.2 Justice

Justice

Now that we have considered the connection between our data and the predictions we seek to make, we will need to consider our model.

First: is our model causal or predictive? Recall that our model measures att_end as a function of liberal and att_start. The variables liberal and att_start do not involve a control or treatment dynamic. There is no manipulation with these variables. Given that there is no causation without manipulation, this is a predictive model.

We are making inferences about groups of people according to their political affiliation and starting attitude. We are not measuring causality, but we are predicting outcomes.

When creating a parallel slops model, we use the basic equation of a line:

\[y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i}\]

If \(y = att\_end\), \(x_1 = att\_start\), \(x\_2 = liberal\), then the equations are as follows:

If liberal = FALSE:

\[y_i = \beta_0 + \beta_1 x_{1,i}\] Which equates to, in y = b + mx form:

\[y_i = intercept + \beta_1 att\_start_i\]

If liberal = TRUE:

\[y_i = (\beta_0 + \beta_2) + \beta_1 x_{1,i}\]

Which equates to, in y = b + mx form:

\[y_i = (intercept + liberal\_true) + \beta_1 att\_start_i\]

6.6.3 Courage

Courage

The use of stan_glm is the same as usual. Using stan_glm, we will create our model, fit_1.

Code
fit_1 <- stan_glm(formula = att_end ~ liberal + att_start,
                  data = trains,
                  refresh = 0,
                  seed = 42)

fit_1
stan_glm
 family:       gaussian [identity]
 formula:      att_end ~ liberal + att_start
 observations: 115
 predictors:   3
------
            Median MAD_SD
(Intercept)  1.9    0.5  
liberalTRUE -0.3    0.3  
att_start    0.8    0.0  

Auxiliary parameter(s):
      Median MAD_SD
sigma 1.4    0.1   

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

To remind ourselves, recall that the (Intercept) here is representing the att_end value for cases where liberal = FALSE and treatment does not equal Control. The next row, liberalTRUE gives a median value which represents the offset in the prediction compared with the (Intercept). In other words, the true intercept for cases where liberal = TRUE is represented by \((Intercept) + liberalTRUE\). The value of treatmentControl is the offset in att_end for those in the Control group.

To find our intercepts, we will need to tidy our regression using the tidy() function from the broom.mixed package. We tidy our data and extract values to create a parallel slopes model. The parallel slopes model allows us to visualize multi-variate Bayesian modeling (i.e. modeling with more than one explanatory variable). That is a complicated way of saying that we will visualize the fitted model created above in a way that allows us to see the intercepts and slopes for two different groups, liberalTRUE and liberalFALSE:

Code
# First, we will tidy the data from our model and select the term and estimate.
# This allows us to create our regression lines more easily.

tidy <- fit_1 |> 
  tidy() |> 
  select(term, estimate)

tidy
# A tibble: 3 × 2
  term        estimate
  <chr>          <dbl>
1 (Intercept)    1.93 
2 liberalTRUE   -0.300
3 att_start      0.799
Code
# Extract and name the columns of our tidy object. By calling tidy$estimate[1],
# we are telling R to extract the first value from the estimate column in our
# tidy object.

intercept <- tidy$estimate[1]
liberal_true <- tidy$estimate[2]
att_start <- tidy$estimate[3]

Now, we can define the following terms—- liberal_false_intercept and liberal_false_att_slope; and liberal_true_intercept and liberal_true_att_slope:

Code
# Recall that the (Intercept) shows us the estimate for the case where liberal =
# FALSE. We want to extract the liberal_false_intercept to indicate where the
# intercept in our visualization should be. The slope for this case, and for the
# liberal = TRUE case, is att_start.

liberal_false_intercept <- intercept
liberal_false_att_slope <- att_start

#  When wanting the intercept for liberal = TRUE, recall that the estimate for
#  liberalTRUE is the offset from our (Intercept). Therefore, to know the true
#  intercept, we must add liberal_true to our intercept.

liberal_true_intercept <- intercept + liberal_true
liberal_true_att_slope <- att_start

All we’ve done here is extracted the values for our intercepts and slopes, and named them to be separated into two groups. This allows us to create a geom_abline object that takes a unique slope and intercept value, so we can separate the liberalTRUE and liberalFALSE observations.

Code
# From the dataset trains, use att_start for the x-axis and att_end for
# the y-axis with color as liberal. This will split our data into two color
# coordinates (one for liberal = TRUE and one for liberal = FALSE)

ggplot(trains, aes(x = att_start, y = att_end, color = liberal)) +
  
  # Use geom_point to show the datapoints. 
  
  geom_point() +
  
  # Create a geom_abline object for the liberal false values. Set the intercept
  # equal to our previously created liberal_false_intercept, while setting slope
  # equal to our previously created liberal_false_att_slope. The color call is
  # for coral, to match the colors used by tidyverse for geom_point().
  
  geom_abline(intercept = liberal_false_intercept,
              slope = liberal_false_att_slope, 
              color = "#F8766D", 
              size = 1) +
  
  # Create a geom_abline object for the liberal TRUE values. Set the intercept
  # equal to our previously created liberal_true_intercept, while setting slope
  # equal to our previously created liberal_true_att_slope. The color call is
  # for teal, to match the colors used by tidyverse for geom_point().

  geom_abline(intercept = liberal_true_intercept,
              slope = liberal_true_att_slope,
              color = "#00BFC4", 
              size = 1) +
  
  # Add the appropriate titles and axis labels. 
  
  labs(title = "Parallel Slopes Model",
       x = "Attitude at Start", 
       y = "Attitude at End", 
       color = "Liberal") 
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

This is our parallel slopes model. What we have done, essentially, is created a unique line for liberalTRUE and liberalFALSE to observe the differences in the groups as related to attitude start and attitude end.

As we can see, commuters who are not liberal tend to start with slightly higher values for att_start. Commuters who are liberal tend to have lower starting values for att_start.

Now, what if we want to look at another model? To judge whether we can have a superior model? Let’s create another object, using stan_glm, that looks at att_end as a function of treatment and att_start.

Code
fit_2 <- stan_glm(formula = att_end ~ treatment + att_start,
                  data = trains,
                  refresh = 0,
                  seed = 56)

fit_2
stan_glm
 family:       gaussian [identity]
 formula:      att_end ~ treatment + att_start
 observations: 115
 predictors:   3
------
                 Median MAD_SD
(Intercept)       2.4    0.4  
treatmentControl -1.0    0.2  
att_start         0.8    0.0  

Auxiliary parameter(s):
      Median MAD_SD
sigma 1.3    0.1   

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

To interpret briefly:

  • (Intercept) here is representing the att_end value for cases where treatment does not equal Control.

  • treamtmentControl gives a median value which represents the offset in the prediction compared with the (Intercept). In other words, the true intercept for cases where treatment = Control is represented by \((Intercept) + treatmentControl\).

  • att_start represents the slope for both groups representing a unit change.

An important point here is that these models are causal. When including the variable treatment, we have a measured causal effect of a condition. This is different from our prior parallel slopes model, where we were modeling for prediction, not causation.

To see how these models compare in performance, we will perform leave-one-out (LOO) cross validation again.

Code
L1 <- loo(fit_1)

L1

Computed from 4000 by 115 log-likelihood matrix

         Estimate   SE
elpd_loo   -202.3 13.1
p_loo         5.5  2.3
looic       404.6 26.1
------
Monte Carlo SE of elpd_loo is 0.1.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     114   99.1%   2531      
 (0.5, 0.7]   (ok)         1    0.9%   157       
   (0.7, 1]   (bad)        0    0.0%   <NA>      
   (1, Inf)   (very bad)   0    0.0%   <NA>      

All Pareto k estimates are ok (k < 0.7).
See help('pareto-k-diagnostic') for details.

Perform loo() on our second model:

Code
L2 <- loo(fit_2)

L2

Computed from 4000 by 115 log-likelihood matrix

         Estimate   SE
elpd_loo   -195.3 12.2
p_loo         5.1  1.8
looic       390.5 24.5
------
Monte Carlo SE of elpd_loo is 0.1.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     114   99.1%   1625      
 (0.5, 0.7]   (ok)         1    0.9%   206       
   (0.7, 1]   (bad)        0    0.0%   <NA>      
   (1, Inf)   (very bad)   0    0.0%   <NA>      

All Pareto k estimates are ok (k < 0.7).
See help('pareto-k-diagnostic') for details.

Recall that the relevant data is the data from elpd_loo. The estimates for elpd_loo vary quite a bit. Recall that the superior model will have a value of elpd_loo() that is closer to 0. The standard error (SE) for these models also differs some. To compare these directly, we will use loo_compare.

Code
loo_compare(L1, L2)
      elpd_diff se_diff
fit_2  0.0       0.0   
fit_1 -7.0       3.8   

Recall that, with loo_compare(), the resulting data shows the superior model first, with values of 0 for elpd_diff and se_diff, since it compares the models to the best option. The values of elpd_diff and se_diff for fit_1 show the difference in the models. As we can see, fit_2, the model which looks at treatment + att_start, is better.

But how certain can we be that it is better? Note that the difference between the two models is not quite two standard errors. So, there is a reasonable possibility that the difference is due to chance.

6.6.4 Temperance

What we really care about is data we haven’t seen yet, mostly data from tomorrow. But what if the world changes, as it always does? If it doesn’t change much, maybe we are OK. If it changes a lot, then what good will our model be? In general, the world changes some. That means that are forecasts are more uncertain that a naive use of our model might suggest. Having created (and checked) a model, we now use the model to answer questions. Models are made for use, not for beauty. The world confronts us. Make decisions we must. Our decisions will be better ones if we use high quality models to help make them.

Preceptor’s Posterior is the posterior you would calculate if all the assumptions you made under Wisdom and Justice were correct. Sadly, they never are! So, you can never know Preceptor’s Posterior. Our posterior will, we hope, be a close-ish approximation of Preceptor’s Posterior.

6.7 Summary

In this chapter, we covered a number of topics important to effectively creating models.

Key commands: - Create a model using stan_glm(). - After creating a model, we can use loo() to perform leave-one-out cross validation. This assesses how effectively our model makes predictions for data it has not seen yet. - The command loo_compare() allows us to compare two models, to see which one performs better in leave-one-out cross validation. The superior model makes better predictions.

Remember: - We can transform variables – through centering, scaling, taking logs, etc. – to make them more sensible. Consider using a transformation if the intercept is awkward. For instance, if the intercept for age represents the estimate for people of age zero, we might consider transforming age to be easier to interpret. - When selecting variables to include in our model, follow this rule: keep it if the variable has a large and well-estimated coefficient. This means that the 95% confidence interval excludes zero. Speaking roughly, removing a variable with a large coefficient meaningfully changes the predictions of the model. - When we compare across unit, meaning comparing Joe and George, we are not looking at a causal relationship. Within unit discussions, where we are comparing Joe under treatment versus Joe under control, are causal. This means that within unit interpretation is only possible in causal models, where we are studying one unit under two conditions. - When we talk about two potential outcomes, we are discussing the same person or unit under two conditions.