This chapter is still a DRAFT. Check back in a few weeks.
In our haste to make progress — to get all the way through the process of building, interpreting and using models — we have given short shrift to some of the messy details of model building and evaluation. This chapter fills in those lacunae. We will also introduce models with four parameters.
Another question for us data scientists: what happens when our assumptions break down? The first option is to extend our model, perhaps by addressing our concerns with validity by including error models. The second option, which we will focus on here, is to modify our data or model so that our assumptions are more reasonable. How do we accomplish this?
It is often convenient to transform a predictor variable so that our model makes more sense. Let’s explore exactly what that means.
Recall our model of
income as a function of
\[ y_i = \beta_0 + \beta_1 age_i + \epsilon_i\]
We fit this using the
trains data from primer.data.
## Median MAD_SD ## (Intercept) 103379.6 24657.3 ## age 904.7 561.4 ## ## Auxiliary parameter(s): ## Median MAD_SD ## sigma 74166.4 5146.6
There is nothing wrong with this model. Yet the interpretation of \(\beta_0\), the intercept in the regression, is awkward. It means the average income for people of age zero. That is useless! There are no people of zero age in our data. And, even if there were, it would be weird to think about such people taking the commuter trade into Boston and filling out our survey forms.
It is easy, however, to transform
age into a variable which makes the intercept more meaningful. Consider a new variable,
c_age, which is
age minus the average age in the sample. Using this centered version of age does not change the predictions or residuals in the model, but it does make the intercept easier to interpret.
## Median MAD_SD ## (Intercept) 141804.2 7188.0 ## c_age 906.7 572.9 ## ## Auxiliary parameter(s): ## Median MAD_SD ## sigma 74273.6 4882.0
The intercept, 141,804, is the expected income for someone with
age_c = 0, i.e., someone of an average age in the data, which is around 42.
Centering — changing a variable via addition/subtraction — often makes the intercept easier to interpret. Scaling — changing a variable via multiplication/division — often makes it easier to interpret coefficients. The most common scaling method is to divide the variable by its standard deviation.
## Median MAD_SD ## (Intercept) 103061.4 25235.6 ## s_age 11121.0 6921.6 ## ## Auxiliary parameter(s): ## Median MAD_SD ## sigma 74124.8 4959.5
s_age is age scaled by its own standard deviation. A change in one unit of
s_age is the same as a change in one standard deviation of the
age, which is about 12. The interpretation of \(\beta_1\) is now:
When comparing two people, one about 1 standard deviation worth of years older than the other, we expect the older person to earn about 11,000 more.
But, because we scaled without centering, the intercept is now back to the (nonsensical) meaning of the expected income for people of age 0.
The most common transformation applies both centering and scaling. The base R function
scale() subtracts the mean and divides by the standard deviation. A variable so transformed is a “z-score,” meaning a variable with a mean of zero and a standard deviation of one. Using z-scores makes interpretation easier, especially when we seek to compare the importance of different predictors.
## Median MAD_SD ## (Intercept) 141650.2 7101.3 ## z_age 10899.1 6712.0 ## ## Auxiliary parameter(s): ## Median MAD_SD ## sigma 74300.4 5019.0
The two parameters are easy to interpret after this transformation.
The expected income of someone of average age, which is about 42 in this study, is about 142,000 dollars.
When comparing two individuals who differ in age by one standard deviation, which is about 12 years in this study, the older person is expected to earn about 11,000 more than the younger.
Note that, when using z-scores, we would often phrase this comparison in terms of “sigmas.” One person is “one sigma” older than another person means that they are one standard deviation older. This is simple enough, once you get used to it, but also confusing since we are already using the word “sigma” to mean \(\sigma\), the standard deviation of \(\epsilon_i\). Alas, language is something we deal with rather than control. You will hear the same word “sigma” applied to both concepts, even in the same sentence. Determine meaning by context.
It is often helpful to take the log of predictor variables, especially in cases in which their distribution is skewed. You should generally only take the log of variables for which all the values are strictly positive. The log of a negative number is not defined. Consider the number of registered voters (
rv13) at each of the polling stations in
x <- kenya %>% filter(rv13 > 0) rv_p <- x %>% ggplot(aes(rv13)) + geom_histogram(bins = 100) + labs(x = "Registered Voters", y = NULL) log_rv_p <- x %>% ggplot(aes(log(rv13))) + geom_histogram(bins = 100) + labs(x = "Log of Registered Voters", y = NULL) + expand_limits(y = c(0, 175)) rv_p + log_rv_p + plot_annotation(title = 'Registered Votes In Kenya Communities', subtitle = "Taking logs helps us deal with outliers")
Most experienced data scientists would use the log of
rv13 rather than the raw value. Comments:
We do not know the “true” model. Who is to say that a model using the raw value is right or wrong?
Check whether or not this choice meaningfully affects the answer to your question. Much of the time, it won’t. That is, our inferences are often fairly “robust” to small changes in the model. If you get the same answer with
log_rv13, then no one cares which you use.
Follow the conventions in your field. If everyone does X, then you should probably do X, unless you have a good reason not to. If you do have such a reason, explain it prominently.
Most professionals, when presented with data distributed like
rv13, would take the log. Professionals hate (irrationally?) outliers. Any transformation which makes a distribution look more normal is generally considered a good idea.
Many of these suggestions apply to every aspect of the modeling process.
Instead of simply transforming variables, we can add more terms which are transformed versions of a variable. Consider the relation of
nhanes. Let’s start by dropping the missing values.
no_na_nhanes <- nhanes %>% select(height, age) %>% drop_na()
Fit and plot a simple linear model:
nhanes_1 <- stan_glm(height ~ age, data = no_na_nhanes, refresh = 0, seed = 47) no_na_nhanes %>% ggplot(aes(x = age, y = height)) + geom_point(alpha = 0.1) + geom_line(aes(y = fitted(nhanes_1)), color = "red") + labs(title = "Age and Height", subtitle = "Children are shorter, but a linear fit is poor", x = "Age", y = "Height (cm)", caption = "Data source: NHANES")
That is not a very good model, obviously.
Adding a quadratic term makes it better. (Note the need for
I() in creating the squared term within the
nhanes_2 <- stan_glm(height ~ age + I(age^2), data = no_na_nhanes, refresh = 0, seed = 33) no_na_nhanes %>% ggplot(aes(x = age, y = height)) + geom_point(alpha = 0.1) + geom_line(aes(y = fitted(nhanes_2)), color = "red") + labs(title = "Age and Height", subtitle = "Quadratic fit is much better, but still poor", x = "Age", y = "Height (cm)", caption = "Data source: NHANES")
Still, we have not made use of our background knowledge in creating these variables. We know that people don’t get any taller after age 18 or so. Let’s create variables which capture that break.
nhanes_3 <- stan_glm(height ~ I(ifelse(age > 18, 18, age)), data = no_na_nhanes, refresh = 0, seed = 23) no_na_nhanes %>% ggplot(aes(x = age, y = height)) + geom_point(alpha = 0.1) + geom_line(aes(y = fitted(nhanes_3)), color = "red") + labs(title = "Age and Height", subtitle = "Domain knowledge makes for better models", x = "Age", y = "Height (cm)", caption = "Data source: NHANES")
The point is that we should not take the variables we receive as given. We are the captains of our souls. We transform variables as needed.
Transforming predictor variables is uncontroversial. It does not matter much. Change most continuous predictor variables to \(z\)-scores and you won’t go far wrong. Or keep them in their original form, and take care with your interpretations. It’s all good.
Transforming the outcome variable is a much more difficult question. Imagine that we seek to create a model which explains
rv13 from the
kenya tibble. Should we transform it?
Maybe? There are no right answers. A model with
rv13as the outcome variable is different from a model with
log(rv13)as the outcome. The two are not directly comparable.
Much of the same advice with regard to taking logs of predictor variables applies here as well.
How do we decide which variables to include in a model? There is no one right answer to this question. Our advice is to keep a variable X if any of the following circumstances apply:
The variable has a large and well-estimated coefficient. This means, roughly, that the 95% confidence interval excludes zero.
Underlying theory/observation suggests that X has a meaningfully connection to the outcome variable.
If the variable has a small standard error, it is general practice to include it in our model to improve predictions. If the standard error of a coefficient is large and we find no other reason to include it in our model, it may be a signal to remove the variable from your model.
If a predictor isn’t statistically significant, we would normally consider discarding it. The exception to this rule is if we wanted to compare two indicators (for instance, if we wanted to compare results from the variable
party– to look at Democrats versus Republicans – in the
trainsdataset) in spite of that variable not appearing to be significant.
It is standard in your field to include X in such regressions.
Your boss/client/reviewer/supervisor wants to include X.
Deciding which variables to include in a model is a subset of the larger question: How do we decide which model, out of the set of possible models, to choose?
Consider two models which explain attitudes to immigration among Boston commuters.
## Median MAD_SD ## (Intercept) 10.0 0.3 ## liberalTRUE -2.0 0.5 ## ## Auxiliary parameter(s): ## Median MAD_SD ## sigma 2.7 0.2
## Median MAD_SD ## (Intercept) 1.6 0.4 ## att_start 0.8 0.0 ## ## Auxiliary parameter(s): ## Median MAD_SD ## sigma 1.4 0.1
They both seem like good models! The results make sense. People who are liberal have more liberal attitudes about immigration, so we would expect their
att_end scores to be lower. We would also expect people to provide similar answers in two surveys administered a week or two apart. It makes sense that those with higher (more conservative) values for
att_start would also have higher values for
How do we choose between these models?
The most obvious criteria for comparing models is the accuracy of the predictions. For example, consider the use of
liberal to predict
trains %>% mutate(pred_liberal = fitted(fit_liberal)) %>% ggplot(aes(x = pred_liberal, y = att_end)) + geom_jitter(width = 0.05, height = 0.2, alpha = 0.5) + labs(title = "Modeling Attitude Toward Immigration", subtitle = "Liberals are less conservative", x = "Predicted Attitude", y = "True Attitude")
Because there are only two possible values for
liberal — TRUE and FALSE — there are only two predictions which this model will make: about 10 for
liberal == FALSE and about 8 for
liberal == TRUE. (The points in the above plot are jittered.) For some individuals, these are perfect predictions. For others, they are poor predictions. (Note the two individuals who are
liberal == TRUE, and who the model thinks will have
att_end == 8, but who have
att_end == 15. The model got them both very, very wrong.)
Consider our second model, using
att_start to forecast
trains %>% mutate(pred_liberal = fitted(fit_att_start)) %>% ggplot(aes(x = pred_liberal, y = att_end)) + geom_jitter(width = 0.05, height = 0.2, alpha = 0.5) + labs(title = "Modeling Attitude Toward Immigration", subtitle = "Survey responses are somewhat consistent", x = "Predicted Attitude", y = "True Attitude")
att_end takes on 13 unique values, the model makes 13 unique predictions. Some of those predictions are perfect! But others are very wrong. Note the individual with a predicted
att_end of around 9 but with an actual value of
15. That is a big miss!
Rather than looking at individual cases, we need to look at the errors for all the predictions. Fortunately, a prediction error is the same thing as a residual, which is easy enough to calculate.
## # A tibble: 115 x 7 ## att_end att_start liberal pred_lib resid_lib pred_as resid_as ## <dbl> <dbl> <lgl> <dbl> <dbl> <dbl> <dbl> ## 1 11 11 FALSE 10.0 -0.974 10.6 -0.399 ## 2 10 9 FALSE 10.0 0.0259 8.97 -1.03 ## 3 5 3 TRUE 8.02 3.02 4.08 -0.916 ## 4 11 11 FALSE 10.0 -0.974 10.6 -0.399 ## 5 5 8 TRUE 8.02 3.02 8.16 3.16 ## 6 13 13 FALSE 10.0 -2.97 12.2 -0.769 ## 7 13 13 FALSE 10.0 -2.97 12.2 -0.769 ## 8 11 10 FALSE 10.0 -0.974 9.79 -1.21 ## 9 12 12 FALSE 10.0 -1.97 11.4 -0.584 ## 10 10 9 FALSE 10.0 0.0259 8.97 -1.03 ## # … with 105 more rows
Let’s look at the square root of the average squared error.
## # A tibble: 1 x 2 ## lib_sigma as_sigma ## <dbl> <dbl> ## 1 2.68 1.35
There are many different measures of the error which we might calculate. The squared difference is most common for historical reasons: it was the mathematically most tractable in the pre-computer age. Having calculated a squared difference for each observation, we can sum them or take their average or take the square root of their average. All produce the same relative ranking, but the last is most popular because it (more or less) corresponds to the estimated \(\sigma\) for a linear model. Note how these measures are the same as the ones produced by the Bayesian models created above.
Sadly, it is not wise to simply select the model which fits the data best because doing so can be misleading. After all, you are cheating! You are using that very data to select your parameters and then, after using the data once, turning around and “checking” to see how well your model fits the data. It better fit! You used it to pick your parameters! This is the danger of overfitting.
One of the biggest dangers in data science is overfitting, using a model with too many parameters which fits the data we have too well and, therefore, works poorly on data we have yet to see. Consider a simple example with 10 data points.
What happens when we fit a model with one predictor?
That is a reasonable model. It does not fit the data particularly well, but we certainly believe that higher values of
x are associated with higher values of
y. A linear fit is not unreasonable.
But we can also use some of the lessons from above and try a quadratic fit by adding \(x^2\) as a predictor.
Is this a better model? Maybe?
But why stop at adding \(x^2\) to the regression? Why not add \(x^3\), \(x^4\) and all the way to \(x^9\)? When we do so, the fit is much better.
nine_pred <- lm(y ~ poly(x, 9), data = ovrftng) newdata <- tibble(x = seq(1, 10, by = 0.01), y = predict(nine_pred, newdata = tibble(x = x))) ovrftng %>% ggplot(aes(x, y)) + geom_point() + geom_line(data = newdata, aes(x, y)) + labs(title = "`y` as a 9-Degree Polynomial Function of `x`") + scale_x_continuous(breaks = seq(2, 10, 2)) + scale_y_continuous(breaks = seq(2, 10, 2))
If the only criteria we cared about was how well the model predicts using the data on which the parameters were estimated, then a model with more parameters will always be better. But that is not what truly matters. What matters is how well the model works on data which was not used to create the model.
The most sensible way to test a model is to use the model to make predictions and compare those predictions to new data. After fitting the model using
stan_glm, we would use
posterior_predict to obtain simulations representing the predictive distribution for new cases. For instance, if we were to predict how someone’s attitude changes toward immigration among Boston commuters based on political affiliation, we would want to go out and test our theories on new Boston commuters.
When thinking of generalization to new data, it is important to consider what is relevant new data in the context of the modeling problem. Some models are used to predict the future and, in those cases, we can wait and eventually observe the future and check how good our model is for making predictions. Often models are used to obtain insight to some phenomenon without immediate plan for predictions. This is the case with our Boston commuters example. In such cases, we are also interested whether learned insights from on part of the data generalizes to other parts of the data. For example, if we know how political attitudes informed future immigration stances in Boston commuters, we may want to know if those same conclusions could generalize to train commuters in different locations.
Even if we had detected clear problems with our predictions, this would not necessarily mean that there is anything wrong with the model as fit to the original dataset. However, we would need to understand it further before generalizing to other commuters.
Often, we would like to evaluate and compare models without waiting for new data. One can simply evaluate predictions on the observed data. But since these data have already been used to fit the model parameters, these predictions are optimistic for assessing generalization.
In cross validation, part of the data is used to fit the model and the rest of the data—the hold-out set—is used as a proxy for future data. When there is no natural prediction task for future data, we can think of cross validation as a way to assess generalization from one part of the data to another part.
In any form of cross validation, the model is re-fit leaving out one part of the data and then the prediction for the held-out part is evaluated. Cross validation removes the overfitting problem arising from using the same data for estimation and evaluation, but at the cost of requiring the model to be fit as many times as the number of partitions. In the next section, we will look at a type of cross validation called leave-one-out (LOO) cross validation.
To compare models without waiting for new data, we evaluate predictions on the observed data. However, due to the fact that the data has been used to fit the model parameters, our predictions are often optimistic when assessing generalization.
In cross validation, part of the data is used to fit the model, while the rest of the data is used as a proxy for future data. We can think of cross validation as a way to assess generalization from one part of the data to another part. How do we do this?
We can hold out individual observations, called leave-one-out (LOO) cross validation; or groups of observations, called leave-one-group-out cross validation; or use past data to predict future observations, called leave-future-out cross validation. When we perform cross validation, the model is re-fit leaving out one part of the data and then the prediction for the held-out part is evaluated.
We have built an R package,
loo, containing a function,
loo, with a computational shortcut that uses probability calculations to approximate the leave-one-out evaluation, while only fitting the model once, and also accounting for posterior uncertainty in the fit. This is how we will determine which model is superior for our purposes.
First, we will refamiliarize ourselves with our first model,
## stan_glm ## family: gaussian [identity] ## formula: att_end ~ liberal ## observations: 115 ## predictors: 2 ## ------ ## Median MAD_SD ## (Intercept) 10.0 0.3 ## liberalTRUE -2.0 0.5 ## ## Auxiliary parameter(s): ## Median MAD_SD ## sigma 2.7 0.2 ## ## ------ ## * For help interpreting the printed output see ?print.stanreg ## * For info on the priors used see ?prior_summary.stanreg
Now, we will perform
loo() on our model and look at the results.
loo_liberal <- fit_liberal %>% loo() loo_liberal
## ## Computed from 4000 by 115 log-likelihood matrix ## ## Estimate SE ## elpd_loo -279.4 7.5 ## p_loo 2.9 0.5 ## looic 558.9 15.0 ## ------ ## Monte Carlo SE of elpd_loo is 0.0. ## ## All Pareto k estimates are good (k < 0.5). ## See help('pareto-k-diagnostic') for details.
What does any of this mean?
elpd_loois the estimated log score along with a standard error representing uncertainty due to using only 115 data points.
p_loois the estimated “effective number of parameters” in the model.
looicis the LOO information criterion, −2 elpd_loo, which we compute for comparability to deviance.
For our purposes, we mostly need to focus on
elpd_loo. Let’s explain, in more depth, what this information means.
Conclusion: the higher our value for
elpd_loo, the better our model performs. This means that, when comparing models, we want to select the model with the higher value for
Let’s turn our attention to our second model. To begin, let’s observe the qualities of
fit_att_start once again.
## stan_glm ## family: gaussian [identity] ## formula: att_end ~ att_start ## observations: 115 ## predictors: 2 ## ------ ## Median MAD_SD ## (Intercept) 1.6 0.4 ## att_start 0.8 0.0 ## ## Auxiliary parameter(s): ## Median MAD_SD ## sigma 1.4 0.1 ## ## ------ ## * For help interpreting the printed output see ?print.stanreg ## * For info on the priors used see ?prior_summary.stanreg
Great! Now, let’s perform
loo() on this model.
loo_att_start <- fit_att_start %>% loo() loo_att_start
## ## Computed from 4000 by 115 log-likelihood matrix ## ## Estimate SE ## elpd_loo -201.7 12.5 ## p_loo 4.2 1.8 ## looic 403.4 25.1 ## ------ ## Monte Carlo SE of elpd_loo is 0.1. ## ## All Pareto k estimates are good (k < 0.5). ## See help('pareto-k-diagnostic') for details.
elpd_loo value for this model is -201.7. This is higher than the
att_liberal, implying that this model is superior. However, we can’t see our estimates together. Is there a simpler way to calculate which model is better?
Actually, yes! To compare the two models directly, we can use the function
loo_compare with our two
loo objects created above. This will calculate the difference in
elpd_loo() between our models for us, making our job easier:
## elpd_diff se_diff ## . 0.0 0.0 ## . -77.7 12.0
The value for
elpd_diff is equal to the difference in
elpd_loo between our two models: elpd_loo for loo_att_start - elpd_loo for loo_liberal = -279.4 - (-201.7) = -77.7. The
se_diff shows that the difference has a standard error of 12.0. This shows that, even when accounting for standard error, the better model is clear:
fit_att_start. Therefore, the attitude at the start of the
trains study is more significant to predicting final attitude when compared with the variable
liberal, which is an analog for political affiliation.
As we have seen,
loo_compare is a shortcut for comparing two models. When you are deciding between two models,
loo_compare() is a great way to simplify your decision.
As always, it is important to look at the practices of other professionals and the reasons we may choose not to follow those tactics. For instance, our continued problem with hypothesis testing. In hypothesis testing, we assert a null hypothesis \(H_0\) about our data and an alternative hypothesis \(H_a\).
When performing hypothesis testing, we either reject the hypothesis or we do not reject it. The qualifications for rejecting are met if the 95% confidence interval excludes the null hypothesis. If the hypothesis is included in our 95% confidence interval, we do not reject it. In the case of “insignificant” results, with p > 0.5, we also can’t “reject” the null hypothesis. However, this does not mean that we accept it.
The premise of hypothesis testing is to answer a specific question – one that may not even be particularly relevant to our understanding of the world – about our data. So, what are our problems with hypothesis testing? - Rejecting or not rejecting hypotheses doesn’t helps us to answer real questions. - The fact that a difference is not “significant” has no relevance to how we use the posterior to make decisions. - Statistical significance is not equal to practical importance. - There is no reason to test when you can summarize by providing the full posterior probability distribution.