Functions

11.7 Introduction

A function is a piece of code that is packaged in a way that makes it easy to reuse. Functions make it easy for you to filter(), arrange(), select(), and create a tibble(), as you have seen in Chapters 1 and 2. Functions also allow you to transform variables and perform mathematical calculations. We use functions like rnorm() and runif() to generate random draws from a distribution.

Every time we reference a function in this Primer, we include the parentheses. You call a function by including its parentheses and any necessary arguments within those parentheses. This is a correct call of rnorm():

rnorm(n = 1)
## [1] -0.36

If you run the function name without its parentheses, R will return the code that makes up the function.

rnorm
## function (n, mean = 0, sd = 1) 
## .Call(C_rnorm, n, mean, sd)
## <bytecode: 0x7f947761b8d8>
## <environment: namespace:stats>

Functions can do all sorts of things. sample() takes a vector of values and returns a number of values randomly selected from that vector. You can specify the number of random values with the argument size. This call is the equivalent of rolling a die.

sample(x = 1:6, size = 1)
## [1] 1

Functions can also take in other functions as arguments. For example, replicate() takes an expression and repeats it n times. What if we replicated the rolling of a die ten times?

replicate(10, sample(1:6, 1))
##  [1] 1 1 5 3 4 2 5 6 2 4

An especially useful type of function is the family of map_* functions, from the purrr package, which is automatically loaded with library(tidyverse). These functions apply some other function to every row in a tibble.

Let’s create a tibble with one variable x which takes on three values: 3, 7, and 2.

tibble(x = c(4, 16, 9))
## # A tibble: 3 x 1
##       x
##   <dbl>
## 1     4
## 2    16
## 3     9

It is easy to use mutate to create a new variable, sq_root, which is the square root of each value of x.

tibble(x = c(4, 16, 9)) %>% 
  mutate(sq_root = sqrt((x)))
## # A tibble: 3 x 2
##       x sq_root
##   <dbl>   <dbl>
## 1     4       2
## 2    16       4
## 3     9       3

map_* functions provide another approach. A map_* function takes two required arguments. First is the object over which you want to iterate. This will generally be a column in the tibble in which you are working. Second is the function which you want to run for each row in the tibble.

tibble(x = c(4, 16, 9)) %>% 
  mutate(sq_root = map_dbl(x, ~ sqrt(.)))
## # A tibble: 3 x 2
##       x sq_root
##   <dbl>   <dbl>
## 1     4       2
## 2    16       4
## 3     9       3

map_dbl() (pronounced “map-double”) took the function sqrt() and applied it to each element of x. There are two tricky parts to the use of map_* functions. First, you need to put the tilde symbol — the “~” — before the name of the function which you want to call. Without the ~, you will get an error:

tibble(x = c(4, 16, 9)) %>% 
  mutate(sq_root = map_dbl(x, sqrt(.)))
## Error: Problem with `mutate()` column `sq_root`.
## ℹ `sq_root = map_dbl(x, sqrt(.))`.
## x Can't convert a `tbl_df/tbl/data.frame` object to function

Second, you need to include a period — the “.” — in the spot where the variable goes. Using the name of the variable — x in this case — will generate an error.

tibble(x = c(4, 16, 9)) %>% 
  mutate(sq_root = map_dbl(x, ~ sqrt(x)))
## Error: Problem with `mutate()` column `sq_root`.
## ℹ `sq_root = map_dbl(x, ~sqrt(x))`.
## x Result 1 must be a single double, not a double vector of length 3

Tilde and dot (~ and .) are easy to forget.

If you know the expected output of your function, you can specify that kind of vector:

  • map(): list
  • map_lgl(): logical
  • map_int(): integer
  • map_dbl(): double (numeric)
  • map_chr(): character
  • map_df(): data frame

Since our example returns numeric output, we use map_dbl() instead of map().

The key difference between using mutate() and map_* functions is that map_* functions are designed to work well with lists, both as inputs and as outputs. mutate() is designed for atomic vectors, meaning vectors in which element is a single value.

11.8 List-columns and map functions

Recall that a list is different from an atomic vector. In atomic vectors, each element of the vector has one value. Lists, however, can contain vectors, and even more complex objects, as elements.

x <- list(c(4, 16, 9), c("A", "Z"))
x
## [[1]]
## [1]  4 16  9
## 
## [[2]]
## [1] "A" "Z"

x is a list with two elements. That element is a numeric vector of length 3. The second element is a character vector of length 2. We use [[]] to extract specific elements. Example:

x[[1]][3]
## [1] 9

The first [[]] extracts the first element form the list x. The second `[[]]`` extracts the 3rd element from the vector which is that first element.

There are a number of built-in R functions that output lists. For example, the ggplot objects you have been making store all of the plot information in lists. Any function that returns multiple values can be used to create a list output by wrapping that returned object with list().

x <- rnorm(10)

# range() returns the min and max of the argument 

range(x)
## [1] -0.64  1.34
tibble(col_1 = list(range(x))) 
## # A tibble: 1 x 1
##   col_1    
##   <list>   
## 1 <dbl [2]>

Notice this is a 1x1 tibble with one observation, which is a list of one element. Voila! You have just created a list-column.

If a function returns multiple values as a vector, like range() does, you must use list() as a wrapper if you want to create a list-column.

A list column is a column of your data which is a list rather than an atomic vector. Like with lists, you can pipe to str() to examine the column.

tibble(col_1 = list(range(x))) %>%
  str()
## tibble [1 × 1] (S3: tbl_df/tbl/data.frame)
##  $ col_1:List of 1
##   ..$ : num [1:2] -0.636 1.341

We can use map_* functions to both create a list-column and then, much more importantly, work with that list-column afterwards. Example:

# This simple example demonstrates the workflow which we will often follow.
# Start by creating a tibble which will be used to store the results. (Or start
# with a tibble which already exists and to which you will be adding more
# columns.) It is often convenient to get all the code working with just a few
# rows. Once it is working, we increase the number of rows to a thousand or
# million or whatever we need.

tibble(ID = 1:3) %>% 
  
  # The big convenience is being able to store a list in each row of the tibble.
  # Note that we are not using the value of ID in the call to rnorm(). (That is
  # why we don't have a "." anywhere.) But we are still using ID as a way of
  # iterating through each row; ID is keeping count for us, in a sense.
  
  mutate(draws = map(ID, ~ rnorm(10))) %>% 
  
  # Each succeeding step of the pipe works with columns already in the tibble
  # while, in general, adding more columns. The next step calculates the max
  # value in each of the draw vectors. We use map_dbl() because we know that
  # max() will returns a single number.
  
  mutate(max = map_dbl(draws, ~ max(.))) %>% 
  
  # We will often need to calculate more than one item from a given column like
  # draws. For example, in addition to knowing the max value, we would like to
  # know the range. Because the range is a vector, we need to store the result
  # in a list column. map() does that for us automatically.
  
  mutate(min_max = map(draws, ~ range(.)))
## # A tibble: 3 x 4
##      ID draws        max min_max  
##   <int> <list>     <dbl> <list>   
## 1     1 <dbl [10]>  1.39 <dbl [2]>
## 2     2 <dbl [10]>  1.19 <dbl [2]>
## 3     3 <dbl [10]>  1.87 <dbl [2]>

This flexibility is only possible via the use of list-columns and map_* functions. This workflow is extremely common. We start with an empty tibble, using ID to specify the number of rows. With that skeleton, each step of the pipe adds a new column, working off a column which already exists.

Let’s practice with the nhanes dataset from the primer.data package. How could we add a column to the dataset that included the quantiles of the height variable for each gender?

library(primer.data)

Select the relevant variables, and group by gender. We are grouping because we are curious as to how height is distributed in between gender. We drop any rows with missing data.

nhanes %>%
  select(gender, height) %>%
  drop_na() %>% 
  group_by(gender)
## # A tibble: 9,647 x 2
## # Groups:   gender [2]
##    gender height
##    <chr>   <dbl>
##  1 Male     165.
##  2 Male     165.
##  3 Male     165.
##  4 Male     105.
##  5 Female   168.
##  6 Male     133.
##  7 Male     131.
##  8 Female   167.
##  9 Female   167.
## 10 Female   167.
## # … with 9,637 more rows

There are two approaches. In the first, we are happy to have an output tibble with just two rows:

nhanes %>%
  select(gender, height) %>%
  drop_na() %>% 
  group_by(gender) %>% 
  summarise(q_height = list(quantile(height)),
            .groups = "drop")
## # A tibble: 2 x 2
##   gender q_height 
##   <chr>  <list>   
## 1 Female <dbl [5]>
## 2 Male   <dbl [5]>

Note that there was no need to use map_* functions in this case. The simple dplyr approach works fine. The only “trick” is the use of list() to wrap the output of quantile(). Use str() to examine the exact values.

nhanes %>%
  select(gender, height) %>%
  drop_na() %>% 
  group_by(gender) %>% 
  summarise(q_height = list(quantile(height)),
            .groups = "drop") %>% 
  str()
## tibble [2 × 2] (S3: tbl_df/tbl/data.frame)
##  $ gender  : chr [1:2] "Female" "Male"
##  $ q_height:List of 2
##   ..$ : Named num [1:5] 83.8 154.3 160.6 165.9 184.5
##   .. ..- attr(*, "names")= chr [1:5] "0%" "25%" "50%" "75%" ...
##   ..$ : Named num [1:5] 83.6 166.2 173.8 179.4 200.4
##   .. ..- attr(*, "names")= chr [1:5] "0%" "25%" "50%" "75%" ...

Men are taller than women throughout the distribution, but the smallest individual (child) in the data, see the 0% quantile, happens to be male.

The second case involves a scenario in which we do not want to “lose” any rows in our tibble. We want a q_height column for all the rows, even if the values included are repetitive. A common scenario is that we want to use q_height to perform a calculation for each individual. To do this, we need a map_* function.

nhanes %>%
  select(gender, height) %>%
  drop_na() %>% 
  group_by(gender) %>% 
  summarize(q_height = map(height, ~ quantile(.)),
            .groups = "drop")
## # A tibble: 9,647 x 2
##    gender q_height 
##    <chr>  <list>   
##  1 Female <dbl [5]>
##  2 Female <dbl [5]>
##  3 Female <dbl [5]>
##  4 Female <dbl [5]>
##  5 Female <dbl [5]>
##  6 Female <dbl [5]>
##  7 Female <dbl [5]>
##  8 Female <dbl [5]>
##  9 Female <dbl [5]>
## 10 Female <dbl [5]>
## # … with 9,637 more rows

The first four lines of the pipe are the same in both cases. The only difference is the use of list(quantile(height)) in the first and map(height, ~ quantile(.) in the second.

Until now, we have practiced using map_* functions with built-in R functions. Sometimes, however, there is not an R function which does what we want. When that happens, we need to create our own function.

11.9 Custom Functions

There are many built-in functions in R. A function is composed of a name, a list of arguments and a body. We create our own functions with the function() function.

11.9.1 Creating your own functions

Assume we want to create a function which adds 1 and 1 together. The first step is to write some R code which does that.

1 + 1
## [1] 2

This code will become the “body” of the function, the part in between the curly braces. We also need to a function definition, which is composed of the name of the function, a call to the function() function, and a pair of curly braces.

add_one_and_one <- function(){}

Combining the function definition and the body of the function completes the process.

add_one_and_one <- function(){
  1 + 1
}

add_one_and_one()
## [1] 2

You just created a function! This function will return 1 + 1 whenever called.

Consider a function which adds the number 6 to a value x, a value which we want to allow the user provide.

add_six_to_something <- function(x){
  x + 6
}

add_six_to_something(x = 1)
## [1] 7

You have incorporated your first formal argument. Formal arguments in functions are additional parameters that allow the user to customize the use of the function. Instead of adding 1 + 1 over and over again, your function takes in a number x that the user defines and adds 6. Consider a function with two formal arguments.

add_x_to_y <- function(x, y) {
  x + y
}

add_x_to_y(1, 2)
## [1] 3
add_x_to_y(4, 3)
## [1] 7

11.9.2 Anonymous functions with map_* functions

We can create functions that perform operations “on the fly,” without bothering to give them a name. These nameless functions are called anonymous functions.

You can use anonymous functions in conjunction with the map_* family of functions. This is probably the most common use of anonymous functions, at least in this Primer.

You can call an anonymous function using a ~ operator and then using a . to represent the current element. Consider these three approaches:

tibble(old = c(4, 16, 9)) %>% 
  mutate(new_1 = old + 6) %>% 
  mutate(new_2 = map_dbl(old, ~ add_six_to_something(.))) %>% 
  mutate(new_3 = map_dbl(old, ~ (. + 6)))
## # A tibble: 3 x 4
##     old new_1 new_2 new_3
##   <dbl> <dbl> <dbl> <dbl>
## 1     4    10    10    10
## 2    16    22    22    22
## 3     9    15    15    15

All three produce the same answer, as we would expect. Just using mutate() is best, as long as it accomplishes your goal. In complex situations, especially those involving simulation, it often won’t. Example:

tibble(ID = 1:3) %>% 
  mutate(x = rnorm(1))
## # A tibble: 3 x 2
##      ID      x
##   <int>  <dbl>
## 1     1 -0.282
## 2     2 -0.282
## 3     3 -0.282

Calling rnorm(), or any function with a random component, does not have the effect which you probably want if you do it in the context of a simple mutate(). Instead, R runs rnorm(1) once, and then copies the value generated to the remaining two rows of the tibble. To get a different value in each row, you need to explicitly tell R to do that by using a map_* function:

tibble(ID = 1:3) %>% 
  mutate(x = rnorm(1)) %>% 
  mutate(y = map_dbl(ID, ~ rnorm(1)))
## # A tibble: 3 x 3
##      ID      x      y
##   <int>  <dbl>  <dbl>
## 1     1 -0.282 -1.31 
## 2     2 -0.282  0.795
## 3     3 -0.282  0.270

Note that the parentheses in the anonymous function are not necessary. As long as everything after the ~ works as R code, the anonymous function should work, each time replacing the . with the value in the relevant row from the .x variable — which is old in this case.

tibble(old = c(4, 16, 9)) %>% 
  mutate(new = map_dbl(old, ~ . + 1))
## # A tibble: 3 x 2
##     old   new
##   <dbl> <dbl>
## 1     4     5
## 2    16    17
## 3     9    10

11.9.3 Skateboard >> perfectly formed rear-view mirror

This image — widely attributed to the Spotify development team — conveys an important point.

Build that skateboard before you build the car or some fancy car part. A limited-but-functioning thing is very useful. It also keeps spirits high.

This is related to the Telescope Rule:

It is faster to make a four-inch mirror and then a six-inch mirror than it is to make a six-inch mirror.

11.10 no_NA_sampler()

Assume that we want to sample 10 observations for height from the nhanes tibble from the primer.data package. That is easy to do with the built in function sample().

sample(nhanes$height, size = 10)
##  [1] 159 119 170 171 181 136  99 169 153 175

One problem with this approach is that it will sample missing values of height. We can avoid that by manipulating the vector inside of the call to sample().

sample(nhanes$height[! is.na(nhanes$height)], size = 10)
##  [1] 163 141 158 160 169 183 117 158  86 176

That works, but, first, it is ugly code. And, second, it is hard to extend when we have more constraints. For example, assume we only want to sample from individuals who have no missing values for any variables, not just height. To do that, we really ought to make a custom function. Call that function no_NA_sampler().

The first step in function creation is to write code in a normal pipe which does what you want the function to do. In this case, that code would look like:

nhanes %>% 
  drop_na() %>%
  slice_sample(n = 10) %>% 
  pull(height)
##  [1] 157 166 171 160 156 152 166 150 175 160

We start with nhanes, use drop_na() to remove rows with missing values for any variable, sample 10 rows at random and then pull out height. To turn this into a function, we just need to copy/paste this pipe within the body of our function definition:

no_NA_sampler <- function(){
  nhanes %>% 
    drop_na() %>%
    slice_sample(n = 10) %>% 
    pull(height)
}

no_NA_sampler()
##  [1] 162 165 159 156 159 160 157 171 164 156

Voila! A function just executes the code within its body. The first step in building a function is not to write the function. It is to write the code which you want the function to execute.

The first version, however, “hard codes” a lot of options which we might want to change. What if we want to sample 5 values of height or 500? In that case, we could hard code a new number in place of “10.” A better option would be to add an argument so that we can pass in whatever value we want.

no_NA_sampler <- function(n){
  nhanes %>% 
    drop_na() %>%
    slice_sample(n = n) %>% 
    pull(height)
}

no_NA_sampler(n = 2)
## [1] 164 170
no_NA_sampler(n = 25)
##  [1] 156 172 166 168 166 164 164 161 159 156 163 164 168 170 155 177 173 166 169
## [20] 173 174 157 159 157 159

What if we want to sample from a different variable than height or from a different tibble than nhanes? Again, the trick is to turn hard coded values into arguments. The argument tbl is a placeholder for a data set, n for the number of samples you want extracted from your data set, and var for the variable in the samples that we are studying.

no_NA_sampler <- function(tbl, var, n){
  tbl %>% 
    drop_na() %>%
    slice_sample(n = n) %>% 
    pull({{var}})
}

no_NA_sampler(tbl = nhanes, var = height, n = 2)
## [1] 164 163

R does not know how to interpret something like age when it is passed in an argument. The double curly braces around var tell R, in essence, that var is a variable in the tibble created from sampling from our input tibble tbl. We can use the order of the arguments, without naming them, with no_NA_sampler(), just as with any other R function:

no_NA_sampler(trains, age, 5)
## [1] 45 42 41 31 44

Now that we have the function doing what we want, we should add some comments and some error checking.

no_NA_sampler <- function(tbl, var, n){
  
  # Function for grabbing `n` samples from a variable `var` which lives in a
  # tibble `tbl`. 
  
  # I could not figure out how to check to see if `var` actually lives in the
  # tibble in my error checking. Also, I don't like that I need to use
  # is_double() as the check on `n` even though I want `n` to be an integer.
  
  stopifnot(is_tibble(tbl))
  stopifnot(is_double(n))

  tbl %>% 
    drop_na() %>%
    
    # What happens if n is "too large"? That is, I need to think harder about a)
    # whether or not I am sampling with or without replacement and b) which I
    # should be doing.
    
    slice_sample(n = n) %>% 
    pull({{var}})
}

Do the comments in the above code seem weird? Perhaps. But they are good comments! First, there about as many lines of comments as there are lines of code. That is a good rule of thumb. Second, the comments do not simple report what the code is doing. That is redundant! The code itself tells us what the code is doing. The comments, instead, are a discussion of issues related to the code, to things we don’t understand, to topics which we should revisit. They are like a diary. Good programmers keep good diaries.

11.11 Prediction Game

Let’s play a prediction game. Consider the kenya tibble from primer.data.

kenya
## # A tibble: 1,672 x 9
##    block poll_station treatment poverty distance pop_density mean_age reg_byrv13
##    <chr> <chr>        <fct>       <dbl>    <dbl>       <dbl>    <dbl>      <dbl>
##  1 KWAL… 007/001      control     0.247     22.0    0.00296      39.6    0.00358
##  2 KWAL… 007/004      local + …   0.329     25.1    0.000888     43.8    0.0742 
##  3 KWAL… 007/009      local       0.263     27.8    0.00184      34.7    0.00691
##  4 KWAL… 007/011      local + …   0.429     27.2    0.000270     44.6    0.26   
##  5 KWAL… 007/017      local + …   0.341     19.3    0.000544     39.0    0.0228 
##  6 KWAL… 007/018      local       0.204     24.0    0.00798      37.1    0.00243
##  7 KWAL… 007/019      SMS         0.272     25.1    0.00167      39.2    0.00487
##  8 KWAL… 007/020      control     0.316     23.8    0.000538     36.3    0      
##  9 KWAL… 007/022      canvass     0.396     20.5    0.000216     42.9    0.00575
## 10 KWAL… 007/023      local + …   0.398     14.5    0.000148     39.8    0.0360 
## # … with 1,662 more rows, and 1 more variable: rv13 <dbl>

The game is that we will pick a random value of rv13, which is the number of people who live in the vicinity of a polling station. You guess a number. I guess a number. The winner of the Prediction Game is the person whose guess is closest to the random value selected. Example:

your_guess <- 500
my_guess <- 600

sampled_value <- no_NA_sampler(kenya, rv13, n = 1) 

your_error <- abs(your_guess - sampled_value)
my_error <- abs(my_guess - sampled_value)

if(your_error < my_error) cat("You win!")
## You win!
if(your_error > my_error) cat("I win!")

Run this code in your R Console to try it out. It works! It is also sloppy and disorganized. The first step in writing good code is to write bad code.

We don’t want to play the Prediction Game just once. We want to play it thousands of times. Copy/pasting this code a thousand times would be stupid. Instead, we need a function. Just place the working code within a function definition, and Voila!

prediction_game <- function(){
  your_guess <- 500
  my_guess <- 600
  
  sampled_value <- no_NA_sampler(kenya, rv13, n = 1) 
  
  your_error <- abs(your_guess - sampled_value)
  my_error <- abs(my_guess - sampled_value)
  
  if(your_error < my_error) cat("You win!")
  if(your_error > my_error) cat("I win!")
}

Other than the function definition itself, there are no changes. Yet, by creating a function, we can now easily run this many times.

replicate(3, prediction_game())
## You win!You win!You win!
## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL

The problem with this version is that we want prediction_game() to return a message about the winner. Right now, it returns nothing. It just prints the winner. Let’s change that, and also allow for guesses to be passed in as an argument, along with the tibble and variable. We can leave n hard coded as 1 since, by definition, the Prediction Game is an attempt to guess one number, at least for now. To do this, we need to use the return() function which, when executed, causes the function to finish and return whatever value is within the paratheses.

prediction_game <- function(guesses, tbl, var){
  
  # Check to make sure that guesses is a vector of doubles of length 2.
  
  stopifnot(all(is_double(guesses)))
  stopifnot(length(guesses) == 2)
  
  # This tells the function that the "guess" inputted first in the 
  # guesses is "your" guess, whereas the second input is "my" guess.
  
  your_guess <- guesses[1]
  my_guess <- guesses[2]
  
  # Use the function no_NA_sampler to draw a sample from a data set
  # of our choosing, with a {{var}} and n.
  
  sampled_value <- no_NA_sampler(tbl, {{var}}, n = 1) 
  
  # Subtract the sampled value obtained from no_NA_sampler from 
  # both of our guesses. 
  
  your_error <- abs(your_guess - sampled_value)
  my_error <- abs(my_guess - sampled_value)
  
  # If the difference between your guess and the sampled value is 
  # less than the difference between my guess and the sampled value
  # (meaning that your guess was closer to the truth), the function
  # returns the message "Guess, your_guess, wins!".
  
  if(your_error < my_error){ 
    return(paste("Guess", your_guess, "wins!"))
  }
  
  # If your error exceeds my error (meaning that your guess was
  # further than the truth than mine), the function prints the 
  # message "Guess, my_guess, wins!" 
  
  if(your_error > my_error){ 
    return(paste("Guess", my_guess, "wins!"))
  }
  
  # If we guess the same number, and our error rates are therefore
  # identical, we return the message "A tie!". 
  
  if(your_error == my_error){ 
    return("A tie!")
  }

}
replicate(5, prediction_game(guesses = c(500, 600), kenya, rv13))
## [1] "Guess 500 wins!" "Guess 500 wins!" "Guess 600 wins!" "Guess 500 wins!"
## [5] "Guess 500 wins!"

In general, we will want to store the results in a tibble, which makes later analysis and plotting easier.

tibble(ID = 1:3) %>% 
  mutate(result = map_chr(ID, ~ 
                            prediction_game(guesses = c(500, 600),
                                            kenya, 
                                            rv13)))
## # A tibble: 3 x 2
##      ID result         
##   <int> <chr>          
## 1     1 Guess 500 wins!
## 2     2 Guess 500 wins!
## 3     3 Guess 500 wins!

Who wins the game the most if we play 1,000 times?

tibble(ID = 1:1000) %>% 
  mutate(result = map_chr(ID, ~ 
                            prediction_game(guesses = c(500, 600),
                                            kenya, 
                                            rv13))) %>% 
  ggplot(aes(result)) +
    geom_bar()

It is hardly surprising that 500 wins more often than 600 since the mean of rv13 is 539.23. The mean seems like a pretty good guess! But it is not the best guess.

To test whether the mean or the median is a better guess, we will use our created prediction_game function with the guesses of 442 (the median) and 539 (the mean) and plot the results.

tibble(ID = 1:1000) %>% 
  mutate(result = map_chr(ID, 
                          ~ prediction_game(c(442, 539),
                                            kenya,
                                            rv13))) %>% 
  ggplot(aes(result)) +
    geom_bar()

The mean is not a bad prediction. But the best prediction is (surprisingly?) the median, which is 442.

11.11.0.1 Playing within a tibble

In other cases, it is more convenient to play portions of the Prediction Game within a tibble. Imagine that we are trying to guess the biggest value out of 10 random samples.

tibble(ID = 1:3, guess_1 = 800, guess_2 = 900) %>% 
  mutate(result = map(ID, ~ no_NA_sampler(kenya, rv13, 10)))
## # A tibble: 3 x 4
##      ID guess_1 guess_2 result    
##   <int>   <dbl>   <dbl> <list>    
## 1     1     800     900 <dbl [10]>
## 2     2     800     900 <dbl [10]>
## 3     3     800     900 <dbl [10]>

We can now manipulate the result column and then see which prediction did better. Using the same structure as before, we subtract our guesses from the variable we were guessing; in this case, the biggest value in 10 random samples.

tibble(ID = 1:3, guess_1 = 800, guess_2 = 900) %>% 
  mutate(result = map(ID, ~ no_NA_sampler(kenya, rv13, 10))) %>% 
  mutate(biggest = map_dbl(result, ~ max(.))) %>% 
  mutate(error_1 = abs(guess_1 - biggest)) %>% 
  mutate(error_2 = abs(guess_2 - biggest)) %>% 
  mutate(winner = case_when(error_1 < error_2 ~ "Guess one wins!",
                            error_1 > error_2 ~ "Guess two wins!",
                            TRUE ~ "A tie!"))
## # A tibble: 3 x 8
##      ID guess_1 guess_2 result     biggest error_1 error_2 winner         
##   <int>   <dbl>   <dbl> <list>       <dbl>   <dbl>   <dbl> <chr>          
## 1     1     800     900 <dbl [10]>    1382     582     482 Guess two wins!
## 2     2     800     900 <dbl [10]>    2246    1446    1346 Guess two wins!
## 3     3     800     900 <dbl [10]>    1422     622     522 Guess two wins!

Run the test 1,000 times.

tibble(ID = 1:1000, guess_1 = 800, guess_2 = 900) %>% 
  mutate(result = map(ID, ~ no_NA_sampler(kenya, rv13, 10))) %>% 
  mutate(biggest = map_dbl(result, ~ max(.))) %>% 
  mutate(error_1 = abs(guess_1 - biggest)) %>% 
  mutate(error_2 = abs(guess_2 - biggest)) %>% 
  mutate(winner = case_when(error_1 < error_2 ~ "Guess one wins!",
                            error_1 > error_2 ~ "Guess two wins!",
                            TRUE ~ "A tie!")) %>% 
  ggplot(aes(winner)) +
    geom_bar()

Empirically, we see than 900 is a much better guess than 800. Instead of calling a function to be run 1,000 times, we just performed each step within each row of a tibble with 1,000 rows. Both approaches work. The best choice depends on the context of your problem.

11.12 Summary

The first step in writing good code is to write bad code.

Tilde and dot (~ and .) are easy to forget.

The first step in building a function is not to write the function. It is to write the code which you want the function to execute.

11.12.1 Lists and list-columns

  • A list is different from an atomic vector. Atomic vectors are familiar to us: each element of the vector has one value, and thus if an atomic vector is a column in your data set, each observation gets a single value. Lists, however, can contain vectors, and other more complex objects, as elements.
  • There are various ways to create lists. The most common is to use the list() function to “wrap” some object. map() always returns a list.
  • We can take a list column and, by applying an anonymous function to it with map(), create another list column. This is similar to taking a tibble and piping it into a function, like mutate(), which returns a new tibble to work with.
  • You can also use map_* functions to take a list column as an input and return an atomic vector – a column with a single value per observation – as an output.

If a function returns multiple values as a vector, you must use list() as a wrapper if you want to create a list-column.

11.12.2 Writing functions

  • Optimize usefulness by adding more formal arguments when needed. A function that only gives an option for n may not be as helpful as a function that allows us to enter options for a data set, variable, and n value.
  • Give your arguments sensible names.
  • By default, a function returns the result of the last line of the body. Use return() to override this default.
  • When starting a function, remember that smaller steps are easier than trying to build everything in one motion. In general: start by writing the body, test the body in a basic function, and then add formal arguments.
  • Use double curly braces around vars, since R does not know how to interpret variable names when they are passed in an argument. The double curly braces tell R that var is a variable in a tibble.

11.12.3 Distributions

  • The word “distribution” can mean two things. First, it is an object — a mathematical formula, an imaginary urn — from which you can draw values. Second, it is a list of such values.
  • The two most important aspects of a distribution are its center and its variability.
  • The median is often a more stable measure of the center than the mean. The mad (scaled median absolute deviation) is often a more stable measure of variation than the standard deviation.
  • Outliers cause a lack a stability. In a distribution without outliers, the mean/median and mad/sd are so close in value that it does not matter much which ones you use.