2 Wrangling

Data science is data cleaning.

Start by loading the packages which we will need in this chapter.

  • The tidyverse package will be used in every chapter. Loading it makes the 8 packages from the “Tidyverse” available.
  • primer.data is the data package created specifically for the primer.
  • lubridate is a package for working with dates and times.
  • skimr contains functions that are useful for providing summary statistics, especially skim().
  • nycflights13 includes data associated with flights out of New York City’s three major airports.
  • gapminder has annual data for countries going back more than 50 years, which we used in Chapter 1.
  • fivethirtyeight cleans up data from the FiveThirtyEight team.

2.1 Tibbles

Tibbles are a kind of data frame, useful for storing data in which we have the same number of observations for each variable. We can use the tibble() function to create tibbles. Tibbles are composed of columns, each of which is a variable, and rows, each of which is a “unit” or an “observation.” Furthermore, each column (i.e., each variable) can be of a different type: character, integer, factor, double, date and so on.

2.1.1 tibble()

tibble(a = 2.1, b = "Hello", c = TRUE, d = 9L)
## # A tibble: 1 x 4
##       a b     c         d
##   <dbl> <chr> <lgl> <int>
## 1   2.1 Hello TRUE      9

In our code, we specify the column names as , a, b, and c. Under each variable, we give a different value. Under each variable name, the data type is specified for the data within that column. A tibble can consist of one or more atomic vectors, the most important types of which are double, character, logical, and integer. The tibble above includes a variable of each type. The “L” in “9L” tells R that we want d to be an integer rather than the default, which would be a double. When you print out a tibble, the variable type is shown below the variable name, as indicated by <dbl>, <chr>, and so on.

Variables should not begin with a number (like 54abc) or include spaces (like my var). If you insist on using variable names such, you must include backticks around each name when you reference it.

tibble(`54abc` = 1, `my var` = 2, c = 3)
## # A tibble: 1 x 3
##   `54abc` `my var`     c
##     <dbl>    <dbl> <dbl>
## 1       1        2     3

If we did not include the backticks, R would give us an error.

tibble(54abc = 1, my var = 2, c = 3)
## Error: <text>:1:10: unexpected symbol
## 1: tibble(54abc
##              ^

In the real world, you may come across datasets with dirty column names such as 54abc or my var.

It is sometimes easier to use the function tribble() to create tibbles.

tribble(
  ~ var1, ~ `var 2`, ~ myvar,
  1,           3,      5,
  4,           6,      8,
)
## # A tibble: 2 x 3
##    var1 `var 2` myvar
##   <dbl>   <dbl> <dbl>
## 1     1       3     5
## 2     4       6     8

The tildes — as in ~ var1 — specify which row has the column names. The formatting makes it easier, relative to specifying raw vectors, to see which values are from the same observation.

2.2 Lists

Subsetting a list, visually.

FIGURE 2.1: Subsetting a list, visually.

Earlier, we briefly introduced lists. Lists are a type of vector that is a step up in complexity from atomic vectors, because lists can contain other lists. This makes them suitable for representing hierarchical or tree-like structures. You create a list with the function list():

x <- list(1, 2, 3)
x
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 2
## 
## [[3]]
## [1] 3

A very useful tool for working with lists is str() because it focuses on displaying the structure, not the contents.

str(x)
## List of 3
##  $ : num 1
##  $ : num 2
##  $ : num 3
x_named <- list(a = 1, b = 2, c = 3)
str(x_named)
## List of 3
##  $ a: num 1
##  $ b: num 2
##  $ c: num 3

Unlike atomic vectors, list() can contain objects of several types.

y <- list("a", 1L, 1.5, TRUE)
str(y)
## List of 4
##  $ : chr "a"
##  $ : int 1
##  $ : num 1.5
##  $ : logi TRUE

Lists can even contain other lists!

z <- list(list(1, 2), list(3, 4))
str(z)
## List of 2
##  $ :List of 2
##   ..$ : num 1
##   ..$ : num 2
##  $ :List of 2
##   ..$ : num 3
##   ..$ : num 4

2.2.1 Visualizing lists

To explain more complicated list manipulation functions, it’s helpful to have a visual representation of lists. For example, take these three lists:

x1 <- list(c(1, 2), c(3, 4))
x2 <- list(list(1, 2), list(3, 4))
x3 <- list(1, list(2, list(3)))

They are structured as follows:

There are three principles:

  1. Lists have rounded corners. Atomic vectors have square corners.

  2. Children are drawn inside their parent, and have a slightly darker background to make it easier to see the hierarchy.

  3. The orientation of the children (i.e. rows or columns) isn’t important, so I’ll pick a row or column orientation to either save space or illustrate an important property in the example.

2.2.2 Subsetting

There are three ways to subset a list, which I’ll illustrate with a list named a:

a <- list(a = 1:3, b = "a string", c = pi, d = list(-1, -5))

[ ] extracts a sub-list. The result will always be a list.

str(a[1:2])
## List of 2
##  $ a: int [1:3] 1 2 3
##  $ b: chr "a string"
str(a[4])
## List of 1
##  $ d:List of 2
##   ..$ : num -1
##   ..$ : num -5

Like with vectors, you can subset with a logical, integer, or character vector.

[[ ]] extracts a single component from a list. It removes a level of hierarchy from the list.

str(a[[1]])
##  int [1:3] 1 2 3
str(a[[4]])
## List of 2
##  $ : num -1
##  $ : num -5

$ is a shorthand for extracting named elements of a list. It works similarly to [[ ]] except that you don’t need to use quotes.

a$a
## [1] 1 2 3
a[["a"]]
## [1] 1 2 3

The distinction between [ ] and [[ ]] is really important for lists, because [[ ]] drills down into the list while [ ] returns a new, smaller list. Compare the code and output above with the visual representation.

2.3 Characters

Real data is nasty.

FIGURE 2.2: Real data is nasty.

So far, our tibbles have been clean and wholesome, like gapminder and trains. Real data is nasty. You will bring data into R from the outside world and discover there are problems. We will now discuss common remedial tasks for cleaning and transforming character data, also known as strings. A string is one or more characters that are enclosed inside a pair of matching ‘single’ or “double quotes.”

We will use the fruit data, a vector with the names of different fruits, from the stringr package, which is automatically loaded when we issue library(tidyverse). Although we can manipulate character vectors directly, it is much more common, in real world situations, to work with vectors which are in a tibble.

tbl_fruit <- tibble(fruit = fruit)

2.3.1 Character vectors

tbl_fruit %>% 
  slice_sample(n = 8)
## # A tibble: 8 x 1
##   fruit     
##   <chr>     
## 1 durian    
## 2 feijoa    
## 3 cantaloupe
## 4 currant   
## 5 lemon     
## 6 jambul    
## 7 rock melon
## 8 damson

Note that slice_sample() selects a random set of rows from the tibble. The argument n, as shown above, will display that many rows. Use the argument prop to return a specific percentage of all the rows in the tibble.

str_detect() determines if a character vector matches a pattern. It returns a logical vector that is the same length as the input. Recall logicals are either TRUE or FALSE.

Which fruit names actually include the letter “c?”

tbl_fruit %>% 
  mutate(fruit_in_name = str_detect(fruit, pattern = "c")) 
## # A tibble: 80 x 2
##    fruit        fruit_in_name
##    <chr>        <lgl>        
##  1 apple        FALSE        
##  2 apricot      TRUE         
##  3 avocado      TRUE         
##  4 banana       FALSE        
##  5 bell pepper  FALSE        
##  6 bilberry     FALSE        
##  7 blackberry   TRUE         
##  8 blackcurrant TRUE         
##  9 blood orange FALSE        
## 10 blueberry    FALSE        
## # … with 70 more rows

str_length() counts characters in your strings. Note this is different from the length() of the character vector itself.

tbl_fruit %>% 
  mutate(name_length = str_length(fruit)) 
## # A tibble: 80 x 2
##    fruit        name_length
##    <chr>              <int>
##  1 apple                  5
##  2 apricot                7
##  3 avocado                7
##  4 banana                 6
##  5 bell pepper           11
##  6 bilberry               8
##  7 blackberry            10
##  8 blackcurrant          12
##  9 blood orange          12
## 10 blueberry              9
## # … with 70 more rows

str_sub() returns a portion of a string. This demonstration will return the first three letters of each fruit with the arguments fruit, 1, and 3.

tbl_fruit %>% 
  mutate(first_three_letters = str_sub(fruit, 1, 3)) 
## # A tibble: 80 x 2
##    fruit        first_three_letters
##    <chr>        <chr>              
##  1 apple        app                
##  2 apricot      apr                
##  3 avocado      avo                
##  4 banana       ban                
##  5 bell pepper  bel                
##  6 bilberry     bil                
##  7 blackberry   bla                
##  8 blackcurrant bla                
##  9 blood orange blo                
## 10 blueberry    blu                
## # … with 70 more rows

str_c() combines a character vector of length to a single string. This is similar to the normal c() function for creating a vector.

tbl_fruit %>% 
  mutate(name_with_s = str_c(fruit, "s")) 
## # A tibble: 80 x 2
##    fruit        name_with_s  
##    <chr>        <chr>        
##  1 apple        apples       
##  2 apricot      apricots     
##  3 avocado      avocados     
##  4 banana       bananas      
##  5 bell pepper  bell peppers 
##  6 bilberry     bilberrys    
##  7 blackberry   blackberrys  
##  8 blackcurrant blackcurrants
##  9 blood orange blood oranges
## 10 blueberry    blueberrys   
## # … with 70 more rows

str_replace() replaces a pattern within a string.

tbl_fruit %>% 
  mutate(capital_A = str_replace(fruit, 
                                 pattern = "a", 
                                 replacement = "A")) 
## # A tibble: 80 x 2
##    fruit        capital_A   
##    <chr>        <chr>       
##  1 apple        Apple       
##  2 apricot      Apricot     
##  3 avocado      Avocado     
##  4 banana       bAnana      
##  5 bell pepper  bell pepper 
##  6 bilberry     bilberry    
##  7 blackberry   blAckberry  
##  8 blackcurrant blAckcurrant
##  9 blood orange blood orAnge
## 10 blueberry    blueberry   
## # … with 70 more rows

2.3.2 Regular expressions with stringr

Sometimes, your string tasks cannot be expressed in terms of a fixed string, but can be described in terms of a pattern. Regular expressions, also know as “regexes,” are the standard way to specify these patterns. In regexes, specific characters and constructs take on special meaning in order to match multiple strings.

To explore regular expressions, we will use the str_detect() function, which reports TRUE for any string which matches the pattern, and then filter() to see all the matches. For example, here are all the fruits which include a “w” in their name.

tbl_fruit %>%
  filter(str_detect(fruit, pattern = "w"))
## # A tibble: 4 x 1
##   fruit     
##   <chr>     
## 1 honeydew  
## 2 kiwi fruit
## 3 strawberry
## 4 watermelon

In the code below, the first metacharacter is the period . , which stands for any single character, except a newline (which, by the way, is represented by \n). The regex b.r will match all fruits that have an “b,” followed by any single character, followed by “r.” Regexes are case sensitive.

tbl_fruit %>%
  filter(str_detect(fruit, pattern = "b.r"))
## # A tibble: 15 x 1
##    fruit      
##    <chr>      
##  1 bilberry   
##  2 blackberry 
##  3 blueberry  
##  4 boysenberry
##  5 cloudberry 
##  6 cranberry  
##  7 cucumber   
##  8 elderberry 
##  9 goji berry 
## 10 gooseberry 
## 11 huckleberry
## 12 mulberry   
## 13 raspberry  
## 14 salal berry
## 15 strawberry

Anchors can be included to express where the expression must occur within the string. The ^ indicates the beginning of string and $ indicates the end.

tbl_fruit %>%
  filter(str_detect(fruit, pattern = "^w"))
## # A tibble: 1 x 1
##   fruit     
##   <chr>     
## 1 watermelon
tbl_fruit %>%
  filter(str_detect(fruit, pattern = "o$"))
## # A tibble: 5 x 1
##   fruit    
##   <chr>    
## 1 avocado  
## 2 mango    
## 3 pamelo   
## 4 pomelo   
## 5 tamarillo

2.4 Factors

Factors are categorical variables that only take on a specified set of values. To manipulate factors we will use the forcats package, a core package in the Tidyverse.

It is easy to make factors with either factor(), as.factor() or parse_factor().

tibble(X = letters[1:3]) %>% 
  mutate(fac_1 = factor(X)) %>% 
  mutate(fac_2 = as.factor(X)) %>% 
  mutate(fac_3 = parse_factor(X))
## # A tibble: 3 x 4
##   X     fac_1 fac_2 fac_3
##   <chr> <fct> <fct> <fct>
## 1 a     a     a     a    
## 2 b     b     b     b    
## 3 c     c     c     c

Which of those three options is best depends on the situation. factor() is useful when you are creating a factor from nothing. as.factor() is best for simple transformations, especially of character variables, as in this example. parse_factor() is the most modern and powerful of the three.

Let’s use gapminder$continent as an example. Note that str() is a useful function for getting detailed information about an object.

str(gapminder$continent)
##  Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
levels(gapminder$continent)
## [1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania"
nlevels(gapminder$continent)
## [1] 5
class(gapminder$continent)
## [1] "factor"

To get a frequency table as a tibble, from a tibble, use count(). To get a similar result from a free-range factor, use fct_count().

gapminder %>% 
  count(continent)
## # A tibble: 5 x 2
##   continent     n
##   <fct>     <int>
## 1 Africa      624
## 2 Americas    300
## 3 Asia        396
## 4 Europe      360
## 5 Oceania      24
fct_count(gapminder$continent)
## # A tibble: 5 x 2
##   f            n
##   <fct>    <int>
## 1 Africa     624
## 2 Americas   300
## 3 Asia       396
## 4 Europe     360
## 5 Oceania     24

2.4.1 Dropping unused levels

Removing all the rows corresponding to a specific factor level does not remove the level itself. These unused levels can come back to haunt you later, e.g., in figure legends.

Watch what happens to the levels of country when we filter gapminder to a handful of countries.

nlevels(gapminder$country)
## [1] 142
h_gap <- gapminder %>%
  filter(country %in% c("Egypt", "Haiti", 
                        "Romania", "Thailand", 
                        "Venezuela"))
nlevels(h_gap$country)
## [1] 142

Even though h_gap only has data for a handful of countries, we are still schlepping around all levels from the original gapminder tibble.

How can you get rid of them? The base function droplevels() operates on all the factors in a data frame or on a single factor. The function fct_drop() operates on a single factor variable.

h_gap_dropped <- h_gap %>% 
  droplevels()
nlevels(h_gap_dropped$country)
## [1] 5
# Use fct_drop() on a free-range factor

h_gap$country %>%
  fct_drop() %>%
  levels()
## [1] "Egypt"     "Haiti"     "Romania"   "Thailand"  "Venezuela"

2.4.2 Change the order of the levels

By default, factor levels are ordered alphabetically.

gapminder$continent %>%
  levels()
## [1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania"

We can also order factors by:

  1. Frequency: Make the most common level the first and so on.
  2. Another variable: Order factor levels according to a summary statistic for another variable.

Let’s order continent by frequency using fct_infreq().

gapminder$continent %>% 
  fct_infreq() %>%
  levels()
## [1] "Africa"   "Asia"     "Europe"   "Americas" "Oceania"

We can also have the frequency print out backwards using fct_rev().

gapminder$continent %>% 
  fct_infreq() %>%
  fct_rev() %>% 
  levels()
## [1] "Oceania"  "Americas" "Europe"   "Asia"     "Africa"

These two bar charts of frequency by continent differ only in the order of the continents. Which do you prefer? We show the code for just the second one.

gapminder %>% 
  mutate(continent = fct_infreq(continent)) %>% 
  mutate(continent = fct_rev(continent)) %>% 
  ggplot(aes(x = continent)) +
    geom_bar() +
    coord_flip()

Let’s now order country by another variable, forwards and backwards. This other variable is usually quantitative and you will order the factor according to a grouped summary. The factor is the grouping variable and the default summarizing function is median() but you can specify something else.

# Order countries by median life expectancy

fct_reorder(gapminder$country, 
            gapminder$lifeExp) %>% 
  levels() %>% 
  head()
## [1] "Sierra Leone"  "Guinea-Bissau" "Afghanistan"   "Angola"       
## [5] "Somalia"       "Guinea"
# Order according to minimum life exp instead of median

fct_reorder(gapminder$country, 
            gapminder$lifeExp, min) %>% 
  levels() %>% 
  head()
## [1] "Rwanda"       "Afghanistan"  "Gambia"       "Angola"       "Sierra Leone"
## [6] "Cambodia"
# Backwards!

fct_reorder(gapminder$country, 
            gapminder$lifeExp, 
            .desc = TRUE) %>% 
  levels() %>% 
  head()
## [1] "Iceland"     "Japan"       "Sweden"      "Switzerland" "Netherlands"
## [6] "Norway"

Why do we reorder factor levels? It often makes plots much better! When plotting a factor against a numeric variable, it is generally a good idea to order the factors by some function of the numeric variable. Alphabetic ordering is rarely best.

Compare the interpretability of these two plots of life expectancy in the Aericas in 2007. The only difference is the order of the country factor. Which is better?

gapminder %>% 
  filter(year == 2007, 
         continent == "Americas") %>% 
  ggplot(aes(x = lifeExp, y = country)) + 
    geom_point()
gapminder %>% 
  filter(year == 2007, 
         continent == "Americas") %>% 
  ggplot(aes(x = lifeExp, 
             y = fct_reorder(country, lifeExp))) + 
    geom_point()

2.4.3 Recode the levels

Use fct_recode() to change the names of the levels for a factor.

i_gap <- gapminder %>% 
  filter(country %in% c("United States", "Sweden", 
                        "Australia")) %>% 
  droplevels()

i_gap$country %>% 
  levels()
## [1] "Australia"     "Sweden"        "United States"
i_gap$country %>%
  fct_recode("USA" = "United States", "Oz" = "Australia") %>% 
  levels()
## [1] "Oz"     "Sweden" "USA"

2.5 Date-Times

We will manipulate date-times using the lubridate package because lubridate makes it easier to work with dates and times in R. lubridate is not part of the core tidyverse because you only need it when you’re working with dates/times.

There are three types of date/time data that refer to an instant in time:

  • A date. Tibbles print this as <date>.

  • A time within a day. Tibbles print this as <time>.

  • A date-time is a date plus a time. It uniquely identifies an instant in time (typically to the nearest second). Tibbles print this as <dttm>.

You should always use the simplest possible data type that works for your needs. That means if you can use a date instead of a date-time, you should. Date-times are substantially more complicated because of the need to handle time zones, which we’ll come back to at the end of the chapter.

To get the current date or date-time you can use today() or now():

## [1] "2021-09-20"
now()
## [1] "2021-09-20 11:24:42 EDT"

Otherwise, there are three ways you’re likely to create a date/time:

  1. From a string.
  2. From individual date-time components.
  3. From an existing date/time object.

They work as follows:

2.5.1 From strings

Date/time data often comes as strings. The lubridate functions automatically work out the format once you specify the order of the component. First, figure out that you want the order in which year, month, and day appear in your dates, then arrange “y,” “m,” and “d” accordingly. That gives you the name of the lubridate function that will parse your date. For example:

ymd("2017-01-31")
## [1] "2017-01-31"
mdy("January 31st, 2017")
## [1] "2017-01-31"
dmy("31-Jan-2017")
## [1] "2017-01-31"

These functions also take unquoted numbers. This is the most concise way to create a single date/time object, as you might need when filtering date/time data. ymd() is short and unambiguous:

ymd(20170131)
## [1] "2017-01-31"

ymd() and friends create dates. To create a date-time, add an underscore and one or more of “h,” “m,” and “s” to the name of the parsing function:

ymd_hms("2017-01-31 20:11:59")
## [1] "2017-01-31 20:11:59 UTC"
mdy_hm("01/31/2017 08:01")
## [1] "2017-01-31 08:01:00 UTC"

You can also force the creation of a date-time from a date by supplying a timezone:

ymd(20170131, tz = "UTC")
## [1] "2017-01-31 UTC"

2.5.2 From individual components

Instead of a single string, sometimes you’ll have the individual components of the date-time spread across multiple columns. This is what we have in the flights data:

flights %>% 
  select(year, month, day, hour, minute)
## # A tibble: 336,776 x 5
##     year month   day  hour minute
##    <int> <int> <int> <dbl>  <dbl>
##  1  2013     1     1     5     15
##  2  2013     1     1     5     29
##  3  2013     1     1     5     40
##  4  2013     1     1     5     45
##  5  2013     1     1     6      0
##  6  2013     1     1     5     58
##  7  2013     1     1     6      0
##  8  2013     1     1     6      0
##  9  2013     1     1     6      0
## 10  2013     1     1     6      0
## # … with 336,766 more rows

To create a date/time from this sort of input, use make_date() for dates, or make_datetime() for date-times:

flights %>% 
  select(year, month, day, hour, minute) %>% 
  mutate(departure = make_datetime(year, month, day, hour, minute))
## # A tibble: 336,776 x 6
##     year month   day  hour minute departure          
##    <int> <int> <int> <dbl>  <dbl> <dttm>             
##  1  2013     1     1     5     15 2013-01-01 05:15:00
##  2  2013     1     1     5     29 2013-01-01 05:29:00
##  3  2013     1     1     5     40 2013-01-01 05:40:00
##  4  2013     1     1     5     45 2013-01-01 05:45:00
##  5  2013     1     1     6      0 2013-01-01 06:00:00
##  6  2013     1     1     5     58 2013-01-01 05:58:00
##  7  2013     1     1     6      0 2013-01-01 06:00:00
##  8  2013     1     1     6      0 2013-01-01 06:00:00
##  9  2013     1     1     6      0 2013-01-01 06:00:00
## 10  2013     1     1     6      0 2013-01-01 06:00:00
## # … with 336,766 more rows

2.5.3 From other types

You may want to switch between a date-time and a date. That’s the job of as_datetime() and as_date():

## [1] "2021-09-20 UTC"
## [1] "2021-09-20"

Sometimes you’ll get date/times as numeric offsets from the “Unix Epoch,” 1970-01-01. If the offset is in seconds, use as_datetime(); if it’s in days, use as_date().

as_datetime(60 * 60 * 10)
## [1] "1970-01-01 10:00:00 UTC"
as_date(365 * 10 + 2)
## [1] "1980-01-01"

2.5.4 Date-time components

Now that you know how to get date-time data into R’s date-time data structures, let’s explore what you can do with them. This section will focus on the accessor functions that let you get and set individual components. The next section will look at how arithmetic works with date-times.

You can pull out individual parts of the date with the accessor functions year(), month(), mday() (day of the month), yday() (day of the year), wday() (day of the week), hour(), minute(), and second().

datetime <- ymd_hms("2016-07-08 12:34:56")
year(datetime)
## [1] 2016
month(datetime)
## [1] 7
mday(datetime)
## [1] 8
yday(datetime)
## [1] 190
wday(datetime)
## [1] 6

For month() and wday() you can set label = TRUE to return the abbreviated name of the month or day of the week. Set both label = TRUE and abbr = FALSE to return the full name.

month(datetime, label = TRUE)
## [1] Jul
## 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec
wday(datetime, label = TRUE, abbr = FALSE)
## [1] Friday
## 7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday

2.5.5 Setting components

You can create a new date-time with update().

update(datetime, year = 2020, month = 2, mday = 2, hour = 2)
## [1] "2020-02-02 02:34:56 UTC"

If values are too big, they will roll-over:

ymd("2015-02-01") %>% 
  update(mday = 30)
## [1] "2015-03-02"
ymd("2015-02-01") %>% 
  update(hour = 400)
## [1] "2015-02-17 16:00:00 UTC"

2.5.6 Time zones

Time zones are an enormously complicated topic because of their interaction with geopolitical entities. Fortunately, we don’t need to dig into all the details as they’re not all imperative for data analysis. You can see the complete list of possible timezones with the function OlsonNames(). Unless otherwise specified, lubridate always uses UTC (Coordinated Universal Time).

In R, the time zone is an attribute of the date-time that only controls printing. For example, these three objects represent the same instant in time:

(x1 <- ymd_hms("2015-06-01 12:00:00", tz = "America/New_York"))
## [1] "2015-06-01 12:00:00 EDT"
(x2 <- ymd_hms("2015-06-01 18:00:00", tz = "Europe/Copenhagen"))
## [1] "2015-06-01 18:00:00 CEST"
(x3 <- ymd_hms("2015-06-02 04:00:00", tz = "Pacific/Auckland"))
## [1] "2015-06-02 04:00:00 NZST"

2.6 Combining Data

There are many ways to bring data together.

Combining data is often tricky.

FIGURE 2.3: Combining data is often tricky.

The bind_rows() function is used to combine all the rows from two or more tibbles.

data_1 <- tibble(x = 1:2,                
                 y = c("A", "B")) 

data_2 <- tibble(x = 3:4,
                 y = c("C", "D")) 


bind_rows(data_1, data_2)
## # A tibble: 4 x 2
##       x y    
##   <int> <chr>
## 1     1 A    
## 2     2 B    
## 3     3 C    
## 4     4 D

2.6.1 Joins

Consider two tibbles: superheroes and publishers.

superheroes <- tibble::tribble(
       ~name,   ~gender,     ~publisher,
   "Magneto",   "male",       "Marvel",
     "Storm",   "female",     "Marvel",
    "Batman",   "male",       "DC",
  "Catwoman",   "female",     "DC",
   "Hellboy",   "male",       "Dark Horse Comics"
  )

publishers <- tibble::tribble(
  ~publisher, ~yr_founded,
        "DC",       1934L,
    "Marvel",       1939L,
     "Image",       1992L
  )

Note how easy it is to use tribble() from the tibble package to create a tibble on the fly using text which organized for easy entry and reading. Recall that a double colon — :: — is how we indicate that a function comes from a specific package.

2.6.1.1 inner_join()

inner_join(x, y): Returns all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.

Inner join.

FIGURE 2.4: Inner join.

inner_join(superheroes, publishers)
## Joining, by = "publisher"
## # A tibble: 4 x 4
##   name     gender publisher yr_founded
##   <chr>    <chr>  <chr>          <int>
## 1 Magneto  male   Marvel          1939
## 2 Storm    female Marvel          1939
## 3 Batman   male   DC              1934
## 4 Catwoman female DC              1934

We lose Hellboy in the join because, although he appears in x = superheroes, his publisher Dark Horse Comics does not appear in y = publishers. The join result has all variables from x = superheroes plus yr_founded, from y.

Note the message that we are ‘Joining, by = “publisher”.’ Whenever joining, R checks to see if there are variables in common between the two tibbles and, if there are, uses them to join. However, it is concerned that you may not be aware that this is what it is doing, so R tells you. Such messages are both annoying and a signal that we have not made our code as robust as we should. Fortunately, we can specify precisely which variables we want to join by. Always do this.

inner_join(superheroes, publishers, by = "publisher")

by also takes a vector of key variables if you want to merge by multiple variables.

Now compare this result to that of using inner_join() with the two datasets in opposite positions.

inner_join(publishers, superheroes, by = "publisher")
## # A tibble: 4 x 4
##   publisher yr_founded name     gender
##   <chr>          <int> <chr>    <chr> 
## 1 DC              1934 Batman   male  
## 2 DC              1934 Catwoman female
## 3 Marvel          1939 Magneto  male  
## 4 Marvel          1939 Storm    female

In a way, this does illustrate multiple matches, if you think about it from the x = publishers direction. Every publisher that has a match in y = superheroes appears multiple times in the result, once for each match. In fact, we’re getting the same result as with inner_join(superheroes, publishers), up to variable order (which you should also never rely on in an analysis).

2.6.1.2 full_join()

full_join(x, y): Returns all rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing.

full_join(superheroes, publishers, by = "publisher")
## # A tibble: 6 x 4
##   name     gender publisher         yr_founded
##   <chr>    <chr>  <chr>                  <int>
## 1 Magneto  male   Marvel                  1939
## 2 Storm    female Marvel                  1939
## 3 Batman   male   DC                      1934
## 4 Catwoman female DC                      1934
## 5 Hellboy  male   Dark Horse Comics         NA
## 6 <NA>     <NA>   Image                   1992

We get all rows of x = superheroes plus a new row from y = publishers, containing the publisher Image. We get all variables from x = superheroes AND all variables from y = publishers. Any row that derives solely from one table or the other carries NAs in the variables found only in the other table.

Because full_join() returns all rows and all columns from both x and y, the result of full_join(x = superheroes, y = publishers) should match that of full_join(x = publishers, y = superheroes).

2.6.1.3 left_join()

left_join(x, y): Returns all rows from x, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.

left_join(superheroes, publishers, by = "publisher")
## # A tibble: 5 x 4
##   name     gender publisher         yr_founded
##   <chr>    <chr>  <chr>                  <int>
## 1 Magneto  male   Marvel                  1939
## 2 Storm    female Marvel                  1939
## 3 Batman   male   DC                      1934
## 4 Catwoman female DC                      1934
## 5 Hellboy  male   Dark Horse Comics         NA

We basically get x = superheroes back, but with an additional variable yr_founded, which is unique to y = publishers. Hellboy, whose publisher does not appear in y = publishers, has an NA for yr_founded.

Now compare this result to that of running left_join(x = publishers, y = superheroes). Unlike inner_join() and full_join() the order of the arguments has a significant effect on the resulting tibble.

left_join(publishers, superheroes, by = "publisher")
## # A tibble: 5 x 4
##   publisher yr_founded name     gender
##   <chr>          <int> <chr>    <chr> 
## 1 DC              1934 Batman   male  
## 2 DC              1934 Catwoman female
## 3 Marvel          1939 Magneto  male  
## 4 Marvel          1939 Storm    female
## 5 Image           1992 <NA>     <NA>

We get a similar result as with inner_join(), but the publisher Image survives in the join, even though no superheroes from Image appear in y = superheroes. As a result, Image has NAs for name and gender.

There is a similar function, right_join(x, y) that returns all rows from y, and all columns from x and y.

2.6.1.4 semi_join()

semi_join(x, y): Returns all rows from x where there are matching values in y, keeping just columns from x. A semi join differs from an inner join because an inner join will return one row of x for each matching row of y, whereas a semi join will never duplicate rows of x. This is a filtering join.

semi_join(superheroes, publishers, by = "publisher")
## # A tibble: 4 x 3
##   name     gender publisher
##   <chr>    <chr>  <chr>    
## 1 Magneto  male   Marvel   
## 2 Storm    female Marvel   
## 3 Batman   male   DC       
## 4 Catwoman female DC

Compare the result of switching the values of the arguments.

semi_join(x = publishers, y = superheroes, by = "publisher")
## # A tibble: 2 x 2
##   publisher yr_founded
##   <chr>          <int>
## 1 DC              1934
## 2 Marvel          1939

Now the effects of switching the x and y roles is more clear. The result resembles x = publishers, but the publisher Image is lost, since there are no observations where publisher == "Image" in y = superheroes.

2.6.1.5 anti_join()

anti_join(x, y): Return all rows from x where there are no matching values in y, keeping just columns from x.

anti_join(superheroes, publishers, by = "publisher")
## # A tibble: 1 x 3
##   name    gender publisher        
##   <chr>   <chr>  <chr>            
## 1 Hellboy male   Dark Horse Comics

We keep only Hellboy now.

Now switch the arguments and compare the result.

anti_join(publishers, superheroes, by = "publisher")
## # A tibble: 1 x 2
##   publisher yr_founded
##   <chr>          <int>
## 1 Image           1992

We keep only publisher Image now (and the variables found in x = publishers).

2.6.2 Example

Consider the relationships among the tibbles in the nycflights13 package:

Relationships among nycflights tables

FIGURE 2.5: Relationships among nycflights tables

In both the flights and airlines data frames, the key variable we want to join/merge/match the rows by has the same name: carrier. Let’s use inner_join() to join the two data frames, where the rows will be matched by the variable carrier

flights %>% 
  inner_join(airlines, by = "carrier")
## # A tibble: 336,776 x 20
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 336,766 more rows, and 12 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>,
## #   name <chr>

This is our first example of using a join function with a pipe. flights is fed as the first argument to inner_join(). That is what the pipe does. The above code is equivalent to:

inner_join(flights, airlines, by = "carrier")

The airports data frame contains the airport codes for each airport:

airports
## # A tibble: 1,458 x 8
##    faa   name                       lat    lon   alt    tz dst   tzone          
##    <chr> <chr>                    <dbl>  <dbl> <dbl> <dbl> <chr> <chr>          
##  1 04G   Lansdowne Airport         41.1  -80.6  1044    -5 A     America/New_Yo…
##  2 06A   Moton Field Municipal A…  32.5  -85.7   264    -6 A     America/Chicago
##  3 06C   Schaumburg Regional       42.0  -88.1   801    -6 A     America/Chicago
##  4 06N   Randall Airport           41.4  -74.4   523    -5 A     America/New_Yo…
##  5 09J   Jekyll Island Airport     31.1  -81.4    11    -5 A     America/New_Yo…
##  6 0A9   Elizabethton Municipal …  36.4  -82.2  1593    -5 A     America/New_Yo…
##  7 0G6   Williams County Airport   41.5  -84.5   730    -5 A     America/New_Yo…
##  8 0G7   Finger Lakes Regional A…  42.9  -76.8   492    -5 A     America/New_Yo…
##  9 0P2   Shoestring Aviation Air…  39.8  -76.6  1000    -5 U     America/New_Yo…
## 10 0S9   Jefferson County Intl     48.1 -123.    108    -8 A     America/Los_An…
## # … with 1,448 more rows

However, if you look at both the airports and flights data frames, you’ll find that the airport codes are in variables that have different names. In airports the airport code is in faa, whereas in flights the airport codes are in origin and dest.

In order to join these two data frames by airport code, our inner_join() operation will use by = c("dest" = "faa") thereby allowing us to join two data frames where the key variable has a different name.

flights %>% 
  inner_join(airports, by = c("dest" = "faa"))
## # A tibble: 329,174 x 26
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      554            600        -6      812            837
##  5  2013     1     1      554            558        -4      740            728
##  6  2013     1     1      555            600        -5      913            854
##  7  2013     1     1      557            600        -3      709            723
##  8  2013     1     1      557            600        -3      838            846
##  9  2013     1     1      558            600        -2      753            745
## 10  2013     1     1      558            600        -2      849            851
## # … with 329,164 more rows, and 18 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>,
## #   name <chr>, lat <dbl>, lon <dbl>, alt <dbl>, tz <dbl>, dst <chr>,
## #   tzone <chr>

Let’s construct the chain of pipe operators %>% that computes the number of flights from NYC to each destination, but also includes information about each destination airport:

flights %>%
  group_by(dest) %>%
  summarize(num_flights = n(),
            .groups = "drop") %>%
  arrange(desc(num_flights)) %>%
  inner_join(airports, by = c("dest" = "faa")) %>%
  rename(airport_name = name)
## # A tibble: 101 x 9
##    dest  num_flights airport_name          lat    lon   alt    tz dst   tzone   
##    <chr>       <int> <chr>               <dbl>  <dbl> <dbl> <dbl> <chr> <chr>   
##  1 ORD         17283 Chicago Ohare Intl   42.0  -87.9   668    -6 A     America…
##  2 ATL         17215 Hartsfield Jackson…  33.6  -84.4  1026    -5 A     America…
##  3 LAX         16174 Los Angeles Intl     33.9 -118.    126    -8 A     America…
##  4 BOS         15508 General Edward Law…  42.4  -71.0    19    -5 A     America…
##  5 MCO         14082 Orlando Intl         28.4  -81.3    96    -5 A     America…
##  6 CLT         14064 Charlotte Douglas …  35.2  -80.9   748    -5 A     America…
##  7 SFO         13331 San Francisco Intl   37.6 -122.     13    -8 A     America…
##  8 FLL         12055 Fort Lauderdale Ho…  26.1  -80.2     9    -5 A     America…
##  9 MIA         11728 Miami Intl           25.8  -80.3     8    -5 A     America…
## 10 DCA          9705 Ronald Reagan Wash…  38.9  -77.0    15    -5 A     America…
## # … with 91 more rows

"ORD" is the airport code of Chicago O’Hare airport and "FLL" is the code for the main airport in Fort Lauderdale, Florida, which can be seen in the airport_name variable.

2.7 Tidy data

Consider the first five rows from the drinks data frame from the fivethirtyeight package:

## # A tibble: 5 x 5
##   country    beer_servings spirit_servings wine_servings total_litres_of_pure_a…
##   <chr>              <int>           <int>         <int>                   <dbl>
## 1 Afghanist…             0               0             0                     0  
## 2 Albania               89             132            54                     4.9
## 3 Algeria               25               0            14                     0.7
## 4 Andorra              245             138           312                    12.4
## 5 Angola               217              57            45                     5.9

After reading the help file by running ?drinks, you’ll see that drinks is a data frame that contains the results from a survey of the average number of servings of beer, spirits, and wine consumed in 193 countries. This data was originally reported on FiveThirtyEight.com in Mona Chalabi’s article: “Dear Mona Followup: Where Do People Drink The Most Beer, Wine And Spirits?”.

Let’s apply some of the data wrangling verbs on the drinks data frame:

  1. filter() the drinks data frame to only consider 4 countries: the United States, China, Italy, and Saudi Arabia, then
  2. select() all columns except total_litres_of_pure_alcohol by using the - sign, then
  3. rename() the variables beer_servings, spirit_servings, and wine_servings to beer, spirit, and wine, respectively.

We will save the resulting data frame in drinks_smaller.

drinks_smaller <- drinks %>%
  filter(country %in% c("USA", "China", "Italy", "Saudi Arabia")) %>%
  select(-total_litres_of_pure_alcohol) %>%
  rename(beer = beer_servings, spirit = spirit_servings, wine = wine_servings)
drinks_smaller
## # A tibble: 4 x 4
##   country       beer spirit  wine
##   <chr>        <int>  <int> <int>
## 1 China           79    192     8
## 2 Italy           85     42   237
## 3 Saudi Arabia     0      5     0
## 4 USA            249    158    84

2.7.1 Definition of “tidy” data

What does it mean for your data to be “tidy?” While “tidy” has a clear English meaning of “organized,” the word “tidy” in data science using R means that your data follows a standardized format.

“Tidy” data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. In tidy data:

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.

2.7.2 Converting to “tidy” data

In this book so far, you’ve only seen data frames that were already in “tidy” format. Furthermore, for the rest of this book, you’ll mostly only see data frames that are already tidy as well. This is not always the case, however, with all datasets in the world. If your original data frame is in wide (non-“tidy”) format and you would like to use the ggplot2 or dplyr packages, you will first have to convert it to “tidy” format. To do so, we recommend using the pivot_longer() function in the tidyr package (Wickham 2021d).

Going back to our drinks_smaller data frame from earlier:

drinks_smaller
## # A tibble: 4 x 4
##   country       beer spirit  wine
##   <chr>        <int>  <int> <int>
## 1 China           79    192     8
## 2 Italy           85     42   237
## 3 Saudi Arabia     0      5     0
## 4 USA            249    158    84

We tidy it by using the pivot_longer() function from the tidyr package as follows:

drinks_smaller_tidy <- drinks_smaller %>% 
  pivot_longer(names_to = "type", 
               values_to = "servings", 
               cols = -country)
drinks_smaller_tidy
## # A tibble: 12 x 3
##    country      type   servings
##    <chr>        <chr>     <int>
##  1 China        beer         79
##  2 China        spirit      192
##  3 China        wine          8
##  4 Italy        beer         85
##  5 Italy        spirit       42
##  6 Italy        wine        237
##  7 Saudi Arabia beer          0
##  8 Saudi Arabia spirit        5
##  9 Saudi Arabia wine          0
## 10 USA          beer        249
## 11 USA          spirit      158
## 12 USA          wine         84

Let’s dissect the arguments to pivot_longer().

  1. the first argumentnames_to corresponds to the name of the variable in the new “tidy”/long data frame that will contain the column names of the original data. Observe how we set names_to = "type". In the resulting drinks_smaller_tidy, the column type contains the three types of alcohol beer, spirit, and wine. Since type is a variable name that doesn’t appear in drinks_smaller, we use quotation marks around it. You’ll receive an error if you just use names_to = type here.

  2. The second argument values_to corresponds to the name of the variable in the new “tidy” data frame that will contain the values of the original data. Observe how we set values_to = "servings" since each numeric value in the beer, wine, and spirit columns of the drinks_smaller corresponds to a value of servings. In the resulting drinks_smaller_tidy, the column servings contains the 4 \(\times\) 3 = 12 numerical values. Note again that servings doesn’t appear as a variable in drinks_smaller so it again needs quotation marks around it for the values_to argument.

  3. The third argument cols is the columns in the drinks_smaller data frame you either want to or don’t want to “tidy.” Observe how we set this to -country indicating that we don’t want to “tidy” the country variable in drinks_smaller and rather only beer, spirit, and wine. Since country is a column that appears in drinks_smaller we don’t put quotation marks around it.

The third argument here of cols is a little nuanced, so let’s consider code that was written slightly differently but that produces the same output:

drinks_smaller %>% 
  pivot_longer(names_to = "type", 
               values_to = "servings", 
               cols = c(beer, spirit, wine))

Note that the third argument now specifies which columns we want to “tidy” with c(beer, spirit, wine), instead of the columns we don’t want to “tidy” using -country. We use the c() function to create a vector of the columns in drinks_smaller that we’d like to “tidy.” Note that since these three columns appear one after another in the drinks_smaller data frame, we can also do the following for the cols argument:

drinks_smaller %>% 
  pivot_longer(names_to = "type", 
               values_to = "servings", 
               cols = beer:wine)

Converting “wide” format data to “tidy” format often confuses new R users. The only way to get comfortable with the pivot_longer() function is with practice, practice, and more practice using different datasets. For example, run ?pivot_longer and look at the examples in the bottom of the help file.

If, however, you want to convert a “tidy” data frame to “wide” format, you will need to use the pivot_wider() function instead. Run ?pivot_wider and look at the examples in the bottom of the help file for examples.

You can also view examples of both pivot_longer() and pivot_wider() on the tidyverse.org webpage. There’s a nice example to check out the different functions available for data tidying and a case study using data from the World Health Organization on that webpage. Furthermore, each week the R4DS Online Learning Community posts a dataset in the weekly #TidyTuesday event that might serve as a nice place for you to find other data to explore and transform.

2.8 Other commands

Here are some topics which will prove important later in the Primer.

2.8.1 Random variables

2.8.1.1 sample()

The most common distributions you will work with are empirical or frequency distributions, the values of age in the trains tibble, the values of poverty in the kenya tibble, and so on. But we can also create our own data by making “draws” from a distribution which we have concocted.

Consider the distribution of the possible values from rolling a fair die. We can use the sample() function to create draws from this distribution, meaning it will change (or sometimes stay the same) for every subsequent draw.

die <- c(1, 2, 3, 4, 5, 6)

sample(x = die, size = 1)
## [1] 3

This produces one “draw” from the distribution of the possible values of one roll of fair six-sided die.

Now, suppose we wanted to roll this die 10 times. One of the arguments of the sample() function is replace. We must specify it as TRUE if values can appear more than once. Since, when rolling a die 10 times, we expect that a value like 3 can appear more than once, we need to set replace = TRUE.

sample(x = die, size = 10, replace = TRUE)
##  [1] 5 6 3 3 3 4 3 6 4 5

In other words, rolling a 1 on the first roll should not preclude you from rolling a 1 on a later roll.

What if the die is not “fair,” meaning that some sides are more likely to appear than others? The final argument of the sample() function is the prob argument. It takes a vector (of the same length as the initial vector x) that contains all of the probabilities of selecting each one of the elements of x. Suppose that the probability of rolling a 1 was 0.5, and the probability of rolling any other value is 0.1. (These probabilities should sum to 1. If they don’t sample() will automatically re-scale them.)

sample(x = die, 
       size = 10, 
       replace = TRUE, 
       prob = c(0.5, 0.1, 0.1, 0.1, 0.1, 0.1))
##  [1] 1 1 3 1 2 1 1 6 1 1

Remember: There is no real data here. We have not actually rolled a die. We have just made some assumptions about what would happen if we were to roll a die. With those assumptions we have built an urn — a data generating mechanism — from which we can draw as many values as we like. Let’s roll the unfair die 10,000 times.

tibble(result = sample(x = die, 
                       size = 10000, 
                       replace = TRUE, 
                       prob = c(0.5, rep(0.1, 5)))) %>% 
  ggplot(aes(x = result)) +
    geom_bar() +
    labs(title = "Distribution of Results of an Unfair Die",
         x = "Result of One Roll",
         y = "Count") +
    scale_x_continuous(breaks = 1:6,
                       labels = as.character(1:6)) +
    scale_y_continuous(labels = scales::comma_format())

sample() is just one of many functions for creating draws — or, more colloquially, “drawing” — from a distribution. Three of the most important functions are: runif(), rbinom(), and rnorm().

2.8.1.2 runif()

Consider a “uniform” distribution. This is the case in which every outcome in the range of possible outcomes has the same chance of occurring. The function runif() (spoken as “r-unif”) enables us to draw from a uniform contribution. runif() has three arguments: n, min, and max. runif() will produce n draws from between min and max, with each value having an equal chance of occurring.

runif(n = 10, min = 4, max = 6)
##  [1] 5.6 5.0 4.6 5.7 4.9 5.2 4.3 5.6 4.8 5.6

Mathematically, we would notate:

\[y_i \sim U(4, 6)\],

This means that the each value for \(y\) is drawn from a uniform distribution between four and six.

2.8.1.3 rbinom()

Consider binomial distribution, the case in which the probability of some Boolean variable (for instance success or failure) is calculated for repeated, independent trials. One common example would be the probability of flipping a coin and landing on heads. The function rbinom() allows us to draw from a binomial distribution. This function takes three arguments, n, size, and prob.

n is the number of values we seek to draw. size is the number of trials for each n. *prob is the probability of success on each trial.

Suppose we wanted to flip a fair coin one time, and let landing on heads represent success.

rbinom(n = 1 , size = 1, prob = 0.5)
## [1] 0

Do the same thing 100 times:

tibble(heads = rbinom(n = 100, size = 1, prob = 0.5)) %>% 
  ggplot(aes(x = heads)) +
    geom_bar() +
    labs(title = "Flipping a Fair Coin 100 Times",
         x = "Result",
         y = "Count") +
    scale_x_continuous(breaks = c(0, 1),
                       labels = c("Tails", "Heads"))

In our graph above, we use the function scale_x_continuous() because our x-axis variable is continuous, meaning it can take on any real values. The breaks argument to scale_x_continuous() converts our x-axis to having two different “tick marks.” There is a fairly even distribution of Tails and Heads. More draws would typically result in an even more equal split.

Randomness creates (inevitable) tension between distribution as a “thing” and distribution as a vector of draws from that thing. In this case, the vector of draws is not balanced between Tails and Heads. Yet, we “know” that it should be since the coin is, by definition, fair. In a sense, the mathematics require an even split. Yet, randomness means that the vector of draws will rarely match the mathematically “true” result. And that is OK! First, randomness is an intrinsic property of the real world. Second, we can make the effect of randomness be as small as we want by increasing the number of draws.

Suppose instead we wanted to simulate an unfair coin, where the probability of landing on Heads was 0.75 instead of 0.25.

tibble(heads = rbinom(n = 100, size = 1, prob = 0.75)) %>% 
  ggplot(aes(x = heads)) +
    geom_bar() +
    labs(title = "Flipping a Fair Coin 100 Times",
         x = "Result",
         y = "Count") +
    scale_x_continuous(breaks = c(0, 1),
                       labels = c("Tails", "Heads"))

The distribution — the imaginary urn — from which we draw the results of a coin flip for a fair coin is a different distribution — a different imaginary urn — from the distribution for a biased coin. In fact, there are an infinite number of distributions. Yet as long as we can draw values from a distribution, we can work with it. Mathematics:

\[y_i \sim B(n, p)\].

Each value for \(y\) is drawn from a binomial distribution with parameters \(n\) for the number of trials and \(p\) for the probability of success.

Instead of each n consisting of a single trial, we could have situation in which we are, 10,000 times, flipping a coin 10 times and summing up, for each experiment, the number of heads. In that case:

set.seed(9)
tibble(heads = rbinom(n = 10000, size = 10, prob = 0.5)) %>% 
  ggplot(aes(x = heads)) +
    geom_bar() +
    labs(title = "Flipping a Fair Coin 10 Times",
         subtitle = "Extreme results are possible with enough experiments",
         x = "Total Number of Heads in Ten Flips",
         y = "Count") +
    scale_x_continuous(breaks = 0:10)

2.8.1.4 rnorm()

The most important distribution is the normal distribution. Mathematics:

\[y_i \sim N(\mu, \sigma^2)\].

Each value \(y_i\) is drawn from a normal distribution with parameters \(\mu\) for the mean and \(\sigma\) for the standard deviation.

This bell-shaped distribution is defined by two parameters: (1) the mean \(\mu\) (spoken as “mu”) which locates the center of the distribution and (2) the standard deviation \(\sigma\) (spoken as “sigma”) which determines the variation of values around that center. In the figure below, we plot three normal distributions where:

  1. The solid normal curve has mean \(\mu = 5\) & standard deviation \(\sigma = 2\).
  2. The dotted normal curve has mean \(\mu = 5\) & standard deviation \(\sigma = 5\).
  3. The dashed normal curve has mean \(\mu = 15\) & standard deviation \(\sigma = 2\).
Three normal distributions.

FIGURE 2.6: Three normal distributions.

Notice how the solid and dotted line normal curves have the same center due to their common mean \(\mu\) = 5. However, the dotted line normal curve is wider due to its larger standard deviation of \(\sigma = 5\). On the other hand, the solid and dashed line normal curves have the same variation due to their common standard deviation \(\sigma = 2\). However, they are centered at different locations.

When the mean \(\mu = 0\) and the standard deviation \(\sigma = 1\), the normal distribution has a special name. It’s called the standard normal distribution or the \(z\)-curve.

Furthermore, if a variable follows a normal curve, there are three rules of thumb we can use:

  1. 68% of values will lie within \(\pm\) 1 standard deviation of the mean.
  2. 95% of values will lie within \(\pm\) 1.96 \(\approx\) 2 standard deviations of the mean.
  3. 99.7% of values will lie within \(\pm\) 3 standard deviations of the mean.

Let’s illustrate this on a standard normal curve. The dashed lines are at -3, -1.96, -1, 0, 1, 1.96, and 3. These 7 lines cut up the x-axis into 8 segments. The areas under the normal curve for each of the 8 segments are marked and add up to 100%. For example:

  1. The middle two segments represent the interval -1 to 1. The shaded area above this interval represents 34% + 34% = 68% of the area under the curve. In other words, 68% of values.
  2. The middle four segments represent the interval -1.96 to 1.96. The shaded area above this interval represents 13.5% + 34% + 34% + 13.5% = 95% of the area under the curve. In other words, 95% of values.
  3. The middle six segments represent the interval -3 to 3. The shaded area above this interval represents 2.35% + 13.5% + 34% + 34% + 13.5% + 2.35% = 99.7% of the area under the curve. In other words, 99.7% of values.
Rules of thumb about areas under normal curves.

FIGURE 2.7: Rules of thumb about areas under normal curves.

2.8.2 Combinations

We often need to create a tibble with all the possible combinations of two or more variables. expand_grid() is the easiest approach.

expand_grid(x = c("A", "B"), y = c(1, 2, 3, 4))
## # A tibble: 8 x 2
##   x         y
##   <chr> <dbl>
## 1 A         1
## 2 A         2
## 3 A         3
## 4 A         4
## 5 B         1
## 6 B         2
## 7 B         3
## 8 B         4

crossing() does something similar if the input variables are already in a tibble.

2.8.3 Matrices

Recall that a “matrix” in R is a rectangular array of data, shaped like a data frame or tibble, but containing only one type of data, e.g., numeric. Large matrices also print out ugly. (There are other differences, none of which we care about here.) Example:

m <- matrix(c(3, 4, 8, 9, 12, 13, 0, 15, -1), ncol = 3)
m
##      [,1] [,2] [,3]
## [1,]    3    9    0
## [2,]    4   12   15
## [3,]    8   13   -1

The easiest way to pull information from a matrix is to use [ ], the subset operator. Here is how we grab the second and third columns of m:

m[, 2:3]
##      [,1] [,2]
## [1,]    9    0
## [2,]   12   15
## [3,]   13   -1

Note, however, that matrices with just one dimension “collapse” into single vectors:

m[, 2]
## [1]  9 12 13

Tibbles, on the other hand, always maintain their rectangular shapes, even with only one column or row.

We can turn matrices into tibbles with as_tibble().

m %>% 
  as_tibble()
## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
## Using compatibility `.name_repair`.
## # A tibble: 3 x 3
##      V1    V2    V3
##   <dbl> <dbl> <dbl>
## 1     3     9     0
## 2     4    12    15
## 3     8    13    -1

Because m does not have column names, as_tibble() creates its won variables names, using “V” for variable.

2.8.4 Missing Values

Some observations in a tibble are blank. These are called missing values, and they are often marked as NA. We can create such a tibble as follows:

tbl <- tribble(
  ~ a, ~ b, ~ c,
    2,   3,   5,
    4,  NA,   8,
   NA,   7,   9,
)

tbl
## # A tibble: 3 x 3
##       a     b     c
##   <dbl> <dbl> <dbl>
## 1     2     3     5
## 2     4    NA     8
## 3    NA     7     9

The presence of NA values can be problematic.

tbl %>% 
  summarize(avg_a = mean(a))
## # A tibble: 1 x 1
##   avg_a
##   <dbl>
## 1    NA

Fortunately, most R functions take an argument, na.rm, which, when set to TRUE, removes NA values from any calculations.

tbl %>% 
  summarize(avg_a = mean(a, na.rm = TRUE))
## # A tibble: 1 x 1
##   avg_a
##   <dbl>
## 1     3

Another approach is to use drop_na().

tbl %>% 
  drop_na(a)
## # A tibble: 2 x 3
##       a     b     c
##   <dbl> <dbl> <dbl>
## 1     2     3     5
## 2     4    NA     8

However, be careful! If you use drop_na() without a specific variable provided, you will remove all rows with a missing value for any variable in the tibble.

tbl %>% 
  drop_na()
## # A tibble: 1 x 3
##       a     b     c
##   <dbl> <dbl> <dbl>
## 1     2     3     5

A final approach is to use is.na() to explicitly determine if a value is missing.

tbl %>% 
  mutate(a_missing = is.na(a))
## # A tibble: 3 x 4
##       a     b     c a_missing
##   <dbl> <dbl> <dbl> <lgl>    
## 1     2     3     5 FALSE    
## 2     4    NA     8 FALSE    
## 3    NA     7     9 TRUE

2.8.5 Working by rows

Tibbles and the main Tidyverse functions are designed to work by columns. You do something to all the values for variable a. But, sometimes, we want to work across the tibble, comparing the value for a in the first row to the value of b in the first row, and so on. To do that, we need two tricks. First, we use rowwise() to inform R that the next set of commands should be executed across the rows.

tbl %>% 
  rowwise()
## # A tibble: 3 x 3
## # Rowwise: 
##       a     b     c
##   <dbl> <dbl> <dbl>
## 1     2     3     5
## 2     4    NA     8
## 3    NA     7     9

Note how “# Rowwise:” is printed out. Having set up the pipe to work across rows, we need to pass c_across() to whichever function we are using, generally specifying the variables we want to use. If we don’t provide any arguments to c_across(), it will use all the columns in the tibble.

tbl %>% 
  rowwise() %>% 
  mutate(sum_a_c = sum(c_across(c(a, c)))) %>% 
  mutate(largest = max(c_across())) %>% 
  mutate(largest_na = max(c_across(), na.rm = TRUE))
## # A tibble: 3 x 6
## # Rowwise: 
##       a     b     c sum_a_c largest largest_na
##   <dbl> <dbl> <dbl>   <dbl>   <dbl>      <dbl>
## 1     2     3     5       7       7          7
## 2     4    NA     8      12      NA         12
## 3    NA     7     9      NA      NA          9

2.8.6 Using skim()

The skimr package offers a very useful function known as the skim() function, allowing you to get valuable information from the data set in one glance. This is similar to the glimpse() function, but it’s a little more detailed and offers some preliminary analysis of the topic.

Let’s try skimming the nhanes dataset.

skim(nhanes)
TABLE 2.1: Data summary
Name nhanes
Number of rows 10000
Number of columns 15
_______________________
Column type frequency:
character 2
factor 3
numeric 10
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
gender 0 1 4 6 0 2 0
race 0 1 5 8 0 5 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
education 2779 0.72 FALSE 4 Som: 2267, Col: 2098, Hig: 1517, Mid: 1339
hh_income 811 0.92 FALSE 12 mor: 2220, 750: 1084, 250: 958, 350: 863
depressed 3327 0.67 FALSE 3 Non: 5246, Sev: 1009, Mos: 418

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
survey 0 1.00 2010.00 1.00 2009.0 2009 2010 2011 2011 ▇▁▁▁▇
age 0 1.00 36.74 22.40 0.0 17 36 54 80 ▇▇▇▆▅
weight 78 0.99 70.98 29.13 2.8 56 73 89 231 ▂▇▂▁▁
height 353 0.96 161.88 20.19 83.6 157 166 174 200 ▁▁▁▇▂
bmi 366 0.96 26.66 7.38 12.9 22 26 31 81 ▇▆▁▁▁
pulse 1437 0.86 73.56 12.16 40.0 64 72 82 136 ▂▇▃▁▁
diabetes 142 0.99 0.08 0.27 0.0 0 0 0 1 ▇▁▁▁▁
general_health 2461 0.75 3.38 0.94 1.0 3 3 4 5 ▁▃▇▇▂
pregnancies 7396 0.26 3.03 1.80 1.0 2 3 4 32 ▇▁▁▁▁
sleep 2245 0.78 6.93 1.35 2.0 6 7 8 12 ▁▅▇▁▁

The skim() function provides information such as the mean, the number of unique counts, and even bar graphs about the distribution of the data. This information is extremely useful because it allows us to easily find things about that data and give us a starting point. For example, there are over 7000 missing values for the pregnancies variable. That means that we’re going to have to run drop_na() so that we can ignore those 7000 missing values.

By running skim() you can create a starting point, both for your analysis and for the creation of your graph.

2.9 Summary

Data science is data cleaning.

Real data is nasty.

This chapter covered many, many commands. You should have them all memorized by now.

No! That is ridiculous. We don’t have them all memorized. Why should you? The point of this chapter was to give you a tour of what you can do in R and how to do it. With that information, you have a base from which to try to solve the problems you will encounter in the future.

The key data science concept from this chapter is, again, the idea of a “distribution.” The word distribution is used in two very different ways. First, a distribution is an invisible object that you can never see or touch. It is the imaginary urn from which you can take draws. Only in very special cases will you ever be able to “know” what the distribution is, mainly in cases where there is a physical process, like a roulette wheel, which you can inspect or in the case of an assumed mathematical formula. In almost all real world data science problems, the “distribution” is a mental creation whose reality you can never confirm.

The second way that the word distribution is used is to refer to a vector of values, a variable in an R tibble. The 115 ages in trains are a distribution as are 1,000 draws from rnorm().

Whether “distribution” means the imaginary object or a vector of numbers drawn from that imaginary object depends on context.