2 Wrangling
Data science is data cleaning.
Start by loading the packages which we will need in this chapter.
library(tidyverse)
library(primer.data)
library(lubridate)
library(janitor)
library(skimr)
library(nycflights13)
library(gapminder)
library(fivethirtyeight)
The tidyverse package will be used in every chapter. Loading it makes the 8 packages from the “Tidyverse” available. primer.data is the data package created specifically for this Primer. lubridate is a package for working with dates and times. janitor offers functions for cleaning up dirty data, the most important of which is clean_names()
. skimr contains functions that are useful for providing summary statistics, especially skim()
. nycflights13 includes data associated with flights out of New York City’s three major airports. gapminder has annual data for countries going back more than 50 years. We used this data in Chapter 1. fivethirtyeight cleans up data from the FiveThirtyEight team.
2.1 Tibbles
Tibbles are a kind of data frame, useful for storing data in which we have the same number of observations for each variable. We can use the tibble()
function to create tibbles, which are composed of columns, each of which is a variable, and rows, each of which are a “unit” or “observation.” Each column (i.e., each variable) can be of a different type: character, integer, factor, double, date and so on.
2.1.1 tibble()
tibble(a = 2.1, b = "Hello", c = TRUE, d = 9L)
## # A tibble: 1 x 4
## a b c d
## <dbl> <chr> <lgl> <int>
## 1 2.1 Hello TRUE 9
In our code, we specify the variable names (or the names of columns) as , a
, b
, and c
. Under each variable, we give a different value. Under each variable name, the data type is specified for the data within that column. A tibble can consist of one or more atomic vectors, the most important types of which are double, character, logical, and integer. The tibble above includes a variable of each type. (The “L” in “9L” tells R that we want d
to be an integer rather than the default, which would be a double.) When you print out a tibble, the variable type is shown below the variable name.
Variables should not begin with a number (like 54abc
) or include spaces (like my var
). If you insist on using variable names like that, you must include backticks around each name when you reference it.
tibble(`54abc` = 1, `my var` = 2, c = 3)
## # A tibble: 1 x 3
## `54abc` `my var` c
## <dbl> <dbl> <dbl>
## 1 1 2 3
If we did not include the backticks, R would give us an error.
tibble(54abc = 1, my var = 2, c = 3)
## Error: <text>:1:10: unexpected symbol
## 1: tibble(54abc
## ^
It is sometimes easier to use the function tribble()
to create tibbles.
tribble(
~ var1, ~ `var 2`, ~ myvar,
1, 3, 5,
4, 6, 8,
)
## # A tibble: 2 x 3
## var1 `var 2` myvar
## <dbl> <dbl> <dbl>
## 1 1 3 5
## 2 4 6 8
The tildes — as in ~ var1
— specify which row has the column names. The formatting makes it easier, relative to specifying raw vectors, to see which values are from the same observation.
2.2 Characters
So far, our tibbles have been clean and wholesome, like gapminder
and trains
. Real data is nasty. You will bring data into R from the outside world and discover there are problems. We will now discuss common remedial tasks for cleaning and transforming character data, also known as strings. A string is one or more characters that are enclosed inside a pair of matching ‘single’ or “double quotes.”
We will use the fruit
data, a vector with the names of different fruits, from the stringr package, which is automatically loaded when we issue library(tidyverse)
. Although we can manipulate character vectors directly, it is much more common, in real world situations, to work with vectors which are in a tibble.
tbl_fruit < tibble(fruit = fruit)
2.2.1 Character vectors
tbl_fruit %>%
slice_sample(n = 8)
## # A tibble: 8 x 1
## fruit
## <chr>
## 1 clementine
## 2 ugli fruit
## 3 peach
## 4 elderberry
## 5 gooseberry
## 6 mandarine
## 7 mulberry
## 8 nectarine
Note that slice_sample()
selects a random set of rows from the tibble. If you use the argument n
, as here, you get back that many rows. Use the argue prop
to return a specific percentage of all the rows in the tibble.
str_detect()
determines if a character vector matches a pattern. It returns a logical vector that is the same length as the input. Recall logicals are either TRUE
or FALSE
.
Which fruit names actually include the letter “c?”
tbl_fruit %>%
mutate(fruit_in_name = str_detect(fruit, pattern = "c"))
## # A tibble: 80 x 2
## fruit fruit_in_name
## <chr> <lgl>
## 1 apple FALSE
## 2 apricot TRUE
## 3 avocado TRUE
## 4 banana FALSE
## 5 bell pepper FALSE
## 6 bilberry FALSE
## 7 blackberry TRUE
## 8 blackcurrant TRUE
## 9 blood orange FALSE
## 10 blueberry FALSE
## # … with 70 more rows
str_length()
counts characters in your strings. Note this is different from the length()
of the character vector itself.
tbl_fruit %>%
mutate(name_length = str_length(fruit))
## # A tibble: 80 x 2
## fruit name_length
## <chr> <int>
## 1 apple 5
## 2 apricot 7
## 3 avocado 7
## 4 banana 6
## 5 bell pepper 11
## 6 bilberry 8
## 7 blackberry 10
## 8 blackcurrant 12
## 9 blood orange 12
## 10 blueberry 9
## # … with 70 more rows
str_sub()
extracts parts of a string. The function takes start and end arguments which are vectorised.
tbl_fruit %>%
mutate(first_three_letters = str_sub(fruit, 1, 3))
## # A tibble: 80 x 2
## fruit first_three_letters
## <chr> <chr>
## 1 apple app
## 2 apricot apr
## 3 avocado avo
## 4 banana ban
## 5 bell pepper bel
## 6 bilberry bil
## 7 blackberry bla
## 8 blackcurrant bla
## 9 blood orange blo
## 10 blueberry blu
## # … with 70 more rows
str_c()
combines a character vector of length to a single string. This is similar to the normal c()
function for creating a vector.
tbl_fruit %>%
mutate(name_with_s = str_c(fruit, "s"))
## # A tibble: 80 x 2
## fruit name_with_s
## <chr> <chr>
## 1 apple apples
## 2 apricot apricots
## 3 avocado avocados
## 4 banana bananas
## 5 bell pepper bell peppers
## 6 bilberry bilberrys
## 7 blackberry blackberrys
## 8 blackcurrant blackcurrants
## 9 blood orange blood oranges
## 10 blueberry blueberrys
## # … with 70 more rows
str_replace()
replaces a pattern within a string.
tbl_fruit %>%
mutate(capital_A = str_replace(fruit,
pattern = "a",
replacement = "A"))
## # A tibble: 80 x 2
## fruit capital_A
## <chr> <chr>
## 1 apple Apple
## 2 apricot Apricot
## 3 avocado Avocado
## 4 banana bAnana
## 5 bell pepper bell pepper
## 6 bilberry bilberry
## 7 blackberry blAckberry
## 8 blackcurrant blAckcurrant
## 9 blood orange blood orAnge
## 10 blueberry blueberry
## # … with 70 more rows
2.2.2 Regular expressions with stringr
Sometimes, your string tasks cannot be expressed in terms of a fixed string, but can be described in terms of a pattern. Regular expressions, also know as “regexes,” are the standard way to specify these patterns. In regexes, specific characters and constructs take on special meaning in order to match multiple strings.
To explore regular expressions, we will use the str_detect()
function, which reports TRUE for any string which matches the pattern, and then filter()
to see all the matches. For example, here are all the fruits which include a “w” in their name.
tbl_fruit %>%
filter(str_detect(fruit, pattern = "w"))
## # A tibble: 4 x 1
## fruit
## <chr>
## 1 honeydew
## 2 kiwi fruit
## 3 strawberry
## 4 watermelon
In the code below, the first metacharacter is the period .
, which stands for any single character, except a newline (which, by the way, is represented by \n
). The regex b.r
will match all fruits that have an “b,” followed by any single character, followed by “r.” Regexes are case sensitive.
tbl_fruit %>%
filter(str_detect(fruit, pattern = "b.r"))
## # A tibble: 15 x 1
## fruit
## <chr>
## 1 bilberry
## 2 blackberry
## 3 blueberry
## 4 boysenberry
## 5 cloudberry
## 6 cranberry
## 7 cucumber
## 8 elderberry
## 9 goji berry
## 10 gooseberry
## 11 huckleberry
## 12 mulberry
## 13 raspberry
## 14 salal berry
## 15 strawberry
Anchors can be included to express where the expression must occur within the string. The ^
indicates the beginning of string and $
indicates the end.
tbl_fruit %>%
filter(str_detect(fruit, pattern = "^w"))
## # A tibble: 1 x 1
## fruit
## <chr>
## 1 watermelon
tbl_fruit %>%
filter(str_detect(fruit, pattern = "o$"))
## # A tibble: 5 x 1
## fruit
## <chr>
## 1 avocado
## 2 mango
## 3 pamelo
## 4 pomelo
## 5 tamarillo
2.2.3 Raw strings
There are certain characters with special meaning in regexes, including $ * + . ? [ ] ^ { }  ( ) \
. This makes things difficult when we want to use these characters within the strings instead of as regexes. Patterns which seek to match these characters, rather than use them as metacharacters, should make use of r"()"
, the raw string constant in R. Consider:
r"(Do you use "airquotes" much?)"
## [1] "Do you use \"airquotes\" much?"
The \
preceding each quote is what is known as escaping. It signals to the computer that the quotations are part of the string. Now, what happens if we have a backslash within the string.
r"(\)"
## [1] "\\"
As you can see, a backslash is used to escape the initial backslash.
2.3 Factors
Factors are categorical variables that only take on a specified set of values. To manipulate factors we will use the forcats package, a core package in the Tidyverse.
It is easy to make factors with either factor()
, as.factor()
or parse_factor()
.
tibble(X = letters[1:3]) %>%
mutate(fac_1 = factor(X)) %>%
mutate(fac_2 = as.factor(X)) %>%
mutate(fac_3 = parse_factor(X))
## # A tibble: 3 x 4
## X fac_1 fac_2 fac_3
## <chr> <fct> <fct> <fct>
## 1 a a a a
## 2 b b b b
## 3 c c c c
Which of those three options is best depends on the situation. factor()
is useful when you are creating a factor from nothing. as.factor()
is best for simple transformations, especially of character variables, as in this example. parse_factor()
is the most modern and powerful of the three.
Let’s use gapminder$continent
as an example. Note that str()
is a useful function for getting detailed information about an object.
str(gapminder$continent)
## Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
levels(gapminder$continent)
## [1] "Africa" "Americas" "Asia" "Europe" "Oceania"
nlevels(gapminder$continent)
## [1] 5
class(gapminder$continent)
## [1] "factor"
To get a frequency table as a tibble, from a tibble, use count()
. To get a similar result from a freerange factor, use fct_count()
.
gapminder %>%
count(continent)
## # A tibble: 5 x 2
## continent n
## <fct> <int>
## 1 Africa 624
## 2 Americas 300
## 3 Asia 396
## 4 Europe 360
## 5 Oceania 24
fct_count(gapminder$continent)
## # A tibble: 5 x 2
## f n
## <fct> <int>
## 1 Africa 624
## 2 Americas 300
## 3 Asia 396
## 4 Europe 360
## 5 Oceania 24
2.3.1 Dropping unused levels
Removing all the rows corresponding to a specific factor level does not remove the level itself. These unused levels can come back to haunt you later, e.g., in figure legends.
Watch what happens to the levels of country
when we filter gapminder
to a handful of countries.
nlevels(gapminder$country)
## [1] 142
h_gap < gapminder %>%
filter(country %in% c("Egypt", "Haiti",
"Romania", "Thailand",
"Venezuela"))
nlevels(h_gap$country)
## [1] 142
Even though h_gap
only has data for a handful of countries, we are still schlepping around all levels from the original gapminder
tibble.
How can you get rid of them? The base function droplevels()
operates on all the factors in a data frame or on a single factor. The function fct_drop()
operates on a single factor variable.
h_gap_dropped < h_gap %>%
droplevels()
nlevels(h_gap_dropped$country)
## [1] 5
# Use fct_drop() on a freerange factor
h_gap$country %>%
fct_drop() %>%
levels()
## [1] "Egypt" "Haiti" "Romania" "Thailand" "Venezuela"
2.3.2 Change the order of the levels
By default, factor levels are ordered alphabetically.
gapminder$continent %>%
levels()
## [1] "Africa" "Americas" "Asia" "Europe" "Oceania"
We can also order factors by:
 Frequency: Make the most common level the first and so on.
 Another variable: Order factor levels according to a summary statistic for another variable.
Let’s order continent by frequency using fct_infreq()
.
gapminder$continent %>%
fct_infreq() %>%
levels()
## [1] "Africa" "Asia" "Europe" "Americas" "Oceania"
We can also have the frequency print out backwards using fct_rev()
.
gapminder$continent %>%
fct_infreq() %>%
fct_rev() %>%
levels()
## [1] "Oceania" "Americas" "Europe" "Asia" "Africa"
These two barcharts of frequency by continent differ only in the order of the continents. Which do you prefer? We show the code for just the second one.
gapminder %>%
mutate(continent = fct_infreq(continent)) %>%
mutate(continent = fct_rev(continent)) %>%
ggplot(aes(x = continent)) +
geom_bar() +
coord_flip()
Let’s now order country
by another variable, forwards and backwards. This other variable is usually quantitative and you will order the factor according to a grouped summary. The factor is the grouping variable and the default summarizing function is median()
but you can specify something else.
# Order countries by median life expectancy
fct_reorder(gapminder$country,
gapminder$lifeExp) %>%
levels() %>%
head()
## [1] "Sierra Leone" "GuineaBissau" "Afghanistan" "Angola"
## [5] "Somalia" "Guinea"
# Order according to minimum life exp instead of median
fct_reorder(gapminder$country,
gapminder$lifeExp, min) %>%
levels() %>%
head()
## [1] "Rwanda" "Afghanistan" "Gambia" "Angola" "Sierra Leone"
## [6] "Cambodia"
# Backwards!
fct_reorder(gapminder$country,
gapminder$lifeExp,
.desc = TRUE) %>%
levels() %>%
head()
## [1] "Iceland" "Japan" "Sweden" "Switzerland" "Netherlands"
## [6] "Norway"
Why do we reorder factor levels? It often makes plots much better! When plotting a factor against a numeric variable, it is generally a good idea to order the factors by some function of the numeric variable. Alphabetic ordering is rarely best.
Compare the interpretability of these two plots of life expectancy in the Aericas in 2007. The only difference is the order of the country
factor. Which is better?
gapminder %>%
filter(year == 2007,
continent == "Americas") %>%
ggplot(aes(x = lifeExp, y = country)) +
geom_point()
gapminder %>%
filter(year == 2007,
continent == "Americas") %>%
ggplot(aes(x = lifeExp,
y = fct_reorder(country, lifeExp))) +
geom_point()
2.3.3 Recode the levels
Use fct_recode()
to change the names of the levels for a factor.
i_gap < gapminder %>%
filter(country %in% c("United States", "Sweden",
"Australia")) %>%
droplevels()
i_gap$country %>%
levels()
## [1] "Australia" "Sweden" "United States"
i_gap$country %>%
fct_recode("USA" = "United States", "Oz" = "Australia") %>%
levels()
## [1] "Oz" "Sweden" "USA"
2.3.4 Grow a factor
Let’s create two tibbles, each with data from two countries and dropped unused factor levels.
df1 < gapminder %>%
filter(country %in% c("United States", "Mexico"), year > 2000) %>%
droplevels()
df2 < gapminder %>%
filter(country %in% c("France", "Germany"), year > 2000) %>%
droplevels()
The country
factors in df1
and df2
have different levels.
levels(df1$country)
## [1] "Mexico" "United States"
levels(df2$country)
## [1] "France" "Germany"
Can you just combine them?
c(df1$country, df2$country)
## [1] Mexico Mexico United States United States France
## [6] France Germany Germany
## Levels: Mexico United States France Germany
Umm, no. That is wrong on many levels! Use fct_c()
to do this.
fct_c(df1$country, df2$country)
## [1] Mexico Mexico United States United States France
## [6] France Germany Germany
## Levels: Mexico United States France Germany
2.4 Lists
Earlier, we briefly introduced lists. Lists are a type of vector that is a step up in complexity from atomic vectors, because lists can contain other lists. This makes them suitable for representing hierarchical or treelike structures. You create a list with the function list()
:
x < list(1, 2, 3)
x
## [[1]]
## [1] 1
##
## [[2]]
## [1] 2
##
## [[3]]
## [1] 3
A very useful tool for working with lists is str()
because it focuses on displaying the structure, not the contents.
str(x)
## List of 3
## $ : num 1
## $ : num 2
## $ : num 3
## List of 3
## $ a: num 1
## $ b: num 2
## $ c: num 3
Unlike atomic vectors, list()
can contain a mix of objects.
## List of 4
## $ : chr "a"
## $ : int 1
## $ : num 1.5
## $ : logi TRUE
Lists can even contain other lists!
## List of 2
## $ :List of 2
## ..$ : num 1
## ..$ : num 2
## $ :List of 2
## ..$ : num 3
## ..$ : num 4
2.4.1 Visualizing lists
To explain more complicated list manipulation functions, it’s helpful to have a visual representation of lists. For example, take these three lists:
I’ll draw them as follows:
There are three principles:
Lists have rounded corners. Atomic vectors have square corners.
Children are drawn inside their parent, and have a slightly darker background to make it easier to see the hierarchy.
The orientation of the children (i.e. rows or columns) isn’t important, so I’ll pick a row or column orientation to either save space or illustrate an important property in the example.
2.4.2 Subsetting
There are three ways to subset a list, which I’ll illustrate with a list named a
:
[ ]
extracts a sublist. The result will always be a list.
str(a[1:2])
## List of 2
## $ a: int [1:3] 1 2 3
## $ b: chr "a string"
str(a[4])
## List of 1
## $ d:List of 2
## ..$ : num 1
## ..$ : num 5
Like with vectors, you can subset with a logical, integer, or character vector.
[[ ]]
extracts a single component from a list. It removes a level of hierarchy from the list.
str(a[[1]])
## int [1:3] 1 2 3
str(a[[4]])
## List of 2
## $ : num 1
## $ : num 5
$
is a shorthand for extracting named elements of a list. It works similarly to [[ ]]
except that you don’t need to use quotes.
a$a
## [1] 1 2 3
a[["a"]]
## [1] 1 2 3
The distinction between [ ]
and [[ ]]
is really important for lists, because [[ ]]
drills down into the list while [ ]
returns a new, smaller list. Compare the code and output above with the visual representation.
2.5 DateTimes
We will manipulate datetimes using the lubridate package, which makes it easier to work with dates and times in R. lubridate is not part of the core tidyverse because you only need it when you’re working with dates/times.
There are three types of date/time data that refer to an instant in time:
A
date
. Tibbles print this as<date>
.A
time
within a day. Tibbles print this as<time>
.A
datetime
is a date plus a time: it uniquely identifies an instant in time (typically to the nearest second). Tibbles print this as<dttm>
.
You should always use the simplest possible datatype that works for your needs. That means if you can use a date instead of a datetime, you should. Datetimes are substantially more complicated because of the need to handle time zones, which we’ll come back to at the end of the chapter.
To get the current date or datetime you can use today()
or now()
:
today()
## [1] "20210605"
now()
## [1] "20210605 18:22:35 EDT"
Otherwise, there are three ways you’re likely to create a date/time:
 From a string.
 From individual datetime components.
 From an existing date/time object.
They work as follows:
2.5.1 From strings
Date/time data often comes as strings. The lubridate functions automatically work out the format once you specify the order of the component. To use them, identify the order in which year, month, and day appear in your dates, then arrange “y,” “m,” and “d” in the same order. That gives you the name of the lubridate function that will parse your date. For example:
ymd("20170131")
## [1] "20170131"
mdy("January 31st, 2017")
## [1] "20170131"
dmy("31Jan2017")
## [1] "20170131"
These functions also take unquoted numbers. This is the most concise way to create a single date/time object, as you might need when filtering date/time data. ymd()
is short and unambiguous:
ymd(20170131)
## [1] "20170131"
ymd()
and friends create dates. To create a datetime, add an underscore and one or more of “h,” “m,” and “s” to the name of the parsing function:
ymd_hms("20170131 20:11:59")
## [1] "20170131 20:11:59 UTC"
mdy_hm("01/31/2017 08:01")
## [1] "20170131 08:01:00 UTC"
You can also force the creation of a datetime from a date by supplying a timezone:
ymd(20170131, tz = "UTC")
## [1] "20170131 UTC"
2.5.2 From individual components
Instead of a single string, sometimes you’ll have the individual components of the datetime spread across multiple columns. This is what we have in the flights data:
flights %>%
select(year, month, day, hour, minute)
## # A tibble: 336,776 x 5
## year month day hour minute
## <int> <int> <int> <dbl> <dbl>
## 1 2013 1 1 5 15
## 2 2013 1 1 5 29
## 3 2013 1 1 5 40
## 4 2013 1 1 5 45
## 5 2013 1 1 6 0
## 6 2013 1 1 5 58
## 7 2013 1 1 6 0
## 8 2013 1 1 6 0
## 9 2013 1 1 6 0
## 10 2013 1 1 6 0
## # … with 336,766 more rows
To create a date/time from this sort of input, use make_date()
for dates, or make_datetime()
for datetimes:
flights %>%
select(year, month, day, hour, minute) %>%
mutate(departure = make_datetime(year, month, day, hour, minute))
## # A tibble: 336,776 x 6
## year month day hour minute departure
## <int> <int> <int> <dbl> <dbl> <dttm>
## 1 2013 1 1 5 15 20130101 05:15:00
## 2 2013 1 1 5 29 20130101 05:29:00
## 3 2013 1 1 5 40 20130101 05:40:00
## 4 2013 1 1 5 45 20130101 05:45:00
## 5 2013 1 1 6 0 20130101 06:00:00
## 6 2013 1 1 5 58 20130101 05:58:00
## 7 2013 1 1 6 0 20130101 06:00:00
## 8 2013 1 1 6 0 20130101 06:00:00
## 9 2013 1 1 6 0 20130101 06:00:00
## 10 2013 1 1 6 0 20130101 06:00:00
## # … with 336,766 more rows
2.5.3 From other types
You may want to switch between a datetime and a date. That’s the job of as_datetime()
and as_date()
:
as_datetime(today())
## [1] "20210605 UTC"
## [1] "20210605"
Sometimes you’ll get date/times as numeric offsets from the “Unix Epoch,” 19700101. If the offset is in seconds, use as_datetime()
; if it’s in days, use as_date()
.
as_datetime(60 * 60 * 10)
## [1] "19700101 10:00:00 UTC"
as_date(365 * 10 + 2)
## [1] "19800101"
2.5.4 Datetime components
Now that you know how to get datetime data into R’s datetime data structures, let’s explore what you can do with them. This section will focus on the accessor functions that let you get and set individual components. The next section will look at how arithmetic works with datetimes.
You can pull out individual parts of the date with the accessor functions year()
, month()
, mday()
(day of the month), yday()
(day of the year), wday()
(day of the week), hour()
, minute()
, and second()
.
## [1] 2016
month(datetime)
## [1] 7
mday(datetime)
## [1] 8
yday(datetime)
## [1] 190
wday(datetime)
## [1] 6
For month()
and wday()
you can set label = TRUE
to return the abbreviated name of the month or day of the week. Set abbr = FALSE
to return the full name.
month(datetime, label = TRUE)
## [1] Jul
## 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec
wday(datetime, label = TRUE, abbr = FALSE)
## [1] Friday
## 7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday
2.5.5 Setting components
You can create a new datetime with update()
.
update(datetime, year = 2020, month = 2, mday = 2, hour = 2)
## [1] "20200202 02:34:56 UTC"
If values are too big, they will rollover:
## [1] "20150302"
## [1] "20150217 16:00:00 UTC"
2.5.6 Time zones
Time zones are an enormously complicated topic because of their interaction with geopolitical entities. Fortunately we don’t need to dig into all the details as they’re not all that important for data analysis. You can see the complete list of possible timezones with the function OlsonNames()
. Unless otherwise specified, lubridate always uses UTC (Coordinated Universal Time).
In R, the time zone is an attribute of the datetime that only controls printing. For example, these three objects represent the same instant in time:
(x1 < ymd_hms("20150601 12:00:00", tz = "America/New_York"))
## [1] "20150601 12:00:00 EDT"
(x2 < ymd_hms("20150601 18:00:00", tz = "Europe/Copenhagen"))
## [1] "20150601 18:00:00 CEST"
(x3 < ymd_hms("20150602 04:00:00", tz = "Pacific/Auckland"))
## [1] "20150602 04:00:00 NZST"
2.6 Combining Data
There are many ways to bring data together.
The bind_rows()
function is used to combine all the rows from two or more tibbles.
data_1 < tibble(x = 1:2,
y = c("A", "B"))
data_2 < tibble(x = 3:4,
y = c("C", "D"))
bind_rows(data_1, data_2)
## # A tibble: 4 x 2
## x y
## <int> <chr>
## 1 1 A
## 2 2 B
## 3 3 C
## 4 4 D
2.6.1 Joins
Consider two tibbles: superheroes
and publishers
.
superheroes < tibble::tribble(
~name, ~gender, ~publisher,
"Magneto", "male", "Marvel",
"Storm", "female", "Marvel",
"Batman", "male", "DC",
"Catwoman", "female", "DC",
"Hellboy", "male", "Dark Horse Comics"
)
publishers < tibble::tribble(
~publisher, ~yr_founded,
"DC", 1934L,
"Marvel", 1939L,
"Image", 1992L
)
Note how easy it is to use tribble()
, from the tibble package (part of the Tidyverse) to create a tibble on the fly using text which organized for easy entry and reading. Recall that a double colon — ::
— is how we indicate that a function comes from a specific package.
2.6.1.1 inner_join()
inner_join(x, y)
: Return all rows from x
where there are matching values in y
, and all columns from x
and y
. If there are multiple matches between x
and y
, all combination of the matches are returned.
inner_join(superheroes, publishers)
## Joining, by = "publisher"
## # A tibble: 4 x 4
## name gender publisher yr_founded
## <chr> <chr> <chr> <int>
## 1 Magneto male Marvel 1939
## 2 Storm female Marvel 1939
## 3 Batman male DC 1934
## 4 Catwoman female DC 1934
We lose Hellboy in the join because, although he appears in x = superheroes
, his publisher Dark Horse Comics does not appear in y = publishers
. The join result has all variables from x = superheroes
plus yr_founded
, from y
.
Note the message that we are ‘Joining, by = “publisher”.’ Whenever joining, R checks to see if there are variables in common between the two tibbles and, if there are, uses them to join. However, it is concerned that you may not be aware that this is what it is doing, so R tells you. Such messages are both annoying and a signal that we have not made our code as robust as we should. Fortunately, we can specify precisely which variables we want to join by. Always do this.
inner_join(superheroes, publishers, by = "publisher")
by
also takes a vector of key variables if you want to merge by multiple variables.
Now compare this result to that of using inner_join()
with the two datasets in opposite positions.
inner_join(publishers, superheroes, by = "publisher")
## # A tibble: 4 x 4
## publisher yr_founded name gender
## <chr> <int> <chr> <chr>
## 1 DC 1934 Batman male
## 2 DC 1934 Catwoman female
## 3 Marvel 1939 Magneto male
## 4 Marvel 1939 Storm female
In a way, this does illustrate multiple matches, if you think about it from the x = publishers
direction. Every publisher that has a match in y = superheroes
appears multiple times in the result, once for each match. In fact, we’re getting the same result as with inner_join(superheroes, publishers)
, up to variable order (which you should also never rely on in an analysis).
2.6.1.2 full_join()
full_join(x, y)
: Return all rows and all columns from both x
and y
. Where there are not matching values, returns NA
for the one missing.
full_join(superheroes, publishers, by = "publisher")
## # A tibble: 6 x 4
## name gender publisher yr_founded
## <chr> <chr> <chr> <int>
## 1 Magneto male Marvel 1939
## 2 Storm female Marvel 1939
## 3 Batman male DC 1934
## 4 Catwoman female DC 1934
## 5 Hellboy male Dark Horse Comics NA
## 6 <NA> <NA> Image 1992
We get all rows of x = superheroes
plus a new row from y = publishers
, containing the publisher Image. We get all variables from x = superheroes
AND all variables from y = publishers
. Any row that derives solely from one table or the other carries NA
s in the variables found only in the other table.
Because full_join()
returns all rows and all columns from both x
and y
the result of full_join(x = superheroes, y = publishers)
should match that of full_join(x = publishers, y = superheroes)
.
2.6.1.3 left_join()
left_join(x, y)
: Return all rows from x
, and all columns from x
and y
. If there are multiple matches between x
and y
, all combination of the matches are returned.
left_join(superheroes, publishers, by = "publisher")
## # A tibble: 5 x 4
## name gender publisher yr_founded
## <chr> <chr> <chr> <int>
## 1 Magneto male Marvel 1939
## 2 Storm female Marvel 1939
## 3 Batman male DC 1934
## 4 Catwoman female DC 1934
## 5 Hellboy male Dark Horse Comics NA
We basically get x = superheroes
back, but with the addition of variable yr_founded
, which is unique to y = publishers
. Hellboy, whose publisher does not appear in y = publishers
, has an NA
for yr_founded
.
Now compare this result to that of running left_join(x = publishers, y = superheroes)
. Unlike inner_join()
and full_join()
the order of the arguments has a significant effect on the resulting tibble.
left_join(publishers, superheroes, by = "publisher")
## # A tibble: 5 x 4
## publisher yr_founded name gender
## <chr> <int> <chr> <chr>
## 1 DC 1934 Batman male
## 2 DC 1934 Catwoman female
## 3 Marvel 1939 Magneto male
## 4 Marvel 1939 Storm female
## 5 Image 1992 <NA> <NA>
We get a similar result as with inner_join()
but the publisher Image survives in the join, even though no superheroes from Image appear in y = superheroes
. As a result, Image has NA
s for name
and gender
.
There is a similar function, right_join(x, y)
that return all rows from y
, and all columns from x
and y
.
2.6.1.4 semi_join()
semi_join(x, y)
: Return all rows from x
where there are matching values in y
, keeping just columns from x
. A semi join differs from an inner join because an inner join will return one row of x
for each matching row of y
, where a semi join will never duplicate rows of x
. This is a filtering join.
semi_join(superheroes, publishers, by = "publisher")
## # A tibble: 4 x 3
## name gender publisher
## <chr> <chr> <chr>
## 1 Magneto male Marvel
## 2 Storm female Marvel
## 3 Batman male DC
## 4 Catwoman female DC
Compare the result of switching the values of the arguments.
semi_join(x = publishers, y = superheroes, by = "publisher")
## # A tibble: 2 x 2
## publisher yr_founded
## <chr> <int>
## 1 DC 1934
## 2 Marvel 1939
Now the effects of switching the x
and y
roles is more clear. The result resembles x = publishers
, but the publisher Image is lost, because there are no observations where publisher == "Image"
in y = superheroes
.
2.6.1.5 anti_join()
anti_join(x, y)
: Return all rows from x
where there are no matching values in y
, keeping just columns from x
.
anti_join(superheroes, publishers, by = "publisher")
## # A tibble: 1 x 3
## name gender publisher
## <chr> <chr> <chr>
## 1 Hellboy male Dark Horse Comics
We keep only Hellboy now (and do not get yr_founded
).
Now switch the arguments and compare the result.
anti_join(publishers, superheroes, by = "publisher")
## # A tibble: 1 x 2
## publisher yr_founded
## <chr> <int>
## 1 Image 1992
We keep only publisher Image now (and the variables found in x = publishers
).
2.6.2 Example
Consider the relationships among the tibbles in the nycflights package:
In both the flights
and airlines
data frames, the key variable we want to join/merge/match the rows by has the same name: carrier
. Let’s use inner_join()
to join the two data frames, where the rows will be matched by the variable carrier
flights %>%
inner_join(airlines, by = "carrier")
## # A tibble: 336,776 x 20
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 1 1004 1022
## 5 2013 1 1 554 600 6 812 837
## 6 2013 1 1 554 558 4 740 728
## 7 2013 1 1 555 600 5 913 854
## 8 2013 1 1 557 600 3 709 723
## 9 2013 1 1 557 600 3 838 846
## 10 2013 1 1 558 600 2 753 745
## # … with 336,766 more rows, and 12 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>,
## # name <chr>
This is our first example of using a join function with a pipe. flights
is fed as the first argument to inner_join()
. That is what the pipe does. The above code is equivalent to:
inner_join(flights, airlines, by = "carrier")
The airports
data frame contains the airport codes for each airport:
airports
## # A tibble: 1,458 x 8
## faa name lat lon alt tz dst tzone
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 04G Lansdowne Airport 41.1 80.6 1044 5 A America/New_Yo…
## 2 06A Moton Field Municipal A… 32.5 85.7 264 6 A America/Chicago
## 3 06C Schaumburg Regional 42.0 88.1 801 6 A America/Chicago
## 4 06N Randall Airport 41.4 74.4 523 5 A America/New_Yo…
## 5 09J Jekyll Island Airport 31.1 81.4 11 5 A America/New_Yo…
## 6 0A9 Elizabethton Municipal … 36.4 82.2 1593 5 A America/New_Yo…
## 7 0G6 Williams County Airport 41.5 84.5 730 5 A America/New_Yo…
## 8 0G7 Finger Lakes Regional A… 42.9 76.8 492 5 A America/New_Yo…
## 9 0P2 Shoestring Aviation Air… 39.8 76.6 1000 5 U America/New_Yo…
## 10 0S9 Jefferson County Intl 48.1 123. 108 8 A America/Los_An…
## # … with 1,448 more rows
However, if you look at both the airports
and flights
data frames, you’ll find that the airport codes are in variables that have different names. In airports
the airport code is in faa
, whereas in flights
the airport codes are in origin
and dest
.
In order to join these two data frames by airport code, our inner_join()
operation will use by = c("dest" = "faa")
thereby allowing us to join two data frames where the key variable has a different name.
flights %>%
inner_join(airports, by = c("dest" = "faa"))
## # A tibble: 329,174 x 26
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 554 600 6 812 837
## 5 2013 1 1 554 558 4 740 728
## 6 2013 1 1 555 600 5 913 854
## 7 2013 1 1 557 600 3 709 723
## 8 2013 1 1 557 600 3 838 846
## 9 2013 1 1 558 600 2 753 745
## 10 2013 1 1 558 600 2 849 851
## # … with 329,164 more rows, and 18 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>,
## # name <chr>, lat <dbl>, lon <dbl>, alt <dbl>, tz <dbl>, dst <chr>,
## # tzone <chr>
Let’s construct the chain of pipe operators %>%
that computes the number of flights from NYC to each destination, but also includes information about each destination airport:
flights %>%
group_by(dest) %>%
summarize(num_flights = n(),
.groups = "drop") %>%
arrange(desc(num_flights)) %>%
inner_join(airports, by = c("dest" = "faa")) %>%
rename(airport_name = name)
## # A tibble: 101 x 9
## dest num_flights airport_name lat lon alt tz dst tzone
## <chr> <int> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 ORD 17283 Chicago Ohare Intl 42.0 87.9 668 6 A America…
## 2 ATL 17215 Hartsfield Jackson… 33.6 84.4 1026 5 A America…
## 3 LAX 16174 Los Angeles Intl 33.9 118. 126 8 A America…
## 4 BOS 15508 General Edward Law… 42.4 71.0 19 5 A America…
## 5 MCO 14082 Orlando Intl 28.4 81.3 96 5 A America…
## 6 CLT 14064 Charlotte Douglas … 35.2 80.9 748 5 A America…
## 7 SFO 13331 San Francisco Intl 37.6 122. 13 8 A America…
## 8 FLL 12055 Fort Lauderdale Ho… 26.1 80.2 9 5 A America…
## 9 MIA 11728 Miami Intl 25.8 80.3 8 5 A America…
## 10 DCA 9705 Ronald Reagan Wash… 38.9 77.0 15 5 A America…
## # … with 91 more rows
"ORD"
is the airport code of Chicago O’Hare airport and "FLL"
is the code for the main airport in Fort Lauderdale, Florida, which can be seen in the airport_name
variable.
2.7 Tidy data
Consider the first five rows from the drinks
data frame from the fivethirtyeight package:
## # A tibble: 5 x 5
## country beer_servings spirit_servings wine_servings total_litres_of_pure_a…
## <chr> <int> <int> <int> <dbl>
## 1 Afghanist… 0 0 0 0
## 2 Albania 89 132 54 4.9
## 3 Algeria 25 0 14 0.7
## 4 Andorra 245 138 312 12.4
## 5 Angola 217 57 45 5.9
After reading the help file by running ?drinks
, you’ll see that drinks
is a data frame containing results from a survey of the average number of servings of beer, spirits, and wine consumed in 193 countries. This data was originally reported on FiveThirtyEight.com in Mona Chalabi’s article: “Dear Mona Followup: Where Do People Drink The Most Beer, Wine And Spirits?”.
Let’s apply some of the data wrangling verbs on the drinks
data frame:

filter()
thedrinks
data frame to only consider 4 countries: the United States, China, Italy, and Saudi Arabia, then 
select()
all columns excepttotal_litres_of_pure_alcohol
by using the
sign, then 
rename()
the variablesbeer_servings
,spirit_servings
, andwine_servings
tobeer
,spirit
, andwine
, respectively.
We will save the resulting data frame in drinks_smaller
.
drinks_smaller < drinks %>%
filter(country %in% c("USA", "China", "Italy", "Saudi Arabia")) %>%
select(total_litres_of_pure_alcohol) %>%
rename(beer = beer_servings, spirit = spirit_servings, wine = wine_servings)
drinks_smaller
## # A tibble: 4 x 4
## country beer spirit wine
## <chr> <int> <int> <int>
## 1 China 79 192 8
## 2 Italy 85 42 237
## 3 Saudi Arabia 0 5 0
## 4 USA 249 158 84
2.7.1 Definition of “tidy” data
What does it mean for your data to be “tidy?” While “tidy” has a clear English meaning of “organized,” the word “tidy” in data science using R means that your data follows a standardized format.
“Tidy” data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. In tidy data:
 Each variable forms a column.
 Each observation forms a row.
 Each type of observational unit forms a table.
2.7.2 Converting to “tidy” data
In this book so far, you’ve only seen data frames that were already in “tidy” format. Furthermore, for the rest of this book, you’ll mostly only see data frames that are already in “tidy” format as well. This is not always the case however with all datasets in the world. If your original data frame is in wide (non“tidy”) format and you would like to use the ggplot2 or dplyr packages, you will first have to convert it to “tidy” format. To do so, we recommend using the pivot_longer()
function in the tidyr package (Wickham 2021d).
Going back to our drinks_smaller
data frame from earlier:
drinks_smaller
## # A tibble: 4 x 4
## country beer spirit wine
## <chr> <int> <int> <int>
## 1 China 79 192 8
## 2 Italy 85 42 237
## 3 Saudi Arabia 0 5 0
## 4 USA 249 158 84
We convert it to “tidy” format by using the pivot_longer()
function from the tidyr package as follows:
drinks_smaller_tidy < drinks_smaller %>%
pivot_longer(names_to = "type",
values_to = "servings",
cols = country)
drinks_smaller_tidy
## # A tibble: 12 x 3
## country type servings
## <chr> <chr> <int>
## 1 China beer 79
## 2 China spirit 192
## 3 China wine 8
## 4 Italy beer 85
## 5 Italy spirit 42
## 6 Italy wine 237
## 7 Saudi Arabia beer 0
## 8 Saudi Arabia spirit 5
## 9 Saudi Arabia wine 0
## 10 USA beer 249
## 11 USA spirit 158
## 12 USA wine 84
Let’s dissect the arguments to pivot_longer()
.
the first argument
names_to
corresponds to the name of the variable in the new “tidy”/long data frame that will contain the column names of the original data. Observe how we setnames_to = "type"
. In the resultingdrinks_smaller_tidy
, the columntype
contains the three types of alcoholbeer
,spirit
, andwine
. Sincetype
is a variable name that doesn’t appear indrinks_smaller
, we use quotation marks around it. You’ll receive an error if you just usenames_to = type
here.The second argument
values_to
corresponds to the name of the variable in the new “tidy” data frame that will contain the values of the original data. Observe how we setvalues_to = "servings"
since each of the numeric values in each of thebeer
,wine
, andspirit
columns of thedrinks_smaller
data corresponds to a value ofservings
. In the resultingdrinks_smaller_tidy
, the columnservings
contains the 4 \(\times\) 3 = 12 numerical values. Note again thatservings
doesn’t appear as a variable indrinks_smaller
so it again needs quotation marks around it for thevalues_to
argument.The third argument
cols
is the columns in thedrinks_smaller
data frame you either want to or don’t want to “tidy.” Observe how we set this tocountry
indicating that we don’t want to “tidy” thecountry
variable indrinks_smaller
and rather onlybeer
,spirit
, andwine
. Sincecountry
is a column that appears indrinks_smaller
we don’t put quotation marks around it.
The third argument here of cols
is a little nuanced, so let’s consider code that’s written slightly differently but that produces the same output:
drinks_smaller %>%
pivot_longer(names_to = "type",
values_to = "servings",
cols = c(beer, spirit, wine))
Note that the third argument now specifies which columns we want to “tidy” with c(beer, spirit, wine)
, instead of the columns we don’t want to “tidy” using country
. We use the c()
function to create a vector of the columns in drinks_smaller
that we’d like to “tidy.” Note that since these three columns appear one after another in the drinks_smaller
data frame, we could also do the following for the cols
argument:
drinks_smaller %>%
pivot_longer(names_to = "type",
values_to = "servings",
cols = beer:wine)
Converting “wide” format data to “tidy” format often confuses new R users. The only way to learn to get comfortable with the pivot_longer()
function is with practice, practice, and more practice using different datasets. For example, run ?pivot_longer
and look at the examples in the bottom of the help file.
If however you want to convert a “tidy” data frame to “wide” format, you will need to use the pivot_wider()
function instead. Run ?pivot_wider
and look at the examples in the bottom of the help file for examples.
You can also view examples of both pivot_longer()
and pivot_wider()
on the tidyverse.org webpage. There’s a nice example to check out the different functions available for data tidying and a case study using data from the World Health Organization on that webpage. Furthermore, each week the R4DS Online Learning Community posts a dataset in the weekly #
TidyTuesday event that might serve as a nice place for you to find other data to explore and transform.
2.8 Distributions
A distribution is a function which shows the possible values of a variable and how often they occur.
Think of the distribution of a variable as an urn from which we can pull out, at random, values for that variable. Drawing a thousand or so values from that urn, and then looking at a histogram, can show where the values are centered and how the values vary. Because people are sloppy, they will use the word distribution to refer to both the (imaginary!) urn from which we are drawing values and to the list of values we have drawn. It is better, however, to keep three distinct ideas separate:
The unknown true distribution which, in reality, generates the data which we see. Outside of stylized examples in which we assume that a distribution follows a simple mathematical formula, we will never have access to the unknown true distribution. We can only estimate it, however imperfectly. This unknown true distribution is often referred to as the data generating mechanism, or DGM. It is a function or black box or urn which produces data. We can see the data. We can’t see the urn. Later in the Primer, once we have learned about posterior distributions, we will often refer to this as Preceptor’s Posterior.
The estimated distribution which, we think, generates the data which we see. Again, we can never know the unknown true distribution. But, by making some assumptions and using the data we have, we can estimate a distribution. Our estimate may be very close to the true distribution. Or it may be far away. The main task of data science to to create and use these estimated distributions. Almost always, these distributions are instantiated in computer code.
A vector of numbers drawn from the estimated distribution. Both true and estimated distributions can be complex beasties, difficult to describe accurately and in detail. But a vector of numbers drawn from a distribution is easy to understand and use. So, in general, we work with vectors of numbers. When someone — either a colleague or a piece of R code — has created a distribution which we want to use to answer a question, we don’t really want the distribution itself. We just want a vectors of “draws” from that distribution. Vectors are easy to work with! Complex computer code is not.
Again, people (including us!) will often be sloppy and use the same word, “distribution,” without making it clear whether they are talking about the true distribution, the estimated distribution, or a vector of draws from the estimated distribution. Try not to be sloppy.
2.8.1 Scaling a distribution
Consider the vector which is the result of rolling one die 10 times.
rolls < c(5, 5, 1, 5, 4, 2, 6, 2, 1, 5)
There are other ways of storing the data in this vector. Instead of recording every draw, we could just record the number of times each value appears.
table(rolls)
## rolls
## 1 2 4 5 6
## 2 2 1 4 1
In this case, with only 10 vlaues, it is actually less efficient to store the data like this. But what happens when we have 10,000 rolls.
## more_rolls
## 1 2 4 5 6
## 2000 2000 1000 4000 1000
Instead of keeping around a vector of length 10,000, we can just keep 10 values, without losing any information.
This example also highlights the fact that, graphically, two distributions can be identical even if they are of very different lengths.
The two vectors — rolls
and more_rolls
— have the exact same shape because, even though their lengths differ, they are the same thing. The total count for each value does not matter. What matters is the relative proportions.
Since shape is what matters, we will often “normalize” distributions so that the sum of the counts equals one, meaning that the yaxis is a percentage of the total. Example:
2.8.2 sample()
The most common distributions you will work with are empirical or frequency distributions, the values of age
in the trains
tibble, the values of poverty
in the kenya
tibble, and so on. But we can also create our own data by making “draws” from a distribution which we have concocted.
Consider the distribution of the possible values from rolling a fair die. We can use the sample()
function to create draws from this distribution, meaning it will change (or sometimes stay the same) for every subsequent draw.
## [1] 3
This produces one “draw” from the distribution of the possible values of one roll of fair sixsided die.
Now, suppose we wanted to roll this die 10 times. One of the arguments of the sample()
function is replace
. We must specify it as TRUE
if values can appear more than once. Since, when rolling a die 10 times, we expect that a value like 3 can appear more than once, we need to set replace = TRUE
.
sample(x = die, size = 10, replace = TRUE)
## [1] 5 6 3 3 3 4 3 6 4 5
In other words, rolling a 1 on the first roll should not preclude you from rolling a 1 on a later roll.
What if the die is not “fair,” meaning that some sides are more likely to appear than others? The final argument of the sample()
function is the prob
argument. It takes a vector (of the same length as the initial vector x
) that contains all of the probabilities of selecting each one of the elements of x
. Suppose that the probability of rolling a 1 was 0.5, and the probability of rolling any other value is 0.1. (These probabilities should sum to 1. If they don’t sample()
will automatically rescale them.)
## [1] 1 1 3 1 2 1 1 6 1 1
Remember: There is no real data here. We have not actually rolled a die. We have just made some assumptions about what would happen if we were to roll a die. With those assumptions we have built an urn — a data generating mechanism — from which we can draw as many values as we like. Let’s roll the unfair die 10,000 times.
tibble(result = sample(x = die,
size = 10000,
replace = TRUE,
prob = c(0.5, rep(0.1, 5)))) %>%
ggplot(aes(x = result)) +
geom_bar() +
labs(title = "Distribution of Results of an Unfair Die",
x = "Result of One Roll",
y = "Count") +
scale_x_continuous(breaks = 1:6,
labels = as.character(1:6)) +
scale_y_continuous(labels = scales::comma_format())
This makes the dual nature of distributions more clear. A distribution is the “thing” you see in this plot. A mathematical object with several different parts, including a set of possible values (1 through 6, in this case) and a record of the number of times each value appears. A distribution is also the simple vector of numbers we used to create this plot.
In general, we travel backandforth between distribution as a thing and distribution as a vector of draws from the thing, depending on what we are trying to accomplish.
sample()
is just one of many functions for creating draws — or, more colloquially, “drawing” — from a distribution. Three of the most important functions are: runif()
, rbinom()
, and rnorm()
.
2.8.3 runif()
Consider a “uniform” distribution. This is the case in which every outcome in the range of possible outcomes has the same chance of occurring. The function runif()
(spoken as “runif”) enables us to draw from a uniform contribution. runif()
has three arguments: n
, min
, and max
. runif()
will produce n
draws from between min
and max
, with each value having an equal chance of occurring.
runif(n = 10, min = 4, max = 6)
## [1] 5.6 5.0 4.6 5.7 4.9 5.2 4.3 5.6 4.8 5.6
Mathematically, we would write:
\[y_i \sim U(4, 6)\],
This means that the each value for \(y\) is drawn from a uniform distribution between four and six.
2.8.4 rbinom()
Consider binomial distribution, the case in which the probability of some Boolean variable (for instance success or failure) is calculated for repeated, independent trials. One common example would be the probability of flipping a coin and landing on heads. The function rbinom()
allows us to draw from a binomial distribution. This function takes three arguments, n
, size
, and prob
.
n
is the number of values we seek to draw.
size
is the number of trials for each n
.
*prob
is the probability of success on each trial.
Suppose we wanted to flip a fair coin one time, and let landing on heads represent success.
rbinom(n = 1 , size = 1, prob = 0.5)
## [1] 0
Do the same thing 100 times:
tibble(heads = rbinom(n = 100, size = 1, prob = 0.5)) %>%
ggplot(aes(x = heads)) +
geom_bar() +
labs(title = "Flipping a Fair Coin 100 Times",
x = "Result",
y = "Count") +
scale_x_continuous(breaks = c(0, 1),
labels = c("Tails", "Heads"))
In our graph above, we use the function scale_x_continuous()
because our xaxis variable is continuous, meaning it can take on any real values. The breaks
argument to scale_x_continuous()
converts our xaxis to having two different “tick marks.” There is a fairly even distribution of Tails and Heads. More draws would typically result in an even more equal split.
Randomness creates (inevitable) tension between distribution as a “thing” and distribution as a vector of draws from that thing. In this case, the vector of draws is not balanced between Tails and Heads. Yet, we “know” that it should be since the coin is, by definition, fair. In a sense, the mathematics require an even split. Yet, randomness means that the vector of draws will rarely match the mathematically “true” result. And that is OK! First, randomness is an intrinsic property of the real world. Second, we can make the effect of randomness be as small as we want by increasing the number of draws.
Suppose instead we wanted to simulate an unfair coin, where the probability of landing on Heads was 0.75 instead of 0.25.
tibble(heads = rbinom(n = 100, size = 1, prob = 0.75)) %>%
ggplot(aes(x = heads)) +
geom_bar() +
labs(title = "Flipping a Fair Coin 100 Times",
x = "Result",
y = "Count") +
scale_x_continuous(breaks = c(0, 1),
labels = c("Tails", "Heads"))
The distribution — the imaginary urn — from which we draw the results of a coin flip for a fair coin is a different distribution — a different imaginary urn — from the distribution for a biased coin. In fact, there are an infinite number of distributions. Yet as long as we can draw values from a distribution, we can work with it. Mathematics:
\[y_i \sim B(n, p)\].
Each value for \(y\) is drawn from a binomial distribution with parameters \(n\) for the number of trials and \(p\) for the probability of success.
Instead of each n
consisting of a single trial, we could have situation in which we are, 10,000 times, flipping a coin 10 times and summing up, for each experiment, the number of heads. In that case:
set.seed(9)
tibble(heads = rbinom(n = 10000, size = 10, prob = 0.5)) %>%
ggplot(aes(x = heads)) +
geom_bar() +
labs(title = "Flipping a Fair Coin 10 Times",
subtitle = "Extreme results are possible with enough experiments",
x = "Total Number of Heads in Ten Flips",
y = "Count") +
scale_x_continuous(breaks = 0:10)
2.8.5 rnorm()
The most important distribution is the normal distribution. Mathematics:
\[y_i \sim N(\mu, \sigma^2)\].
Each value \(y_i\) is drawn from a normal distribution with parameters \(\mu\) for the mean and \(\sigma\) for the standard deviation.
This bellshaped distribution is defined by two parameters: (1) the mean \(\mu\) (spoken as “mu”) which locates the center of the distribution and (2) the standard deviation \(\sigma\) (spoken as “sigma”) which determines the variation of values around that center. In the figure below, we plot three normal distributions where:
 The solid normal curve has mean \(\mu = 5\) & standard deviation \(\sigma = 2\).
 The dotted normal curve has mean \(\mu = 5\) & standard deviation \(\sigma = 5\).
 The dashed normal curve has mean \(\mu = 15\) & standard deviation \(\sigma = 2\).
Notice how the solid and dotted line normal curves have the same center due to their common mean \(\mu\) = 5. However, the dotted line normal curve is wider due to its larger standard deviation of \(\sigma = 5\). On the other hand, the solid and dashed line normal curves have the same variation due to their common standard deviation \(\sigma = 2\). However, they are centered at different locations.
When the mean \(\mu = 0\) and the standard deviation \(\sigma = 1\), the normal distribution has a special name. It’s called the standard normal distribution or the \(z\)curve.
Furthermore, if a variable follows a normal curve, there are three rules of thumb we can use:
 68% of values will lie within \(\pm\) 1 standard deviation of the mean.
 95% of values will lie within \(\pm\) 1.96 \(\approx\) 2 standard deviations of the mean.
 99.7% of values will lie within \(\pm\) 3 standard deviations of the mean.
Let’s illustrate this on a standard normal curve. The dashed lines are at 3, 1.96, 1, 0, 1, 1.96, and 3. These 7 lines cut up the xaxis into 8 segments. The areas under the normal curve for each of the 8 segments are marked and add up to 100%. For example:
 The middle two segments represent the interval 1 to 1. The shaded area above this interval represents 34% + 34% = 68% of the area under the curve. In other words, 68% of values.
 The middle four segments represent the interval 1.96 to 1.96. The shaded area above this interval represents 13.5% + 34% + 34% + 13.5% = 95% of the area under the curve. In other words, 95% of values.
 The middle six segments represent the interval 3 to 3. The shaded area above this interval represents 2.35% + 13.5% + 34% + 34% + 13.5% + 2.35% = 99.7% of the area under the curve. In other words, 99.7% of values.
The function rnorm()
(spoken as “rnorm”) returns draws from a normal distribution. rnorm()
has three arguments: n
, mean
, and sd
. n
corresponds to the number of draws, mean
and sd
are the \(\mu\) and \(\sigma\) of the distribution from which we want to draw. Again, imagine an urn filled with beads. Each bead has a number written on it. If the distribution is standard normal, then we can draw 10 beads from the urn by running:
rnorm(10)
## [1] 0.84 1.38 1.26 0.07 1.71 0.60 0.47 0.64 0.29 0.14
These 10 draws come from a distribution with the default mean
of 0 and the default sd
of 1. What if we create a histogram of the values?
tibble(value = rnorm(10)) %>%
ggplot(aes(x = value)) +
geom_histogram(bins = 10)
As you can see, it is not as symmetrical as the one displayed above. This is not surprising! If you just draw 10 beads from the urn, you can not possibly have a very good sense of what all the numbers on all the beads in the urn look like. What if we draw 100 values? What about 100,000?
tibble(value = rnorm(100)) %>%
ggplot(aes(x = value)) +
geom_histogram(bins = 10)
tibble(value = rnorm(100000)) %>%
ggplot(aes(x = value)) +
geom_histogram(bins = 1000)
Now it’s looking a lot more similar to the “truth,” although still imperfect.
Now, let’s compare normal distributions with varying means and standard deviations, which can be set using the mean and sd arguments included with the function.
tibble(rnorm_5_1 = rnorm(n = 1000, mean = 5, sd = 1),
rnorm_0_3 = rnorm(n = 1000, mean = 0, sd = 3),
rnorm_0_1 = rnorm(n = 1000, mean = 0, sd = 1)) %>%
pivot_longer(cols = everything(),
names_to = "distribution",
values_to = "value") %>%
ggplot(aes(x = value, fill = distribution)) +
geom_density(alpha = 0.5) +
labs(title = "Comparison of Normal Distributions with Differing Mean and Standard Deviation Values",
fill = "Distribution",
x = "Value",
y = "Density")
2.8.6 Working with draws
Once we have a vector of draws, we can examine various aspects of the distribution. Examples:
draws < rnorm(100, mean = 2, sd = 1)
This is a case in which there is no distinction between the true distribution and the estimated distribution. We know, by assumption, what the truth is.
Even though we know, because we wrote the code, that the draws come from a normal distribution with a mean of 2 and a standard deviation of 1, the calculated results will not match those values exactly because the draws themselves are random.
mean(draws)
## [1] 1.9
sd(draws)
## [1] 1
Note that we are more likely to use the median and the mad to summarize a distribution. In this case, they are very similar to the mean and standard deviation.
median(draws)
## [1] 1.9
mad(draws)
## [1] 1.1
In practice, we will not know the exact distribution which generates our data. (If we did know, then estimation would not be necessary.) The inherent randomness of the world means that calculated statistics will not match the underlying truth perfectly. But the more data that we collect, the closer the match will be.
In addition to the mean and standard deviation of the draws, we will often be interested in various quantiles of the distribution, most commonly because we want to create intervals which cover a specified portion of the draws. Examples:
## 25% 75%
## 1.2 2.7
## 5% 95%
## 0.4 3.5
## 2.5% 97.5%
## 0.15 3.80
Note that these draws come from a distribution which is centered around 2 rather than 0. There is nothing intrinsically special about any of these ranges. They are mere convention, especially the 95% interval.
Note how cavalier we are in sometimes using the word “distribution” and sometimes the word “draws.” These are two different things! The distribution is the underlying reality, which we will only know for certain when we create it ourselves, as in this example. The draws are a vector of numbers which, we assume, are “drawn” from some underlying distribution which, in general, we do not know.
By assumption, we can analyze the draws to make inferences about the distribution.
Although distributions (and the draws therefrom) are complex, we can often treat them in the same way that we treat simple numbers. For example, we can add two distributions together.
n < 100000
tibble(Normal = rnorm(n, mean = 1),
Uniform = runif(n, min = 2, max = 3),
Combined = Normal + Uniform) %>%
pivot_longer(cols = everything(),
names_to = "Distribution",
values_to = "draw") %>%
ggplot(aes(x = draw, fill = Distribution)) +
geom_histogram(aes(y = after_stat(count/sum(count))),
alpha = 0.5,
bins = 100,
position = "identity") +
labs(title = "Two Distributions and Their Sum",
subtitle = "You can sum distributions just like you sum numbers",
x = "Value",
y = "Probability")
Drawing from a distribution also allows us to answer questions via simulation. For example, imagine that A and B are both flipping fair coins. A flips the coin 3 times. B flips the coin 6 times. What is the probability that A flips more heads than B?
It is obvious that B will win this game more often than A. It is also obvious that A will win some of the time. But in order to estimate the chances of A winning, we can simply simulate playing the game 1,000 times.
set.seed(56)
games < 1000
tibble(A_heads = rbinom(n = games, size = 3, prob = 0.5),
B_heads = rbinom(n = games, size = 6, prob = 0.5)) %>%
mutate(A_wins = ifelse(A_heads > B_heads, 1, 0)) %>%
summarize(A_chances = mean(A_wins))
## # A tibble: 1 x 1
## A_chances
## <dbl>
## 1 0.091
A has about a 9% chance of winning the game.
In data science, the most important kind of distribution is a probability distribution, a concept which we will introduce in Chapter 5.
2.9 Other Commands
Here are some topics which will prove important later in the Primer.
2.9.1 Matrices
Recall that a “matrix” in R is a rectangular array of data, shaped like a data frame or tibble, but containing only one type of data, e.g., numeric. Large matrices also print out ugly. (There are other differences, none of which we care about here.) Example:
## [,1] [,2] [,3]
## [1,] 3 9 0
## [2,] 4 12 15
## [3,] 8 13 1
The easiest way to pull information from a matrix is to use [ ]
, the subset operator. Here is how we grab the second and third columns of m
:
m[, 2:3]
## [,1] [,2]
## [1,] 9 0
## [2,] 12 15
## [3,] 13 1
Note, however, that matrices with just one dimension “collapse” into single vectors:
m[, 2]
## [1] 9 12 13
Tibbles, on the other hand, always maintain their rectangular shapes, even with only one column or row.
We can turn matrices into tibbles with as_tibble()
.
m %>%
as_tibble()
## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
## Using compatibility `.name_repair`.
## # A tibble: 3 x 3
## V1 V2 V3
## <dbl> <dbl> <dbl>
## 1 3 9 0
## 2 4 12 15
## 3 8 13 1
Because m
does not have column names, as_tibble()
creates its won variables names, using “V” for variable.
2.9.2 Missing Values
Some observations in a tibble are blank. These are called missing values, and they are often marked as NA
. We can create such a tibble as follows:
tbl < tribble(
~ a, ~ b, ~ c,
2, 3, 5,
4, NA, 8,
NA, 7, 9,
)
tbl
## # A tibble: 3 x 3
## a b c
## <dbl> <dbl> <dbl>
## 1 2 3 5
## 2 4 NA 8
## 3 NA 7 9
The presence of NA
values can be problematic.
tbl %>%
summarize(avg_a = mean(a))
## # A tibble: 1 x 1
## avg_a
## <dbl>
## 1 NA
Fortunately, most R functions take an argument, na.rm
, which, when set to TRUE
, removes NA values from any calculations.
tbl %>%
summarize(avg_a = mean(a, na.rm = TRUE))
## # A tibble: 1 x 1
## avg_a
## <dbl>
## 1 3
Another approach is to use drop_na()
.
tbl %>%
drop_na(a)
## # A tibble: 2 x 3
## a b c
## <dbl> <dbl> <dbl>
## 1 2 3 5
## 2 4 NA 8
Be careful, however, if you use drop_na()
without a specific variable provided. In that case, you will remove all rows with a missing value for any variable in the tibble.
tbl %>%
drop_na()
## # A tibble: 1 x 3
## a b c
## <dbl> <dbl> <dbl>
## 1 2 3 5
A final approach is to use is.na()
to explicitly determine if a value is missing.
## # A tibble: 3 x 4
## a b c a_missing
## <dbl> <dbl> <dbl> <lgl>
## 1 2 3 5 FALSE
## 2 4 NA 8 FALSE
## 3 NA 7 9 TRUE
2.9.3 Working by rows
Tibbles and the main Tidyverse functions are designed to work by columns. You do something to all the values for variable a
. But, sometimes, we want to work across
the tibble, comparing the value for a
in the first row to the value of b
in the first row, and so on. To do that, we need two tricks. First, we use rowwise()
to inform R that the next set of commands should be executed across the rows.
tbl %>%
rowwise()
## # A tibble: 3 x 3
## # Rowwise:
## a b c
## <dbl> <dbl> <dbl>
## 1 2 3 5
## 2 4 NA 8
## 3 NA 7 9
Note how “# Rowwise:” is printed out. Having set up the pipe to work across rows, we need to pass c_across()
to whichever function we are using, generally specifying the variables we want to use. If we don’t provide any arguments to c_across()
, it will use all the columns in the tibble.
tbl %>%
rowwise() %>%
mutate(sum_a_c = sum(c_across(c(a, c)))) %>%
mutate(largest = max(c_across())) %>%
mutate(largest_na = max(c_across(), na.rm = TRUE))
## # A tibble: 3 x 6
## # Rowwise:
## a b c sum_a_c largest largest_na
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 3 5 7 7 7
## 2 4 NA 8 12 NA 12
## 3 NA 7 9 NA NA 9
2.10 Summary
Data science is data cleaning.
Real data is nasty.
This chapter covered many, many commands. You should have them all memorized by now.
No! That is ridiculous. We don’t have them all memorized. Why should you? The point of this chapter was to give you a tour of what you can do in R and how to do it. With that information, you have a base from which to try to solve the problems you will encounter in the future.
The key data science concept from this chapter is, again, the idea of a “distribution.” The word distribution is used in two very different ways. First, a distribution is an invisible object that you can never use of touch. It is the imaginary urn from which you can take draws. Only in very special cases will you ever be able to “know” what the distribution is, mainly the case where there is a physical process, like a roulette wheel, which you can inspect or in the case of an assumed mathematical formula. But, in almost all real world data science problems, the “distribution” is a mental creation whose reality you can never confirm.
The second way that the word distribution is used is to refer to a vector of values, a variable in an R tibble. The 115 ages in trains
are a distribution as are 1,000 draws from rnorm()
.
Whether “distribution” means the imaginary object or a vector of numbers drawn from that imaginary object depends on context.