# 3 Data

This chapter is a draft.

## 3.1 Introduction

Start by loading the packages which we will need in this chapter.

library(tidyverse)
library(primer.data)
library(lubridate)
library(janitor)
library(skimr)
library(nycflights13)
library(gapminder)
library(fivethirtyeight)

## 3.2 Absolute and relative file paths

When you load a data set into R, you first need to tell R where those files live. The file could live on your computer (local) or somewhere on the internet (remote). In this section, we will discuss the case where the file lives on your computer.

The place where the file lives on your computer is called the “path.” You can think of the path as directions to the file. There are two kinds of paths: relative paths and absolute* paths. A relative path is where the file is with respect to where you currently are on the computer. On the other hand, an absolute path is where the file is in respect to the base (or root) folder of the computer’s filesystem.

Consider the a csv file called happiness_report.csv.

Reading happiness_report.csv using a relative path:

happiness_data <- read_csv("data/happiness_report.csv")

Reading happiness_report.csv using an absolute path:

happiness_data <- read_csv("/home/preceptor/gov-1005/worksheet_02/data/happiness_report.csv")

So which one should you use? Generally speaking, to ensure your code can be run on a different computer, you should use relative paths. An added bonus is that it’s also less typing! This is because the absolute path of a file (the names of folders between the computer’s root / and the file) isn’t usually the same across different computers. For example, suppose Fatima and Jayden are working on a project together on the happiness_report.csv data. Fatima’s file is stored at

/home/Fatima/project/data/happiness_report.csv,

while Jayden’s is stored at

/home/Jayden/project/data/happiness_report.csv.

Even though Fatima and Jayden stored their files in the same place on their computers (in their home folders), the absolute paths are different due to their different usernames. If Jayden has code that loads the happiness_report.csv data using an absolute path, the code won’t work on Fatima’s computer. But the relative path from inside the project folder (data/happiness_report.csv) is the same on both computers; any code that uses relative paths will work on both!

See this video for another explanation:

Source: Udacity course “Linux Command Line Basics”

## 3.3 Writing and Reading Files

Getting data into and out of R is a major part of any real world data science project.

### 3.3.1 CSV files

“CSV” stands for comma separated value. In other words, csv files are files whose values are separated by commas. Each comma from the csv file corresponds to a column, and the column names are taken from the first line of the file. The function then “guesses” an appropriate data type for each of the columns.

In the real word, you often want to write a data frame that has changed (either through filtering, selecting, mutating or summarizing) to a file to share it with others. We use write_csv() to save a data frame to a .csv file. write_csv() has two main arguments: x and path. The x argument is the data set that you want to save. The file argument is the file path where you want to save the file. The end of the path argument is the name that you want to use for the file.

We use read_csv() from the readr package to load the data into R, and in that call we specify the relative path to the file

Consider the following csv file “test_2.csv.”

file_1 <- "https://raw.githubusercontent.com/PPBDS/primer.tutorials/master/inst/www/test_2.csv"

read_csv(file_1)
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   Top two rows consist of junk which = col_character()
## )
## Warning: 3 parsing failures.
## row col  expected    actual                                                                                  file
##   2  -- 1 columns 3 columns 'https://raw.githubusercontent.com/PPBDS/primer.tutorials/master/inst/www/test_2.csv'
##   3  -- 1 columns 3 columns 'https://raw.githubusercontent.com/PPBDS/primer.tutorials/master/inst/www/test_2.csv'
##   4  -- 1 columns 3 columns 'https://raw.githubusercontent.com/PPBDS/primer.tutorials/master/inst/www/test_2.csv'
## # A tibble: 4 x 1
##   Top two rows consist of junk which
##   <chr>
## 1 we don't care about. Data starts on row 3.
## 2 a
## 3 9
## 4 4

As you can see, there is text at the top of this file. Often times information about how data was collected, or other relevant information, is included at the top of the data file. This does not allow us to correctly load the data into R, and it is not intended to be shown.

We can use the skip argument to skip the text lines.

read_csv(file = file_1,
skip = 2)
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   a = col_double(),
##   b = col_double(),
##   c = col_double()
## )
## # A tibble: 2 x 3
##       a     b     c
##   <dbl> <dbl> <dbl>
## 1     9     8     7
## 2     4     5     6

Do you see the column specification message? As of now, R “guesses” an appropriate data type for each of the columns. To get rid of this message, we use the col_types() argument and specify the data types.

read_csv(file = file_1,
skip = 1,
col_types = cols(a = col_double(),
b = col_double(),
c = col_double()))
## Warning: The following named parsers don't match the column names: a, b, c
## Warning: 3 parsing failures.
## row col  expected    actual                                                                                  file
##   1  -- 1 columns 3 columns 'https://raw.githubusercontent.com/PPBDS/primer.tutorials/master/inst/www/test_2.csv'
##   2  -- 1 columns 3 columns 'https://raw.githubusercontent.com/PPBDS/primer.tutorials/master/inst/www/test_2.csv'
##   3  -- 1 columns 3 columns 'https://raw.githubusercontent.com/PPBDS/primer.tutorials/master/inst/www/test_2.csv'
## # A tibble: 3 x 1
##   we don't care about. Data starts on row 3.
##   <chr>
## 1 a
## 2 9
## 3 4

When our tabular data comes in a different format, we can use the read_delim() function instead. For example, a different version of “test_2.csv” could exist that has no column names and uses tabs as the delimiter instead of commas.

With read_delim(), we specify the first argument as the path to the file (as done with read_csv). Then we provide values to the delim argument (could be a tab, which we represent by ").

### 3.3.2 Excel Files

Excel is a spreadsheet program that use tables to analyze, store, or manipulate data. The tables are composed of cells which include text, numbers, or formulas. Excel files have the filename extension .xlsx. They store additional things that you cannot store in a .csv file such as fonts, text formatting, graphics, etc.

In order to write excel files you have to install complex packages, and they are hard to create. Writing excel files is beyond the scope of Primer.

Reading Excel files is easy. To do so, we use the read_excel() function from the readxl package.

library(readxl)

# Unfortunately, it is not possible to read Excel files directly from the web.
# So we download the file by hand and then read it in from the current working
# directory. Note that the "proper" way of handling this would be to create a
# and then delete the temp directory. That way, you would not have random
# downloaed files hanging around.

destfile = "example_excel.xlsx")

read_excel(path = "example_excel.xlsx")
## # A tibble: 2 x 3
##       a     b     c
##   <dbl> <dbl> <dbl>
## 1     1     2     3
## 2     4     5     6

If the .xlsx file has multiple sheets, you have to use the sheet argument to specify the sheet number or name.

### 3.3.3 RDS

rds files store a single R object within a file.

When we save to a rds file, we use the function write_rds(). Just like write_csv(), this function has two main arguments: x and file. The x argument is the data set that you want to save. The file argument is the file path where you want to save the file. The end of the path argument is the name that you want to use for the file.

read_rds() reads the file back into R. Just like read_csv() read_rds has one main argument, which is the path to the file that you are wanting to read into R.

# need example here

### 3.3.4 JSON

An increasingly common format for sharing data is JavaScript Object Notation or JSON. Because this format is very general, it is nothing like a spreadsheet. Note that JSON files are often made available via the internet. Several organizations provide a JSON API or a web service that you can connect directly to and from which you can obtain data.

The functions fromJSON() and toJSON() allow you to convert between R objects and JSON. Both functions come from the jsonlite package.

The function toJSON() converts a tibble to JSON format.

##
## Attaching package: 'jsonlite'
## The following object is masked from 'package:purrr':
##
##     flatten
## [
##   {
##     "name": "Miguel",
##     "student_id": 1,
##     "exam_1": 85,
##     "exam_2": 86
##   },
##   {
##     "name": "Sofia",
##     "student_id": 2,
##     "exam_1": 94,
##     "exam_2": 93
##   },
##   {
##     "name": "Aya",
##     "student_id": 3,
##     "exam_1": 87,
##     "exam_2": 88
##   },
##   {
##     "name": "Cheng",
##     "student_id": 4,
##     "exam_1": 90,
##     "exam_2": 91
##   }
## ]

The function fromJSON() converts JSON format to a tibble.

##     Name Age Occupation
## 1  Mario  32    Plumber
## 2  Peach  21   Princess
## 3   <NA>  NA       <NA>
## 4 Bowser  NA      Koopa

## 3.4 Reading data from a database

### 3.4.1 Reading data from a SQLite database

SQLite is probably the simplest relational database that one can use in combination with R. SQLite databases are self-contained and usually stored and accessed locally on one computer. Data is usually stored in a file with a .db extension. Similar to Excel files, these are not plain text files and cannot be read in a plain text editor.

The first thing you need to do to read data into R from a database is to connect to the database. We do that using dbConnect() function from the DBI (database interface) package. This does not read in the data, but simply tells R where the database is and opens up a communication channel.

Often relational databases have many tables, and their power comes from the useful ways they can be joined. Thus anytime you want to access data from a relational database, you need to know the table names. You can get the names of all the tables in the database using dbListTables().

We only get one table name returned, which tells us that there is only one table in this database. To reference a table in the database to do things like select columns and filter rows, we use the tbl() function from the dbplyr package. The package dbplyr allows us to work with data stored in databases as if they were local data frames, which is useful because we can do a lot with big datasets without actually having to bring these vast amounts of data into your computer!

Although it looks like we just got a data frame from the database, we didn’t! It’s a reference, showing us data that is still in the SQLite database (note the first two lines of the output). It does this because databases are often more efficient at selecting, filtering and joining large data sets than R. And typically, the database will not even be stored on your computer, but rather a more powerful machine somewhere on the web. So R is lazy and waits to bring this data into memory until you explicitly tell it. To do so, we use the collect() function.

Here we will filter for only rows in the Aboriginal languages category according to the 2016 Canada Census, and then use collect() to finally bring this data into R as a data frame.

Why bother to use the collect() function? The data looks pretty similar in both outputs shown above. And dbplyr provides lots of functions similar to filter() that you can use to directly feed the database reference (i.e. what tbl() gives you) into downstream analysis functions (e.g., ggplot2 for data visualization and lm for linear regression modeling). However, this does not work in every case; look what happens when we try to use nrow to count rows in a data frame:

or tail to preview the last 6 rows of a data frame:

tail(aboriginal_lang_db)
## Error: tail() is not supported by sql sources

Additionally, some operations will not work to extract columns or single values from the reference given by the tbl function. Thus, once you have finished your data wrangling of the tbl() database reference object, it is advisable to bring it into your local machine’s memory using collect() as a data frame.

Warning: Usually, databases are very big! Reading the object into your local machine may give an error or take a lot of time to run so be careful if you plan to do this!

### 3.4.2 Reading data from a PostgreSQL database

PostgreSQL (also called Postgres) is a very popular and open-source option for relational database software. Unlike SQLite, PostgreSQL uses a client–server database engine, as it was designed to be used and accessed on a network. This means that you have to provide more information to R when connecting to Postgres databases. The additional information that you need to include when you call the dbConnect() is listed below:

• dbname - the name of the database (a single PostgreSQL instance can host more than one database)
• host - the URL pointing to where the database is located
• port - the communication endpoint between R and the PostgreSQL database (this is typically 5432 for PostgreSQL)
• user - the username for accessing the database
• password - the password for accessing the database

Additionally, we must use the RPostgres package instead of RSQLite in the dbConnect() function call. Below we demonstrate how to connect to a version of the can_mov_db database, which contains information about Canadian movies (note - this is a synthetic, or artificial, database).

library(RPostgres)
can_mov_db_con <- dbConnect(RPostgres::Postgres(), dbname = "can_mov_db",
host = "r7k3-mds1.stat.ubc.ca", port = 5432,
user = "user0001", password = '################')

### 3.4.3 Interacting with a database

After opening the connection, everything looks and behaves almost identically to when we were using an SQLite database in R. For example, we can again use dbListTables() to find out what tables are in the can_mov_db database:

dbListTables(can_mov_db_con)
 [1] "themes"            "medium"           "titles"     "title_aliases"       "forms"
[6] "episodes"          "names"      "names_occupations" "occupation"       "ratings" 

We see that there are 10 tables in this database. Let’s first look at the "ratings" table to find the lowest rating that exists in the can_mov_db database:

ratings_db <- tbl(can_mov_db_con, "ratings")
ratings_db
# Source:   table<ratings> [?? x 3]
# Database: postgres [user0001@r7k3-mds1.stat.ubc.ca:5432/can_mov_db]
<chr>                    <dbl>     <int>
1 The Grand Seduction       6.6       150
2 Rhymes for Young Ghouls   6.3      1685
3 Mommy                     7.5      1060
4 Incendies                 6.1      1101
5 Bon Cop, Bad Cop          7.0       894
6 Goon                      5.5      1111
7 Monsieur Lazhar           5.6       610
8 What if                   5.3      1401
9 The Barbarian Invations   5.8        99
10 Away from Her             6.9      2311
# … with more rows

To find the lowest rating that exists in the data base, we first need to extract the average_rating column using select():

avg_rating_db <- select(ratings_db, average_rating)
avg_rating_db
# Source:   lazy query [?? x 1]
# Database: postgres [user0001@r7k3-mds1.stat.ubc.ca:5432/can_mov_db]
average_rating
<dbl>
1            6.6
2            6.3
3            7.5
4            6.1
5            7.0
6            5.5
7            5.6
8            5.3
9            5.8
10            6.9
# … with more rows

Next we use min() to find the minimum rating in that column:

min(avg_rating_db)
Error in min(avg_rating_db) : invalid 'type' (list) of argument

Instead of the minimum, we get an error! This is another example of when we need to use the collect() function to bring the data into R for further computation:

avg_rating_data <- collect(avg_rating_db)
min(avg_rating_data)
[1] 1

We see the lowest rating given to a movie is 1, indicating that it must have been a really bad movie…

Why should we bother with databases at all?

Opening a database stored in a .db file involved a lot more effort than just opening a .csv, .tsv, or any of the other plain text or Excel formats. It was a bit of a pain to use a database in that setting since we had to use dbplyr to translate tidyverse-like commands (filter, select, head, etc.) into SQL commands that the database understands. Not all tidyverse commands can currently be translated with SQLite databases. For example, we can compute a mean with an SQLite database but can’t easily compute a median. So you might be wondering why should we use databases at all?

Databases are beneficial in a large-scale setting:

• they enable storing large data sets across multiple computers with automatic redundancy and backups
• they allow multiple users to access them simultaneously and remotely without conflicts and errors
• they provide mechanisms for ensuring data integrity and validating input
• they provide security to keep data safe For example, there are billions of Google searches conducted daily.

## 3.5 Webscraping

In the first part of this chapter, we learned how to read in data from plain text files that are usually “rectangular” in shape using the tidyverse read_* functions. Sadly, not all data comes in this simple format, but we can happily use many other tools to read in more messy/wild data formats. The formal name for gathering non-rectangular data from the web and transforming it into a more useful format for data analysis is web scraping.

### 3.5.1 HTML and CSS selectors

Before we jump into scraping, let’s learn a little bit about what the “source code” of a website looks like. Say we are interested in knowing the average rental price (per square footage) of the most recently available one-bedroom apartments in Vancouver from https://vancouver.craigslist.org. When we visit the Vancouver Craigslist website and search for one-bedroom apartments, this is what we are shown:

From that page, it’s pretty easy for our human eyes to find the apartment price and square footage. But how can we do this programmatically, so we don’t have to copy and paste all these numbers? Well, we have to deal with the webpage source code, which we show a snippet of below (and link to the entire source code here):

        <span class="result-meta">
<span class="result-price">$800</span> <span class="housing"> 1br - </span> <span class="result-hood"> (13768 108th Avenue)</span> <span class="result-tags"> <span class="maptag" data-pid="6786042973">map</span> </span> <span class="banish icon icon-trash" role="button"> <span class="screen-reader-text">hide this posting</span> </span> <span class="unbanish icon icon-trash red" role="button" aria-hidden="true"></span> <a href="#" class="restore-link"> <span class="restore-narrow-text">restore</span> <span class="restore-wide-text">restore this posting</span> </a> </span> </p> </li> <li class="result-row" data-pid="6788463837"> <a href="https://vancouver.craigslist.org/nvn/apa/d/north-vancouver-luxury-1-bedroom/6788463837.html" class="result-image gallery" data-ids="1:00U0U_lLWbuS4jBYN,1:00T0T_9JYt6togdOB,1:00r0r_hlMkwxKqoeq,1:00n0n_2U8StpqVRYX,1:00M0M_e93iEG4BRAu,1:00a0a_PaOxz3JIfI,1:00o0o_4VznEcB0NC5,1:00V0V_1xyllKkwa9A,1:00G0G_lufKMygCGj6,1:00202_lutoxKbVTcP,1:00R0R_cQFYHDzGrOK,1:00000_hTXSBn1SrQN,1:00r0r_2toXdps0bT1,1:01616_dbAnv07FaE7,1:00g0g_1yOIckt0O1h,1:00m0m_a9fAvCYmO9L,1:00C0C_8EO8Yl1ELUi,1:00I0I_iL6IqV8n5MB,1:00b0b_c5e1FbpbWUZ,1:01717_6lFcmuJ2glV"> <span class="result-price">$2285</span>
</a>

<p class="result-info">
<span class="icon icon-star" role="button">
</span>

<time class="result-date" datetime="2019-01-06 12:06" title="Sun 06 Jan 12:06:01 PM">Jan  6</time>

<a href="https://vancouver.craigslist.org/nvn/apa/d/north-vancouver-luxury-1-bedroom/6788463837.html" data-id="6788463837" class="result-title hdrlnk">Luxury 1 Bedroom CentreView with View - Lonsdale</a>



This is not easy for our human eyeballs to read! However, it is easy for us to use programmatic tools to extract the data we need by specifying which HTML tags (things inside < and > in the code above). For example, if we look in the code above and search for lines with a price, we can also look at the tags that are near that price and see if there’s a common “word” we can use that is near the price but doesn’t exist on other lines that have the information we are not interested in:

<span class="result-price">$800</span> and <span class="result-price">$2285</span>

What we can see is there is a special “word” here, “result-price,” which appears only on the lines with prices and not on the other lines (that have information we are not interested in). This special word and the context in which is is used (learned from the other words inside the HTML tag) can be combined to create something called a CSS selector. The CSS selector can then be used by R’s rvest package to select the information we want (here price) from the website source code.

Many websites are quite large and complex, which means the website source code is large as well. And as you saw above, it is not easy to read and pick out the special words we want with our human eyeballs. So to make this easier, we will use the SelectorGadget tool. It is an open source tool that simplifies generating and finding CSS selectors. We recommend you use the Chrome web browser to use this tool, and install the selector gadget tool from the Chrome Web Store. Here is a short video on how to install and use the SelectorGadget tool to get a CSS selector for use in web scraping:

From installing and using the selectorgadget as shown in the video above, we get the two CSS selectors .housing and .result-price that we can use to scrape information about the square footage and the rental price, respectively. The selector gadget returns them to us as a comma separated list (here .housing , .result-price), which is exactly the format we need to provide to R if we are using more than one CSS selector.

### 3.5.2 Are you allowed to scrape that website?

“You agree not to copy/collect CL content via robots, spiders, scripts, scrapers, crawlers, or any automated or manual equivalent (e.g., by hand).”

Want to learn more about the legalities of web scraping and crawling? Read this interesting blog post titled “Web Scraping and Crawling Are Perfectly Legal, Right?” by Benoit Bernard (this is optional, not required reading).

So what to do now? Well, we shouldn’t scrape Craigslist! Let’s instead scrape some data on the population of Canadian cities from Wikipedia (who’s Terms of Service document does not explicitlty say do not scrape). In this video below we demonstrate using the selectorgadget tool to get CSS Selectors from Wikipedia’s Canada page to scrape a table that contains city names and their populations from the 2016 Canadian Census:

### 3.5.3 Using rvest

Now that we have our CSS selectors we can use the rvest R package to scrape our desired data from the website. First we start by loading the rvest package:

library(rvest)
##
## Attaching package: 'rvest'
## The following object is masked from 'package:readr':
##
##     guess_encoding

Next, we tell R what page we want to scrape by providing the webpage’s URL in quotations to the function read_html:

page <- read_html("https://en.wikipedia.org/wiki/Canada")

Then we send the page object to the html_nodes() function. We also provide that function with the CSS selectors we obtained from the selectorgadget tool. These should be surrounded by quotations. The html_nodes function select nodes from the HTML document using CSS selectors. Nodes are the HTML tag pairs as well as the content between the tags. For our CSS selector td:nth-child(5) and example node that would be selected would be: <td style="text-align:left;background:#f0f0f0;"><a href="/wiki/London,_Ontario" title="London, Ontario">London</a></td>

We will use head() here to limit the print output of these vectors to 6 lines.

population_nodes <- html_nodes(page, "td:nth-child(5) , td:nth-child(7) , .infobox:nth-child(122) td:nth-child(1) , .infobox td:nth-child(3)")
head(population_nodes)
## {xml_nodeset (6)}
## [1] <td style="text-align:right;">5,928,040</td>
## [2] <td style="text-align:left;background:#f0f0f0;"><a href="/wiki/London,_On ...
## [3] <td style="text-align:right;">494,069\n</td>
## [4] <td style="text-align:right;">4,098,927</td>
## [5] <td style="text-align:left;background:#f0f0f0;">\n<a href="/wiki/St._Cath ...
## [6] <td style="text-align:right;">406,074\n</td>

Next we extract the meaningful data from the HTML nodes using the html_text() function. For our example, this functions only required argument is the an html_nodes object, which we named rent_nodes. In the case of this example node: <td style="text-align:left;background:#f0f0f0;"><a href="/wiki/London,_Ontario" title="London, Ontario">London</a></td>, the html_text function would return London.

population_text <- html_text(population_nodes)
head(population_text)
## [1] "5,928,040"              "London"                 "494,069\n"
## [4] "4,098,927"              "St. Catharines–Niagara" "406,074\n"

Are we done? Not quite… If you look at the data closely you see that the data is not in an optimal format for data analysis. Both the city names and population are encoded as characters in a single vector instead of being in a data frame with one character column for city and one numeric column for population (think of how you would organize the data in a spreadsheet). Additionally, the populations contain commas (not useful for programmatically dealing with numbers), and some even contain a line break character at the end (\n).

## 3.6 Working with APIs

“API” stands for Application Program Interface. They allow us to access open data from government agencies, companies, and other organizations. API provides the rules for software applications to interact with one another. Open data APIs provide the rules you need to know to write R code to request and pull data from the organization’s web server into R. Usually, some of the computational burden of querying and subsetting the data is taken on by the source’s server, to create the subset of requested data to pass to your computer. In practice, this means you can often pull the subset of data you want from a very large available dataset without having to download the full dataset and load it locally into your R session.

As an overview, the basic steps for accessing and using data from a web API when working in R are:

• Figure out the API rules for HTTP requests
• Write R code to create a request in the proper format
• Send the request using GET or POST HTTP methods
• Once you get back data from the request, parse it into an easier-to-use format if necessary

To get the data from an API, you should first read the organization’s API documentation. An organization will post details on what data is available through their API(s), as well as how to set up HTTP requests to get that data. To request the data through the API, you will typically need to send the organization’s web server an HTTP request using a GET or POST method. The API documentation details will typically show an example GET or POST request for the API, including the base URL to use and the possible query parameters that can be used to customize the dataset request.

Here is an example:

The National Aeronautics and Space Administration (NASA) has an API for pulling the Astronomy Picture of the Day. In their API documentation, they specify that the base URL for the API request should be https://api.nasa.gov/planetary/apod and that you can include parameters to specify the date of the daily picture you want, whether to pull a high-resolution version of the picture, and a NOAA API key you have requested from NOAA.

Many organizations will require you to get an API key and use this key in each of your API requests. This key allows the organization to control API access, including enforcing rate limits per user. API rate limits restrict how often you can request data (such as an hourly limit of 1,000 requests per user for NASA APIs).

API keys should be kept private, so if you are writing code that includes an API key, be very careful not to include the actual key in any code that is public (even any code in public GitHub repositories). To ensure privacy, save the value of your key in a file named .Renviron in your home directory. This file should be a plain text file and must end in a blank line. Once you’ve saved your API key to a global variable in that file (e.g., with a line added to the .Renviron file like NOAA_API_KEY = “abdafjsiopnab038”), you can assign the key value to an R object in an R session using the Sys.getenv function (e.g., noaa_api_key <- Sys.getenv(“NOAA_API_KEY”)), and then use the object noaa_api_key anywhere you would otherwise have used the character string with your API key.

To find more R packages for accessing and exploring open data, check out the Open Data CRAN task view. You can also browse through the ROpenSci packages, all of which have GitHub repositories where you can further explore how each package works! ROpenSci is an organization with the mission to create open software tools for science. If you create your own package to access data relevant to scientific research through an API, consider submitting it for peer-review through ROpenSci.

The riem package, developed by Maelle Salmon and an ROpenSci package, is an excellent and straightforward example of how you can use R to pull open data through a web API. This package allows you to pull weather data from airports around the world directly from the Iowa Environmental Mesonet. To show you how to pull data into R through an API, in this section we will walk you through code in the riem package or code based closely on code in the package.

To get a certain set of weather data from the Iowa Environmental Mesonet, you can send an HTTP request specifying a base URL, https://mesonet.agron.iastate.edu/cgi-bin/request/asos.py/, as well as some parameters describing the subset of dataset you want (e.g., date ranges, weather variables, output format). Once you know the rules for the names and possible values of these parameters (more on that below), you can submit an HTTP GET request using the functionGET() from the httr package.

When you are making an HTTP request using the GET() or POST() functions from the httr package, you can include the key-value pairs for any query parameters as a list object in the query argument of the function. For example, suppose you want to get wind speed in miles per hour (data = “sped”) for Denver, CO, (station = “DEN”) for the month of June 2016 (year1 = “2016,” month1 = “6,” etc.) in Denver’s local time zone (tz = “America/Denver”) and in a comma-separated file (format = “comma”). To get this weather dataset, you can run:

library(httr)
meso_url <- "https://mesonet.agron.iastate.edu/cgi-bin/request/asos.py/"
denver <- GET(url = meso_url,
query = list(station = "DEN",
data = "sped",
year1 = "2016",
month1 = "6",
day1 = "1",
year2 = "2016",
month2 = "6",
day2 = "30",
tz = "America/Denver",
format = "comma")) %>%
content() %>%
read_csv(skip = 5, na = "M")

# There are 9,106 rows of data to look at! Let's just look at subset for our purposes.
denver %>%
slice(1:3)
## # A tibble: 3 x 3
##   station valid                sped
##   <chr>   <dttm>              <dbl>
## 1 DEN     2016-06-01 00:00:00   9.2
## 2 DEN     2016-06-01 00:05:00   9.2
## 3 DEN     2016-06-01 00:10:00   6.9

The content() call extracts the content from the response to the HTTP request sent by the GET() function. The Iowa Environmental Mesonet API offers the option to return the requested data in a comma-separated file (format = “comma” in the GET request), so here content and read_csv() are used to extract and read in that csv file. Usually, data will be returned in a JSON format instead.

The only tricky part of this process is figuring out the available parameter names (e.g., station) and possible values for each (e.g., “DEN” for Denver). Currently, the details you can send in an HTTP request through Iowa Environmental Mesonet’s API include:

• A four-character weather station identifier (station)
• The weather variables (e.g., temperature, wind speed) to include (data)
• Starting and ending dates describing the range for which you’d like to pull data (year1, month1, day1, year2, month2, day2)
• The time zone to use for date-times for the weather observations (tz)
• Different formatting options (e.g., delimiter to use in the resulting data file [format], whether to include longitude and latitude)

Typically, these parameter names and possible values are explained in the API documentation. In some cases, however, the documentation will be limited. In that case, you may be able to figure out possible values, especially if the API specifies a GET rather than POST method, by playing around with the website’s point-and-click interface and then looking at the url for the resulting data pages. For example, if you look at the Iowa Environmental Mesonet’s page for accessing this data, you’ll notice that the point-and-click web interface allows you the options in the list above, and if you click through to access a dataset using this interface, the web address of the data page includes these parameter names and values.

The riem package implements all these ideas in three very clean and straightforward functions. You can explore the code behind this package and see how these ideas can be incorporated into a small R package, in the /R directory of the package’s GitHub page.

R packages already exist for many open data APIs. If an R package already exists for an API, you can use functions from that package directly, rather than writing your own code using the API protocols and httr functions. Other examples of existing R packages to interact with open data APIs include: