Set Up for Working on The Primer

This document provides a guide in setting up R/RStudio to work on The Primer, both the book itself, PPDBS/primer, and the associated tutorial and data packages: PPDBS/primer.tutorials and PPDBS/primer.data. There are three steps:

  • The first part ensures that you have the knowledge and computer settings to be successful. With luck, you will only need to do this once.

  • The second part involves making a connection between the true repos at the PPBDS Github organization and your Github account and your computer. You may end up doing this dozens of times since, whenever something gets messed up, the easiest solution is often to just nuke it from orbit and start again. But, most weeks, you won’t do this at all.

  • The third part involves the daily workflow of merges and pull requests. You will do these steps many times each day.

The test which ensures that you have successfully completed this set up is to submit a PR for the TODO.txt file in the primer package which adds your name at the very top of the file. The PR should only change that file.

11.13 Computer Set Up

  1. Install the latest released versions of R and RStudio. Install the usethis package.

Read the Getting Started and Tools sections of The Primer and make sure your Git/Github is working. Read (and watch the videos from) Getting Used to R, RStudio, and R Markdown by Chester Ismay and Patrick C. Kennedy. Check out RStudio Essentials Videos. Most relevant for us are “Writing code in RStudio”, “Projects in RStudio” and “Github and RStudio”. The best reference for R/RStudio/Git/Github issues is always Happy Git and GitHub for the useR.

  1. Make sure that your Git/Github connections are good. If you have gone through the key chapters in Happy Git with R — as you should have — then these may already be OK. If not (or, even if you have), then you need to run usethis::git_sitrep().
> library(usethis)   
> git_sitrep()    
Git config (global)   
● Name: 'David Kane'   
● Email: 'dave.kane@gmail.com'   
● Vaccinated: FALSE   
ℹ See `?git_vaccinate` to learn more   
ℹ Defaulting to https Git protocol   
● Default Git protocol: 'https'   
GitHub   
● Default GitHub host: 'https://github.com'   
● Personal access token for 'https://github.com': '<discovered>'   
● GitHub user: 'davidkane9'   
● Token scopes: 'delete_repo, gist, notifications, repo, user, workflow'   
● Email(s): 'dave.kane@gmail.com (primary)', 'dkane@fas.harvard.edu'   
...   

I left out the end of the output.

If the first part — Git config — seems messed up, execute:

use_git_config(user.name = "David Kane", user.email = "dave.kane@gmail.com")

If the second part seems messed up, try:

and read about Github credentials. After you do, restart R and then run git_sitrep() again to make sure that things look like mine, more or less.

  1. Install the renv package. You can read about the renv package here.

It is not critical to understand all the details of how renv works. The big picture is that it creates a set of libraries which will be used just for this project and whose versions are kept in sync between you and me.

  1. At this point, you should have all the tools you need to contribute. If you have never done a pull request, however, you will need to learn more. Start by reading the help page. Read the whole thing! Don’t just skim it. These are important concepts for professional-level workflow. The usethis package is mostly providing wrappers around the underlying git commands. If you want to understand what is happening at a lower level, read this, but doing so is optional.

Again, with luck, you will only have to do these steps once.

  1. Prove to yourself (and to me) that your set up is working by submittimg a pull request to me which simply adds your name to the top of one of the TODO.txt files. (See below for how to do this.)

11.14 Project Set Up

You will need to do the below steps at least one time. It is more likely, however, that you will do them dozens of times. If things are working, great! If they start not working, you can try to diagnose the problem. But, if you can’t, then you are in a nuke it from orbit scenario, which means that you start by deleting the current version of the package from two places: your computer, and your Github account. To delete the primer from your computer, put the R Studio project directory in the Trash. Make sure to also close out of the R Studio session after you delete it. If for some reason you cannot completely remove it, consider using the command $sudo rm -r dirname where you replace “dirname” with the path to primer on your computer! sudo and rm can be extremely dangerous when used together, so make sure to double check the command and/or do additional research. After you successfully remove it from your computer, go to your Github account and then go to Settings to delete the repo.

Key steps:

  1. Fork/download the target repo:
library(usethis)  
create_from_github("PPBDS/primer",   
                    fork = TRUE,   
                    destdir = "/Users/davidkane/Desktop/projects/",   
                    protocol = "https")  

That is the repo for working on the book. If you are working on PPBDS/primer.data or PPBDS/primer.tutorials, you need to create_from_github() using those repos. You must change destdir to be a location on your computer. Indeed, professionals will generally have several different RStudio sessions open, each working on a different R project/package, each of which is connected to its own Github repo.

For your education, it is worth reading the help page for create_from_github(). The fork and protocal arguments may not really be necessary and, obviously, you should place the project in the location on your computer in which your other projects live. The command first forks a copy of PPBDS/primer to your Github account and then clone/downloads that fork to your computer.

This may seem like overkill, but, as Pro Git explains, it is how (essentially) all large projects are organized. With luck, you only have to issue this command once. After that, you are always connected, both to your fork and to the true repos, which live at github/com/PPBDS. Also, note that, if something ever gets totally messed up on your computer, you can just delete the project folder on your computer and the repo on your Github account and then start again. (If you have made changes that you don’t want to lose, just save the files with those changes to one side and then move them back after you have recreated the project.)

Note that this command should automatically put you in a new RStudio session with the primer (or primer.tutorials or primer.data) RStudio project which resides on your computer

  1. The next step is to get renv setup so that you are running the same package versions as everyone else. Run this once:

This will install all the packages you need in the directory of this project. (This has no effect on your main library of R packages.) Restart your R session. Again, this means that you now have two separate installations of, for example, ggplot2. One is in the default place which your R sessions is by default pointed to. (In a different project without a renv directory, you can run .libPaths() to see where that is.) The second place that ggplot2 is installed is in the renv directory which lives in this project.

Note that, for the most part, you won’t do anything with renv after this initial use. If you use error = TRUE in any code chunk, you will also need renv.ignore = TRUE in that code chunk, or you will get an annoying warning because renv can’t parse the code in that chunk.

However, there are three other renv commands you might issue:

renv::status() just reports if anything is messed up. It won’t hurt anything.

renv::restore() looks at the renv.lock file and installs the packages it specifies. You will need to do this when I make a change to renv.lock, e.g., if I upgrade our version of ggplot2 or add a new package.

renv::snapshot() should only be issued if you know what you are doing. This changes the renv.lock file, which is something that, usually, only I do. Most common case for use would be if you need to add a new package to the project.

  1. Create a branch to work from:
pr_init(branch = "chapter-9")

Make sure the branch name is sensible. Again, this is a command that you only need to issue once, at least for our current workflow. You should always be “on” this branch, never on the default (master) branch. You can check this in the upper right corner of the git panel on R Studio.

In more professional settings, you will often work on several different branches at once. So, if you are comfortable, you should feel free to create more than one branch, use it, delete it and so on. Never work on the default branch, however. And, if you use multiple branches, be careful where you are and what you are doing.

11.15 Daily Work

  1. Pull regularly:

Issue this command all the time. This is how you make sure that your repo and your computer is updated with the latest changes that have been made in the book. The word “upstream” is associated with the repos at PPBDS. The word “origin” is associated with the fork at your Github account. But, in general, you don’t need to worry about this. Just pull every time you sit down. (Just clicking the pull button is not enough. That only pulls from your repo, to which no changes have been made. It does not pull from PPBDS/primer, et al.) You issue this command multiple times a day.

  1. Make changes in the file you are editing. Knit to make sure the changes work. Commit with a message. Push to the repo on your Github account. And so on.

At some point, you will be ready to push to the PPBDS organization. However, you can’t do this directly. Instead, you must submit a pull request (PR). Because you are part of a larger project, these commands are slightly different than what you have done before, which has usually just been clicking on the pull (blue) and push (green) arrows in the Git pane in RStudio.

  1. Issue pull requests every few days, depending on how much work you have done and/or whether other people are waiting for something you have done.

This command bundles up a bunch of git commands (which you could do by hand) into one handy step. This command does everything needed to create a “pull request” — a request from you to me that I accept the changes you are proposing into the repo at PPBDS/primer — and then opens up the web page to show you. But your are not done! You must PRESS the green button on that web page, sometimes twice. Until then, the PR has not actually been created. pr_push() just does everything before that. The “pr” in pr_push() stands for pull request.

  1. I will leave aside for now issues associated with the back-and-forth discussions we might have around your pull request. I will probably just accept it. Your changes will go into the repos at PPBDS and then be distributed to everyone else when they run pr_merge_main().

  2. You can now continue on. There is no need to wait for me to deal with your pull request. There is no need to fork/clone/download again. You don’t need to create a new branch, although many people do, with a branch name which describes what they are working on now. You just keep editing your files, knitting, and committing. When you feel you have completed another chunk of work, just run pr_push() again.

  3. Read the usethis setup help page at least once, perhaps after a week or two of working within this framework. It has lots of good stuff!

11.16 Common Problems

  1. In the immediate aftermath of this creation process, the blue/green arrows (in the Git panel) for pulling/pushing may be grayed out. This is a sign that the connection between your computer and your forked repo has not “settled in.” (I am not sure of the cause or even if this is the right terminology.) I think that just issuing your first pr_merge_main() fixes it. If not, it always goes away. Until it does, however, you can’t pull/push to your repo. That doesn’t really matter, however, since the key commands you need are pr_merge_main() and pr_push(), both of which always work immediately.

  2. After running pr_merge_main(), you will often see a bunch of files in your Git tab in the top right corner of Rstudio marked with an M (for Modified), including files which you know you did not edit. These are the files that have been updated on the “truth” — on PPBDS/primer — since your last pr_merge_main(). Since you pulled them directly from the PPBDS/primer repo, your forked repo sees all the changes other people have made and thinks that you made them. This is easily fixed, however — just commit all the changes to your forked repo. (Strangely, this seems to not always happen. If you don’t see this effect, don’t worry.)

  3. Always run pr_merge_main() before committing a file. Otherwise, you may create lots of merge conflicts. If this happens, save a copy of the file(s) you personally were editing off to the side. Then, nuke it from orbit, following the instructions above. Repeat the Project Set Up process. Then move in your file(s) by hand into the new repo, and commit/push them as normal.

  4. When you submit a pull request to merge your work with the PPBDS repo, it won’t always be smiles and sunshine — every once in a while, you’ll run into merge conflicts. When these arise, it is because two parties work on a file separately and submit conflicting changes. This makes it hard for GitHub to “merge” your version with the other version. When this happens, find multiple adjacent “>,” “<,” and “=” signs in your document — these will show you where the conflicts occur. For more background on merge conflicts, read this.

If you see the above-mentioned conflicts in your document, do not submit a pull request. This will mess things up. Instead, first, go through your document, and make sure all the weird conflict indicators (<, >, and =) are removed. Second, decide what goes in that space. It might be the stuff you wrote. It might be the other stuff. It might be some combination of the two which you decide on. Whatever happens, you are making an affirmative choice about what should appear in the file at that location. Once all the merge conflicts are fixed, run pr_push() again.

  1. pr_push() can be tricky. First, note that, if I have not accepted a (p)ull (r)equest which you have submitted, then your PR is still open. You can see it on Github. In fact, you can see all the closed/completed pull requests as well. If, while one PR is still open, you submit another pr_push(), then this will just be added to your current PR. And that is OK! We don’t need it to be separate.

But even if there is not an open PR, pr_push() can be tricky. The key thing to remember is that you must press a green button on Github for a new PR to be created. Normally, this is easy. Running pr_push() automatically (or perhaps after you run pr_view()) puts you in a browser and brings you to the correct Github page. Press the button and – presto! – you have created a PR. But, sometimes, the web page is different. It actually sends you back to an old pull request. When this happens, you need to click on the “Pull Request” tab above. This will take you to a new page, with a green button labeled “Compare & Pull Request.” Press that button.

  1. If you end up needing to install a new package — which should be rare — just install it and then type renv::status() to confirm than renv is aware of the change. Then, type renv::snapshot(). This will update the renv.lock file to include the new package. You just commit/push the new version of renv.lock, and that shares the information with everyone else on the project. Never commit/push a modified renv.lock unless you know why it has changed.

  2. Be careful of committing garbage files like “.DS_Store,” which is a file created sometimes. Only commit changes which you understand. In the vast majority of cases your PRs will only involve one or two files.

11.17 Style Guide

  • Never use just a single # after using it for the chapter title. The first subpart uses a ##. There should be 5 to 8 subparts for each chapter. Within each subpart, you may have sub-subparts, indicated with ###. There should be 3 to 10 of those. You may use #### if you like.

  • Section headings (other than Chapter titles) are in sentence case (with only the first word capitalized, unless it is something always capitalized) rather than title case (in which all words except small words like “the” and “of” are capitalized). Chapter titles are in title case. Headings do not end with a period.

  • Never hard code stuff like “A tibble with 336,776 rows and 19 columns.” What happens when you update the data? Instead, calculate all numbers on the fly, with “r scales::comma(x)” whenever x is a number in the thousands or greater. Example: “A tibble with ‘r scales::comma(nrow(x))’ rows and ‘r ncol(x)’ columns.”

  • “We” are writing this book.

  • Package names are in bold: ggplot2 is a package for doing graphics. In general, we reserve bolding for package names. Use italics for emphasis in other contexts.

  • R code, anything you might type in the console, is always within backticks. Example: mtcars is a built-in dataset.

  • Function names always include the parentheses: we write pivot_wider(), not pivot_wider.

  • Add lots of memes and videos and cartoons.

  • Do not use code chunk names because it messes up building the book because of limits in bookdown.

  • Make ample use of comments, placed with the handy CMD-Shift-/ shortcut. These are notes for everyone else working on the chapter, and for future you.

  • All tables should be created with the gt package.

  • All images and gifs are loaded with knitr::include_graphics().

  • Only code chunk options allowed are include = FALSE, echo = FALSE, fig.cap = “This is my cap” and message = FALSE when loading packages like ggplot2 since it prevents all the messages from printing out.

  • Interim data sets should be called x or something sensible to the situation, like ch7 for a data set you are working with in Chapter 7. Do not use names like data and df, both of which are R commands.

  • Students are sometimes tentative. Don’t be! Edit aggressively. If you don’t like what is there, delete it. (If I disagree with your decision, I can always get the text back from Github.) Move things around. Make the chapter yours, while keeping to the style of the other chapters. Note that 90% of the prose here was not written by me. Cut anything you don’t like.

  • If you make an mp4, you can convert it to .gif using https://convertio.co/mp4-gif.

  • Everything is Bayesian. The confidence interval for a regression means that there is a 95% chance that the true value lies within that interval. Use Rubin Causal Model and potential outcomes to define precisely what “true value” you are talking about. And so on.

11.17.1 Stray thoughts

Every chapter 5+ begins with a problem, and the decision we must make. These are often toy, highly stylized problems. The decisions are not realistic. But, in structure, these problems parallel the real problems that people face, the actual decisions which they must make.

The problem is specified at the end of the “preamble,” the untitled part of the chapter after the title and before the first subpart. Example from Chapter 8:

A person arrives at a Boston commuter station. The only thing you know is their political party. How old are they? Two people arrive: A Democrat and a Republican. What are the odds that the Democrat is 10% older than the Republican?

A different person arrives at the station. You know nothing about them. What will their attitude toward immigration be if they are exposed to Spanish-speakers on the platform? What will it be if they are not? How certain are you?

Is this an actual problem that someone might face? No! But it is like such problems. The first requires the creation of a predictive model. The second necessitates a causal model. The rest of the chapter teaches the reader how to create such models. The end of the chapter harkens back to the questions from the beginning.

Might it be nice to put more meat on the story than that? Perhaps. In an ideal world, the “decision” you faced would be more complex than just playing the prediction game. Begin with a decision. What real world problem are you trying to solve? What are the costs and benefits of different approaches? What unknown thing are you trying to estimate? With Sampling, it might be: How many people should I call? With estimating one parameter — like vote share as the ballots come in — it might be: How much should I bet on the election outcome?

The data we have might not be directly connected to our problem. For example, we might be running a Senate campaign and trying to decide what to spend money on. The Spanish-speakers-on-a-train-platform data set is not directly related to that problem, but it isn’t unrelated. Indeed, the first theme of “validity” is directly related to this issue: Is the data we have relevant to the problem we want to solve?