When you move from short exercises to a more sustained project, a whole new set of challenges make themselves known. I’ve learned a few lessons the hard way about organizing a programmatic data-analysis project, and I’d like to attempt to spare you the hard parts. Here are some suggestions for how you might work with data and program code as you do your research and prepare the report.

Data management

Keep original data. Have a data folder which contains nothing but the original form of your data. Never delete or modify these “raw” sources. Instead, work on copies.

File what you store. Put those saved files in subfolders. Make a reasonable decision about keeping track of provenance (reasonable means: at least some information, but not so much that it will be exhausting to record it).

Keep intermediate forms. Store as much as possible. I cannot count the number of times I wish I’d kept something I initially thought I could just throw away. Disk space is plentiful, whereas the time it takes to re-find or recreate something is not.

Keep records. Start a “lab notebook” immediately. Use an ordinary text file or word-processor file. At a minimum, note the dates you obtain data files, how you obtained them, and what you named them.

Human procedures

Be systematic. Before you embark on any construction of data, reflect on how you can proceed in a consistent way. Many of you will have occasion to “hand-code” at some stage: labeling or categorizing texts before or after some computational transformation. Don’t do this ad hoc. Discuss with your partner or partners how you should approach it, choose categories in advance, write out a procedure, and follow it independently of one another.

Changing your procedure as you go along and having to go back happens sometimes; but not being sure, later, whether you did change on the way is a problem.

Code management

Use the file system to organize your work. Your project should have a single master directory, final-report or similar. There should be an RStudio project associated with this working directory, which you can open using the RStudio “Open Project” menu command. Every single file path should be relative to this directory. Every source file should be sourced or knit from the main directory.

If you figure something out in the console, make it a script or an R markdown. Then check that the script actually works.

Don’t put everything in one ginormous file. Supplement 1 has some remarks on how to use R source files as well as subsidiary R markdown files, but I recommend well-commented subsidiary R files you source from a master R markdown file. In fact, I think the best strategy would be to share R source files that prepare and process your data, and for each person to have their own master R markdown file with their final report on the project. Within that R markdown file, you can have chunks sourcing the shared code. Let’s say you work together on a script, prepare_analysis.R, that creates a data frame, all_the_things, with the data of interest. Then your master report file might have a stretch looking like this:

```{r include=F}
source("prepare_analysis.R")
```

The trend we found is depicted in the chart.

```{r echo=F, fig.cap="All the things"}
ggplot(all_the_things, aes(year, thing)) +
    geom_bar(stat="identity")
```

As you can see, the things are all the things...

I am introducing two chunk options that are important for writing of this kind. include=F chunks are executed but neither your source code nor your results will appear in the final knitted PDF. The results of echo=F chunks do appear in the PDF, but not the source code.

Write code with your collaborators and your future self in mind. Choose meaningful names for your variables, comment your code amply, separate logically independent units of your program into functions, and in the name of William Morris format your program text consistently. The kinds of consistency I have in mind are exemplified by Wickham’s style guide for R, which you could follow or modify.

Test, then share. If you are up and running with the vagrant virtual machine, that is a very useful way to test that you have actually structured your scripts and your data files reliably. You will not be able to share code if your scripts include lines like readLines("/Users/alphonse/MySpecialData/Things/Stuff/Things/stuff.txt") or setwd("../../../project/project2/project2b"). On the other hand, if you always know you will be running the scripts from /vagrant, and you always put your data in /vagrant/data (or some other subfolder like that), you and your collaborators (and I) will always be able to reproduce your results, if we have the same data folder. And if we don’t reproduce your results, we’ll immediately know it’s because our code or data is not in sync.

Sharing code revisions. Tricky. I really like using version-control software (I use git), but that was one extra set of complexities I decided to spare you having to take on board this semester. At a minimum, you might record the date and time of last revision in any source file you share. You can use Dropbox shared folders (just be careful not to use a Dropbox folder, especially a shared one, as your R working directory! R creates and deletes some temporary files as it runs, and Dropbox might get a little frantic trying to keep up) or Google drive shared folders (remember that Rutgers provides you with its very own Google drive, ScarletApps). Both Dropbox and Google Drive remember past versions of files (“Manage Versions” in Google Drive; “Previous versions” in Dropbox), so this can be helpful if you want to walk back some changes you or some else has made. Avoid the all-too-familiar situation described here.

See also the manual by Gentzkow and Shapiro, Code and Data for the Social Sciences: A Practitioner’s Guide.