Programs have multiple failure modes: they can go wrong in more than one way. You have all encountered two, relatively friendly failure modes: the Knit PDF that includes R error messages instead of the results you were trying to get, or the failure to knit with an error message in RStudio. That may not seem altogether friendly, but at least it gives you a hint of what and where things have gone wrong. But other kinds of failure are more challenging. Working with text data often leads to a particularly tricky kind of glitch, failure to complete. In other words, you try some code, and R never finishes running it—if you are knitting, you see that the Knit process is happening, but in the middle of some “code chunk” the thing stalls and no PDF is forthcoming. What to do?

Let’s look at a toy example. The classical way to produce this glitch is to write an infinite loop:

x <- 1
k <- 10
while (x > 0) {
    k <- k - 1 # oops

Since x is always 1, the while loop repeats forever. You need a way out. There are a series of escalating steps you can take to bail out and start over:

  1. Choose “Interrupt R” from the “Session” menu (or click the red stop sign button). This works, in the case of the infinite loop above, but it doesn’t always. If interrupting has failed, you won’t get your > prompt again, or, if you’re knitting, the knitting will still continue, hopelessly.

  2. Choose “Terminate R…” from the “Session” menu. This will erase all the variables in your workspace and your recent command history. It often works but not always. If it doesn’t work, try the same command again—and even a third time. RStudio tries progressively harder to get R to stop.

  3. If three terminations don’t work, force quit RStudio. This should almost always work. It doesn’t do any particular harm, but you’ll lose your environment, your recent command history, and any unsaved source files. Save your R markdown files often.

You’re not likely to write an infinite loop, but a different kind of “stalling” is subtler and more common. This is when you ask R to do a calculation that takes a really, really long time—not forever, but too long. The most common case is a vectorized computation that involves much more legwork than you suppose. Here is an example.

austen <- readLines("austen.txt")
austen_novel <- austen[match("CHAPTER 1", austen):match("THE END", austen)]
austen_single <- str_c(austen, collapse=" ")
austen_words <- str_split(austen_single, "\\W+")

If you try this, you will find that it takes an annoyingly long time to obtain austen_words. Far too long, especially if you compare it to the following seemingly equivalent process:

austen_words <- strsplit(austen_single, "\\W+")

How can this be? Aren’t these functions doing the same thing? I’m sorry to tell you that because of hidden differences in the way they are implemented, str_split is very slow if you are splitting a long string in many places. I conjecture that, in the case of str_split, R makes as many copies of austen_single as there are split points (strsplit does not need to do this copying). Since the text of the novel takes about 700 K, and it has about 250,000 words, suddenly R is trying to use about 175 GB of memory. Your computer does not have 175 GB of memory, but it does its best by “swapping” pieces of the problem onto free space on your hard drive. This is slow and tiring for your computer. (The problem with this conjecture is that R is not supposed to actually copy identical strings in a case like this, but I’m certain something lke this is going on.)

Anyway, I missed this issue when I wrote my solution set using str_split, because in my solution set I did more or less this:

austen_words <- unlist(str_split(austen_novel, "\\W+"))

The individual lines of austen_novel are much shorter, and need to be broken up at many fewer points, than austen_single, and str_split handles them just as well as strsplit. We can say that I inadvertently took a “divide and conquer” approach to the problem, giving str_split small chunks of the novel to split into words, then sticking the list of words together using unlist.

What matters here? OK, use strsplit instead of str_split. But the more general point is that when R stalls or hangs, there are things you can do. Eventually you will get a nose for problems that might involve this sort of computational explosion, and you’ll develop a repertoire of strategies (like divide and conquer) for dealing with them. In the meantime, knowing how to terminate R is a start. Then you can start using the debugging strategies to try to isolate the part of the code that is causing R to stall. Then…get in touch.