You are now experts at handling data frames in R, but we have not spent as much time on getting data from files into R (and even less on saving a data frame to disk). The most convenient formats for tabular data (and the most commonly-encountered) are CSV (comma separated values) and TSV (tab separated values).

Reading data in

The most important function is read.table. This function has many (many) parameters, which you can read about in the exhausting help(read.table) documentation. The most important parameters are:

file

the first parameter, the name of a file (paths relative to working directory). Normally, just write read.table(filename) rather than read.table(file=filename). The meaning is still clear. As in all such functions, filename can be either a literal string (read.table("data/awesome_data.tsv")) or a variable—the latter is especially common in loops:

frms <- list()
for (j in seq_along(filenames)) {
    frms[[j]] <- read.table(filenames[j], ...)
}
sep
what separates columns? For TSV, this will be sep="\t"
stringsAsFactors
ritually set stringsAsFactors=F
header
does the first line of the file give names of columns? Then pass header=T. If you set header=F, the columns are given dummy names (V1, V2, V3...)
quote
consider the following problem: what if you want to store a string value that itself contains a comma in a CSV file? One solution is the following: only treat a comma as a separator if does not occur between quotation marks. To permit this, pass quote="\"". The default value for quote is "\"'", which means that read.table assumes both single and double quotes are used in this way. If you have textual data which contains quotation marks but your file doesn’t follow the rules for quoting, you’ll get terrors. To forestall this assumption, turn off quote-interpretation with quote="".
comment.char
another trap. This allows read.table to read files with comments in them. By default # is a comment character, which means that if your text contains # for other reasons, read.table will nonetheless skip everything after the #. To prevent this, set comment.char="".
encoding, fileEncoding
describe the character encoding of the input: e.g., if it is UTF-8, say so. The distinction between these two parameters is confusing to me, and when I have encoding issues I have to do a bit of trial and error. Usually you will get away without them. Your next option is fileEncoding="UTF-8".

The upshot: your typical command for reading in a TSV file with column headers is

read.table(filename, sep="\t", header=T,
    quote="", comment.char="", stringsAsFactors=F)

read.csv is a shortcut for read.table with some of these parameters set for convenient CSV input. You can use all the same parameters as read.table. This is indicated, but rather cryptically, when you try help(read.csv) and find yourself looking at the same help page as read.table. Reading down to the function listing for read.csv, we find that it is given as

read.csv(file, header = TRUE, sep = ",", quote = "\"",
         dec = ".", fill = TRUE, comment.char = "", ...)

which tells you the variant parameter settings. Because commas are much more common than tabs, you often do want to make use of quoting, so a typical read.csv call looks like

read.csv(filename, stringsAsFactors=F)

but if you don’t want quoting, pass quote="".

Diagnosing input problems

You often will not succeed the first time you try to read in a file with read.table or read.csv. Either the results won’t be what you expect or the functions will give up with an error message. A very common error is

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
line X did not have N elements

Notice that even though you called read.table or read.csv you are told the error was in scan. R is confessing to you that read.table is itself only a convenient wrapper around the even more elaborate function scan. scan is occasionally useful in its own right, but never mind that for now. The error arises because these functions require that every row have the same number of columns.1 If it seems to the function that a row has a deviant number of columns, you will get the error message above, telling you what line of the file causes the problem. Occasionally there is genuinely a lacuna in the file, in which case you should make a copy in which you correct the error by hand in a text editor, then work on the copy. But normally the problem is that the quote or sep or comment.char or fileEncoding parameters are not correctly set.

If you have a line number in hand, you can examine the line in R with

ll <- readLines(filename)
ll[X + 1]   # where X is the problem line

(X + 1 because the line number from the error message doesn’t count the header line, if you set header=T.) If even readLines gives an error (other than “cannot open the connection”), you likely have a character encoding problem.

Saving data to disk

If you have tabular data you want to save to disk, R has complementary write.table and write.csv functions (and writeLines too, in fact). These have many parameters, which have been carefully chosen so as to be not quite identical to those for read.table. Again consult help(write.table) and prepare yourself. write.table accepts either a data frame or a matrix and writes it to disk.

sep
the field separator. Use sep="\t" for tabs. I am a big fan of the TSV file rather than the CSV, since it usually lets us avoid quoting.
row.names
whether to write rownames (if your variable does not have rownames, numbers are used). By default this is TRUE, a terrible decision on the part of R’s designers. Always pass row.names=F.
col.names
whether to add column headers. This is very useful when you are saving data frames (since otherwise you lose the column names). If you are saving a matrix, you may or may not want the column names.
quote
whether to quote string values. If you are using a TSV and don’t expect your strings to have tabs in them, you can specify quote=F. If you are using a CSV and there are commas in your values, you need quote=T.
fileEncoding
if you have any non-ASCII characters in your data, set fileEncoding="UTF-8". Then you can be sure that it can be read in with the same fileEncoding setting. (Otherwise the encoding will be whatever the default is for your system. On Macs this is normally UTF-8; on Windows the default varies.)
na
how to represent NA values in the output. The default is the two letters "NA", but sometimes an empty string is preferable (na="").

In text analysis we often have data frames where one column contains a big blob of text. Some care is needed in saving this to disk. I’ve already alluded to the quoting problem. A second problem is that your text might contain line breaks. These will turn into aberrant extra rows when you try to read the result back into R. You have two choices. Either devise your own scheme for representing the line breaks, or replace the breaks with ordinary spaces. If lineation is important, the latter choice won’t work, but then it may also make more sense to store the data with one row per text line. A reasonable scheme for replacing line breaks might involve something like

frm <- frm %>%
    mutate(txt=str_replace_all(txt, fixed("\n"), "<br>"))

which replaces all line breaks with the two-character sequence <br>.

But often you won’t really care about line breaks—or tabs or other different kinds of spacing. In that case you can simply flatten out all that white space into ordinary spaces:

frm <- frm %>% 
    mutate(txt=str_replace_all(txt, "\\s+", " "))

This frm can then be safely written using

write.table(frm, filename, sep="\t",
    quote=F, row.names=F, fileEncoding="UTF-8")

and read back in with

read.table(filename, sep="\t", header=T,
    quote="", comment.char="", stringsAsFactors=F)

Warning. R’s file-writing functions will happily overwrite any pre-existing file of the same name, without asking you. You have to write the code to prevent this, if you don’t want it to happen:

if (file.exists(filename)) {
    stop(filename, " already exists. Don't clobber it!")
}
write.table(frm, filename, sep="\t",
    quote=F, row.names=F, fileEncoding="UTF-8")

(stop is a function for producing an error message of your own and halting execution.)

A typical CSV-writing call might look like:

write.csv(frm, filename, quote=F, row.names=F)

for which the results can be read in with:

read.csv(filename, quote="", stringsAsFactors=F)

CSV is a convenient (though space-inefficient) format for storing numerical matrices. If you have a matrix object, a typical CSV-writing call looks like:

write.csv(m, filename, row.names=F, col.names=F)

and the results can be read back in with:

as.matrix(read.csv(filename, header=F))

(read.csv always returns a data frame, so if you want a matrix you must apply as.matrix. A more direct route uses the lower-level scan function.)

More possibilities

I keep alluding to the low-level scan function. Sometimes it solves problems read.table cannot. help(scan) is quite comprehensive, not to say exhausting, and Teetor has some good remarks on this function too.

A new package has appeared on CRAN, readr, which supplies replacements for these functions which are simpler, more consistent, and quite a bit faster. But the package is new, and its documentation is rather sparse. I recommend switching over to readr’s read_tsv and friends for your future projects, but not for this one.

It’s just possible that tabular data will come to you in JSON format (JSON is really designed for more hierarchically arranged data, but it works fine for tabular data too). The jsonlite package brings JSON into R list structures very simply, with

fromJSON(filename)

you will then have to figure out how to convert the results (usually a list) into a data frame (or matrix) of the right configuration. In some cases, this is very simple. If fromJSON returns a list whose elements are vectors of the same length, then you can simply use as.data.frame(x, stringsAsFactors=F) to get a data frame.


  1. A tricky exception: if the first row appears to R to be one column shorter than the rest, the first column of all subsequent rows are treated as row names instead of a column in their own right. That’s rarely what you want.