The last homework generated some…interesting glitches related to one of the banes of all literary data analysis: text encoding.

Remember Petzold on ASCII? He points out that ASCII is all well and good if you’re sticking to the characters found on an American typewriter, but as soon as you think about the big world of writing systems you need something more than the 127 code points of ASCII. The standard solution, these days, is Unicode. Unicode assigns a unique number (six hexadecimal digits) to each character in all the world’s writing systems (aspirationally speaking). Unicode itself is encoded in multiple different ways: the most common encoding is UTF-8. (For more discussion, see the accessible if moderately annoying essay by software development guru Joel Spolsky, The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).)

But it’s the pluralism that matters, because it means there is no bullet-proof assumption you can make about either the encoding a text file is in or the encoding any program you use will assume that file is in or the encoding any program will save a file in. R, having been invented long before Unicode and used mostly in the English-default world of statistical analysis, does not have the world’s most robust capabilities for dealing with these issues, either.

In any case, here’s what you need to know. You may have a character encoding problem if:

  1. You open a source code file in RStudio and see garbage characters or ?? characters.

  2. You knit a PDF and find that the PDF contains garbage characters, especially in places where you expected accented or other non-ASCII characters.

  3. You are using grep or other regular expressions and getting bizarre results (again around non-ASCII characters).

  4. You attempt to knit a PDF and get an error message including package inputenc error.

  5. You can’t knit a PDF file, but you can knit an HTML file from the same R markdown.

Here are some steps to forestall some of these problems.

  1. Use xelatex instead of pdflatex. Both xelatex and pdflatex are programs to make PDFs from your markdown. xelatex is smarter about Unicode (and fonts, incidentally). I have updated the Homework template in the litdata package so that when you Knit, xelatex will be used. It is possible that this will produce new errors if xelatex was not installed when you installed LaTeX. If so, look at the top of your R markdown file for the line reading

    latex_engine: xelatex

    and change xelatex to pdflatex.

  2. Explicitly specify character encodings when you read in files. For example, on hw5, make sure you load the ECCO data as follows:

    read.csv("ecco-headers.csv", as.is=T, encoding="UTF-8")

    (The file I have provided you with is indeed UTF-8 encoded.)

  3. Ensure that your source files are loaded and saved as UTF-8 by RStudio. To do this, open up your hw5 file. Now go to the File menu and choose “Reopen with Encoding…” Click “UTF-8.” If it is not listed as the default (it should be, if you have a Mac; possibly not, if you’re on Windows), also check the checkbox “set as default encoding for all source files.” Finally, just to make sure, also choose the “Save with encoding…” command under File and again choose “UTF-8.”