The subject of our readings for the week is text encoding and its theoretical implications. To help get us oriented to what this discussion is about, I’m asking you to look over some TEI markup. The sample files are in a zip archive on Sakai. Here’s what they are, and some questions to think about as you browse.

The Text Encoding Initiative defines the standard way to mark up historical documents in plain text. It is very complex; take a look at the Guidelines and their supposedly Gentle Introduction to XML to see just how complex. [Edited 4/1/15]: A better introduction to the TEI is the TEI Lite document, which describes a “basic” subset of the TEI in an accessible way (and from which you can infer much of what you need to know about XML as used in text encoding). Its appendix is also a good starting point for explanations of what the most commonly used elements used in TEI are.

In order to get a feel for what TEI-XML looks like, it is easier to work from examples. I have collected sample files from three sources.


This is from the publicly released selection of texts from ECCO: We have already studied their metadata.

William Congreve, Love for love: A comedy… (London: Jacob Tonson, 1704)

Modernist Journals Project

The MJP has offered some of their TEI for download: Here are issues of Blast and the Crisis.

Blast 1 (1914)
Crisis 22, no. 2 (June 1921)

Victorian Women Writers’ Project

Files from this project are available from a Github repository with a selection of Indiana text-encoding projects:

Mary Elizabeth Braddon, Lady Audley’s Secret (London: Tinsley Bros., 1862), vol. 1
Mary Elizabeth Braddon, Lady Audley’s Secret (London: Tinsley Bros., 1862), vol. 2
Mary Elizabeth Braddon, Lady Audley’s Secret (London: Tinsley Bros., 1862), vol. 3

How to browse

RStudio can open these files and even knows how to highlight XML tags. Alternatively, open these files in a good text editor, like TextWrangler on Mac or Notepad++ on Windows. Skim the texts rapidly with an eye to these questions:

  1. What are the rules of XML? That is, what conventions do these all follow?

  2. What range of variation seems to be permitted?

  3. What affordances does this kind of encoding give? What kinds of questions could you start to answer with this encoded text in hand?

  4. Just how would you get R to help you answer those questions? What kinds of functions and data structures will you need to capture and analyze the format of TEI markup?