Literary Data: Some Approaches

Note to web visitors: This was a graduate course offered in Spring 2015. For the programming practicum in this course, I developed a series of presentations and a set of homeworks, which I leave online for anyone who might wish to make use of them (the homework solutions are also available on the same page). All this material is available for non-commercial use or modification; please attribute what you use to me. As I note on the syllabus, I drew a great deal on syllabuses by others in designing the course; I hope someone else can use what I have done in turn.
—Andrew Goldstone, June 15, 2015

Further note, June 13, 2019: This site has been moved to a new host; its URL is now http://www.sas.rutgers.edu/virtual/ag978/litdata. Old URLs to rci.rutgers.edu/~ag978/litdata should continue to work.

Course description

In the last ten years, the strange quasi-disciplinary formation known as DH or Digital Humanities has renewed the struggle over methods in literary studies. Analyses of digitized texts using computer-assisted techniques promise to transform the kinds of evidence, the methods of interpretation, and the modes of argument which matter to literary scholarship. Data is now a subject of energetic debate in literary studies: what constitutes literary data, and how should it be analyzed and interpreted? How might aggregation and quantification produce new knowledge in literary scholarship? What methods are most appropriate for grappling with the enormous, and enormously messy, world of digitized literary texts and data about literature?

This course pursues two aims in parallel: to engage with the history and current practice of literary data analysis, and to introduce the foundational skills of literary data analysis in the R programming language. Class time will be divided between seminar and practical instruction. The seminar discussions trace theoretical debates about literary data from structuralism and scientific bibliography, to experiments in computational stylistics, to contemporary scholarly controversies in and around DH. The practicum surveys the fundamentals of programming and data manipulation, with an introduction to selected numerical techniques and data visualizations. Short homework exercises supplement the in-class instruction, with an emphasis on handling actual literary data of various kinds.

There are two major assignments. A short position paper on a theoretical question about literary data and DH is due at midterm. The final assignment is to plan, carry out, and report on a small-scale project in literary data analysis. This project is to be undertaken in small groups; the report will detail methods and interpretations together with code and data.

No special technical expertise of any kind is expected; instruction begins from first principles. However, the work of programming does require willingness to experiment, patience in the face of frustration, and the nerve to ask for help as often as needed.

Bring your own laptop to class, if you have one; loaner laptops will also be available for in-class workshops. MacOS X and Linux are the preferred operating systems for work in the course, but Windows will be accommodated as well.

Announcements (rss)

June 15, 2015 Now You Know How to Go On

For the last class, I had planned to circulate some suggestions on where you could go on next in the theory and practice of literary data analysis. But the end of term got to me. I have finally written up those suggestions, as a belated coda: Supplement 2, with some discussion and bibliographic pointers.

May 6, 2015 Reading and writing tabular data files

You are now experts at handling data frames in R, but we have not spent as much time on getting data from files into R (and even less on saving a data frame to disk). The most convenient formats for tabular data (and the most commonly-encountered) are CSV (comma separated values) and TSV (tab separated values).

April 26, 2015 Organizing a Project

When you move from short exercises to a more sustained project, a whole new set of challenges make themselves known. I’ve learned a few lessons the hard way about organizing a programmatic data-analysis project, and I’d like to attempt to spare you the hard parts. Here are some suggestions for how you might work with data and program code as you do your research and prepare the report.

April 6, 2015 R Markdown and Figures

It seems as though a few more details about how to do figures in R Markdown would be helpful. The executive summary is: put fig_caption: true under pdf_document: in the YAML block at the start of your R markdown files. Control whether a figure floats or not by leaving blank lines on both sides of the R code chunk that creates the figure.

March 28, 2015 Browsing the TEI Sampler

The subject of our readings for the week is text encoding and its theoretical implications. To help get us oriented to what this discussion is about, I’m asking you to look over some TEI markup. The sample files are in a zip archive on Sakai. Here’s what they are, and some questions to think about as you browse.

March 13, 2015 Deadline adjustments

A few deadline adjustments as we continue the catching-up process. The position paper is now due Wednesday, March 25 at 10 a.m. I will consider extension requests, but not past March 29. And the next homework is due Monday, March 30.

February 25, 2015 When R Stalls

Programs have multiple failure modes: they can go wrong in more than one way. You have all encountered two, relatively friendly failure modes: the Knit PDF that includes R error messages instead of the results you were trying to get, or the failure to knit with an error message in RStudio. That may not seem altogether friendly, but at least it gives you a hint of what and where things have gone wrong. But other kinds of failure are more challenging. Working with text data often leads to a particularly tricky kind of glitch, failure to complete. In other words, you try some code, and R never finishes running it—if you are knitting, you see that the Knit process is happening, but in the middle of some “code chunk” the thing stalls and no PDF is forthcoming. What to do?

February 22, 2015 Paper assignment

I’ve written up the assignment for the short position paper due March 23. Take a look, and we’ll discuss it in class next time.

February 20, 2015 Encoding problems: spotter's guide

The last homework generated some…interesting glitches related to one of the banes of all literary data analysis: text encoding.

February 9, 2015 We make our meek adjustments

I’ve done a little rearrangement of the syllabus, plus a new consolidated homeworks/solution sets page for you to bookmark. And as usual, there are updates to the agoldst/litdata package for you to install. For those of you feeling exploratory, a first draft of some Early Modern Titles data is online. Eventually we’ll work together on this, but if you’re champing at the bit, the main thing you’ll need to do, after reading Homework 4, is to read the R help for read.csv and Teetor, §4.7–8.

January 29, 2015 Initial remarks on programming in R

I keep meaning to note this. I wrote a little page with some remarks on learning R, which are meant as encouragement. The two key things are to work together, and to try to frame frustrations and mistakes as productive moments of learning. I’ve also added a couple of notes on the practicalities of debugging.

January 22, 2015 The DH community and digital media

I believe that DH is held together much less by shared theoretical commitments or methods than by particular genres of scholarly communication: in particular, the blog, the tweet, the “experiment” or informal publication, and the interactive project prototype website. If you so desire, this course is an opportunity for you to join that community, which is comparatively welcoming towards graduate students and novices. Twitter has been frankly indispensable during my own novitiate in DH—I met a collaborator through it!—though there are many good reasons to hesitate before you join. As with all forms of publicness, academics are well-advised to be deliberate about their engagements. I am in favor of blogging, on the other hand, but because this course already asks so much of you, I am not requiring regular blogging. I invite you either to blog on the course Wordpress site, litdata15.blogs.rutgers.edu, or to use that site to circulate links to any posts on a personal blog you wish us to see. If you’d like to learn how to make use of your own university webspace (www.rci.rutgers.edu/~netid), I’m glad to talk with you about this in office hours.

January 13, 2015 Supplementary notes on Petzold

A note on the Petzold reading for the first meeting: you are not expected to be able to follow all the details of Petzold’s little program for converting strings of ASCII characters to uppercase. However, as a sneak preview of where we’re headed, I’ve added some notes on the program to this site, including some more detailed explanation of the program, plus several translations into R. In a few weeks, you’ll be able to make sense of the R versions (if not the Intel 8080 assembly language—though that’s not so bad either). This is all strictly optional.

January 9, 2015 Syllabus update and preliminary readings

The readings for the first meeting are available on Sakai. I have brought the syllabus up to date. The Software Setup page has some initial notes on the software you should install. I will be updating and expanding this page before the semester starts.

September 19, 2014 Draft syllabus

A preliminary syllabus (pdf). This owes more than I can say both to collegial advice and to the many models I’ve found out there. I continue to welcome questions of any kind from interested students.

June 25, 2014 Draft course description

A first draft overview of the course, with a bibliography of possible readings: description (pdf). Subject to change. The detailed list of topics and the schedule will be on the syllabus, which I’ll post in the fall.