This homework is mostly text for you to read, so it’s rather long for the web. I have made a printable PDF version as well: hw4.pdf.

Update the course software with devtools::install_github("agoldst/litdata"). Now, for the first time, you will also have to load this software explicitly:

library("litdata")

This is done for you in the homework 4 template (also in the package update). We’ll be using stringr again too, but when you load litdata you get stringr as well.

You do not have to read any of Jockers for this homework. If you have Teetor, you may find the suggested sections useful.

Now that we have spent some time with R’s simple data types (vectors of various kinds), we are ready to think about complex types. But we need to cover one more simple type first.

Factors

A factor is yet another type of vector, representing categorical data. Character vectors would work for storing categorical data:

birth_cc <- c("FR", "DE", "NO", "FR", "PL", "IT", "UK", "DE", "DE")

These are country codes for the countries of birth of the earliest Nobel literature laureates. Yet the character vector doesn’t contain any information about the fact that country codes are chosen from a finite set of possibilities—on this list, 6 out of 9. A factor adds just this information:

birth_country <- factor(birth_cc)
birth_country
[1] FR DE NO FR PL IT UK DE DE
Levels: DE FR IT NO PL UK

Notice the absence of quotation marks and the information about the “levels” of the factor. The levels are the categories from which our categorical value is drawn. To find the levels of a factor:

levels(birth_country)
[1] "DE" "FR" "IT" "NO" "PL" "UK"

More formally, factor(xs) takes a vector xs and produces a factor of the same length as xs whose levels are given by unique(xs) and whose elements correspond to those of xs. A factor is a vector and can be subscripted in all the usual ways by [...].

birth_country[birth_country != "FR"]
[1] DE NO PL IT UK DE DE
Levels: DE FR IT NO PL UK

You rarely need to create a factor explicitly. But you will encounter them fairly frequently. In fact, you already have. Secretly, the table function requires a factor argument. On reflection, this makes sense: in order to tabulate how often each element of a vector occurs, we need to know that the elements of a vector are things that can occur more than once. This is precisely what a factor is for. If xs is a character vector, table(xs) is a shortcut for table(factor(xs)).

Two extra notes on factors

The last thing to know at this stage is that factors can be a source of annoying and mysterious errors. Sometimes, you just want the character vector. (That Levels: display can be annoying when you have a lot of levels. Think about making a factor from every word of The Sheik, as in fact we did.) You can test whether a variable x is a character vector with the function is.character(x). You can coerce a factor fac back into a plain character vector with as.character(fac).

Supplementary information. You can do more with the factor function. You can explicitly set the levels with factor(xs, levels=lvls), which allows you to include categories that don’t actually occur in xs (but might later). You can also identify the categories as ordered with factor(xs, ordered=T), but the consequences of this we won’t see until later. It might occur to you that once you know the levels, you don’t actually have to store the factored values as character strings: you just need the integer indices into the levels vector. This is in fact how R normally stores factors internally, as you can see by trying out the function as.numeric on a factor.

Lists

At last we can move on to compound types. In R, compound types are collections of vectors. Depending on what assumptions we can make about the vectors, we will use different types.

Introduction

If we make no assumptions, we have the list. A list is an ordered collection of values of any type. We made vectors with c; we make lists with list, separating elements by commas.

sonnet_words <- list("Music", "to", "hear", "why")
sonnet_words
[[1]]
[1] "Music"

[[2]]
[1] "to"

[[3]]
[1] "hear"

[[4]]
[1] "why"

This looks almost the same as the vector c("Music", "to", "hear", "why"), except for the extra stuff in the output. To get the length of a list, use length. All the different kinds of indexing we used with vectors are also possible with lists. But when you use [...] subscripting with lists, the result is a list:

sonnet_words[2:3]
[[1]]
[1] "to"

[[2]]
[1] "hear"

If you want to get out a single element in non-list form, use [[...]] instead:

sonnet_words[[2]]
[1] "to"

Recall that when we were making vectors with c, we figured out that nesting c calls didn’t make a difference:

c("Music", c("to", "hear"))
[1] "Music" "to"    "hear" 
c("Music", "to", "hear")
[1] "Music" "to"    "hear" 

Lists, on the other hand, can capture a hierarchical structure. Any element of a list can be a vector of any length:

sonnet_words <- list(c("Music", "to", "hear"),
                     c("why", "playst", "thou", "music"))
sonnet_words
[[1]]
[1] "Music" "to"    "hear" 

[[2]]
[1] "why"    "playst" "thou"   "music" 

This is a list of two elements, whose first element is a vector of three elements:

sonnet_words[[1]]
[1] "Music" "to"    "hear" 

Lists can be iterated over in a for loop. Pay close attention to what happens here:

for (words in sonnet_words) {
    print(str_c(words, collapse=" "))
}
[1] "Music to hear"
[1] "why playst thou music"

Why are there two lines and not seven?

Lists may be heterogeneous with respect to type. This is a perfectly good list:

stuff <- list("Gaskell", 3.14159, F)

By contrast, recall that a vector must be a single type. If we try to make a vector, the non-character values will be coerced to character:

c("Gaskell", 3.14159, F)
[1] "Gaskell" "3.14159" "FALSE"  

But I didn’t want the strings "3.14159" and "FALSE", I wanted the number and the logical value. (Otherwise funny things happen; e.g., I get an error if use logical operators like & with the string "FALSE".) The different types are preserved in the list form:

stuff[[3]]
[1] FALSE

Named elements

Lists, like vectors, can also have names to refer to their elements. names(lst) is a vector of names for the elements of lst, which can be manipulated just like names(x) for a vector x. But it’s even more common to create lists using named parameters, like this:

personae <- list(lear=c("Lear", "king of Britain"),
                 edgar=c("Edgar", "son of Gloucester"),
                 edmund=c("Edmund", "bastard", "son to Gloucester"))

personae is now a three-element list, but it can be conveniently accessed using names:

personae[["edgar"]]
[1] "Edgar"             "son of Gloucester"

This is so common that R has a special shortcut, $:

personae$edgar
[1] "Edgar"             "son of Gloucester"

You can add new elements onto a list using this shortcut as well as the methods familiar from vectors:

personae$goneril <- c("Goneril", "daughter to Lear")

Uses of lists

Now we can at last demystify the results of strsplit and str_split. These functions return lists. Why? Because they accept a character vector, and they return a list of vectors, one vector of split-up words per element of the input vector.

sonnet_lines <- c("Were't aught to me I bore the canopy",
                  "With my extern the outward honoring")
str_split(sonnet_lines,
          "\\W")
[[1]]
[1] "Were"   "t"      "aught"  "to"     "me"     "I"      "bore"   "the"   
[9] "canopy"

[[2]]
[1] "With"     "my"       "extern"   "the"      "outward"  "honoring"

The results here are called a ragged list, because the number of elements of each list element is different. str_split returns a ragged list.

unlist(lst) returns a vector with all the elements of lst one after another (“flattened”). But remember: if lst is heterogeneous, the results of unlist(lst) will be forced to be all of one type (probably character vector). You lose the hierarchical structure of the list. When we were splitting up the words of The Sheik, we didn’t care: the initial vector of lines of the text file represented a pretty arbitrary division of the text anyway. But we often will care.

Exercise

If you loaded the course package with library("litdata"), you now have a variable called sheik_ll, with the lines of the Gutenberg The Sheik. This is just what you worked with on homework 2, but in order to skip the file-loading part, which caused some glitches last time around, I’ve provided the data for you in my litdata package. We’ll have plenty more practice with file loading soon enough.

The following lines, which use str_detect, a function we haven’t yet studied, derive a character vector with one element for each chapter of The Sheik.

# strip off all but the body text
body_ll <- sheik_ll[match("CHAPTER I", sheik_ll):
                    (match("THE END", sheik_ll) - 1)]
# find chapter heading indices:
# str_detect gives a logical vector telling us which lines
# match the pattern; which tells us the indices
chap_indices <- which(str_detect(body_ll, "^CHAPTER [IVX]"))

body_lc <- tolower(body_ll)

# initialize results vector
sheik_chaps <- character(length(chap_indices))

# tricky edge condition: contrive the last "chapter start"
# to be one past the last line
chap_indices <- c(chap_indices, length(body_ll) + 1)

for (j in seq_along(chap_indices[-length(chap_indices)])) {
    sheik_chaps[j] <- str_c(body_lc[chap_indices[j]:
                                   (chap_indices[j + 1] - 1)],
                           collapse=" ")
}

If you can guess what str_detect does, you can figure this code out using what you know, but it is fiddly. We will come back to it soon. For now, it suffices to say that if you run this code, sheik_chaps holds each chapter in an element of the vector.

Now we split this up using our friend str_split:

sheik_chaps_words <- str_split(sheik_chaps, "\\W+")

sheik_chaps_words is a list.

Now write a for loop that counts the fraction of times “Arab” and “Diana” occur in each chapter, storing the results in two vectors. You will have to use seq_along in the loop condition and [[...]] indexing in the loop body. The fraction is just the number of times a word occurs in a chapter divided by the total number of words in that chapter. Division in R is notated with the forward slash, /.

arab <- numeric()
diana <- numeric()
# your for loop here

These lines will print out your results as occurrences per 10000 words:

For arab:

round(arab * 10000)

For diana:

round(diana * 10000)

I obtain the following rates per 10000 for arab:

 [1]  0 13 19  9 10 15  5 15 14  3

and for diana:

 [1] 33 52 51 35 28 26 34 48 18 22

(This is a modified version of some of what Jockers does in chapter 4.)

Data Frames

The next complex type to consider is the most important of all: the data frame. A data frame is a list, plus a single further assumption: all the elements are vectors of the same length, though not necessarily of the same type. The data frame represents tabular data—data that comes in rows and columns. Think of a spreadsheet as the paradigm for a data frame.

Here is a miniature data frame:

classes <- data.frame(class_date=c("2015-01-22", "2015-01-29", "2015-02-05"),
                      students_present=c(12, 12, 12),
                      topic=c("intro", "indexing", "loops"),
                      used_handouts=c(T, F, F)
                      )
classes
  class_date students_present    topic used_handouts
1 2015-01-22               12    intro          TRUE
2 2015-01-29               12 indexing         FALSE
3 2015-02-05               12    loops         FALSE

This is a little table with four columns of different types. The key idea here is that the rows of the table all refer to the same entity—one of our class meetings. The first row gives the date of our first meeting, the number of students present, and so on. Each column is a parallel series of simple data about the classes.

I can access the columns using my list accessor $:

classes$students_present
[1] 12 12 12

And indeed everything I can do to a list I can do to a data frame. Create a data frame with the following syntax

data.frame(name1=col1,
           name2=col2,
           ...)

The col1, col2... vectors must be of the same length, or they will be recycled (this is sometimes useful).

Whereas lists and vectors have a single dimension, found with length, data frames have two dimensions: the number of rows and the number of columns. These are obtained with the nrow and ncol functions, or both at once in a vector with the dim function:

dim(classes)
[1] 3 4

Subscripting

But I can do more. Data frames can be subscripted using a two-place subscript [rows, cols]. I can pick out one element of my table:

classes[1, 2]
[1] 12
classes[1, 4]
[1] TRUE

I can use names:

classes[2, "topic"]
[1] indexing
Levels: indexing intro loops

Even more, I can leave the row or the column blank (but keep the comma):

classes[2, ]
  class_date students_present    topic used_handouts
2 2015-01-29               12 indexing         FALSE
classes[, "used_handouts"]
[1]  TRUE FALSE FALSE

Unlike the [...] for lists, when we subscript a data frame, we get a vector, not a data frame, unless we are picking out more than one column (in which case we get a data frame):

classes[2:3, ]
  class_date students_present    topic used_handouts
2 2015-01-29               12 indexing         FALSE
3 2015-02-05               12    loops         FALSE

The row and column specifications can even be logical vectors:

classes[c(T, T, F), ]
  class_date students_present    topic used_handouts
1 2015-01-22               12    intro          TRUE
2 2015-01-29               12 indexing         FALSE

which tends to be most useful when we form a logical expression that picks out table rows. For example, if I want all the data about the class that was about indexing, I write:

classes[classes$topic == "indexing", ]
  class_date students_present    topic used_handouts
2 2015-01-29               12 indexing         FALSE

Reflect carefully on why this works. This is an expression which is evaluated like any other.

classes$topic
[1] intro    indexing loops   
Levels: indexing intro loops
classes$topic == "indexing"
[1] FALSE  TRUE FALSE
logicals <- classes$topic == "indexing"
classes[logicals, ]
  class_date students_present    topic used_handouts
2 2015-01-29               12 indexing         FALSE

Similar operations are possible over columns, but this is less commonly used for data frames.

Row names

Data frames can have one further way to refer to their components. So far, we’ve seen that columns are named (and accessible with $ or [, cols]). Rows, too, can be named. This is optional. Just as we set the names of a vector with names(x) <- nms, we can set the row names of a data frame with an assignment to its rownames:

rownames(classes) <- c("intro_class", "indexing_class", "control_class")
classes
               class_date students_present    topic used_handouts
intro_class    2015-01-22               12    intro          TRUE
indexing_class 2015-01-29               12 indexing         FALSE
control_class  2015-02-05               12    loops         FALSE

(You can also get the names of the columns with colnames(classes) or just names(classes).) In this case, the row names are kind of redundant with the topic column, and it usually makes more sense to store information in a proper column than in row names (in fact, as we shall see, R idol Hadley Wickham would describe a data frame with information in row names as “untidy”). Still, this completes the suite of subscripting possibilities; just as we could have a column subscript that used column names, so too with rows:

classes["indexing_class", ]
               class_date students_present    topic used_handouts
indexing_class 2015-01-29               12 indexing         FALSE

Or both at once:

classes["indexing_class", c("class_date", "students_present")]
               class_date students_present
indexing_class 2015-01-29               12

Ubiquity

Data frames are everywhere when you work in R. Why? Consider our readings so far. Rosenberg studied the number of occurrences of the word “data” in each year’s worth of words in Google’s digital library. So he had a spreadsheet where each row had the year in question and the number of data occurrences. His figure 1.8 charts data that looks like this:

data.frame(decade_start=c(1700, 1710, 1720, ...),
           data_fraction=c(0.03, 0.02, 0.04, ...))

That’s probably more “cooked” than the actual data, which might be more like

data.frame(year=c(1700, 1701, 1702, ...),
           data_hits=c(3, 1, 8, ...),
           total_docs=c(100, 50, 200, ...))

(Unrealistic numbers.)

Or imagine that Moretti had examined each title by hand and tallied it like this:

data.frame(year=c(1749, 1813, ...),
           title=c("The History of Tom Jones...",
                   "Pride and Prejudice", ...),
           definite_article=c(T, F, ...),
           abstract_qualities=c(F, T, ...),
           indefinite_article=c(F, F, ...))

Notice that there’s no rule that says that the rows are in chronological order. We’ll soon see how to sort data frames. That is of course not what he did: the actual data had publication date, title, and author. Further computations derived the other qualities (like definite_article). That goes also for the yearly frequencies Moretti charts, which are found by aggregating those derived data over all the titles published in a given year.

Data frames need not be time series at all, though in our historical studies they often will be. Consider Burrows’s tables:

data.frame(novel=factor(c("PP", "PP", "Emma", ...)),
           char=c("Elizabeth", "Darcy", "Mr. Knightley", ...),
           and=c(308, 128, 264, ...),
           the=c(360, 203, 237, ...))

Again, such a table had to be derived from the digitized and marked-up text of the novels by computer programs before it could be studied. Indeed, it might occur to you that even this very same information could be represented in a data frame in a different way:

data.frame(novel=factor(c("PP", "PP", "PP", ...)),
           char=factor(c("Eliz", "Eliz", "Dar", ...)),
           word=factor(c("and", "the", "and", ...))
           count=c(308, 360, 128, ...))

Whereas in the former case, each row of the table represented a character, with each of their words tallied in a column, in the latter case each row represents a character-word combination, with many rows for each character (one for each word-type we tally).

One further point. Explicitly specifying a data frame column as a factor is not actually necessary. By default, when you make a data frame column a character vector, R first turns it into a factor. So I could have written

data.frame(novel=c("PP", "PP", ...))

That’s a nice convenience…except when we don’t want a factor! Character vectors often aren’t categorical in literary data. Consider, for example, those novel titles. Most titles occur only once. We don’t want a factor with 6800 levels to encode the titles of 7000 novels. We just want the character vector. In that case, we use a special named parameter to data.frame:

data.frame(year=c(1749, 1813, ...),
           title=c("The History of Tom Jones...",
                   "Pride and Prejudice", ...),
           stringsAsFactors=F)

Now only columns we explicitly make a factor(...) will be a factor.

Terminological note: tabular things

Rectangular things, things with rows and columns—what to call them? When we see them on the printed page we call them tables. In R, a table is something else (something we’ve already encountered)—a special data type that holds tallies of how often things occur. (table is short for contingency table, for reasons we’ll make clear eventually.) So I have been trying to say tabular data when I mean data that comes in rows and columns. The R type for such data is the data frame.

There is one more tabular form you might be wondering about (especially if you read Jockers, chap. 4). That is the matrix. In R, a matrix is for rectangular data that is all of the same type. (Remember that the power of the data frame was that, like the list, it need not be homogeneous with respect to type.) Numerical matrices occur very frequently in statistical applications, and R has lots of powerful functions for working with matrices. Because we are dodging advanced mathematics in this course, we won’t have much to do with them. However, just for completeness, I’ll note that matrices can be subscripted with the same two-place indices that data frames can. You make matrices with the matrix function, which (this is a bit strange) takes a vector of all the matrix entries and parameters describing the matrix dimensions. In R it’s perfectly possible to have a character or logical matrix as well as a numerical one, and in fact we will eventually meet a few character matrices. (For utter completeness I note that R also supports arbitrary-dimensional arrays—think of Lévi-Strauss’s three-dimensional stack of cards—but I doubt we’ll be needing them.)

Exercises

The logic of the query

If you have loaded the course software with library("litdata"), you also have a variable, laureates, with some information about the Nobel literature laureates from nobelprize.org. Print it out in the console but be prepared for mess. You will have to use colnames to figure out what information is stored here. Other useful exploratory functions are head and tail, which print out the first and last rows of a data frame (or the first and last elements of a vector).

Write down expressions that yield the following information by subscripting laureates.

  1. What is the surname of the unique laureate born in Portugal?

  2. What are the first names and surnames of the female laureates? (A single expression, yielding two columns of information.)

  3. What are the full names of all the laureates who are either women or born in Sweden? (A single expression. You will need logical operations as well as a string operation.)

  4. How many laureates died in a country other than the country of their birth? Derive also an expression for their names, countries of birth, and countries of death.

Sorting

Sorting tabular data is not as straightforward as sorting a vector. You have to decide what to sort by. R has an elegant way of doing this, the ordering permutation function order.

  1. Explain the relationship between a vector x and its ordering permutation order(x) by considering these three examples:

    x <- c("Morrison", "Gordimer", "Lessing")
    o_x <- order(x)
    o_x
    [1] 2 3 1
    x[o_x]
    [1] "Gordimer" "Lessing"  "Morrison"
    y <- c("Heaney", "Szymborska", "Walcott")
    order(y)
    [1] 1 2 3
    y[order(y)]
    [1] "Heaney"     "Szymborska" "Walcott"   
    order(c(0, 4, 2))
    [1] 1 3 2

    You can certainly look at the R help for order (?order).

  2. Now we are going to exploit our special data-frame subscripting syntax. So we don’t go crazy, let’s work on a smaller table:

    laur_small <- laureates[1:5, c("surname", "bornCountry", "gender", "year")]

    Write an expression to sort this table alphabetically by surname. It has the form laur_small[order(v), ] where v is the vector of names.

  3. Now sort it in reverse alphabetical order by country of birth. order takes a named logical parameter, decreasing (order(v, decreasing=T); the default is FALSE.)

  4. Actually order is another variadic function. order(x, y) is an ordering permutation for x in which ties are broken by the ordering of corresponding elements of y. That’s an obscure way of saying something that we do all the time. Sort laur_small by gender and year of prize. You’ll need two $ expressions.

  5. Finally, just as we did “top words” for novels by subscripting our sorted frequency tables, so too we can get “top” hits of a big table by subscripting our ordering permutation before using it as a row subscript. Write an expression to show first names, surnames, birth dates, and prize years of the five most recently-born laureates. Thanks to the nice string representation of the dates of birth, order will work just fine on them.