This homework is mostly text for you to read, so it’s rather long for the web. I have made a printable PDF version as well: hw4.pdf.

Update the course software with `devtools::install_github("agoldst/litdata")`. Now, for the first time, you will also have to load this software explicitly:

``library("litdata")``

This is done for you in the homework 4 template (also in the package update). We’ll be using `stringr` again too, but when you load `litdata` you get `stringr` as well.

You do not have to read any of Jockers for this homework. If you have Teetor, you may find the suggested sections useful.

Now that we have spent some time with R’s simple data types (vectors of various kinds), we are ready to think about complex types. But we need to cover one more simple type first.

Factors

A factor is yet another type of vector, representing categorical data. Character vectors would work for storing categorical data:

``birth_cc <- c("FR", "DE", "NO", "FR", "PL", "IT", "UK", "DE", "DE")``

These are country codes for the countries of birth of the earliest Nobel literature laureates. Yet the character vector doesn’t contain any information about the fact that country codes are chosen from a finite set of possibilities—on this list, 6 out of 9. A factor adds just this information:

``````birth_country <- factor(birth_cc)
birth_country``````
``````[1] FR DE NO FR PL IT UK DE DE
Levels: DE FR IT NO PL UK``````

Notice the absence of quotation marks and the information about the “levels” of the factor. The levels are the categories from which our categorical value is drawn. To find the levels of a factor:

``levels(birth_country)``
``[1] "DE" "FR" "IT" "NO" "PL" "UK"``

More formally, `factor(xs)` takes a vector `xs` and produces a factor of the same length as `xs` whose levels are given by `unique(xs)` and whose elements correspond to those of `xs`. A factor is a vector and can be subscripted in all the usual ways by `[...]`.

``birth_country[birth_country != "FR"]``
``````[1] DE NO PL IT UK DE DE
Levels: DE FR IT NO PL UK``````

You rarely need to create a factor explicitly. But you will encounter them fairly frequently. In fact, you already have. Secretly, the `table` function requires a factor argument. On reflection, this makes sense: in order to tabulate how often each element of a vector occurs, we need to know that the elements of a vector are things that can occur more than once. This is precisely what a factor is for. If `xs` is a character vector, `table(xs)` is a shortcut for `table(factor(xs))`.

Two extra notes on factors

The last thing to know at this stage is that factors can be a source of annoying and mysterious errors. Sometimes, you just want the character vector. (That `Levels:` display can be annoying when you have a lot of levels. Think about making a factor from every word of The Sheik, as in fact we did.) You can test whether a variable `x` is a character vector with the function `is.character(x)`. You can coerce a factor `fac` back into a plain character vector with `as.character(fac)`.

Supplementary information. You can do more with the `factor` function. You can explicitly set the levels with `factor(xs, levels=lvls)`, which allows you to include categories that don’t actually occur in `xs` (but might later). You can also identify the categories as ordered with `factor(xs, ordered=T)`, but the consequences of this we won’t see until later. It might occur to you that once you know the levels, you don’t actually have to store the factored values as character strings: you just need the integer indices into the levels vector. This is in fact how R normally stores factors internally, as you can see by trying out the function `as.numeric` on a factor.

Lists

At last we can move on to compound types. In R, compound types are collections of vectors. Depending on what assumptions we can make about the vectors, we will use different types.

Introduction

If we make no assumptions, we have the list. A list is an ordered collection of values of any type. We made vectors with `c`; we make lists with `list`, separating elements by commas.

``````sonnet_words <- list("Music", "to", "hear", "why")
sonnet_words``````
``````[[1]]
[1] "Music"

[[2]]
[1] "to"

[[3]]
[1] "hear"

[[4]]
[1] "why"``````

This looks almost the same as the vector `c("Music", "to", "hear", "why")`, except for the extra stuff in the output. To get the length of a list, use `length`. All the different kinds of indexing we used with vectors are also possible with lists. But when you use `[...]` subscripting with lists, the result is a list:

``sonnet_words[2:3]``
``````[[1]]
[1] "to"

[[2]]
[1] "hear"``````

If you want to get out a single element in non-list form, use `[[...]]` instead:

``sonnet_words[[2]]``
``[1] "to"``

Recall that when we were making vectors with `c`, we figured out that nesting `c` calls didn’t make a difference:

``c("Music", c("to", "hear"))``
``[1] "Music" "to"    "hear" ``
``c("Music", "to", "hear")``
``[1] "Music" "to"    "hear" ``

Lists, on the other hand, can capture a hierarchical structure. Any element of a list can be a vector of any length:

``````sonnet_words <- list(c("Music", "to", "hear"),
c("why", "playst", "thou", "music"))
sonnet_words``````
``````[[1]]
[1] "Music" "to"    "hear"

[[2]]
[1] "why"    "playst" "thou"   "music" ``````

This is a list of two elements, whose first element is a vector of three elements:

``sonnet_words[[1]]``
``[1] "Music" "to"    "hear" ``

Lists can be iterated over in a `for` loop. Pay close attention to what happens here:

``````for (words in sonnet_words) {
print(str_c(words, collapse=" "))
}``````
``````[1] "Music to hear"
[1] "why playst thou music"``````

Why are there two lines and not seven?

Lists may be heterogeneous with respect to type. This is a perfectly good list:

``stuff <- list("Gaskell", 3.14159, F)``

By contrast, recall that a vector must be a single type. If we try to make a vector, the non-character values will be coerced to character:

``c("Gaskell", 3.14159, F)``
``[1] "Gaskell" "3.14159" "FALSE"  ``

But I didn’t want the strings `"3.14159"` and `"FALSE"`, I wanted the number and the logical value. (Otherwise funny things happen; e.g., I get an error if use logical operators like `&` with the string `"FALSE"`.) The different types are preserved in the list form:

``stuff[[3]]``
``[1] FALSE``

Named elements

Lists, like vectors, can also have names to refer to their elements. `names(lst)` is a vector of names for the elements of `lst`, which can be manipulated just like `names(x)` for a vector `x`. But it’s even more common to create lists using named parameters, like this:

``````personae <- list(lear=c("Lear", "king of Britain"),
edgar=c("Edgar", "son of Gloucester"),
edmund=c("Edmund", "bastard", "son to Gloucester"))``````

`personae` is now a three-element list, but it can be conveniently accessed using names:

``personae[["edgar"]]``
``[1] "Edgar"             "son of Gloucester"``

This is so common that R has a special shortcut, `\$`:

``personae\$edgar``
``[1] "Edgar"             "son of Gloucester"``

You can add new elements onto a list using this shortcut as well as the methods familiar from vectors:

``personae\$goneril <- c("Goneril", "daughter to Lear")``

Uses of lists

Now we can at last demystify the results of `strsplit` and `str_split`. These functions return lists. Why? Because they accept a character vector, and they return a list of vectors, one vector of split-up words per element of the input vector.

``````sonnet_lines <- c("Were't aught to me I bore the canopy",
"With my extern the outward honoring")
str_split(sonnet_lines,
"\\W")``````
``````[[1]]
[1] "Were"   "t"      "aught"  "to"     "me"     "I"      "bore"   "the"
[9] "canopy"

[[2]]
[1] "With"     "my"       "extern"   "the"      "outward"  "honoring"``````

The results here are called a ragged list, because the number of elements of each list element is different. `str_split` returns a ragged list.

`unlist(lst)` returns a vector with all the elements of `lst` one after another (“flattened”). But remember: if `lst` is heterogeneous, the results of `unlist(lst)` will be forced to be all of one type (probably character vector). You lose the hierarchical structure of the list. When we were splitting up the words of The Sheik, we didn’t care: the initial vector of lines of the text file represented a pretty arbitrary division of the text anyway. But we often will care.

Exercise

If you loaded the course package with `library("litdata")`, you now have a variable called `sheik_ll`, with the lines of the Gutenberg The Sheik. This is just what you worked with on homework 2, but in order to skip the file-loading part, which caused some glitches last time around, I’ve provided the data for you in my `litdata` package. We’ll have plenty more practice with file loading soon enough.

The following lines, which use `str_detect`, a function we haven’t yet studied, derive a character vector with one element for each chapter of The Sheik.

``````# strip off all but the body text
body_ll <- sheik_ll[match("CHAPTER I", sheik_ll):
(match("THE END", sheik_ll) - 1)]
# str_detect gives a logical vector telling us which lines
# match the pattern; which tells us the indices
chap_indices <- which(str_detect(body_ll, "^CHAPTER [IVX]"))

body_lc <- tolower(body_ll)

# initialize results vector
sheik_chaps <- character(length(chap_indices))

# tricky edge condition: contrive the last "chapter start"
# to be one past the last line
chap_indices <- c(chap_indices, length(body_ll) + 1)

for (j in seq_along(chap_indices[-length(chap_indices)])) {
sheik_chaps[j] <- str_c(body_lc[chap_indices[j]:
(chap_indices[j + 1] - 1)],
collapse=" ")
}``````

If you can guess what `str_detect` does, you can figure this code out using what you know, but it is fiddly. We will come back to it soon. For now, it suffices to say that if you run this code, `sheik_chaps` holds each chapter in an element of the vector.

Now we split this up using our friend `str_split`:

``sheik_chaps_words <- str_split(sheik_chaps, "\\W+")``

`sheik_chaps_words` is a list.

Now write a `for` loop that counts the fraction of times “Arab” and “Diana” occur in each chapter, storing the results in two vectors. You will have to use `seq_along` in the loop condition and `[[...]]` indexing in the loop body. The fraction is just the number of times a word occurs in a chapter divided by the total number of words in that chapter. Division in R is notated with the forward slash, `/`.

``````arab <- numeric()
diana <- numeric()

These lines will print out your results as occurrences per 10000 words:

For `arab`:

``round(arab * 10000)``

For `diana`:

``round(diana * 10000)``

I obtain the following rates per 10000 for `arab`:

`` [1]  0 13 19  9 10 15  5 15 14  3``

and for `diana`:

`` [1] 33 52 51 35 28 26 34 48 18 22``

(This is a modified version of some of what Jockers does in chapter 4.)

Data Frames

The next complex type to consider is the most important of all: the data frame. A data frame is a list, plus a single further assumption: all the elements are vectors of the same length, though not necessarily of the same type. The data frame represents tabular data—data that comes in rows and columns. Think of a spreadsheet as the paradigm for a data frame.

Here is a miniature data frame:

``````classes <- data.frame(class_date=c("2015-01-22", "2015-01-29", "2015-02-05"),
students_present=c(12, 12, 12),
topic=c("intro", "indexing", "loops"),
used_handouts=c(T, F, F)
)
classes``````
``````  class_date students_present    topic used_handouts
1 2015-01-22               12    intro          TRUE
2 2015-01-29               12 indexing         FALSE
3 2015-02-05               12    loops         FALSE``````

This is a little table with four columns of different types. The key idea here is that the rows of the table all refer to the same entity—one of our class meetings. The first row gives the date of our first meeting, the number of students present, and so on. Each column is a parallel series of simple data about the classes.

I can access the columns using my list accessor `\$`:

``classes\$students_present``
``[1] 12 12 12``

And indeed everything I can do to a list I can do to a data frame. Create a data frame with the following syntax

``````data.frame(name1=col1,
name2=col2,
...)``````

The `col1, col2...` vectors must be of the same length, or they will be recycled (this is sometimes useful).

Whereas lists and vectors have a single dimension, found with `length`, data frames have two dimensions: the number of rows and the number of columns. These are obtained with the `nrow` and `ncol` functions, or both at once in a vector with the `dim` function:

``dim(classes)``
``[1] 3 4``

Subscripting

But I can do more. Data frames can be subscripted using a two-place subscript `[rows, cols]`. I can pick out one element of my table:

``classes[1, 2]``
``[1] 12``
``classes[1, 4]``
``[1] TRUE``

I can use names:

``classes[2, "topic"]``
``````[1] indexing
Levels: indexing intro loops``````

Even more, I can leave the row or the column blank (but keep the comma):

``classes[2, ]``
``````  class_date students_present    topic used_handouts
2 2015-01-29               12 indexing         FALSE``````
``classes[, "used_handouts"]``
``[1]  TRUE FALSE FALSE``

Unlike the `[...]` for lists, when we subscript a data frame, we get a vector, not a data frame, unless we are picking out more than one column (in which case we get a data frame):

``classes[2:3, ]``
``````  class_date students_present    topic used_handouts
2 2015-01-29               12 indexing         FALSE
3 2015-02-05               12    loops         FALSE``````

The row and column specifications can even be logical vectors:

``classes[c(T, T, F), ]``
``````  class_date students_present    topic used_handouts
1 2015-01-22               12    intro          TRUE
2 2015-01-29               12 indexing         FALSE``````

which tends to be most useful when we form a logical expression that picks out table rows. For example, if I want all the data about the class that was about indexing, I write:

``classes[classes\$topic == "indexing", ]``
``````  class_date students_present    topic used_handouts
2 2015-01-29               12 indexing         FALSE``````

Reflect carefully on why this works. This is an expression which is evaluated like any other.

``classes\$topic``
``````[1] intro    indexing loops
Levels: indexing intro loops``````
``classes\$topic == "indexing"``
``[1] FALSE  TRUE FALSE``
``````logicals <- classes\$topic == "indexing"
classes[logicals, ]``````
``````  class_date students_present    topic used_handouts
2 2015-01-29               12 indexing         FALSE``````

Similar operations are possible over columns, but this is less commonly used for data frames.

Row names

Data frames can have one further way to refer to their components. So far, we’ve seen that columns are named (and accessible with `\$` or `[, cols]`). Rows, too, can be named. This is optional. Just as we set the names of a vector with `names(x) <- nms`, we can set the row names of a data frame with an assignment to its `rownames`:

``````rownames(classes) <- c("intro_class", "indexing_class", "control_class")
classes``````
``````               class_date students_present    topic used_handouts
intro_class    2015-01-22               12    intro          TRUE
indexing_class 2015-01-29               12 indexing         FALSE
control_class  2015-02-05               12    loops         FALSE``````

(You can also get the names of the columns with `colnames(classes)` or just `names(classes)`.) In this case, the row names are kind of redundant with the `topic` column, and it usually makes more sense to store information in a proper column than in row names (in fact, as we shall see, R idol Hadley Wickham would describe a data frame with information in row names as “untidy”). Still, this completes the suite of subscripting possibilities; just as we could have a column subscript that used column names, so too with rows:

``classes["indexing_class", ]``
``````               class_date students_present    topic used_handouts
indexing_class 2015-01-29               12 indexing         FALSE``````

Or both at once:

``classes["indexing_class", c("class_date", "students_present")]``
``````               class_date students_present
indexing_class 2015-01-29               12``````

Ubiquity

Data frames are everywhere when you work in R. Why? Consider our readings so far. Rosenberg studied the number of occurrences of the word “data” in each year’s worth of words in Google’s digital library. So he had a spreadsheet where each row had the year in question and the number of data occurrences. His figure 1.8 charts data that looks like this:

``````data.frame(decade_start=c(1700, 1710, 1720, ...),
data_fraction=c(0.03, 0.02, 0.04, ...))``````

That’s probably more “cooked” than the actual data, which might be more like

``````data.frame(year=c(1700, 1701, 1702, ...),
data_hits=c(3, 1, 8, ...),
total_docs=c(100, 50, 200, ...))``````

(Unrealistic numbers.)

Or imagine that Moretti had examined each title by hand and tallied it like this:

``````data.frame(year=c(1749, 1813, ...),
title=c("The History of Tom Jones...",
"Pride and Prejudice", ...),
definite_article=c(T, F, ...),
abstract_qualities=c(F, T, ...),
indefinite_article=c(F, F, ...))``````

Notice that there’s no rule that says that the rows are in chronological order. We’ll soon see how to sort data frames. That is of course not what he did: the actual data had publication date, title, and author. Further computations derived the other qualities (like `definite_article`). That goes also for the yearly frequencies Moretti charts, which are found by aggregating those derived data over all the titles published in a given year.

Data frames need not be time series at all, though in our historical studies they often will be. Consider Burrows’s tables:

``````data.frame(novel=factor(c("PP", "PP", "Emma", ...)),
char=c("Elizabeth", "Darcy", "Mr. Knightley", ...),
and=c(308, 128, 264, ...),
the=c(360, 203, 237, ...))``````

Again, such a table had to be derived from the digitized and marked-up text of the novels by computer programs before it could be studied. Indeed, it might occur to you that even this very same information could be represented in a data frame in a different way:

``````data.frame(novel=factor(c("PP", "PP", "PP", ...)),
char=factor(c("Eliz", "Eliz", "Dar", ...)),
word=factor(c("and", "the", "and", ...))
count=c(308, 360, 128, ...))``````

Whereas in the former case, each row of the table represented a character, with each of their words tallied in a column, in the latter case each row represents a character-word combination, with many rows for each character (one for each word-type we tally).

One further point. Explicitly specifying a data frame column as a factor is not actually necessary. By default, when you make a data frame column a character vector, R first turns it into a factor. So I could have written

``data.frame(novel=c("PP", "PP", ...))``

That’s a nice convenience…except when we don’t want a factor! Character vectors often aren’t categorical in literary data. Consider, for example, those novel titles. Most titles occur only once. We don’t want a factor with 6800 levels to encode the titles of 7000 novels. We just want the character vector. In that case, we use a special named parameter to `data.frame`:

``````data.frame(year=c(1749, 1813, ...),
title=c("The History of Tom Jones...",
"Pride and Prejudice", ...),
stringsAsFactors=F)``````

Now only columns we explicitly make a `factor(...)` will be a factor.

Terminological note: tabular things

Rectangular things, things with rows and columns—what to call them? When we see them on the printed page we call them tables. In R, a `table` is something else (something we’ve already encountered)—a special data type that holds tallies of how often things occur. (`table` is short for contingency table, for reasons we’ll make clear eventually.) So I have been trying to say tabular data when I mean data that comes in rows and columns. The R type for such data is the data frame.

There is one more tabular form you might be wondering about (especially if you read Jockers, chap. 4). That is the matrix. In R, a matrix is for rectangular data that is all of the same type. (Remember that the power of the data frame was that, like the list, it need not be homogeneous with respect to type.) Numerical matrices occur very frequently in statistical applications, and R has lots of powerful functions for working with matrices. Because we are dodging advanced mathematics in this course, we won’t have much to do with them. However, just for completeness, I’ll note that matrices can be subscripted with the same two-place indices that data frames can. You make matrices with the `matrix` function, which (this is a bit strange) takes a vector of all the matrix entries and parameters describing the matrix dimensions. In R it’s perfectly possible to have a character or logical matrix as well as a numerical one, and in fact we will eventually meet a few character matrices. (For utter completeness I note that R also supports arbitrary-dimensional arrays—think of Lévi-Strauss’s three-dimensional stack of cards—but I doubt we’ll be needing them.)

Exercises

The logic of the query

If you have loaded the course software with `library("litdata")`, you also have a variable, `laureates`, with some information about the Nobel literature laureates from nobelprize.org. Print it out in the console but be prepared for mess. You will have to use `colnames` to figure out what information is stored here. Other useful exploratory functions are `head` and `tail`, which print out the first and last rows of a data frame (or the first and last elements of a vector).

Write down expressions that yield the following information by subscripting `laureates`.

1. What is the surname of the unique laureate born in Portugal?

2. What are the first names and surnames of the female laureates? (A single expression, yielding two columns of information.)

3. What are the full names of all the laureates who are either women or born in Sweden? (A single expression. You will need logical operations as well as a string operation.)

4. How many laureates died in a country other than the country of their birth? Derive also an expression for their names, countries of birth, and countries of death.

Sorting

Sorting tabular data is not as straightforward as sorting a vector. You have to decide what to sort by. R has an elegant way of doing this, the ordering permutation function `order`.

1. Explain the relationship between a vector `x` and its ordering permutation `order(x)` by considering these three examples:

``````x <- c("Morrison", "Gordimer", "Lessing")
o_x <- order(x)
o_x``````
``[1] 2 3 1``
``x[o_x]``
``[1] "Gordimer" "Lessing"  "Morrison"``
``````y <- c("Heaney", "Szymborska", "Walcott")
order(y)``````
``[1] 1 2 3``
``y[order(y)]``
``[1] "Heaney"     "Szymborska" "Walcott"   ``
``order(c(0, 4, 2))``
``[1] 1 3 2``

You can certainly look at the R help for order (`?order`).

2. Now we are going to exploit our special data-frame subscripting syntax. So we don’t go crazy, let’s work on a smaller table:

``laur_small <- laureates[1:5, c("surname", "bornCountry", "gender", "year")]``

Write an expression to sort this table alphabetically by surname. It has the form `laur_small[order(v), ]` where `v` is the vector of names.

3. Now sort it in reverse alphabetical order by country of birth. `order` takes a named logical parameter, `decreasing` (`order(v, decreasing=T)`; the default is `FALSE`.)

4. Actually `order` is another variadic function. `order(x, y)` is an ordering permutation for `x` in which ties are broken by the ordering of corresponding elements of `y`. That’s an obscure way of saying something that we do all the time. Sort `laur_small` by gender and year of prize. You’ll need two `\$` expressions.

5. Finally, just as we did “top words” for novels by subscripting our sorted frequency tables, so too we can get “top” hits of a big table by subscripting our ordering permutation before using it as a row subscript. Write an expression to show first names, surnames, birth dates, and prize years of the five most recently-born laureates. Thanks to the nice string representation of the dates of birth, `order` will work just fine on them.