I have made a printable PDF version of this homework as well: hw7.pdf.

To produce a little breathing room, this homework will be due March 30. The next homework will be due a week after that, then we’ll try to get back on schedule.

For this homework, use the Homework template. You will also need to include a code chunk at the top of your R markdown that loads the libraries you need: litdata, dplyr, and tidyr.

`​``{r include=F, cache=F}
library("litdata")
library("dplyr")
library("tidyr")
`​``

If your chunk begins {r include=F} instead of {r} then the code will execute, but it will not be printed in your PDF. For technical reasons I will only explain if you ask, chunks with library calls should also be cache=F.

This homework is mostly reading and code for you to try out. Each section heading says whether there’s an exercise or not. There is an optional exercise on higher-order functions in the next-to-last section. Please don’t include all of my explanatory text in your own homework; just include the code necessary to complete the exercises, plus any commentary you care to make on that code.

I’d like all of you to concentrate on the position paper. Don’t spend more than two hours on this homework before the paper deadline. I also am concerned about the long working times that have been reported for some of these homeworks. I don’t expect you to work to exhaustion on these exercises. Many things in graduate school are exhausting, but it is wise to save your energy for the tasks where an all-out effort will be worth it (which might include, to speak only of your scholarly work, final papers and projects in courses, a first journal publication, and your dissertation committee meetings). Make a sincere effort on each part of the homework, talk to one another and help each other when you can, and then, where you are stuck, make careful notes about what isn’t working or doesn’t make sense, and stop.

Ply the trade (exercises)

This exercise is more practice with dplyr. A crowded “data wrangling cheetsheat” that might be useful as a reference can be downloaded here: http://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf. Try to ignore everything we haven’t discussed. [Edited 3/15/15:] I’ve also put some extra notes on the pipe operator on the website, which you can refer to supplement the somewhat compressed presentation of %>% in class. I emphasize that the questions below are to be answered using only the functions introduced in class and anything I explicitly note here on the homework.

Let’s continue to work with Three Percent’s spreadsheet data about literary translations in the U.S., which I’ve put on Sakai. Save the file in the same folder as your homework R markdown. As in class, let’s drop the 2015 items, since it’s only March.

# using a *relative path* here
txl <- read.csv("three-percent.csv", as.is=T, encoding="UTF-8")
txl <- txl %>% filter(Year < 2015) %>% rename(Language=Lanuage)

(rename is a convenient dplyr function. rename(x=y) is equivalent to mutate(x=y) %>% select(-y).)

Use dplyr expression pipelines to answer the following questions about this data set:

  1. Within the two catalogued genres, poetry and fiction, which are the three most prolific publishers of translations? When you use top_n, you also have to arrange if you want the chosen rows in sorted order.

  2. How many publishers are publishing translations in each year?

  3. Which five translators are most prolific? Create a translator_fullname column and group by that. Discard names that aren’t the names of individuals.

  4. Restricting from the last one to translators who have published more than once in this data set, what proportion have published with multiple presses?

Here are the answers I get:

Source: local data frame [6 x 3]
Groups: Genre

    Genre       Publisher count
1 Fiction  Dalkey Archive   181
2 Fiction  AmazonCrossing   121
3 Fiction Europa Editions    93
4  Poetry    Zephyr Press    28
5  Poetry      White Pine    23
6  Poetry   Ugly Duckling    20

Source: local data frame [7 x 2]

  Year n_publishers
1 2008          143
2 2009          143
3 2010          136
4 2011          146
5 2012          161
6 2013          195
7 2014          194

Source: local data frame [5 x 2]

   translator_fullname count
1       Curtis, Howard    35
2     Anderson, Alison    25
3         Bell, Anthea    25
4 Costa, Margaret Jull    24
5      Shugaar, Antony    24

Source: local data frame [1 x 1]

  percent_multi_pub
1             68.75

[Edited, 4/20/16. The denominator for the last one was slightly off in the original version. Works with multiple translators are coded with translator names "various" and "Various".]

“Relevance” (no exercises)

We are still not done counting words (will we ever be?). Just to be goofy about it, let’s practice dplyr operations for this purpose. This section does not have any exercises for you to do. If I’d had time I would have gone through this in class. If you like, before each code chunk, stop and ask yourself how you think you might do it before reading on.

Let’s add to our old friend The Sheik a couple of other texts. Let’s work with hand-trimmed text files of Hull’s novel and three others:

These files are on Sakai; unzip the archive into the same folder as your homework R markdown. You should get a folder called e20c-novels with four text files in it. Now we can reuse our work from last time, the featurize function:

featurize <- function (ll) {
    result <- unlist(strsplit(ll, "\\W+"))
    result <- result[result != ""]
    tolower(result)
}

The next step uses an idiom you haven’t seen yet. rbind(x, y, z, ...) stacks data frames x, y, z with the same columns on top of one another. do.call(rbind, lst) stacks a list of data frames lst into a single data frame. (do.call is another functional, similar to bind from class or %>%. do.call(f, lst) is equivalent to f(lst[[1]], lst[[2]], ...).) Don’t worry too much if the details of the following are vague:

# create vector of relative paths
fs <- file.path("e20c-novels", list.files("e20c-novels"))
frms <- list()
for (j in seq_along(fs)) {
    ll <- readLines(fs[j], encoding="UTF-8")
    # derive the feature vector:
    words <- featurize(ll)
    # Now create a data frame. stringsAsFactors=F ensures our feature
    # vector remains a character vector rather than a factor.
    # The title column will just have one value, the filename,
    # repeated over and over again.
    frms[[j]] <- data.frame(title=basename(fs[j]),
                            feature=words,
                            stringsAsFactors=F)
}
novels <- do.call(rbind, frms)

novels is a very long data frame, with one row for each word of these four texts:

nrow(novels)
[1] 313812

The subtotals for each novel:

novels %>% group_by(title) %>%
    summarize(total_words=n())
Source: local data frame [4 x 2]

                title total_words
1     blue-lagoon.txt       63752
2           sheik.txt       88974
3     three-weeks.txt       53823
4 way-of-an-eagle.txt      107263

tf

Now we summarize this data frame into a data frame of word counts (or term frequencies) for each novel.

novel_counts <- novels %>%
    group_by(title, feature) %>%
    summarize(term_freq=n())

The most frequent words in each novel are hardly surprising, but something funny happens when we try to print them out:

novel_counts %>% group_by(title) %>%
    top_n(10, term_freq)
Source: local data frame [40 x 3]
Groups: title

             title feature term_freq
1  blue-lagoon.txt       a      1842
2  blue-lagoon.txt     and      2290
3  blue-lagoon.txt     had       743
4  blue-lagoon.txt      he      1134
5  blue-lagoon.txt      in      1077
6  blue-lagoon.txt      it       981
7  blue-lagoon.txt      of      1924
8  blue-lagoon.txt     the      5342
9  blue-lagoon.txt      to      1268
10 blue-lagoon.txt     was      1019
..             ...     ...       ...

Why hasn’t R given us all the rows? This is, as programmers say, a feature, not a bug. dplyr changes the way data frames are printed so that if you print a long data frame, it only prints the first few rows (and if there are lots of columns, it only gives you the first few). That’s very useful when you’re exploring data in the console, but when you’re making a report, you want to be able to print everything. I’m still postponing the moment when you have to learn how to produce nicely-formatted tables using xtable; for now, here’s how to override that dplyr feature. We have to add an explicit function invocation at the end of our pipelines to the base R data frame print function, which is called print.data.frame:

novel_counts %>% group_by(title) %>%
    top_n(10, term_freq) %>% # gets top 10 per group, but doesn't sort
    arrange(desc(term_freq)) %>% # sorts within groups
    print.data.frame()
                 title feature term_freq
1      blue-lagoon.txt     the      5342
2      blue-lagoon.txt     and      2290
3      blue-lagoon.txt      of      1924
4      blue-lagoon.txt       a      1842
5      blue-lagoon.txt      to      1268
6      blue-lagoon.txt      he      1134
7      blue-lagoon.txt      in      1077
8      blue-lagoon.txt     was      1019
9      blue-lagoon.txt      it       981
10     blue-lagoon.txt     had       743
11           sheik.txt     the      4976
12           sheik.txt     her      2805
13           sheik.txt     and      2735
14           sheik.txt     she      2451
15           sheik.txt      to      2214
16           sheik.txt      of      2143
17           sheik.txt       a      1860
18           sheik.txt     had      1675
19           sheik.txt     was      1603
20           sheik.txt      he      1517
21     three-weeks.txt     the      2499
22     three-weeks.txt     and      2322
23     three-weeks.txt      of      1364
24     three-weeks.txt      to      1353
25     three-weeks.txt      he      1216
26     three-weeks.txt       a      1058
27     three-weeks.txt     his       973
28     three-weeks.txt     was       917
29     three-weeks.txt      in       874
30     three-weeks.txt     her       784
31 way-of-an-eagle.txt     the      3794
32 way-of-an-eagle.txt      to      3064
33 way-of-an-eagle.txt     her      2813
34 way-of-an-eagle.txt     she      2767
35 way-of-an-eagle.txt     and      2446
36 way-of-an-eagle.txt       a      2344
37 way-of-an-eagle.txt      he      2334
38 way-of-an-eagle.txt      of      2058
39 way-of-an-eagle.txt     you      1950
40 way-of-an-eagle.txt       i      1818

idf

Document frequencies (the document frequency is the number of titles in the corpus that a term occurs in) involve a new complication. novel_counts is still grouped by title (look at the assignment expression above) but now we want to group it by feature and count titles. If you apply group_by to a data frame that already has a grouping, dplyr replaces the previous grouping. (If you want multiple levels of grouping, you write group_by(col1, col2).)

novel_counts %>%
    group_by(feature) %>%
    summarize(doc_freq=n())

But this isn’t quite what we want, because the summarize operation collapses away the titles. What we really want is to add on a document frequency column to our novel_counts data frame. Fortunately, this is as simple as changing summarize to mutate (pause for a minute to think about why this works):

novel_counts <- novel_counts %>%
    group_by(feature) %>%
    mutate(doc_freq=n())

tf*idf

Now recall the tf*idf formula from Ramsay. I’ll write it as a function:

tf_idf <- function (term_freq, doc_freq, num_docs) {
    1 + term_freq * log(num_docs / doc_freq)
}

Since the arithmetic operators work on vectors, and so does log, we can get all the tf*idf scores at once:

num_docs <- length(fs) # fs: the filenames vector; a long way to say 4
novel_counts <- novel_counts %>%
    mutate(score=tf_idf(term_freq, doc_freq, num_docs))

Here are the most “relevant” words for each novel:

novel_counts %>%
    group_by(title) %>%
    top_n(10, score) %>%
    arrange(desc(score)) %>%
    print.data.frame()
                 title   feature term_freq doc_freq     score
1      blue-lagoon.txt      dick       314        1 436.29643
2      blue-lagoon.txt  emmeline       264        1 366.98171
3      blue-lagoon.txt      reef       134        1 186.76344
4      blue-lagoon.txt     paddy        97        1 135.47055
5      blue-lagoon.txt lestrange        92        1 128.53908
6      blue-lagoon.txt    island        89        1 124.38020
7      blue-lagoon.txt    dinghy        74        1 103.58578
8      blue-lagoon.txt    button       145        2 101.50634
9      blue-lagoon.txt     coral        67        1  93.88172
10     blue-lagoon.txt     cocoa        66        1  92.49543
11           sheik.txt     diana       315        1 437.68272
12           sheik.txt     sheik       214        1 297.66699
13           sheik.txt     ahmed       164        1 228.35228
14           sheik.txt    gaston       139        1 193.69492
15           sheik.txt      arab        95        1 132.69796
16           sheik.txt       ben        79        1 110.51725
17           sheik.txt    hassan        79        1 110.51725
18           sheik.txt    aubrey        74        1 103.58578
19           sheik.txt      camp        62        1  86.95025
20           sheik.txt     saint       117        2  82.09822
21     three-weeks.txt      paul       574        1 796.73296
22     three-weeks.txt    dmitry        83        1 116.06243
23     three-weeks.txt   charles        54        1  75.85990
24     three-weeks.txt  isabella        41        1  57.83807
25     three-weeks.txt   grigsby        31        1  43.97513
26     three-weeks.txt      pike        31        1  43.97513
27     three-weeks.txt henrietta        27        1  38.42995
28     three-weeks.txt   lucerne        27        1  38.42995
29     three-weeks.txt   terrace        27        1  38.42995
30     three-weeks.txt   darling        52        2  37.04365
31 way-of-an-eagle.txt      nick       646        1 896.54616
32 way-of-an-eagle.txt    muriel       523        1 726.03195
33 way-of-an-eagle.txt     daisy       301        1 418.27460
34 way-of-an-eagle.txt    grange       177        1 246.37410
35 way-of-an-eagle.txt   bassett       136        1 189.53603
36 way-of-an-eagle.txt      olga       128        1 178.44568
37 way-of-an-eagle.txt     blake       116        1 161.81015
38 way-of-an-eagle.txt ratcliffe        75        1 104.97208
39 way-of-an-eagle.txt    roscoe        49        1  68.92842
40 way-of-an-eagle.txt       jim        91        2  64.07639

Reflect on why this list isolates so many character names.

We can even do a “relevance search” for a particular keyword:

novel_counts %>%
    filter(feature=="desert") %>%
    arrange(desc(score))
Source: local data frame [3 x 5]
Groups: feature

                title feature term_freq doc_freq     score
1           sheik.txt  desert        97        3 28.905161
2 way-of-an-eagle.txt  desert         8        3  3.301457
3     three-weeks.txt  desert         2        3  1.575364

tf*idf is not the only scoring scheme, and another option, Dunning’s log-likelihood, is normally better for finding “most characteristic words.” We’ll return to that later.

Tidying (with exercises)

Hadley Wickham defines tidy data as data in which

  1. Each variable forms a column.

  2. Each observation forms a row.

  3. Each type of observational unit forms a table.

(“Tidy Data,” Journal of Statistical Software 59, no. 10 (August 2014): 4.)

The concept of “observation” is not always straightforward in literary studies; “case” might be a useful alternate term. These criteria enable us to recognize that some organizations of data make it easier to carry out aggregating analyses than others.

A typical example, for our purposes, relates to historical time series. In a spreadsheet, you might often write down a series of columns for the same thing over time. Here is an example. I have provided a file of citation data on Sakai. This is a CSV file derived from the Web of Knowledge Arts and Humanities Citation Index data by exporting results from the top 500 most-cited articles from the 1990s in some journals I picked arbitrarily, boundary 2, ELH, MfS, NLH, and PMLA. It tells you how many citations have been indexed for each of these articles. This is particularly untidy data, and I have done a little preliminary cleaning for you so that you can read in the data with the following single line:

cites <- read.csv("wok90s-journals.csv", as.is=T, quote='"',
                  encoding="UTF-8")

Pare down columns (exercise)

As you can see by trying colnames(cites), this includes a bunch of columns of bibliographic information, and then a series of columns counting the number of citations to that item in each year for which there is data (1980–2015). Use select to form a new data frame, cites_items, with only the author, title, journal (called “source title”), publication year, and yearly citation columns from 1990 to 2014, which are called things like X1995. (Throw out the precalculated averages and totals.) It will simplify things to know that select lets you write “all columns from a to b” as a:b:

# one-row, six-column frame
frm <- data.frame(A=1, B=1, C=1, D=1, E=1, F=1)
frm
  A B C D E F
1 1 1 1 1 1 1
# pick four columns
frm %>% select(A, D:F)
  A D E F
1 1 1 1 1

You can verify that you have the right result using this line:

all(colnames(cites_items) == c(
    "Title",
    "Authors",
    "Source.Title",
    "Publication.Year",
    str_c("X", 1990:2014))
)
[1] TRUE

Omnium gatherum (reading only)

cites_items is still untidy in Wickham’s sense, because each of those yearly citation columns is really a separate “observation.” If we wanted, for example, to take sums of citation counts for each five-year period, it would be mightily inconvenient to do with this data in the present form. But if we could have one row of the table for each citation count for each item (one row per item per year), we could make use of the group_by function to carry out the computation much more easily.

The tidyr package supplies a function to help us with this job. It is called gather. gather takes c columns of a data frame with r rows and turns them into 2 columns and rc rows (one group of r rows for each of the c columns). The new columns are called the key and the value columns; the old column names become keys. If the frame has any other columns excluded from the gather operation, the values are repeated over c times each in the result. You write it like this:

gather(frm, "keyname", "valuename", -exclude1, -exclude2, ...)

This is much easier to see in action than to explain. Here are counts of publications in three years for two writers, C.L. Moore and Leslie Stone, recorded in a three-column data frame.

frm <- data.frame(story_pubs1929=c(0, 3),
                  story_pubs1930=c(3, 2),
                  story_pubs1931=c(0, 1))
frm
  story_pubs1929 story_pubs1930 story_pubs1931
1              0              3              0
2              3              2              1

(Data cribbed by hand from ISFDB.) To gather this, we decide on names for the key and value columns:

gather(frm, "year", "count")
            year count
1 story_pubs1929     0
2 story_pubs1929     3
3 story_pubs1930     3
4 story_pubs1930     2
5 story_pubs1931     0
6 story_pubs1931     1

Now consider the case where the author names are included:

frm <- data.frame(author_last=c("Moore", "Stone"),
                  author_first=c("C.L.", "Leslie"),
                  story_pubs1929=c(0, 3),
                  story_pubs1930=c(3, 2),
                  story_pubs1931=c(0, 1))

Now we need to ensure that the author names are excluded from the gather:

gather(frm, "year", "count", -author_last, -author_first)
  author_last author_first           year count
1       Moore         C.L. story_pubs1929     0
2       Stone       Leslie story_pubs1929     3
3       Moore         C.L. story_pubs1930     3
4       Stone       Leslie story_pubs1930     2
5       Moore         C.L. story_pubs1931     0
6       Stone       Leslie story_pubs1931     1

(Try gather(frm, "year", "count") and see what happens.)

Make it tidy (exercise)

Use gather to transform cites_items into a data frame cites_counts with the following columns:

[1] "Title"            "Authors"          "Source.Title"    
[4] "Publication.Year" "year"             "citation_count"  

Then use mutate to get rid of the Xs in the year column. The analysis would begin from here, but this is enough for now. Test that you have done this correctly by comparing your results with the ones I get in the following pipeline expression:

cites_counts %>%
    rename(journal=Source.Title) %>%
    mutate(journal=gsub("-.*$", "", journal)) %>% # clean up journal names
    group_by(journal, year) %>%
    summarize(yearly_cites=sum(citation_count)) %>% # total cites per year
    summarize(hot_year=year[which.max(yearly_cites)],
              hot_cites=max(yearly_cites)) # find year when most cited
Source: local data frame [5 x 3]

                 journal hot_year hot_cites
1             BOUNDARY 2     2013        79
2                    ELH     2009       109
3 MODERN FICTION STUDIES     2013        34
4   NEW LITERARY HISTORY     2013       122
5                   PMLA     2011       124

(gather has an inverse operation, spread, which you can read about using help(spread).)

One more higher-order function (optional)

Consider the function sum (built in to R). Given a vector of numbers x, sum(x) gives the sum, a single number. There is no equivalent for lists, but we could write one. Here is a higher-order way to do it. Let’s define an abstract function:

reduce <- function (f, xss, initial) {
    result <- initial
    for (xs in xss) {   # xss is made of many xs
        result <- f(result, xs)
    }
    result
}

Reduce to the simpler case (optional exercise)

Write a function that gives the overall sum of a list of numeric vectors in terms of reduce. The whole body of the function should look like

sum_list <- function (xss) {
    reduce(...)
}

Test your function:

sum_list(list(1:10, 1:100, 1:1000))
[1] 505605

One-liner (optional exercise)

Setting aside reduce, write a one-line function that does the same thing. You’ll need one of the higher-order built-in functions from class.

Discussion (optional)

What is the relationship between reduce and the summarize function from dplyr?

Visualization introduced (no exercises)

Visualization is—this is my argument for next time—an extension into the visual domain of the operations of transformation and aggregation that you have already spent time learning. As it happens, the functions for plotting in R that we will learn are designed with this principle in mind.

We will begin with qplot (part of the ggplot2 package. You already installed this. I have set things up so this is loaded when you load litdata).

qplot is a function which returns a special value called a ggplot object; when a ggplot object is evaluated, it has the side effect of printing a graphic. Normally you can get away with thinking of qplot as the “plot command.” It is invoked as follows:

qplot(x=x_var, y=y_var, data=frm, geom=geom_name, ...)

x_var and y_var are columns in a data frame frm. You don’t quote them (this is like dplyr). They are said to be aesthetic mappings, in the sense that the data in x_var will correspond to coordinates on the x axis, those in y_var to coordinates on the y axis. Where I’ve written ... you can add other aesthetic mappings of further columns in frm to graphical attributes like color. geom_name is a string naming a “plot geom,” which corresponds pretty closely to the types of plots you might want to make (we’ll refine this understanding later).

Here is a question about the literary translations data: what is the relation between the number of titles each publisher publishes and the number of languages they publish translations from?

langs_titles <- txl %>%
    group_by(Publisher) %>%
    summarize(titles=n(), langs=n_distinct(Language))

qplot(x=titles, y=langs, data=langs_titles, geom="point")

If we want to compare the genres visually, we could make a simple histogram:

# qplot knows to tally up frequencies for a bar chart,
# so the y variable is implicit
qplot(x=Genre, data=txl, geom="bar")

If we want to draw a line, we have to tell qplot we want the line to go through all the points. This demands an extra aesthetic mapping parameter, group. Here is a time series of the number of translations:

yearly_totals <- txl %>%
    group_by(Year) %>%
    summarize(count=n())

qplot(x=Year, y=count, data=yearly_totals, geom="line",
      group=1)

That would be the recession at work, I’d guess, in the dip.

We can also use group to put more than one line on the chart, and let’s add a color mapping while we’re at it:

yearly_genres <- txl %>%
    group_by(Year, Genre) %>%
    summarize(count=n())

qplot(x=Year, y=count,data=yearly_genres, geom="line",
      group=Genre, color=Genre)