Homework 7
I have made a printable PDF version of this homework as well: hw7.pdf.
To produce a little breathing room, this homework will be due March 30. The next homework will be due a week after that, then we’ll try to get back on schedule.
For this homework, use the Homework template. You will also need to include a code chunk at the top of your R markdown that loads the libraries you need: litdata
, dplyr
, and tidyr
.
```{r include=F, cache=F}
library("litdata")
library("dplyr")
library("tidyr")
```
If your chunk begins {r include=F}
instead of {r}
then the code will execute, but it will not be printed in your PDF. For technical reasons I will only explain if you ask, chunks with library
calls should also be cache=F
.
This homework is mostly reading and code for you to try out. Each section heading says whether there’s an exercise or not. There is an optional exercise on higher-order functions in the next-to-last section. Please don’t include all of my explanatory text in your own homework; just include the code necessary to complete the exercises, plus any commentary you care to make on that code.
I’d like all of you to concentrate on the position paper. Don’t spend more than two hours on this homework before the paper deadline. I also am concerned about the long working times that have been reported for some of these homeworks. I don’t expect you to work to exhaustion on these exercises. Many things in graduate school are exhausting, but it is wise to save your energy for the tasks where an all-out effort will be worth it (which might include, to speak only of your scholarly work, final papers and projects in courses, a first journal publication, and your dissertation committee meetings). Make a sincere effort on each part of the homework, talk to one another and help each other when you can, and then, where you are stuck, make careful notes about what isn’t working or doesn’t make sense, and stop.
Ply the trade (exercises)
This exercise is more practice with dplyr
. A crowded “data wrangling cheetsheat” that might be useful as a reference can be downloaded here: http://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf. Try to ignore everything we haven’t discussed. [Edited 3/15/15:] I’ve also put some extra notes on the pipe operator on the website, which you can refer to supplement the somewhat compressed presentation of %>%
in class. I emphasize that the questions below are to be answered using only the functions introduced in class and anything I explicitly note here on the homework.
Let’s continue to work with Three Percent’s spreadsheet data about literary translations in the U.S., which I’ve put on Sakai. Save the file in the same folder as your homework R markdown. As in class, let’s drop the 2015 items, since it’s only March.
# using a *relative path* here
txl <- read.csv("three-percent.csv", as.is=T, encoding="UTF-8")
txl <- txl %>% filter(Year < 2015) %>% rename(Language=Lanuage)
(rename
is a convenient dplyr
function. rename(x=y)
is equivalent to mutate(x=y) %>% select(-y)
.)
Use dplyr
expression pipelines to answer the following questions about this data set:
Within the two catalogued genres, poetry and fiction, which are the three most prolific publishers of translations? When you use
top_n
, you also have toarrange
if you want the chosen rows in sorted order.How many publishers are publishing translations in each year?
Which five translators are most prolific? Create a
translator_fullname
column and group by that. Discard names that aren’t the names of individuals.Restricting from the last one to translators who have published more than once in this data set, what proportion have published with multiple presses?
Here are the answers I get:
Source: local data frame [6 x 3]
Groups: Genre
Genre Publisher count
1 Fiction Dalkey Archive 181
2 Fiction AmazonCrossing 121
3 Fiction Europa Editions 93
4 Poetry Zephyr Press 28
5 Poetry White Pine 23
6 Poetry Ugly Duckling 20
Source: local data frame [7 x 2]
Year n_publishers
1 2008 143
2 2009 143
3 2010 136
4 2011 146
5 2012 161
6 2013 195
7 2014 194
Source: local data frame [5 x 2]
translator_fullname count
1 Curtis, Howard 35
2 Anderson, Alison 25
3 Bell, Anthea 25
4 Costa, Margaret Jull 24
5 Shugaar, Antony 24
Source: local data frame [1 x 1]
percent_multi_pub
1 68.75
[Edited, 4/20/16. The denominator for the last one was slightly off in the original version. Works with multiple translators are coded with translator names "various"
and "Various"
.]
“Relevance” (no exercises)
We are still not done counting words (will we ever be?). Just to be goofy about it, let’s practice dplyr
operations for this purpose. This section does not have any exercises for you to do. If I’d had time I would have gone through this in class. If you like, before each code chunk, stop and ask yourself how you think you might do it before reading on.
Let’s add to our old friend The Sheik a couple of other texts. Let’s work with hand-trimmed text files of Hull’s novel and three others:
- Elinor Glyn, Three Weeks (1907)
- Ethel M. Dell, The Way of an Eagle (1911)
- H. de Vere Stacpoole, The Blue Lagoon (1908)
These files are on Sakai; unzip the archive into the same folder as your homework R markdown. You should get a folder called e20c-novels
with four text files in it. Now we can reuse our work from last time, the featurize
function:
featurize <- function (ll) {
result <- unlist(strsplit(ll, "\\W+"))
result <- result[result != ""]
tolower(result)
}
The next step uses an idiom you haven’t seen yet. rbind(x, y, z, ...)
stacks data frames x
, y
, z
with the same columns on top of one another. do.call(rbind, lst)
stacks a list of data frames lst
into a single data frame. (do.call
is another functional, similar to bind
from class or %>%
. do.call(f, lst)
is equivalent to f(lst[[1]], lst[[2]], ...)
.) Don’t worry too much if the details of the following are vague:
# create vector of relative paths
fs <- file.path("e20c-novels", list.files("e20c-novels"))
frms <- list()
for (j in seq_along(fs)) {
ll <- readLines(fs[j], encoding="UTF-8")
# derive the feature vector:
words <- featurize(ll)
# Now create a data frame. stringsAsFactors=F ensures our feature
# vector remains a character vector rather than a factor.
# The title column will just have one value, the filename,
# repeated over and over again.
frms[[j]] <- data.frame(title=basename(fs[j]),
feature=words,
stringsAsFactors=F)
}
novels <- do.call(rbind, frms)
novels
is a very long data frame, with one row for each word of these four texts:
[1] 313812
The subtotals for each novel:
Source: local data frame [4 x 2]
title total_words
1 blue-lagoon.txt 63752
2 sheik.txt 88974
3 three-weeks.txt 53823
4 way-of-an-eagle.txt 107263
tf
Now we summarize this data frame into a data frame of word counts (or term frequencies) for each novel.
The most frequent words in each novel are hardly surprising, but something funny happens when we try to print them out:
Source: local data frame [40 x 3]
Groups: title
title feature term_freq
1 blue-lagoon.txt a 1842
2 blue-lagoon.txt and 2290
3 blue-lagoon.txt had 743
4 blue-lagoon.txt he 1134
5 blue-lagoon.txt in 1077
6 blue-lagoon.txt it 981
7 blue-lagoon.txt of 1924
8 blue-lagoon.txt the 5342
9 blue-lagoon.txt to 1268
10 blue-lagoon.txt was 1019
.. ... ... ...
Why hasn’t R given us all the rows? This is, as programmers say, a feature, not a bug. dplyr
changes the way data frames are printed so that if you print a long data frame, it only prints the first few rows (and if there are lots of columns, it only gives you the first few). That’s very useful when you’re exploring data in the console, but when you’re making a report, you want to be able to print everything. I’m still postponing the moment when you have to learn how to produce nicely-formatted tables using xtable
; for now, here’s how to override that dplyr
feature. We have to add an explicit function invocation at the end of our pipelines to the base R data frame print function, which is called print.data.frame
:
novel_counts %>% group_by(title) %>%
top_n(10, term_freq) %>% # gets top 10 per group, but doesn't sort
arrange(desc(term_freq)) %>% # sorts within groups
print.data.frame()
title feature term_freq
1 blue-lagoon.txt the 5342
2 blue-lagoon.txt and 2290
3 blue-lagoon.txt of 1924
4 blue-lagoon.txt a 1842
5 blue-lagoon.txt to 1268
6 blue-lagoon.txt he 1134
7 blue-lagoon.txt in 1077
8 blue-lagoon.txt was 1019
9 blue-lagoon.txt it 981
10 blue-lagoon.txt had 743
11 sheik.txt the 4976
12 sheik.txt her 2805
13 sheik.txt and 2735
14 sheik.txt she 2451
15 sheik.txt to 2214
16 sheik.txt of 2143
17 sheik.txt a 1860
18 sheik.txt had 1675
19 sheik.txt was 1603
20 sheik.txt he 1517
21 three-weeks.txt the 2499
22 three-weeks.txt and 2322
23 three-weeks.txt of 1364
24 three-weeks.txt to 1353
25 three-weeks.txt he 1216
26 three-weeks.txt a 1058
27 three-weeks.txt his 973
28 three-weeks.txt was 917
29 three-weeks.txt in 874
30 three-weeks.txt her 784
31 way-of-an-eagle.txt the 3794
32 way-of-an-eagle.txt to 3064
33 way-of-an-eagle.txt her 2813
34 way-of-an-eagle.txt she 2767
35 way-of-an-eagle.txt and 2446
36 way-of-an-eagle.txt a 2344
37 way-of-an-eagle.txt he 2334
38 way-of-an-eagle.txt of 2058
39 way-of-an-eagle.txt you 1950
40 way-of-an-eagle.txt i 1818
idf
Document frequencies (the document frequency is the number of titles in the corpus that a term occurs in) involve a new complication. novel_counts
is still grouped by title
(look at the assignment expression above) but now we want to group it by feature and count titles. If you apply group_by
to a data frame that already has a grouping, dplyr
replaces the previous grouping. (If you want multiple levels of grouping, you write group_by(col1, col2)
.)
But this isn’t quite what we want, because the summarize
operation collapses away the titles. What we really want is to add on a document frequency column to our novel_counts
data frame. Fortunately, this is as simple as changing summarize
to mutate
(pause for a minute to think about why this works):
tf*idf
Now recall the tf*idf formula from Ramsay. I’ll write it as a function:
Since the arithmetic operators work on vectors, and so does log
, we can get all the tf*idf scores at once:
num_docs <- length(fs) # fs: the filenames vector; a long way to say 4
novel_counts <- novel_counts %>%
mutate(score=tf_idf(term_freq, doc_freq, num_docs))
Here are the most “relevant” words for each novel:
novel_counts %>%
group_by(title) %>%
top_n(10, score) %>%
arrange(desc(score)) %>%
print.data.frame()
title feature term_freq doc_freq score
1 blue-lagoon.txt dick 314 1 436.29643
2 blue-lagoon.txt emmeline 264 1 366.98171
3 blue-lagoon.txt reef 134 1 186.76344
4 blue-lagoon.txt paddy 97 1 135.47055
5 blue-lagoon.txt lestrange 92 1 128.53908
6 blue-lagoon.txt island 89 1 124.38020
7 blue-lagoon.txt dinghy 74 1 103.58578
8 blue-lagoon.txt button 145 2 101.50634
9 blue-lagoon.txt coral 67 1 93.88172
10 blue-lagoon.txt cocoa 66 1 92.49543
11 sheik.txt diana 315 1 437.68272
12 sheik.txt sheik 214 1 297.66699
13 sheik.txt ahmed 164 1 228.35228
14 sheik.txt gaston 139 1 193.69492
15 sheik.txt arab 95 1 132.69796
16 sheik.txt ben 79 1 110.51725
17 sheik.txt hassan 79 1 110.51725
18 sheik.txt aubrey 74 1 103.58578
19 sheik.txt camp 62 1 86.95025
20 sheik.txt saint 117 2 82.09822
21 three-weeks.txt paul 574 1 796.73296
22 three-weeks.txt dmitry 83 1 116.06243
23 three-weeks.txt charles 54 1 75.85990
24 three-weeks.txt isabella 41 1 57.83807
25 three-weeks.txt grigsby 31 1 43.97513
26 three-weeks.txt pike 31 1 43.97513
27 three-weeks.txt henrietta 27 1 38.42995
28 three-weeks.txt lucerne 27 1 38.42995
29 three-weeks.txt terrace 27 1 38.42995
30 three-weeks.txt darling 52 2 37.04365
31 way-of-an-eagle.txt nick 646 1 896.54616
32 way-of-an-eagle.txt muriel 523 1 726.03195
33 way-of-an-eagle.txt daisy 301 1 418.27460
34 way-of-an-eagle.txt grange 177 1 246.37410
35 way-of-an-eagle.txt bassett 136 1 189.53603
36 way-of-an-eagle.txt olga 128 1 178.44568
37 way-of-an-eagle.txt blake 116 1 161.81015
38 way-of-an-eagle.txt ratcliffe 75 1 104.97208
39 way-of-an-eagle.txt roscoe 49 1 68.92842
40 way-of-an-eagle.txt jim 91 2 64.07639
Reflect on why this list isolates so many character names.
We can even do a “relevance search” for a particular keyword:
Source: local data frame [3 x 5]
Groups: feature
title feature term_freq doc_freq score
1 sheik.txt desert 97 3 28.905161
2 way-of-an-eagle.txt desert 8 3 3.301457
3 three-weeks.txt desert 2 3 1.575364
tf*idf is not the only scoring scheme, and another option, Dunning’s log-likelihood, is normally better for finding “most characteristic words.” We’ll return to that later.
Tidying (with exercises)
Hadley Wickham defines tidy data as data in which
Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.
(“Tidy Data,” Journal of Statistical Software 59, no. 10 (August 2014): 4.)
The concept of “observation” is not always straightforward in literary studies; “case” might be a useful alternate term. These criteria enable us to recognize that some organizations of data make it easier to carry out aggregating analyses than others.
A typical example, for our purposes, relates to historical time series. In a spreadsheet, you might often write down a series of columns for the same thing over time. Here is an example. I have provided a file of citation data on Sakai. This is a CSV file derived from the Web of Knowledge Arts and Humanities Citation Index data by exporting results from the top 500 most-cited articles from the 1990s in some journals I picked arbitrarily, boundary 2, ELH, MfS, NLH, and PMLA. It tells you how many citations have been indexed for each of these articles. This is particularly untidy data, and I have done a little preliminary cleaning for you so that you can read in the data with the following single line:
Pare down columns (exercise)
As you can see by trying colnames(cites)
, this includes a bunch of columns of bibliographic information, and then a series of columns counting the number of citations to that item in each year for which there is data (1980–2015). Use select
to form a new data frame, cites_items
, with only the author, title, journal (called “source title”), publication year, and yearly citation columns from 1990 to 2014, which are called things like X1995
. (Throw out the precalculated averages and totals.) It will simplify things to know that select
lets you write “all columns from a
to b
” as a:b
:
A B C D E F
1 1 1 1 1 1 1
A D E F
1 1 1 1 1
You can verify that you have the right result using this line:
all(colnames(cites_items) == c(
"Title",
"Authors",
"Source.Title",
"Publication.Year",
str_c("X", 1990:2014))
)
[1] TRUE
Omnium gatherum (reading only)
cites_items
is still untidy in Wickham’s sense, because each of those yearly citation columns is really a separate “observation.” If we wanted, for example, to take sums of citation counts for each five-year period, it would be mightily inconvenient to do with this data in the present form. But if we could have one row of the table for each citation count for each item (one row per item per year), we could make use of the group_by
function to carry out the computation much more easily.
The tidyr
package supplies a function to help us with this job. It is called gather
. gather
takes c columns of a data frame with r rows and turns them into 2 columns and rc rows (one group of r rows for each of the c columns). The new columns are called the key and the value columns; the old column names become keys. If the frame has any other columns excluded from the gather
operation, the values are repeated over c times each in the result. You write it like this:
This is much easier to see in action than to explain. Here are counts of publications in three years for two writers, C.L. Moore and Leslie Stone, recorded in a three-column data frame.
story_pubs1929 story_pubs1930 story_pubs1931
1 0 3 0
2 3 2 1
(Data cribbed by hand from ISFDB.) To gather
this, we decide on names for the key and value columns:
year count
1 story_pubs1929 0
2 story_pubs1929 3
3 story_pubs1930 3
4 story_pubs1930 2
5 story_pubs1931 0
6 story_pubs1931 1
Now consider the case where the author names are included:
frm <- data.frame(author_last=c("Moore", "Stone"),
author_first=c("C.L.", "Leslie"),
story_pubs1929=c(0, 3),
story_pubs1930=c(3, 2),
story_pubs1931=c(0, 1))
Now we need to ensure that the author names are excluded from the gather
:
author_last author_first year count
1 Moore C.L. story_pubs1929 0
2 Stone Leslie story_pubs1929 3
3 Moore C.L. story_pubs1930 3
4 Stone Leslie story_pubs1930 2
5 Moore C.L. story_pubs1931 0
6 Stone Leslie story_pubs1931 1
(Try gather(frm, "year", "count")
and see what happens.)
Make it tidy (exercise)
Use gather
to transform cites_items
into a data frame cites_counts
with the following columns:
[1] "Title" "Authors" "Source.Title"
[4] "Publication.Year" "year" "citation_count"
Then use mutate
to get rid of the X
s in the year
column. The analysis would begin from here, but this is enough for now. Test that you have done this correctly by comparing your results with the ones I get in the following pipeline expression:
cites_counts %>%
rename(journal=Source.Title) %>%
mutate(journal=gsub("-.*$", "", journal)) %>% # clean up journal names
group_by(journal, year) %>%
summarize(yearly_cites=sum(citation_count)) %>% # total cites per year
summarize(hot_year=year[which.max(yearly_cites)],
hot_cites=max(yearly_cites)) # find year when most cited
Source: local data frame [5 x 3]
journal hot_year hot_cites
1 BOUNDARY 2 2013 79
2 ELH 2009 109
3 MODERN FICTION STUDIES 2013 34
4 NEW LITERARY HISTORY 2013 122
5 PMLA 2011 124
(gather
has an inverse operation, spread
, which you can read about using help(spread)
.)
One more higher-order function (optional)
Consider the function sum
(built in to R). Given a vector of numbers x
, sum(x)
gives the sum, a single number. There is no equivalent for lists, but we could write one. Here is a higher-order way to do it. Let’s define an abstract function:
reduce <- function (f, xss, initial) {
result <- initial
for (xs in xss) { # xss is made of many xs
result <- f(result, xs)
}
result
}
Reduce to the simpler case (optional exercise)
Write a function that gives the overall sum of a list of numeric vectors in terms of reduce
. The whole body of the function should look like
Test your function:
[1] 505605
One-liner (optional exercise)
Setting aside reduce
, write a one-line function that does the same thing. You’ll need one of the higher-order built-in functions from class.
Discussion (optional)
What is the relationship between reduce
and the summarize
function from dplyr
?
Visualization introduced (no exercises)
Visualization is—this is my argument for next time—an extension into the visual domain of the operations of transformation and aggregation that you have already spent time learning. As it happens, the functions for plotting in R that we will learn are designed with this principle in mind.
We will begin with qplot
(part of the ggplot2
package. You already installed this. I have set things up so this is loaded when you load litdata
).
qplot
is a function which returns a special value called a ggplot
object; when a ggplot
object is evaluated, it has the side effect of printing a graphic. Normally you can get away with thinking of qplot
as the “plot command.” It is invoked as follows:
x_var
and y_var
are columns in a data frame frm
. You don’t quote them (this is like dplyr
). They are said to be aesthetic mappings, in the sense that the data in x_var
will correspond to coordinates on the x axis, those in y_var
to coordinates on the y axis. Where I’ve written ...
you can add other aesthetic mappings of further columns in frm
to graphical attributes like color. geom_name
is a string naming a “plot geom,” which corresponds pretty closely to the types of plots you might want to make (we’ll refine this understanding later).
Here is a question about the literary translations data: what is the relation between the number of titles each publisher publishes and the number of languages they publish translations from?
langs_titles <- txl %>%
group_by(Publisher) %>%
summarize(titles=n(), langs=n_distinct(Language))
qplot(x=titles, y=langs, data=langs_titles, geom="point")
If we want to compare the genres visually, we could make a simple histogram:
# qplot knows to tally up frequencies for a bar chart,
# so the y variable is implicit
qplot(x=Genre, data=txl, geom="bar")
If we want to draw a line, we have to tell qplot
we want the line to go through all the points. This demands an extra aesthetic mapping parameter, group
. Here is a time series of the number of translations:
yearly_totals <- txl %>%
group_by(Year) %>%
summarize(count=n())
qplot(x=Year, y=count, data=yearly_totals, geom="line",
group=1)
That would be the recession at work, I’d guess, in the dip.
We can also use group
to put more than one line on the chart, and let’s add a color mapping while we’re at it: