```{r setup, include=F} knitr::opts_chunk$set(comment=NA, error=T, cache=T, autodep=T) library("litdata") ``` # Passing show Here's Jockers: ```{r} show.files <- function(file.name.v) { for(i in 1:length(file.name.v)) { cat(i, file.name.v[i], "\n", sep=" ") } } ``` Note, first, that this function does *not* look on your hard disk. It is purely a cosmetic function for taking a vector of file names and printing out a numbered listing. The work of checking your hard disk is done by the `dir` function, which you've met before under its alternate name, `list.files`. Jockers uses this interactively, then uses the function. As you find if you try to incorporate this function in an expression, it appears to have *no* return value: ```{r} result <- show.files(c("pretend-file.txt", "pretend-again.txt")) result ``` The return value of the function is not easy to read off the definition. But it turns out that the value of a `for` loop is the value of the last line of the loop in the final iteration. So in our example, that would be the value of `cat(2, "pretend-again.txt", "\n", sep=" ")`. Now you might think that this value is `"2 pretend-again.txt"`, but, as the help page for `cat` tells you, the return value of `cat` is always "None (invisible `NULL`)." `cat` has a *side effect* of printing its arguments to the console, but its *value* is always `NULL` (a special value which basically means "nothing"). (`cat` is nonetheless quite useful, since, as you'll note, unlike `print`, it doesn't add anything to its arguments---no `[1]` or anything like that appears when you `cat`. If write an interactive script using R, use `cat` to print messages for your user, not `print`. `cat` shares its name, by the way, with the Unix utility for printing a file to the terminal.) Most functions map inputs to outputs. `show.files` is *not* such a function. It maps inputs to `NULL` and prints them as a side effect. Even the name indicates that Jockers has been thinking of it *imperatively* rather than functionally: `show.files` is a little block of code that *does something*, rather than a machine for transforming data. Naturally the distinction is not really clear-cut, but it's good to practice the two ways of thinking. Here's a more functional way of doing it: ```{r} show_files <- function (fs) { str_c(seq_along(fs), fs, sep=" ") } ``` This maps an input to an output, as we can see by using it inside another expression (we say that we *compose* the functions `show_files` and `str_c`): ```{r} fnames <- c("sheik-gutenberg.txt", "three-weeks-gutenberg.txt") str_c("Novel file: ", show_files(fnames)) ``` If we wanted the Jockers effect, we could still compose `show_files` with `cat`: ```{r} cat(show_files(fnames), sep="\n") ``` Notice that our `show_files` has disposed of the `for` loop; our version is implicitly vectorized instead. # Get modular Here's Jockers, being imperative again ("make!"): ```{r} make.file.word.v.l <- function(files.v, input.dir) { #set up an empty container text.word.vector.l <- list() # loop over the files for(i in 1:length(files.v)) { # read the file in (notice that it is here that we need to know # the input directory text.v <- scan(paste(input.dir, files.v[i], sep="/"), what="character", sep="\n") #convert to single string text.v <- paste(text.v, collapse=" ") #lowercase and split on non-word characters text.lower.v <- tolower(text.v) text.words.v <- strsplit(text.lower.v, "\\W") text.words.v <- unlist(text.words.v) #remove the blanks text.words.v <- text.words.v[which(text.words.v!="")] #use the index id from the files.v vector as the "name" in the list text.word.vector.l[[files.v[i]]] <- text.words.v } return(text.word.vector.l) } ``` Our task is to modularize this by noticing (just as Jockers's comments indicate) that this function does several discrete tasks. It reads files, featurizes them one a time, and adds them to a list of word vectors. Featurizing, on its own, looks like this: ```{r} featurize <- function (ll) { result <- unlist(strsplit(ll, "\\W+")) result <- result[result != ""] tolower(result) } ``` You've written this code many times; this is the last time. Notice that Jockers's step of pasting the text lines into a single string is superfluous, because `unlist` flattens out the list produced by `strsplit` into a single vector anyway. One of the benefits of the modular code is that we can test the featurizing component separately, as the homework asks you to do: ```{r} pound <- c("In a Station of the Metro", "The apparition of these faces in the crowd;", "Petals on a wet, black bough." ) featurize(pound) ``` Now we can stick our feature into `make.file.word.v.l`: ```{r} feature_list <- function(files.v, input.dir) { #set up an empty container text.word.vector.l <- list() # loop over the files for(i in 1:length(files.v)) { # read the file in (notice that it is here that we need to know # the input directory text.v <- scan(paste(input.dir, files.v[i], sep="/"), what="character", sep="\n") #convert to single string text.word.vector.l[[files.v[i]]] <- featurize(text.v) } return(text.word.vector.l) } ``` ## Refinement 1: make it neater The next step is small but addresses, I think, some points of confusion about files and directories. `make.file.word.v.l` has two formal parameters, `files.v` and `input.dir`. But in terms of the function logic, these two aren't really distinct: the function requires only a set of *paths to files* that it reads in and featurizes. It's true that the nice touch of naming the resulting list elements by file names gets a little messier if we use paths instead, so that our list will have element names like `"plainText/data/austen.txt"` instead of just `"austen.txt"`. But let's not worry about that for now. If we had one formal parameter, the function could look like this: ```{r} feature_list <- function (fs) { text.word.vector.l <- list() for (i in 1:length(fs)) { text.v <- scan(fs[i], what="character", sep="\n") text.word.vector.l[[fs[i]]] <- featurize(text.v) } return(text.word.vector.l) } ``` With some of the clutter gone, we can also be a little more idiomatic about writing the function: ```{r} feature_list <- function (fs) { result <- list() for (f in fs) { # no need for `1:length` or `seq_along` ll <- scan(f, what="character", sep="\n") result[[f]] <- featurize(ll) } result # no need for `return` since this is the last line } ``` `scan(f, what="character", sep="\n")` is entirely synonymous with `readLines(f)`, wo we can make this slightly more concise. [*Edited 3/12/15*.] This also lets us catch a potential source of errors in Jockers's code (and in my earlier version) by explicitly specifying the text encoding. ```{r} feature_list <- function (fs) { result <- list() for (f in fs) { ll <- readLines(f, encoding="UTF-8") result[[f]] <- featurize(ll) } result } ``` One benefit of this approach is that we are now in a position to clarify what is going on with the files here. `readLines` treats element of the argument bound to `fs` as a path to a file. If the path is relative, it is relative to the working directory. Relative paths can just be plain file names, like `austen.txt`, in which case the file is sought in the working directory. Or they can have directory components, like `plainText/melville.txt`, in which case R looks for a folder *inside* the present working directory called `plainText`, and inside *that* for `melville.txt`. Jockers wants to be able to slurp up a whole directory's worth of files at once, but, again, a more modular style lets us separate out this step, which involves using the function `dir`/`list.files` to get all the file names we need to pass on to `feature_list`. But the trick, as Jockers's code shows you, is that `list.files` returns a vector of filenames, but *not* valid paths to those files. I have copied the `plainText` folder from the `data` folder in Jockers's `TextAnalysisWithR` files into the same folder as this solution set. That is, the current working directory includes both `ss6.Rmd` and the folder `plainText` (with `austen.txt` and `melville.txt` inside it). ```{r} pt_files <- list.files("plainText") pt_files ``` But if I try to read in the file `pt_files[1]`, I get an error, because that value is `austen.txt`, which is a path relative to the directory `plainText`, not to the working directory. To get the valid path, we need to put the directory name back on the front of the path, as Jockers does with his `paste` line. Then we can test that the files are found: ```{r} file.exists(paste("plainText/", pt_files, sep="/")) ``` Actually, the official way to paste together file paths is to use not `paste` or `str_c` but the convenience function `file.path`, which spares typing out `sep="/"`, and, like the other functions, is vectorized: ```{r} file.path("plainText", pt_files) ``` So if we wanted to fully replicate Jockers's program logic, we'd write a "wrapper" function for `feature_list`: ```{r} dir_feature_list <- function (input_dir) { fs <- file.path(input_dir, list.files(input_dir)) feature_list(fs) } ``` This assumes that *every* file in `input_dir` is a text file. It might be smarter only to take files that end in `.txt`. We could use...regular expressions! ```{r} dir_feature_list <- function (input_dir) { fs <- file.path(input_dir, list.files(input_dir)) text_files <- fs[grepl("\\.txt$", fs)] feature_list(text_files) } ``` (*Actually* the super-fancy way to do this is not to use regular expressions but to use "shell globbing," a simpler kind of pattern matching just for filenames. See `help(Sys.glob)`.) ```{r} corpus <- dir_feature_list("plainText") ``` Spot-check your results: ```{r} corpus[["plainText/melville.txt"]][3926] corpus[["plainText/austen.txt"]][118] ``` Now, about those annoying element names: we could fix that by changing the line in `feature_list` reading ```{r eval=F} result[[f]] <- featurize(ll) ``` to ```{r eval=F} result[[basename(f)]] <- featurize(ll) ``` `basename` removes everything from a string up to its last `/`. ## Refinement 2: tidy the texts Our body-extractor from class looked like: ```{r} gutenberg_body <- function (ll, start_pat, end_pat) { start <- grep(start_pat, ll) end <- grep(end_pat, ll) start <- start[1] end <- end[1] ll[start:end] } ``` I could have been a little more exact by writing: ```{r} gutenberg_body <- function (ll, start_pat, end_pat) { start <- grep(start_pat, ll) end <- grep(end_pat, ll) start <- start[1] end <- end[length(end)] # choose *last* occurrence of end_pat ll[start:end] } ``` Now we can insert this into our feature-list pipeline, in the `for` loop after we read in the text lines but before we pass on the text to `featurize`. The trick, though, is that we need to specify `start_pat` and `end_pat`. The natural choice is to rely on the Gutenberg convention: ```{r} body_feature_list <- function (fs) { result <- list() for (f in fs) { ll <- readLines(f, encoding="UTF-8") # edited 3/12/15 body_ll <- gutenberg_body(ll, "^\\*{3} START OF THIS PROJECT GUTENBERG", "^\\*{3} END OF THIS PROJECT GUTENBERG") result[[f]] <- featurize(body_ll) } result } ``` Notice the backslashes to specify a literal `*`. Testing this out: ```{r} cleaned_corpus <- body_feature_list(c("plainText/melville.txt", "plainText/austen.txt")) cleaned_corpus[["plainText/melville.txt"]][1:30] cleaned_corpus[["plainText/austen.txt"]][1:30] ``` It turns out that there is a little more paratext within the "text," but as I remarked on the homework, if we wanted to dispose of this we'd have to do so case by case. Note that it would be a trivial matter to produce a function `dir_body_feature_list` by making one change to `dir_feature_list` above. # KWIC, said the bird The complete `kwic2` function looks like this: ```{r} kwic2 <- function (words, feat) { # add padding to deal with keywords at the ends padded_words <- c("", "", words, "", "") # now find all locations of the keyword at once locs <- which(feat == padded_words) # return NA if no match if (length(locs) == 0) { return(NA) } # otherwise: str_c(padded_words[locs - 2], padded_words[locs - 1], padded_words[locs], padded_words[locs + 1], padded_words[locs + 2], sep=" ") } ``` This way of doing it has a couple of tricks. For once, applying `which` to a logical vector (`feat == padded_words`) is useful: it gives a vector of indices of `padded_words` which are `feat`. (This was the step where you might have wanted `grep` but you didn't actually need it.) The use of `str_c` may look a little funny. Each of its *five* positional arguments here is a vector. `padded_words[locs - 2]` is the vector of words two positions to the left of each occurrence of the key word, `padded_words[locs - 1]` is the vector of words one position to the left of each occurrence of the key word, and so on. The test example I supplied was: ```{r} sent <- c("paul", "verdayne", "was", "young", "and", "fresh", "and", "foolish", "when", "his", "episode", "began") ``` Work through how R evaluates `kwic2(sent, "and")`. In the final step, the vectors to be pasted together are five two-element vectors, as in the columns of this table: -2 -1 0 1 2 --- ----- --- ------- ---- was young and fresh and and fresh and foolish when `str_c` then concatenates together each of the *rows* of this table to yield a two-element vector. Thanks to our padding, we can go off the end of the word list: ```{r} kwic2(sent, "episode") kwic2(sent, "paul") ``` (Though the blank spaces in our listing are perhaps not entirely satisfactory.) Because I (following Jockers) specified that `kwic2` should operate on vectors of words, there was no need for regular expression matching. Now our task is to generalize to any number of context words. We can no longer just write out a big single `str_c` call, so we fall back to the `for` loop: ```{r} kwic <- function (words, feat, n=2) { padding <- rep("", n) pwords <- c(padding, words, padding) locs <- which(feat == pwords) if (length(locs) == 0) { return(NA) } # start from the leftmost nth context word, # and add words one by one result <- pwords[locs - n] for (k in (-n + 1):n) { result <- str_c(result, pwords[locs + k], sep=" ") } result } ``` Notice that we are still using vectorized `str_c`, building up all the KWIC listings at once. But we add on one context word at a time for each keyword match. We start with the leftmost of the $n$ context words on the left, then add the $(n - 1)$th context word, then the $(n - 2)$th, and so on up to the keyword. Then we add the 1st context word to the right, the second, and so on up to the $n$th. What this means is that the number of times through our loop is `2n`, no matter how many matches for `feat` there are in `words`. (Well, there is a secret `for` loop hidden inside the code for `str_c` that runs over each of the elements of its vector arguments. The *time complexity* of this algorithm is linear both in the number of matches and in the number of context words.) Again, if this function code is unclear, try to work through the example with `n=2` and `feat="and"` to see how `result` is built up piece by piece. Here was the test case: ```{r} kwic(featurize(sheik_ll), "passionate", n=3) ```