More and more source text in the humanities gets digitized every day, making it accessible to large scale computational analysis. Nevertheless, traditional methods of humanistic analysis are based on detailed arguments built upon on close readings of individual texts. How will the field adapt? How do we use statistics and text mining to answer humanistic questions?
Zoom in to the field of American literature, and further into the realm of studying the (digitized) narratives of escaped former slaves, published by white abolitionists. There are widespread stylistic and thematic similarities among these narratives. How can text mining help literature scholars here? That’s where WordSeer, my latest project, comes in.
The MONK project at CMU, and the Voyeur project at McMaster University share the same cause as WordSeer. But, when it comes to text analysis, they are essentially search interfaces that show simple statistics about word order, type and frequency. The grammatical relationships within text are neglected.
WordSeer is an evolving project, as all digital humanities projects inevitably are. As my friends in the English department and I learn what we can do for each other, it will get steadily more well-defined, but right now, it’s simple: a search interface and a reading interface. The search interface allows queries based on grammatical structure, and the reading interface is for reading narratives, comparing them, and coming up with new queries.
The search screen is shown below. It supports standard keyword-based search, so scholars can look for words or exact matches in the text. More interestingly, there’s grammatical search. Using grammatical relationships extracted through natural language processing, users can ask how things were described, what actions were performed upon them and by them, who possessed certain things, or what was possessed by them.
For example, the figure above (click for larger image in new window) shows the query, ‘give all adjectives that are applied to the words “slave, bondman, negro”‘. The system returns not only a list of occurrences in the narratives, but also automatically-generated graphs, showing the frequencies of the different words. As you can see, “poor” is the most frequent adjective. The results are sortable, and filterable: clicking on bars filters the list to show just results containing those words. Above, I’ve filtered to show just the instances where “valuable” is applied to “slave”.
Interviews with our literary scholar friends suggested that a search interface alone would not be enough, so WordSeer supports reading narratives individually.
The reading view is shown below. Scholars can select one (or, indeed many) sentences from the search results and be taken to a reading screen, where the narratives are opened up to the correct place. Grammatical search doesn’t end there, however, because the entire text is interactive.
Highlighting a portion of a sentence and clicking the “examine” button (bottom right corner) shows the text pattern, as well as all the grammatical relationships in the highlighted portion. For example, I clicked on a passage about hospitals, and was presented with the pattern-examiner screen (below).
I can select some patterns, either the original passage or some grammatical patterns, and examine them further. I can use them as search queries and be taken back to the original search screen, I can save them for later, or I can view their distributions in the text I’m reading.
Being able to compare the distribution of phrases or patterns across texts can give an idea of how similar the texts are, or of how much their subject matter overlaps. For example, if I wanted to know where plantations were mentioned in these texts, I would highlight the word, “plantation” and click “See in Text”, giving the result below.
The white column represent the length of the entire text, and green bars indicate that the pattern of interest occurred. If I had selected multiple patterns, I would see different colored bars.Clicking on any of the little green bars takes me to an occurrence of the pattern, highlighted in the text.
All of this works because I applied language processing to the text beforehand, and stored the information a database for quick access. I applied part-of-speech tagging, syntactic parsing, and dependency parsing to decompose sentences into their grammatical constituents. For example, the sentence, “The cruel man beat us severely” contains the word “cruel” which is an adjective modfier of the word “man”, which is a noun. There is verb object relation between “beat” and “us”, and a verb subject relation between “man” and “beat”.
If you want to know more about natural language processing, I gave a BootCamp about text mining at THATCamp SF recently, here are the slides [pdf]. I also wrote a blog post introducing the subject for a digital humanities audience.
Syntactic analysis is just a small part of what natural language processing can do. Right now, I’m working on being able to track named entities through a narrative and see descriptions applied to them, and actions in which they participate.