Text mining 19th century novels with the Stanford Literature Lab

Yesterday, I attended a group meeting with the Literature Lab at Stanford University’s English Department, where they presented some very cool new results on mining 19th Century British and American novels. The lab, fresh on its feet, is headed by Matt Jockers and Franco Moretti, and consists of about eight graduate students (some shown below) including Cameron Blevins (who runs history-ing), and Kathryn VanAerendonk. In this post, I’m going to describe what they’re doing, and summarize some of their results so far.

Humanistic Computing Lab
Graduate students in the Literature Lab at Stanford University’s English Department

The Literature Lab is tracking changes in literary style through 19th Century novels, focusing on how the frequencies of words that share a particular theme change over time. The idea is to see whether there are any themes that have “interesting” behavior over the course of the century — where “interesting” might mean increasing prevalence, decreasing prevalence, or some kind of peak.

They have a novel (if statistically disastrous) way of defining themes. A theme (or, in their words, a “semantic field”) is a group of words that satisfies two requirements

  1. They all have semantic or functional similarity
  2. They must behave in the same way over time.

One theme they found started with the seed word “integrity”. They created this theme in two steps. First, using an in-house program called Correlator, they found all the words whose frequencies in novels over time were highly correlated with “integrity”, i.e. increased and decreased in the same way as “integrity” over time. Then they manually removed all the words which they decided to be unrelated to “integrity” (e.g. “bosom”). They named this theme “abstract values”.

Different seed words create themes with different trends over time, and of course, not all trends are interesting. Nevertheless, the trend of the “abstract values” theme is interesting. The frequency of the “abstract values” theme in British novels decreased from about 0.8% of all words to about 0.2% of all words between 1791 and 1903, and the frequency of the “abstract values” theme in American novels decreased from about 0.6% to about 0.2% between 1789 and 1874.

Interestingly, they found another theme, starting with the seed word “hard”, which has the opposite trend, and contains more concrete words. The “hard” theme in British novels increased in frequency from about 1% to about 3% over the 19th Century, and  in American novels increased from about 1% to about 2.5%.

To the literary scholars in the group, these serendipitous results suggest

“a more fundamental shift in the style of narration from abstraction to concreteness, from telling to showing. No longer talking about abstract values but embodying them in actions.”

Nevertheless, the work is still at an early stage. These are initial experiments, not statistically sound enough to demonstrate that these findings reflect actual trends in the data, and not false effects of the trend-reinforcing way in which the words in the themes were chosen.

To avoid this, they need a more complete analysis. I can think of two additions. The first is to use a held-out subset of the novels to mine for themes, and then test for trends in the remaining set. The second is to use a thesaurus or dictionary as a source of related vocabulary, and to see whether the trends remain. A third (potentially embarrassing) experiment is to randomly assign novels to years and see if any “interesting” or  “thought-provoking” patterns emerge.

If trends like this are real, they would be fascinating, but they do need to actually exist. This is why digital humanities researchers, like all other scientists, need to talk to statisticians. Right now, they are new to methodology and know just enough probability to be dangerous. Experimental rigor in the form of “held out data” and “cross validation”, “hypothesis testing”  is very foreign to them.

Aside from caveats about scientific rigor, I’d like to draw attention to is how text mining was used in combination with data visualization to uncover patterns that would have otherwise been extremely difficult to spot, and to spark off a whole set set of interesting hypotheses.

In the past, I’ve seen humanities scholars treat text mining as a curious novelty, used to confirm something they already know, or to quantify an existing academic intuition, but not entirely to be trusted. But yesterday, I saw text mining used for as more than that: it was a way to provoke investigation, find interesting hypotheses, and ask questions that didn’t even exist before.

Update, May 19th, 2010: The graphs and example themes that accompanied this post have been removed because the Literature Lab informed me that they are for internal use only.

Tagged with: , , ,
Posted in Digital Humanities, Natural Language Processing, Text Mining
5 comments on “Text mining 19th century novels with the Stanford Literature Lab
  1. Matthew Jockers says:

    Hi Aditi,

    Thanks for coming and for posting your thoughts here. One correction and a comment:

    First the correction: we are the Stanford “Literature Lab”, not the Stanford “Humanities Computing Lab.”

    Now the comment. You write that the group has a “novel (if statistically disastrous) way of defining themes?” I think that “statistically disastrous” mis-characterizes and misrepresents the purpose of the Correlator tool. The tool is designed as a “finding aid” and is not intended to produce statistically significant results or to “define” a theme. So whether it is “statistically disastrous” or not is irrelevant since the end goal of using the tool is to aid the researcher in identifying words that are likely to be part of a semantic field. And–as you saw and note in citing the “abstract values” field above–the tool does an exceptionally good job of identifying words that “jive” with our human sense of what words belong in a given theme, or topic.

    Indeed, you even write that the abstract values field contains “about 20 more words like this”. My point is that the words that the tool returns are generally “like this” and so the tool succeeds in doing what it was designed to do. The Correlator is not meant to be an alternative to an unsupervised topic model but, rather, a way of generating a word list, or “cluster” of words that are semantically related and can then be tracked using other methods.

    So, when you write that a theme must satisfies two requirements. . .

    1. They all have semantic or functional similarity
    2. They must behave in the same way over time.

    The truth is that the theme only has to meet the first. Your second point is not in fact a criteria for a word’s inclusion in the field, but rather a technique the group developed for identifying words that may be part of a particular field. The tool mines the corpus looking for words with usage patterns that fluctuate similarly over time. It turns out that words that behave similarly over time are also frequently semantically related. The tool provides a list of words that behave in similar ways and then the researcher weeds out the words, like “bosom”, that are obviously not part of the class.

    Your idea of using a dictionary or thesaurus is definitely something we have considered (e.g. using WordNet), and I think that combining the human pruned results of the Correlator with a lookup in WordNet might be fruitful.

    Thanks again for coming to the session; I know the group appreciated your questions. I hope we’ll see you again!

    Matt

    • silverasm says:

      Hi Matt,

      As I understood it, the process of finding a theme was:
      Step 1: find all words that behave the same way over time as “integrity” (using correlator)
      Step 2: remove the ones which are semantically unrelated

      If “finding aid” is meant as something that finds patterns that look interesting, I agree that it is suspicious that so many words that seem to name abstract values would have a decreasing trend over time. Unfortunately, as it is used now, Correlator is not a statistically sound finding aid for a semantic field.

      This is because if we then look at the words from step1 and step2 as a semantic field, any trends we see will be very strong indeed, because we have specifically selected words that have the same trend over time, so their aggregate behavior will look significant.

      Some trends are just noise. Even if a group of words is “actually” constant over time, in any finite sample, such as your corpus, you expect that some words will slightly decrease over time, and that some words will slightly increase over time. It makes no sense to specifically select a subset of the increasing or decreasing group, and then call its aggregate behavior significant. This is what Correlator is doing right now.

      To make it specific, there is still the possibility that “integrity”, while being “actually” constant had a slight decrease over time in this corpus, something which we would look at and call noise. Then we picked (using correlator) what we found to be semantically related word that also had a slight decrease, ending up (surprise surprise) with large aggregate behavior.

      On the other hand, “integrity” could have had a significant drop over time. To tell the difference, we could do some easy experiments. The first, which could be done with almost no additional coding, would be to have held-out data on which claims were tested. Just run Correlator to find semantic fields on half of the novels in a year, and then check if the trend appears in the other half, which would be the held-out set.

      Alternately, use a thesaurus to group “abstract values” (not just those abstract values that correlate with integrity) and check if there are trends in the data.

  2. Jonathan says:

    The most exciting thing about digital humanities is the potential for real interdisciplinary work between two fields (the humanities and the sciences) that have experienced an ever widening culture rift. Such work might help to bridge these trenchant divides.

    It needs to be noted though that a primary reason for the hardening of disciplinary divides has been a pernicious kind of disciplinary egocentrism, the pervasive sense that one’s own methods and way of looking at the world are right and that others are misguided at best. People on both sides of the humanities-science divide are guilty of this.

    So when I come to a blog post about research that has potential for real interdisciplinary work across this divide, it saddens me to come across rhetoric like this:

    “They have a novel (if statistically disastrous) way of defining themes.”
    “This is why digital humanities researchers, like all other scientists, need to talk to statisticians. Right now, they are new to methodology and know just enough probability to be dangerous. Experimental rigor in the form of “held out data” and “cross validation”, “hypothesis testing” is very foreign to them.”

    Now here are some humanists who seem to be trying to buck the trend in their field, going against other humanists who are suspicious that collaborating with the sciences and bringing quantitative methods into literary study might take the human out of the humanities. They are doing so in order to reach out to potential colleagues in the sciences. Instead of welcome and a tone of collegial collaboration from those potential colleagues though, they meet here with a subtly patronizing attitude that implies that it’s only the humanities who have anything to gain from this collaboration. According to this rhetoric, statisticians and scientists will be called upon to patiently school their naïve and clueless humanities colleagues. Doesn’t sound like much of a collaboration to me. This sounds like the same disciplinary egocentrism that helped create the divide in the first place. Perhaps digital humanities scholars “need to talk to statisticians” but if this is the tone of that conversation, I wouldn’t be surprised if they’re not clambering over each other to do so.

    • silverasm says:

      Quantitative methods in the digital humanities should be held up to the same level of rigor as all the rest of the sciences. All scientists, at one point or another, are guilty of looking at their data too much and finding hypotheses that confirm their own ideas. This is bad science, whether or not it’s data about muons or data about word frequencies in 19th century novel.s

      It’s more a “welcome to the club, where we all go to complain about statistical rigor” than anything else.

      If quantitative work in the digital humanities lacks this rigor, it risks not being taken seriously by anyone outside the digital humanities. Then, the world loses all the real information that only humanities researchers know to look for in the first place.

  3. Paul Flesher says:

    I find the reaction against the remark “novel (if statistically disastrous) way of defining themes?” somewhat amusing. As someone who has watched several fields create themselves, and then define themselves and then refine themselves over the past few decades (and is still working on one), I know it is well and good to take your goals seriously. But it is important to realize there is a lot of stumbling and realigning and improving as one goes. We need to keep a sense of humor about our work and not be too touchy.
    Paul

4 Pings/Trackbacks for "Text mining 19th century novels with the Stanford Literature Lab"
  1. [...] is another interesting site linked from one of the [...]

  2. [...] do the humanists among you think of this work? Compared to the other literary analysis of the same novels done at Stanford under Moretti’s eye, this approach is more [...]

  3. [...] both to help find novel and interesting scholarly hypotheses in masses of data and also to experimentally test them.  These methods will be applied and expanded to market research, enabling more accurate [...]

  4. [...] Stanford University some its English literature graduate students are engaged in something called ‘literature mining’. They are currently “mining 19th Century British and American novels” to  track how [...]