Extracting Social Networks from 19th Century Novels

This year’s conference of the Association for Computational Linguistics, the most prestigious event in computational linguistics, had a paper that got me very excited. It’s called Extracting Social Networks from Literary Fiction [pdf], and here’s the abstract (emphasis added):

We present a method for extracting social networks from literature, namely, nineteenth-century British novels and serials. We derive the networks from dialogue interactions, and thus our method depends on the ability to determine when two characters are in conversation. Our approach involves character name chunking, quoted speech attribution and conversation detection given the set of quotes. We extract features from the social networks and examine their correlation with one another, as well as with metadata such as the novel’s setting. Our results provide evidence that the majority of novels in this time period do not fit two characterizations provided by literacy scholars. Instead, our results suggest an alternative explanation for differences in social networks.

The paper advances a new technique for extracting social networks from text, and uses it on 19th century novels to argue that certain aspects of literary theory about novels might be false. In this post, I’ll explain the analysis to the digital humanities audience and discuss some strengths and weaknesses in the argument.

Written at Columbia University by two computer scientists and one English scholar, this paper contains exciting things to both computational linguists and literature researchers. For computational linguists, it proposes the first ever algorithm for extracting speaker-to-speaker networks from free text. This opens up fascinating new areas of study because it is now possible to computationally analyze interactions between people in a text and not just what they say to each other.

For literary scholars, it suggests two hypotheses from literary theory about community and society in 19th century novels might be false, namely:

Literary studies about the nineteenth-century British novel are often concerned with the nature of the community that surrounds the protagonist. Some theorists have suggested a relationship between the size of a community and the amount of dialogue that occurs, positing that “face to face time” diminishes as the number of characters in the novel grows. Others suggest that as the social setting becomes more urbanized, the quality of dialogue also changes, with more interactions occurring in rural communities than urban communities. Such claims have typically been made, however, on the basis of a few novels that are studied in depth. In this paper, we aim to determine whether an automated study of a much larger sample of nineteenth century novels supports these claims.

To make their arguments, the authors frame the statements above in terms of social networks:

  • If face-to-face time diminishes as the number of characters grows, then the more characters the novel has, the less dense its extracted social network will be.
  • Second, if more interactions occur in rural settings than urban settings, networks from rural novels will be densely connected, but contain fewer characters, but networks from urban settings be large and loosely connected.

Social network extracted from Mansfield Park by Jane Austen

Then, they extract social networks from novels using the following steps. First, the Stanford named-entity tagger automatically locates all the names in each novel. Then, a classifier automatically assigns a speaker to every instance of direct speech in the novel using features of the surrounding text. A “conversation” occurs if two characters speak within 300 words each other, and finally, a social network is constructed from the conversations. Nodes are named speakers (that appear 3 times or more – the named-entity tagger is somewhat error prone). Edges appear if there was a conversation between two characters, a heavier edge means more conversations. The end result is a social network like the one shown above, which was extracted from Mansfield Park by Jane Austen.

Using the social networks they extract, the authors show that there is no significant difference in this dimension between urban and rural novels. Instead, they show that the biggest differences seem to be between novels in the third person and novels in the first person – the third-person novels have “dense, talkative” networks, whereas the first-person novels all center around the character “I”:

Our data suggests … that the “urban novel” is not as strongly distinctive a form as has been asserted, and that in fact it can look much like the village fictions of the century, as long as the same method of narration is used

Their claim seems too strong to me. In order to make the problem tractable, they have reduced the concepts of “characters” and “conversation” to simple metrics, but important information isleft out:

  • Characters are equated with names that appear more than thrice, but this leaves out
    • Named characters that are mentioned less than 3 times
    • Nameless characters that don’t speak
  • They assume all conversations are direct speech – they ignore indirect and reported speech

From an NLP perspective, it’s easy to see why they’ve made these simplifications. In the kinds of text that we are used to dealing with: expository things like news articles, or explanatory things like journal papers, infrequent, nameless entities don’t matter and are rare. We’re used to looking for significant entities, popular topics of discussion, so the “drop the infrequent” approach goes a long way, and eliminates noise.

Nevertheless, when it comes to characterizing the aesthetics of a novel’s depiction of a social network, it’s a different matter, no longer about how “important” a character is or how “significant” some topic is. To me, it seems plausible that the number of infrequently-appearing named characters, and the number of nameless characters who are seen but never heard chan change the quality of a social network one experiences in a novel. Without further investigation into how frequent these infrequent-character, or nameless-character-cases are in this particular corpus, I really don’t think they have enough data to claim to have refuted scholarly intuition.Without further investigation, we have no idea whether the cases they leave out are frequent or infrequent enough in urban novels to sway the analysis, and in which direction the decision would go.

It’s very easy to point out problems, but I can’t think of any ways to fix them: the tools to separate infrequent named entities from junk entities just don’t exist. And the tools to identify nameless, speechless characters haven’t made much headway either – what is the difference, quantitatively, between the words “the woman standing at the station” and “the hansom cab standing at the station”? To computers that rely on statistical, automatically extracted information about language, the difference is currently very difficult to detect.

What do the humanists among you think of this work? Compared to the other literary analysis of the same novels done at Stanford, this approach is more linguistically sophisticated – but do all of these computational attempts seem heavy handed? Or do they spark ideas in your head, inspiring you to apply and improve them on your own problems?

Tagged with: , ,
Posted in Digital Humanities, Natural Language Processing
One comment on “Extracting Social Networks from 19th Century Novels
  1. Thanks for this fascinating post and the equally compelling analysis. I gave the essay a quick read this morning and I agree with everything you have said, so I hope you’ll let me step into the role you’ve offered, and perform my humanities interests quickly (that… quickly for an English grad student; apologies for prolixity!).

    It seems worth stressing at the outset that this is a computer science paper rather than a humanities paper. The literary historical claims here provide an alibi for the methodology; but it is the methodology which is on display. Would it be unfair to say that the sort of analysis which has been done here was motivated not by the question scholars were interested in (primarily), but by the tools available? Or that the primary scholarly interest comes not from the literature scholars? (Is that unfair?)

    I ask in part because it’s not entirely clear to me what is really being demonstrated (beyond the methodology). The essay gestures towards scholarship on the nineteenth century novel, but in a rather cursory way (I do not focus on the nineteenth century novel, so I am loathe to comment). The essay suggests: “Some theorists have suggested a relationship between the size of a community and the amount of dialogue that occurs, positing that ‘face to face’ diminishes as the number of characters in the novel grows.” This already seems to couch the issue in pre-digested, already quantized terms; it seems unrecognizable b/c it already sounds like a CS-version of a literary historical claim.

    Who are these theorists? The essay discusses Terry Eagleton, Raymond Williams, Mikhail Bahktin, and Franco Moretti. (I bracket Mosteller & Wallace and John Burrows as figures who appear in the bibliography as text analysts rather than theorists of the novel). But really it is only Moretti who is being responded to. The ideas about the novel and urbanization evident in Bahktin, Williams and Eagleton are never couched in terms of social networks or in terms of the number of encounters between characters. Moretti is, of course, quite interested in ‘distant reading’ by way of quantitative analysis. But even his claim is, I think, a little different than the one the paper discusses: in the city “The narrative system becomes complicated, unstable: the city turns into a gigantic roulette table, where helpers and antagonists mix in unpredictable combinations.” This sense of the experience of the city may hold (and is certainly the more interesting question to scholars of literature, I think), even if the raw quantification of number of conversations does not.

    The broader claim that this research demonstrates that the ‘urban novel’ is not as distinctive as we once supposed seems to move too quickly from the limited feature being analyzed to a much broader, more properly literary historical claim. And for that matter, why would we try to answer this question (“is there a distinct genre of the urban novel?”) with this method? Wouldn’t the better way to answer this question (if we really cared about the answer), simply be to look at how novels in the period were treated, discussed, and received? Perhaps one might still wish to do so by analyzing large amounts of review (or other) data. But questions of genre, I think, cannot be adjudicated through reference to the primary texts alone.

    The conclusion that “Narrative voice… trumps setting” seems reasonable and fair and, I think, is borne out by the analysis; but, in measuring social networks in this way (based on conversations), the conclusion seems to be so tightly correlated to its premise, that the amount of “new knowledge” generated seems a little minimal. To what extent, in pointing to narrative voice, are the authors simply rediscovering the mirror image of the assumptions governing an analysis of conversations in a novel?

    There is nevertheless, I think, great value in the way projects like this force us to try to be rigorous in our claims; to define what an “urban novel” would be, if we wanted to measure it quantitatively. To look at our claims from the practical perspective of how they could be demonstrated empirically is a valuable exercise. Though, in the end, I’m not sure that what is important or valuable to humanities scholars as humanities scholars is this sort of quantitative formalization (unless the very idea of the humanities washes away, like a face drawn in the sand—as someone once said).

3 Pings/Trackbacks for "Extracting Social Networks from 19th Century Novels"
  1. [...] in nineteenth-century novels. (A Berkeley computer science Ph.D. student blogged about it in a post widely re-tweeted by digital humanities types.) Maybe Vulture blog and the Columbia folks can team up to create a [...]

  2. [...] paper discusses how social network data was extracted automatically from literary texts (from Mining the Humanities) : This year’s conference of the Association for Computational Linguistics, the most prestigious [...]

  3. [...] As part of the Magazine Modernisms essay club, I’ve previously written about Franco Moretti’s work with social network analysis.  In particular, Moretti’s work creates conversation networks in Hamlet by using lines spoken on stage and directed to another character.  Moretti’s article references another recent study in “conversational networks” which was presented at the 2010 ACH/ALLC Digital Humanities conference.  The paper “Extracting Social Networks from Literary Fiction” [pdf] traces conversational networks of bilateral conversation between characters.  Again, conversation provides the quantifiable exchange between characters to form the nodes and edges of the network.  Aditi Muralidharan, of course, does an excellent job of unpacking the article’s premise, methods, and arguments in her post Extracting Social Networks from 19th Century Novels. [...]