This year’s conference of the Association for Computational Linguistics, the most prestigious event in computational linguistics, had a paper that got me very excited. It’s called Extracting Social Networks from Literary Fiction [pdf], and here’s the abstract (emphasis added):
We present a method for extracting social networks from literature, namely, nineteenth-century British novels and serials. We derive the networks from dialogue interactions, and thus our method depends on the ability to determine when two characters are in conversation. Our approach involves character name chunking, quoted speech attribution and conversation detection given the set of quotes. We extract features from the social networks and examine their correlation with one another, as well as with metadata such as the novel’s setting. Our results provide evidence that the majority of novels in this time period do not fit two characterizations provided by literacy scholars. Instead, our results suggest an alternative explanation for differences in social networks.
The paper advances a new technique for extracting social networks from text, and uses it on 19th century novels to argue that certain aspects of literary theory about novels might be false. In this post, I’ll explain the analysis to the digital humanities audience and discuss some strengths and weaknesses in the argument.
Written at Columbia University by two computer scientists and one English scholar, this paper contains exciting things to both computational linguists and literature researchers. For computational linguists, it proposes the first ever algorithm for extracting speaker-to-speaker networks from free text. This opens up fascinating new areas of study because it is now possible to computationally analyze interactions between people in a text and not just what they say to each other.
For literary scholars, it suggests two hypotheses from literary theory about community and society in 19th century novels might be false, namely:
Literary studies about the nineteenth-century British novel are often concerned with the nature of the community that surrounds the protagonist. Some theorists have suggested a relationship between the size of a community and the amount of dialogue that occurs, positing that “face to face time” diminishes as the number of characters in the novel grows. Others suggest that as the social setting becomes more urbanized, the quality of dialogue also changes, with more interactions occurring in rural communities than urban communities. Such claims have typically been made, however, on the basis of a few novels that are studied in depth. In this paper, we aim to determine whether an automated study of a much larger sample of nineteenth century novels supports these claims.
To make their arguments, the authors frame the statements above in terms of social networks:
- If face-to-face time diminishes as the number of characters grows, then the more characters the novel has, the less dense its extracted social network will be.
- Second, if more interactions occur in rural settings than urban settings, networks from rural novels will be densely connected, but contain fewer characters, but networks from urban settings be large and loosely connected.
Then, they extract social networks from novels using the following steps. First, the Stanford named-entity tagger automatically locates all the names in each novel. Then, a classifier automatically assigns a speaker to every instance of direct speech in the novel using features of the surrounding text. A “conversation” occurs if two characters speak within 300 words each other, and finally, a social network is constructed from the conversations. Nodes are named speakers (that appear 3 times or more – the named-entity tagger is somewhat error prone). Edges appear if there was a conversation between two characters, a heavier edge means more conversations. The end result is a social network like the one shown above, which was extracted from Mansfield Park by Jane Austen.
Using the social networks they extract, the authors show that there is no significant difference in this dimension between urban and rural novels. Instead, they show that the biggest differences seem to be between novels in the third person and novels in the first person – the third-person novels have “dense, talkative” networks, whereas the first-person novels all center around the character “I”:
Our data suggests … that the “urban novel” is not as strongly distinctive a form as has been asserted, and that in fact it can look much like the village ﬁctions of the century, as long as the same method of narration is used
Their claim seems too strong to me. In order to make the problem tractable, they have reduced the concepts of “characters” and “conversation” to simple metrics, but important information isleft out:
- Characters are equated with names that appear more than thrice, but this leaves out
- Named characters that are mentioned less than 3 times
- Nameless characters that don’t speak
- They assume all conversations are direct speech – they ignore indirect and reported speech
From an NLP perspective, it’s easy to see why they’ve made these simplifications. In the kinds of text that we are used to dealing with: expository things like news articles, or explanatory things like journal papers, infrequent, nameless entities don’t matter and are rare. We’re used to looking for significant entities, popular topics of discussion, so the “drop the infrequent” approach goes a long way, and eliminates noise.
Nevertheless, when it comes to characterizing the aesthetics of a novel’s depiction of a social network, it’s a different matter, no longer about how “important” a character is or how “significant” some topic is. To me, it seems plausible that the number of infrequently-appearing named characters, and the number of nameless characters who are seen but never heard chan change the quality of a social network one experiences in a novel. Without further investigation into how frequent these infrequent-character, or nameless-character-cases are in this particular corpus, I really don’t think they have enough data to claim to have refuted scholarly intuition.Without further investigation, we have no idea whether the cases they leave out are frequent or infrequent enough in urban novels to sway the analysis, and in which direction the decision would go.
It’s very easy to point out problems, but I can’t think of any ways to fix them: the tools to separate infrequent named entities from junk entities just don’t exist. And the tools to identify nameless, speechless characters haven’t made much headway either – what is the difference, quantitatively, between the words “the woman standing at the station” and “the hansom cab standing at the station”? To computers that rely on statistical, automatically extracted information about language, the difference is currently very difficult to detect.
What do the humanists among you think of this work? Compared to the other literary analysis of the same novels done at Stanford, this approach is more linguistically sophisticated – but do all of these computational attempts seem heavy handed? Or do they spark ideas in your head, inspiring you to apply and improve them on your own problems?