As a text miner looking to the humanities as a source of interesting problems, I need to know how “humanities researchers” use text. So I went to Great Lakes THATCamp in March (2010) to find out. I had conversations with about 20 historians, anthropologists, archeologists, political scientists, archivists, librarians, and others, and eavesdropped on many, many more, and managed to characterize their use of text into two broad categories.
My findings may not be surprising to humanists and social scientists, but I hope it will be informative to my fellow techies, who have only the fuzziest notions of what humanities researchers do all day. I’ve drawn heavily on my conversations with historians, because we’re now working to develop a text mining tool with some of them at UC Berkeley.
Humanities researchers use text in two ways. The first is to get an idea of what’s out there, in a way common to all researchers in all fields. The second is as evidence – what traces might have an event, personal characteristic, impression, or anything else, have left in textual records from around a time?
In the first domain, they have the same questions of “the literature” as any other researcher – which are the good books or papers to read? Who are the people working in this area? What are the current opinions and approaches? Where did I read this idea? Where did I see this quote? This process is know as orienteering in the information seeking literature, which my excellent advisor gives an overview of here.
Finding new and better ways to support the orienteering process is an active research area. Different aspects of the problem have been tackled by text mining, natural language processing, and information visualization. The “Previous Work” section of this paper has an overview of the high points.
In the second domain, evidence, they treat text like something out of a detective show. They have a hypothesis in mind and examine all the text they can lay their hands on for traces of evidence relevant in any way.
They might track the language around a term over time, find a change in the way a concept is discussed, observe the way people express their thoughts, or if they are lucky find an original document confirming or denying their hypothesis.
The second use case is more specific to the humanities than the first. Supporting it means giving researchers the ability to ask highly specific and structured natural language processing questions such as “what phrases were used to describe this entity, and how did their use change over time?”. Right now, I’m very interested in what kinds of tools we can build to help humanities researchers ask these kinds of questions of a large collection of text.
Update: As Lincoln Mullen pointed out, researchers tend to use secondary texts for orienteering, and primary texts for evidence. These are very different kinds of text, and secondary sources tend to be much more available in digital collections. With primary sources, OCR, encoding, and availability make getting to the “text” stage of text mining quite a bit harder.