A big question for me, as a designer of text analysis tools for the humanities is: how do the tools I’m building fit in? Sure, you can have fancy word trees and grammatical search histograms. Sure, they’re chock-full of interesting information that you can make an argument about. But where exactly in the humanistic analysis process does a scholar need things like that? I have no idea.
But there’s more. I don’t just build tools, I build environments. And that means support for reading the text, navigating it, searching it, and (most importantly) “working” with it. And I have no idea what that means either. So over the past few weeks I’ve been having hour-long chats with late-stage PhD students from the literature and history departments, and asking them to tell me about how they do research. I asked all kinds of confusing and mundane questions like, “How do you decide what to underline?” and , “Can you define formalism for me?” and, “You mean you actually copy it out by hand?” and “How do you organize all the quotes you collect?” and, “How do you go about proving that?” and, “So you scanned in everything in those boxes?”
I only did twelve of those interviews, but patterns began to emerge. So I did a survey. A simple one, with six questions about reading habits. This survey’s purpose was to confirm whether some of the patterns I noticed around reading were general. If you just want the charts summarizing the responses, you can find them here (those numbers include around 20 more responses I got while I was writing this post). For a full analysis in which I extract some general patterns in humanities scholars’ reading processes, read on.
This was the first page of the survey, with two screener questions. Over two days, I got 153 responses from humanities scholars whose primary sources were mostly textual (and 18 responses from others, but I removed those responses for this analysis).
Working with text
My first question was about copied-out snippets. From what I heard in the interviews, humanities scholars working on a project eventually reach a point where they have some interpretation, some interesting angle they want to take. At this point, they start actively reading and re-reading their primary sources. Snippets begin to collect in their notes in large numbers.
The results are below. An overwhelming 90% of textual humanities scholars surveyed copy out snippets either frequently or occasionally.
What are snippets for?
But why? From what the interviewees said, copied-out snippets seem to be in a different category from other annotations (such as margin notes, underlines, highlights, or circles on the page). In fact, copying out seems reserved for a more important class of items. But what makes these copied-out snippets so special? This was my next question.
The results are below, and the responses confirmed the interviews. Most of the textual humanities scholars surveyed said that snippets were evidence, interesting or thought-provoking passages, or examples of something they were looking for.
In addition to the 4 existing choices, there were 28 “other” responses, which fell into these groups:
- As a way to store quotations I think I’ll need (5 such responses)
- Writing it out helps me understand and think about it (2)
- Writing it out helps me remember it (5)
- The snippet is a good summary of an author’s point or argument (4)
- They help me outline a longer argument (3)
- I need to translate it into another language (3)
It seems that copied-out snippets play a role very much like the evidence a lawyer lays before a jury. They provoke thought, they justify an interpretation, they are examples that support an argument (or indeed counterexamples that need to be explained away). A scholar amasses many such “pieces of argument” and then organizes them to tell a coherent story.
What do snippets look like?
As a tool builder and an information retrievalist I need to know: how long are these snippets, and is there accompanying information that users will want to add?
Are snippets a few words long? A few paragraphs?
The results are below. They revealed that snippets varied greatly in length.
- Many respondents had “some” or “many” snippets between a few words to a paragraph long
- About 25% of had “some” or “many” snippets that were longer than a paragraph.
The interviews suggested that in addition to the literal text of the snippets themselves, there were often various kinds of notes as well as visual finding aids and citation information.
The results are below. Notes about why the snippet was relevant were very common.
And so were ideas that the scholars got from the copied-out snippet.
By far the most common was citation information: 76% always added it.
For a tool designer, the message is clear. If you want humanities scholars to read text using your tool, you must support all of the above activities.
When a scholar wonders, “Where else have I seen that before?” or, “Is this an unusual exception?” or, “Is this a pattern?”, my interviewees told me that there are two approaches they can take: they had either thought of this before, and relevant passages were already copied into their notes, or they’d have go back and re-read the relevant texts to search for evidence. To see if this was a general pattern, I created the next two survey question.
The responses are below, and they confirm my interviewees responses. 79% re-read when they had a new idea or interpretation, 65% when they had a new hypothesis, and 67% when they noticed a new pattern.
There were also 13 “other” responses, which fell into the following categories:
- Just for fun (2 responses)
- When I teach (5)
- When I find my notes don’t have everything I need (3)
- To understand it better (2)
- To compare with other documents (1)
But what do scholars want to do with the text they re-read? This was my final question. I wanted to compare how note-taking behavior varied between first-time reading and re-reading:
The results are below. Compare how frequently scholars copied out snippets the first time (top) with when they were re-reading (bottom). The copy-out rate is pretty much the same, or perhaps even a little higher while re-reading.
These responses reveal an inefficiency in the “finding evidence” portion of the scholarly process. Reading is a great way to understand, to learn, to remember. But it is a very inefficient way to search for something. First, it’s very slow (and when you speed it up, it starts to lose its reliability). Second, it’s very subject to your state of mind (sometimes things pop out at you, sometimes they don’t). And third, it’s impossible to do thoroughly for very large collections.Yes, you need to revisit, re-find and re-acquaint yourself with the material. But what I object to is that you often have to re-read in order to do it.
A (partial) solution already exists: full-text digitization and search. Search can take you a long way, especially if your examples are associated with particular words. However, there are a great many sophisticated information retrieval technologies that go beyond keyword search. We can calculate and retrieve text passages by similarity, we can allow you to mark relevant passages and return more like those, we can train classifiers on what you’ve marked interesting, and have them automatically classify text, we can use the google translate API to help identify foreign words, and online dictionaries to find synonyms.
Based on the answers to these two questions, it seems to me that the following three “finding a snippet-of-text” problems might really benefit from a little information retrieval and visualization, because finding them by reading is especially hard, and formulating them as search queries can be difficult:
- Find me more examples like this
- Find me other other places in the text where this happens
- Show me all the places in the text where this concept comes up
Together, the survey and interviews helped me understand the mechanics of scholarly reading. As I interpret it, humanities scholars working around textual primary sources follow a process like this (and this is why I’m blogging, so you can all violently disagree with me in the comments):
- Scholars begin a project with a (sometimes vague) hypothesis, interest, or interpretation in mind
- They read primary sources to solidify their understanding and find evidence for their arguments
- They notice or realize things while reading passages from the text
- They copy out the passages if they are sufficiently thought-provoking, provide evidence, or are relevant to an interpretation.
- They add information to the passages they copy out:
- Why the passage is relevant/ how it fits into their argument
- Ideas they got from the passage
- Citation information so they can find it and cite it properly later
- If they haven’t already collected the necessary supporting material, they search, read and re-read other texts to find more support.
- When they find supporting or relevant passages, they copy them out (see step 5)
- They curate their collected passages (“pieces of argument”) into a written product representing their argument
Laying out these steps has helped me come up with design requirements. For example, even if no extra visualizations and analysis tools are added, and the interface only supports reading, these steps told me the basic “reading tools” my system must have:
- Copying out variable-length snippets of text into a note-taking area
- Preserving the link between the copied-out snippet and its place in the text
- Taking notes around a snippet
- Tagging and keyword search
- Exporting all of the above into a format compatible with MS Word or similar, so that scholars can integrate it with their other work
Next, if we’re talking about systems with keyword search functionality, steps 6 and 7 make it clear that the ability to save, organize, and annotate search results is key. This is in addition to the ability to switch seamlessly between looking at search results and reading the source text surrounding any particular search result.
If we want to get more sophisticated with information retrieval (which I do), the three “finding snippets of text” problems I identified above need attention.
But what about when you have visualizations? Coming back to my original question, where do these fit in? I’m still not sure, but I think the steps I’ve found above will give me a good place to start looking for answers.