A common problem in search and exploration interfaces is the vocabulary problem. This refers to the great variety of words with which different people can use to describe the same concept. For people exploring a text collection, this makes search difficult. There are only a limited number different queries they can think of to describe that concept, but they may be missing many other instances that use different words. This is an important issue for humanities scholars. Often, the very first step of a literature analysis is to comb through text, trying to find thought-provoking examples to study later.
In this post, I give an example of how our project WordSeer, a text analysis environment for humanities scholars, can be used to overcome this problem. In this example, I’ll using an instance of WordSeer running on the complete works of Shakespeare from the Internet Shakespeare Editions. It’s live, so you can follow along with this example on the web at wordseer.berkeley.edu/shakespeare.
You can read the post after the jump, or just watch this video.
I began with a simple question, “What are some things that are ‘beautiful’ in Shakespeare?”. Normally, this would be a challenging question by itself. WordSeer, however, uses grammatical search. This is a search feature that goes beyond keyword search and instead also searches over grammatical relationships between words. These are relationships such as subject-object and modifier-subject. For example, in the sentence “The good God has given every man intellect”, there is a relationship between “good” and “God” — “good” is an adjective that modifies ”God”. There is also a verb-agency relationship between “God” and “given”, and a verb-object relationship between “given” and “man”. For more than a decade, it has been possible to automatically extract such relationships from text using computational linguistics algorithms. WordSeer uses these well-known algorithms to analyze the works of Shakespeare (in this case) and allow users to search over them.
In the case of my question, “What are some things that are ‘beautiful’ in Shakespeare?” I can use grammatical search to good effect, as shown in Figure 1 above. I leave the left-hand-side box blank to retrieve all matches, select the “described as” grammatical relationship, and put in “beautiful” to create the fill-in-the blanks query “_____ described-as beautiful”. I press “Go” to search, and WordSeer returns all the matching sentences. These are sentences containing the adjective “beautiful”, applied to some other word. I get the results in Figure 2.
To my alarm, there was only one match: the sentence “His youngest daughter, beautiful Bianca”, from The Taming of the Shrew. Concerned that my algorithms had malfunctioned, I did a simple search for the word beautiful, without any grammatical relationships. Lo and behold, there were only 16 results (Figure 3).
I had encountered the vocabulary problem. Not being a Shakespeare scholar, I couldn’t think of any other words that could have been used instead of “beautiful”, and it seemed preposterous that these were the only results in all of Shakespeare. There must be other words.
To investigate further, I decided to read some of the context around the word “beautiful”. To do this, I clicked on the “book” icon to the left of my “beautiful Bianca” search result. This brought up a new window (Figure 4) with the full text of The Taming of The Shrew, opened up to the exact line matching the search result.
After convincing myself that “beautiful” did actually mean what I thought it did in Shakespeare, I decided that I needed to see synonyms. WordSeer supports this need. Using the contexts of words, it computes synonyms based on other words that “behave” in the same way — that are used in the same contexts, that have grammatical relationships to similar words, and so on. Right-clicking on a word while reading (Figure 5) brings up synonyms. These are computed based on being used in a similar way to “beautiful” in Shakespeare, and not based on some external measure of similarity, such as a dictionary or thesaurus. Therefore, they reflect the particular idiosyncrasies of just the Shakespeare collection.
This list of synonyms seemed promising. It contained words such as “tractable”, “fair”, and “gentle”, that I would never have thought of including in my initial search. To investigate whether these were more widespread, I decided to investigate their prevalence in the collection using WordSeer’s heat map tool. I clicked some interesting words and added them to my query (Figure 6). This took me to the heat map view.
WordSeer heat maps can be confusing if you have never seen one before, so I’ll explain them here. If you know what this means, you can skip this section.
WordSeer uses heat maps to visualize collection-wide occurrence patterns of words and phrases. In this example, I’ll use the word “fair” to illustrate.
Typing in “fair” creates the pretty picture in Figure 8 above. Each vertical column is a single document — in the picture, I am hovering over the column corresponding to “Macbeth”. The documents are lined up side by side in long vertical columns.
All of Shakespeare’s works are here . Figure 9 shows the column corresponding to “A midsummer night’s dream”. The blue highlights show occurrences of the query. In each vertical colum, blue blocks indicate that the query word, in this case “fair”, has occurred in that location. Blocks higher up in the column mean that the word occurred near the beginning, and blocks lower down in the column mean that the word occurred towards the end of the document. The documents all “appear” the same length, so shorter documents are “stretched” (a few taller blocks) and longer documents are squeezed (many squat blocks).
Hovering over a highlighted block brings up a window showing the matched sentence. In this case, I’ve hovered over a line containing “fair” from “The Tempest”. The popup shows the line, with “fair” highlighted in the same color, and a book icon. Clicking the icon opens up a new window, in which I can more of the text if I wish to.
But back to the the vocabulary of beauty
The heat map I got from my synonym query (Figure 11. above) showed that, although “beautiful” was quite a rare word, other synonyms for it seemed much more prevalent. “Fair”, in particular, seemed to be used a lot — the whole map was purple. Hovering over individual instances of “fair” showed that it did seem to be used the way “beautiful” is in today’s english.
For further verification, I looked at the word tree for beautiful (Figure 12), which was displayed on the same page just below the heat map. It showed that “fair” was used in constructions like “fair and virtuous”, “fair and happy”, “fair and good”.
All this evidence convinced me that I might be better to use the word “fair” instead of “beautiful” to investigate the concept of beauty in Shakespeare. Returning to the search page, I typed in a new grammatical search query – ”_________ described-as fair”. The results (Figure 14) were much more informative:
Because there was more than one result this time, WordSeer showed a bar graph summarizing the matches. At a glance, I could see that I was on the right track. There were a lot of women’s names, interspersed with other words like ”queen”, “daughter”, and “day”. It seemed that I had successfully overcome the vocabulary problem, at least this time. I had a starting answer to my original question, “what are some things that are ‘beautiful’ in Shakespeare?”
The goal of text analysis interfaces like WordSeer is to properly combine powerful language processing algorithms with easy user interfaces. Neither is be enough by itself, but together, they allow users to progress naturally from step to step, assisting them through an analysis.