WordSeer 3.0

We’re thrilled to announce the latest release of WordSeer!

After almost a year of improvements, WordSeer is now capable of much more than it ever was. You can now filter, get overviews of a collection, do side by side comparisons, open up multiple visualizations, drill down into what you see, and save and export your results. This post introduces the new features through a series of demonstration videos, analyzing 30 years of New York Times editorials about China and Japan. For easy skimming, it also explains the new features with text and screenshots, but with examples from Shakespeare plays — which we’ve analyzed before.

Background

But first, some background. WordSeer is a web-based text analysis and sensemaking environment for humanists and social scientists. It’s a a research project at UC Berkeley’s Computer Science Division and School of Information. 

Let’s unpack that:

Text Analysis and Sensemaking. Sensemaking is a bit of jargon computer scientists use to describe the complex, drawn-out, iterative process we engage in when we’re trying to process and understand information. All scholarly research is a form of sensemaking. We’re particularly interested in sensemaking with text data, because there aren’t too many tools out there for this. Sensemaking with text is more difficult than with other kinds of data, because the only really good way to get meaning out of text is to read it. Tables of numbers on the other hand, don’t need to be read in the same way to be meaningful. Numbers are therefore comparatively easier to condense, summarize, spot patterns with, and predict.

For humanists and social scientists. Literature scholars, historians, and many other kinds of humanist and social scientists (not to mention journalists and data analysts) need do text-based sensemaking very deeply every day. It’s a hard problem that needs a good solution.

WordSeer is a research project, funded by two successive NEH digital humanities grants. We’re computer scientists, and our goal is to figure out how to to make advanced computational technologies from the fields of information retrieval, data visualization, and computational linguistics work for scholars trying to deeply understand text.

Web-based. WordSeer is a program that runs in a web browser. In our case, the only browsers in which it works properly are Chrome and Safari (Firefox is currently breaking for mysterious reasons. see “research project” above). WordSeer’s main website is http://wordseer.berkeley.edu.

Previous Versions

Many of you already know WordSeer from its previous incarnations. You know we’ve been developing it  through case studies with individual scholars, that we’ve already done a few demos and got a few results. But this is WordSeer’s biggest jump yet.

It began last spring, with a class of undergraduate Shakespeare students in Canada. In April of 2012, students in Michael Ullyot‘s “Hamlet in the Humanities Lab” at the University of Calgary had just finished putting WordSeer and four other web-based text analysis tools through their paces. The students had spent the first half of the class getting familiar with five tools (Voyant, Tapor, WordHoard, Monk and WordSeer). In the second half, they split into groups and each group analyzed a different act of Hamlet. Every week, they blogged about their experiences. At first, they talked about learning the tools — how they functioned, when they broke, what they seemed to be useful for. Later, they reported on using them to analyze text.

For the WordSeer project, those blog posts were a gold mine.  We’d never had so many users, and we’d never had such detailed information about their experiences either. And these students weren’t just any users: they were motivated, skeptical, critical, and engaged. They used WordSeer over a significant amount of time, for pre-existing well-defined goals, and not just  one-off “let’s see what this is about” sessions. At the end of the semester they had made around 180 posts.

It hadn’t even been a week before I was contemplating a significant change to WordSeer: the ability to isolate and group together sets of documents to analyze together. As a complete outsider, I’d failed to foresee something as simple as the students’ need to isolate just Hamlet for analysis.  I quickly added it while they were still learning the tool, so that WordSeer wouldn’t fall behind, but then backed off for the rest of the semester, except for a minor bugfix or two. The result of the change was a post, a few videos, and a conference paper about comparative analyses in Shakespeare.

The semester following that, I conducted several interviews with history and English PhD students, as well as an online survey to flesh out other parts of my understanding. The result is WordSeer, version 3.

The New WordSeer

We redesigned WordSeer around the following common analysis needs that the old WordSeer just didn’t meet:

  1. Getting an overview of the contents of a collection or sub-collection
  2. Drilling down into a sub-collection of interest
  3. Narrowing down analyses by meaningful metadata, such as (in Shakespeare) a particular speaker, act, or scene.
  4. Exploring ideas for a new search or analysis based on the results of a current one
  5. Comparing two or more visualizations side-by-side or referring to multiple tools simultaneously
  6. Forming custom categories for analysis (e.g. male speakers, female speakers), and comparing analyses across those categories
  7. Investigating a group of words together
  8. Saving and exporting work

Now, we’re pleased to introduce WordSeer 3.0. Because it’s very interactive, and supports many continuous sequences of analysis, we thought the best way to do that was through a series of videos.  But videos aren’t easy to skim, so the rest of this post explains the new features in regular blog format. If you’re still curious after the videos and reading this post, there’s also the WordSeer 3.0 Guide, but that’s more like an instruction manual for users.

A new collection: 5,000 New York Times editorials

In these videos, we’re analyzing a much larger text collection than we’ve ever done before. It’s every New York Times editorial about China or Japan published between 1980 and 2012. That’s about 5,000 editorials, each one about China or Japan or both. They were downloaded using Lexis Nexis, and filtered to just ‘China’ or ‘Japan’ using the subject categories available through that tool. This choice was motivated by the research interests of Chris Fan (@sea_fan) one of the scholars we’ve been collaborating with. He’s a PhD student in English at Berkeley and studies US-China relations. While taking our digital humanities course in the spring, he became interested in analyzing these articles computationally.

His work on American literature’s reactions to China’s rise draws upon a set of historical observations about the “rise of China” that are broadly accepted by historians and cultural historians. Literary scholars typically allow their claims to rest on observations made by field experts like historians and sociologists, or on their own inductive reasoning, but he wanted to verify some of those observations by gathering as much empirical evidence for them as possible. 

Demo Videos

Although WordSeer is very cool (or so we think, at any rate) it is still kind of slow, so we’ve speeded these videos up to eliminate the 5 or 6 second waits that sometimes occur before the results load. Speeding is really important in exploratory analysis interfaces, and we believe it makes a significant difference in the overall experience. So the slowness is part of the reason we haven’t yet made WordSeer publicly available yet, but we’re working on it.

The first video shows the bones of the new WordSeer: overviews and filters. It shows how WordSeer allows you to drill down into exactly the subsets of data you want to analyze. Equally important, and equally new, is the way it shows you overviews of the contents of any subset of data, so even if you’re not completely familiar with a collection, you have places to start investigating, and a sense of what’s in the text.

Next, we look at how search can be used along with overviews and filters to navigate to very specific subsets of interest. As an example, we look at how searching for the word ‘economic’ allows us to discover the descriptor ‘second-largest’, and how we can follow up on that co-occurrence over time.

Video 3, demonstrating side by side comparisons, is where the fun starts. We now know enough to use WordSeer to actually discover something, so we compare and contrast the different ways China was talked about in the 80s, 90s, and 2000s side by side, and show how the conversation around China turned from cold-war politics in the 80s to the global economy in the 2000s.

In video 4, we do another analysis, showing how side-by-side views can help give you different angles on the same data. We look at sentences containing ‘economic’ and ‘world’, and show how, if you look at them using different visualizations, it’s very easy to re-discover the fact of China’s ‘rise’ from a developing country to a global power.

In video 5, we return to one of WordSeer’s staples: grammatical search, which allows you to query over grammatical relationships between words to discover, for example, which adjectives apply to certain words, or all the verbs that a particular word participates in.  What’s new is how you can now discover the way a word interacts with other words without knowing the word or the particular relation beforehand, or even really how to use grammatical search. In particular, we show how you can discover a particular phrase ‘China card’ and, by following up on it, discover that in the 80′s China was seen as a ‘card’ that US politicians could play, but that the term almost completely disappeared in the 90s and after.

Finally, in video 6, we’ll look at related words and how they allow you to discover the main terms mentioned along with a word and drill down into any particular one.

In the rest of this post, I’ll go through all these features, but on our old standby collection: Shakespeare’s plays.

WordSeer Features

Overviews

When you go to Amazon.com to shop for books, you don’t need to know what the different categories of books are in advance — Amazon shows them to you in a browsable hierarchy. It allows you to discover what’s there, instead of insisting that you ask for something specifically by searching for it. The old WordSeer didn’t have that kind of discovery or navigation. When you opened it up for the first time, you were confronted with a blank search box. This is what the Shakespeare front page looked like:
An uninformative sight, you get no sense of the richness of the collection, the scope of it, the tone of it, or the variety of it.

An uninformative sight, you get no sense of the richness of the collection, the scope of it, the tone of it, or the variety of it.

By contrast, the new WordSeer’s Shakespeare front page is much more informative:

A word tree of the most frequent word in the collection (excluding function words like “the” and “a”, of course), along with overview-filters for metadata categories (act, scene, speaker, play), the most frequent phrases (two or more words) and most frequent nouns, verbs, and adjectives

The new system tries to expose as much of the content and diversity of the collection as soon as possible. Without having to type anything — or even know anything about the collection, a user is presented with four different kinds of overviews:

  1. An interactive Word Tree of the most frequent word in the collection — “good” (center)
  2. Overview-filters of the most frequent nouns, verbs, and adjectives (bottom)
  3. Overview-filters of the most frequent phrases (two or more words) (bottom left)
  4. A browsable list of categories extracted from the input XML. In the Shakespeare collection, these are Act, Scene, Play title, Speaker, and line number. (left)

At a glance, the user not only sees what the main categories and words are, but also sees how many sentences fall into each category.

Drilling Down and Navigating

As mentioned above, the previous WordSeer had no support for discovery or browsing.  When you opened up the Shakespeare page, you couldn’t tell, for example, which the most prominent characters were, or what the most frequent words and phrases were. Also missing was the ability to navigate based on that kind of information. You couldn’t express filters, like “all sentences spoken by Gertrude” or “everything from Act 3 of Romeo and Juliet”, only searches.

That is why all overviews in WordSeer double as navigable filters. For example, if I click on “Hamlet” under the “speaker” list in the categories on the left hand side of the figure above, WordSeer will filter my view to just the sentences in which the “speaker” is “Hamlet”.

Drilling down to just speeches by Hamlet

Drilling down to just speeches by Hamlet

The result is still an overview, but of a smaller set of sentences: just those in speeches by Hamlet. All the details change: the Word Tree (center) changes to show the contexts surrounding the most frequent word in his speeches, and the lists of frequent words (bottom) and phrases (bottom left)  also change to reflect the content of Hamlet’s speeches:

An overview of just Hamlet's speeches

An overview of just Hamlet’s speeches

The above example deal with categorical metadata. But what data types that are more naturally expressed as continuous ranges, such as time? For these data types, the metadata pane shows a different type of overview-filter, a distribution chart. In our Shakespeare collection, the only numerical data type we have is “line” for the line number within the scene. This is what the overview-filter looks like:

Filters for numerical metatdata categories

Filters for numerical metatdata categories

If I want to look at just lines in a particular range, I can drag the handles and click the “filter” button:

Dragging the handles allows you to select a particular range.

Dragging the handles allows you to select a particular range.

In other collections, these sliders might be used to select date ranges or other more meaningful spans. 

Getting a sense of the contents of your data

The overviews I’ve described so far give you a sense of how your metadata attributes are distributed, but what if your collection doesn’t have any built-in categories like ‘speaker’ an ‘Act’? How do you get a sense of its contents? One task that came up over and over again was this one: getting a sense of the contents of some collection of text.

In WordSeer, that’s why we have overviews of frequent words and phrases, and these overviews change to reflect the searches and filters you’ve applied.

Frequent Phrases

One way to get a sense of the contents of a text collection is to look at frequent phrases. In WordSeer, “phrases” are sequences of two or more words. Every panel in WordSeer shows you an overview  of the most frequent phrases in whatever intersection of searches and filters you’ve selected . For example, if we zoom in on  the panel showing the list of sentences in Hamlet, we see the most frequent phrases in  just Hamlet:

The most frequent 2-word phrases in Hamlet

The most frequent 2-word phrases in Hamlet (that don’t contain stop words like ‘the’ and ‘of’).

This frequent phrases overview doubles as a filter. For example, if you want to see all 13 occurrences of “good night” in Hamlet, you can click the table row for “good night”, producing this:

Adding a filter corresponding to the phrase 'good night' to the existing filter for 'Hamlet'

Adding a filter corresponding to the phrase ‘good night’ to the existing filter for ‘Hamlet’ shows only the sentences containing ‘good night’ within ‘Hamlet’. 

That’s not the only kind of overview you get. Just like we can discover the most frequent phrases, we can discover the most frequent nouns, verbs, and adjectives. WordSeer uses a computational linguistics technology called part-of-speech tagging to automatically categorize words into their parts of speech. For example, here are the most frequent nouns, verbs, and adjectives in Hamlet:

The most frequent nouns, verbs, and adjectives in 'Hamlet'

The most frequent nouns, verbs, and adjectives in ‘Hamlet’ 

There is an option: “group by stem”. A stem is a common root from which different word forms are derived. For example, enabling this option would group together  ”read”, “reading”, and “reads” under the single label “read”, and show the added-up count for all of them. Just like the list of phrases, these word lists double as filters. For example, clicking on the word “lord” in the list above would further filter the “Hamlet” sentences to just those containing “lord”.

Video 1 demonstrates these overview and filtering features on the China & Japan editorials, showing how, by successively using overviews and filters (and with absolutely no prior knowledge of the collection), we can discover that, in the 1980s, there was a controversy over Japan’s take on whaling for research.

Search

The most basic component of any text analysis system is search, and WordSeer supports that. It also has a way to discover grammatical relationships between words, and we’ll talk about that too.

WordSeer's search box. Pressing "Go" will show a list of all the sentences that contain the word  "heaven".

WordSeer’s search box. Pressing “Go” will show a list of all the sentences that contain the word “heaven”.

WordSeer has a pretty complicated search box, but for simple searches, you can ignore most of it.

WordSeer's search box

WordSeer’s search box

To perform a keyword search in WordSeer, type words into the search box, and leave the grammatical relation set to ‘anywhere in the text’. It will show you a list of search results, and you can click anywhere.  Video 2 demonstrates this feature on the China and Japan editorials. 

Grammatical Search

Sometimes keyword search isn’t enough, often, what we’re really after are questions like ‘What does X do?” and “How is X described”. These aren’t keyword searches, but questions about the relationships between words — what verbs apply to X, what adjectives?

WordSeer’s other search mode allows you to search over exactly these types of grammatical relationships. These relationships are things like “verb subject”, “verb object”, “adjective modifer”, etc. Grammatical search allows you to ask questions like, “what are all the adjectives that apply to the word ‘man’”, and “what are all the verbs that ‘Hamlet’ is the agent of”?

The full list of grammatical relationships supported by WordSeer is described in detail here, in the Stanford Dependencies Manual. It explains all the different kinds of relationships available in WordSeer and gives examples of them in sentences.

Grammatical searches are more complex because there are three pieces of information in a grammatical relationship.

  1. The type of relationship

  2. the first word in the relationship

  3. and the second word in the relationship.

Why aren’t 2 and 3 interchangeable? Consider the two sentences  ”Look at the poster display, it’s really nice”, and “Look at the display poster, it’s really nice”. In both cases, there’s a noun compound relationship between “poster” and “display”. However different word orders give the compounds slightly different meanings. In the “poster display” is a display of posters which is really nice, in the second “display poster” is a poster for display, and the poster is really nice. Computational linguistics technology represents the two relationships as noun_compound(display, poster) and noun_compound(poster, display).

Performing a Grammatical Search

You can activate grammatical search mode using the drop-down menu in the top search bar. Selecting any relationship other than “anywhere in the text” will perform a grammatical search with that relationship. Another search box will appear to the right of the relations menu, so you can specify both words.

The grammatical search options in WordSeer

The grammatical search options in WordSeer

The Grammatical Search Bar Charts visualization was developed specifically for grammatical search queries. It’s like a list of search results, except augmented with bar charts of how many words match the grammatical relationship. Below, the figure shows how this visual can, be used to investigate descriptions of facial attributes in Shakespeare:

Using the grammatical search bar charts visualization to look at the results of a grammatical search

Using the grammatical search bar charts visualization to look at the results of a grammatical search. Here, we’re

searching for the “face, eyes, hair  [described as] _______” with the Grammatical Search Bar Charts visualization. 

 

 

 

 

 

 

The results of this search are shown below:

The grammatical search bar charts visualization shows the different words that enter into a grammatical relationships and how frequent they are

The grammatical search bar charts visualization shows the different words that enter into a grammatical relationships and how frequent they are. The list of matching sentences is below, clicking on any one of them opens up that play, with that sentence highlighted in context.

The bar charts show how often the each of the words appear in a “described as” relationship, as well as the words that describe them.  Above,  the chart shows that  ”eyes” is the most commonly described feature, at 83 times. The list of matching sentences is below the chart, with the matching words highlighted. The charts are also interactive. Clicking on a word filters the list of sentences to match that word, as shown below:

Filtering the adjectives describing 'face', 'hair' and 'eyes' to just 'sweet', 'heavenly' and 'fair'.

Filtering the adjectives describing ‘face’, ‘hair’ and ‘eyes’ to just ‘sweet’, ‘heavenly’ and ‘fair’.

Here, I’ve clicked on the bars for ‘sweet’, ‘heavenly’ and ‘fair’, which has filtered my sentences to just the ones in which those adjectives describe ‘face’ ‘eyes’ or ‘hair’. In this way WordSeer allows you to not only discover new things about your text, but upon discovering them, drill down into those things further.
Video 5 demonstrates grammatical search on the China & Japan editorials, by showing what we find when we do a grammatical search for the different ways ‘economy’ is described.

Exploring new threads of inquiry with the word menu

Visualizations and search results are often just a starting point. We often do multiple initial searches, and only drill down after we see something interesting, or have gotten a sense of the contents of the collection. But in the old WordSeer, there was no good way to follow up and drill down. That’s why one of the most powerful new ways to get around WordSeer is the Word Menu. When you see something interesting in a visualization, the word menu gives you a way to follow up on that thought by creating a new visualization, adding something to a group, or exploring related ideas.

Words appear in a lot of different places in WordSeer — lists of frequent words, lists of nearby words, in document views, in sentence popups, and in the list of sentences. If a word turns blue when you hover over it, right-clicking on it will make a word menu:

The word menu for 'father'. Right-clicking on a word anywhere in WordSeer shows you the word menu.

The word menu for ‘father’. Right-clicking on a word anywhere in WordSeer shows you the word menu.

The search options in the word menu allow you to create a new visualization around that word. For example, clicking on ‘Search’ in the menu above will open up a search for the word ‘father’ in a new panel.

Navigating a Grammatical Neighborhood with Search Options

One of the most common questions we encountered about words was ‘how is this word used?’ That’s why the word menu includes the ability to explore the grammatical neighborhood of a word. What’s a grammatical neighborhood? It’s a term I made up to stand for the way a word interacts with other words. You can see the grammatical neighborhood of a word by clicking on it (to open the Word Menu) and exploring the search options.

For example, the Word Menu for “father” shows the different ways in which “father” is used, and the number of times each one appears in the collection.  Suppose we look at the “adjectival modifier” search option. This shows us the different adjectives that apply to the word ‘father’. Examining it, we discover that fathers in shakespeare are “good”, “dear”, “noble”, “ghostly”, “royal” and “sweet”:

Clicking on any of these options does a grammatical search for that relationship, and brings up the search results in a new panel alongside.

Video 5 demonstrates this on the China & Japan editorials, showing what we discover when we explore the grammatical neighborhood of the word ‘China’. 

Related Words

Another way WordSeer lets you get a sense of how a word is used is through the ‘Related Words’ option. This shows you all the different words that co-occur in the same sentences as a given word.

This option pops up a window showing nouns, verbs, and adjectives that frequently occur along with that word. These words are sensitive to the searches or filters we’ve applied. For example, if we had previously searched for “father” and then wanted to investigate the word “son”, then the related words would only compute co-occurring words for “son” in the context “father”:

Of course, this list of co-occurring words is just that, a list of words, meaning that clicking on any of them would again bring up a word menu. Except, because these are co-occurring words, these word menus are special: they have an extra option, ‘see co-occurrences’.

Clicking on the “see co-occurrences” button above would bring up just those sentences where “son”, and “daughter” co-occur in a separate panel. Video 6 demonstrates this for the China & Japan editorials.

Side-By-Side Comparisons

In the old WordSeer, side by side comparisons were a pain — yet the students in “Hamlet in the Humanities Lab” needed to do them all the time! You can see what I mean in this video, in which I compare the word “love” in the comedies and tragedies. You had to open up a new browser window, navigate to WordSeer, then type in the other search, and then switch between the two browsers.

This is why the new WordSeer has been redesigned to work somewhat like a computer desktop environment — it can display multiple “panels”, each with different information.

For example, here I am repeating a previous analysis from WordSeer 2 — comparing “love” in the comedies and tragedies. two Word Tree for “love” side by side: over the comedies in the left panel, and over the tragedies in the right panel:

Comparing the word trees for love in the comedies (left) and the tragedies (right)

Comparing the word trees for love in the comedies (left) and the tragedies (right)

Video 3 demonstrates how to open up multiple panels, and shows how we can use this new feature to great effect in the China & Japan editorials to compare the language around the word ‘China’ across the 80s, 90s, and 2000s.

Analyses Across Categories

A common kind of analysis we found, not just in the blog posts but in our interviews as well, was comparing categories of data along some dimension. As a simple example, in Shakespeare, students would often pose questions like this, “How does the theme of the supernatural in Act 1 differ from Act 2″, or “How do the different characters’ levels of involvement change throughout the play?”.

These questions involve comparing two or more pre-existing categories: Act, Scene, Speaker, along some dimension: “theme of the supernatural”, “involvement”, etc.

The old WordSeer simply had no way to express such an analysis goal. You could search. That was it. If it wasn’t a search, you were out of luck.

That’s why we developed the “Word Frequencies” tool for the new WordSeer. It’s got a cryptic name (I’m not very good at names), but look how naturally it expresses these types of questions.

For example, I can search for the terms “ghost, spirit, heaven, hell” — an approximation to the “supernatural theme”, and see how the frequency of those words varies across all the different categories within my data, including Act:

Fequencies of the words "ghost, spirit, heaven, hell" across Act, Scene, and Speaker in Hamlet. Act 1 is where these words are the most frequent, specifically Act 1 Scene 4

Fequencies of the words “ghost, spirit, heaven, hell” across Act, Scene, and Speaker in Hamlet. Act 1 is where these words are the most frequent, specifically Act 1 Scene 4

These graphs are interactive, I can click, for example, on just ‘Act 3′, and that shows me how this theme is distributed across the different characters:

Filtering the statistics to just Act 3, by clicking on the bar, shows which characters and scenes mention "heaven, hell, ghost, or spirit" and how frequently.

 Video 4 demonstrates this feature on the China & Japan editorials, by showing how we can use it to look at trends of words like ‘economy’ and ‘world’ over time, and to compare the trends for the two different countries.

Custom units of analysis

Pre-existing divisions (such as Act, Scene, Speaker, etc.) are useful, but they’re almost always not enough. As a scholar’s understanding of a body of text grows, he or she collects, categorizes and re-categorizes quotes, sentences, and documents into new categories.

For example, consider the question, “How does the treatment of love in Shakespeare vary between the comedies and tragedies”? Here, “comedies” and “tragedies” are units of analysis that don’t come pre-defined in our collection (but maybe they should?).

A better example of such a question is “What are are the different characteristics of speeches by male and female speakers?” Here, our units of analysis are “speeches by male speakers” and “speeches by female speakers” — we don’t have those as pre-defined categories either.

And yet another is, “How do concepts of emotion correlate with mentions of people in power — how often do emotions like “anger”, “sadness”, “joy”, “hate” correlate with different kinds of people in power? This is more complex. We want to look at “the sentences mentioning different types of people in power” and correlate them with “sentences mentioning different types of emotion”.

Here I’ll show how WordSeer’s Document Sets, Sentence Sets, and Word Sets features can help conduct exactly these types of analyses.

Sets aren’t just for comparison — once you make a set, it persists, you don’t lose it. You can use them to collect interesting things to look at, or to make conceptual groupings for your own understanding.

Document Sets

Document sets are most useful when you’re looking at gathering certain types of documents together. You make document sets by searching and filtering in the document browser (that’s a link to the guide entry).  To make a document set, just select some documents, and click “Add to Group”.

For example, If you wanted to follow up on the question of “How does the treatment of love vary between the comedies and tragedies”, we could do that in the following way. First, collect all the comedies:

Selecting the comedies for a document set

Selecting the comedies for a document set

Then, add them to a group by clicking the “Add to group” button at the top, and typing in what we wanted to call it:

Adding plays to a new document set called 'comedies'

Adding plays to a new document set called ‘comedies’

We’d name the new group “comedies”, and hitting enter would create a new group: “comedies”. We can do the same for Tragedies, and the Document Sets overview now shows two sets, “comedies” and “tragedies”

After doing the same for the tragedies, we have two document sets.

After doing the same for the tragedies, we have two document sets.

We can now use these sets as filters, because they appear in the metadata overview:

Any sets you create can act as filters, for example, we're now seeing the comedies and tragedies sets.

Any sets you create can act as filters, for example, we’re now seeing the comedies and tragedies sets.

So, now if we wanted examine the treatment of “love” across the two sets of documents, we could do a word frequencies comparison.  The word frequencies chart automatically uses the new document set categories.

Comparing the frequency of the phrase 'in love' across the comedies and tragedies

Comparing the frequency of the phrase ‘in love’ across the comedies and tragedies. We see that about 0.4% of the sentences in the comedies mention “in love”, whereas less than half that, around 0.1% of the sentences in the tragedies do the same

Sentence Sets

You can put sentences into sets from the List of Search Results and Grammatical Search Bar Charts views. Click the checkboxes next to the sentences you want, and then add them to the set. For example, let’s collect speeches by female speakers in “The Merchant of Venice” into a sentence set. First, narrow down the list of sentences to just that play.

Quickly selecting 'The Merchant of Venice' using the auto-suggest box

Quickly selecting ‘The Merchant of Venice’ using the auto-suggest box

Use the auto-suggest box to quickly select to “Merchant of Venice”. Then we can use the category filters to drill down into the sentences spoken by each female character, select them, and then add them to a sentence set.

Here, I’ve just finished adding the 240 sentences spoken by Portia to the set, and I’m about to add 50 by Nerissa:

Selecting sentences and adding them to a sentence set

Selecting sentences and adding them to a sentence set

After adding all the women’s sentences, I get 338 sentences. After doing the same for the men, and ignoring characters with less than 5 sentences, I get:

After adding the lines spoken by male and female characters in the merchant of venice to different sets.

After adding the lines spoken by male and female characters in the merchant of venice to different sets.

Now I can begin comparing them. I open up two panes, and look at the word frequencies across the acts for the two sets:

Comparing how many lines were spoken by male and female characters in different acts of the Merchant of Venice

Comparing how many lines were spoken by male and female characters in different acts of the Merchant of Venice

The  lists of frequent words (bottom of each panel) in the two panels are all slightly different from each other, and the characters’ patterns of involvement in the play are also very different.

Investigating groups of words together with word sets

One pattern of analysis that came up over and over again was investigating a group of words together. Both undergraduate students in the shakespeare class and the PhD students we interviewed would describe collecting or identifying a group of words of interest, and then analyzing them further in some way.  The further analyses differed: it could be searching for occurrences, trying to find patterns of co-occurrence or distribution, or something else, but this basic pattern: collect a group of words, then analyze it, was the same.

 In WordSeer, we support this kind of analysis using Word Sets. These are are just collections of words. However, like document sets and sentences sets, they are units of analysis that have an independent existent. They act as filters, matching all the sentences that contain the words in the set, and can be used as search terms in the search box. Instead of typing in a long list of words, you can just use the word set instead:

Word sets can be used as search terms -- it's like searching for sentences that match any of the words in the set

Word sets can be used as search terms — it’s like searching for sentences that match any of the words in the set individually, but more compact.

Word Sets with the Word Menu

As we saw in the word menu section, clicking or right-clicking on a word almost anywhere in WordSeer opens up the Word Menu. We saw how the word menu is a jumping-off point for analysis, but it’s also designed as a way to quickly collect words of interest into groups. The word menu has options to either add the word to a word set (you can add it to a new one if you don’t have existing ones) and to edit existing word sets: 

The word set options in the word menu.

The word set options in the word menu. This is the word menu for ‘lord’, and if I click ‘Add to word set > New’ it’ll make a new word set, with ‘lord’ in it.

 If you make a new set, it’ll automatically be named after the word, and you can add more words to it using the word menu. Here, let me add ‘king’ to the ‘lord’ word set I just made:

Adding 'king' to the 'lord' word set through the word menu

Adding ‘king’ to the ‘lord’ word set through the word menu

Of course, it’s extremely cumbersome to click and add words individually if you already know which ones you want to add. You can therefore edit word sets directly, and type words in. Use the ‘Edit word set’ option to bring up that window:

To edit a word set by typing words in directly, use the 'Edit word sets' option in the word menu

To edit a word set by typing words in directly, use the ‘Edit word sets’ option in the word menu

This brings up a little window that you can edit, and you can click on the title bar to rename it.

Editing a word set by typing in words directly

Editing a word set by typing in words directly

You can also make and manage your Word Sets with the “Word Sets” overview:

The Word Sets pane allows you to create, delete, and edit word sets too.

The Word Sets pane allows you to create, delete, and edit word sets too.

Clicking “New” creates a new set, and “Delete” deletes the selected set. Double click to open the word set up in a window, or to rename it. Here, I’m creating a new set and naming it “god/supernatural”:

Creating a new word set and renaming it

Creating a new word set and renaming it

In this set, I put “god, almighty, heaven, and spirit”, so I can compare how some emotion-related words co-occur with the two categories.

For this, I simply do two searches in the word frequency graph and compare the split across categories. As expected, the comedies have more happiness and the tragedies have more anger, but the comparison between royals and supernaturals is interesting:

Comparing the frequencies of anger-related words(blue) and happiness-related words (orange) across the comedies and tragedies, and the 'royals' and 'supernaturals' words

Comparing the frequencies of anger-related words(blue) and happiness-related words (orange) across the comedies and tragedies, and the ‘royals’ and ‘supernaturals’ words

It appears, in fact that the “royals” words are much less associated with the happy search (orange)  than the “god/supernatural” words. Only 0.61% of the royal sentences have “happy” words, whereas twice as many (proportionally speaking) of the “god/supernatural” sentences have “happy” words. 

Saving and Exporting Work

WordSeer isn’t a do-all system. It lacks even rudimentary note-taking capabilities, and it’s really more of a hypothesis-generation and evidence gathering tool than a hypothesis-testing one. That’s why you can export almost all the data you see — from the lists of frequent words, to the graphs and figures, to the lists of sentences, to the distributions of categories out of the tool, so that you can include it into your other work, or do more sophisticated statistics on it.

Every single WordSeer data display has a tiny save button on its top left:

Exporting data from WordSeer with the 'Save' button.

Exporting data from WordSeer with the ‘Save’ button.

For image-based visualizations, such as the Grammatical Search Bar Charts, Word Trees, and Word Frequencies, click on the save button at the top of the panel to generate download links to each of the visualizations as an image:

Saving images and downloading the data behind charts.

Saving images and downloading the data behind charts.

If you want to save the image in a filtered state, just click the save button after performing your operations — the images generated always reflect the current state of the chart.

 History

Finally, the last thing we created is a persistent History store on your computer, which may not seem like such a big deal, but just makes things more convenient. Because WordSeer is an application that runs inside your web browser, it’s common to leave and return later, with a new session. It’s also easy to accidentally close the page. But with the History module, when you open it up again, your history will be available to you from the pane on the left. Just click a row to open up the panel again.  Your history won’t be available if you use a different computer or a different username though.

The History pane on the left hand side of the display

The History pane on the left hand side of the display keeps your history of searches and visualizations available to you even if you close the page and return later.

Posted in Digital Humanities, Text Mining, Visualization, WordSeer