<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>WordSeer Project Page</title>
	<atom:link href="http://wordseer.berkeley.edu/feed/" rel="self" type="application/rss+xml" />
	<link>http://wordseer.berkeley.edu</link>
	<description>Project Page</description>
	<lastBuildDate>Mon, 07 Jan 2013 22:33:11 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.4.2</generator>
		<item>
		<title>How do you read? An analysis of survey responses.</title>
		<link>http://wordseer.berkeley.edu/how-do-you-read-survey/</link>
		<comments>http://wordseer.berkeley.edu/how-do-you-read-survey/#comments</comments>
		<pubDate>Tue, 09 Oct 2012 18:08:29 +0000</pubDate>
		<dc:creator>silverasm</dc:creator>
				<category><![CDATA[Digital Humanities]]></category>

		<guid isPermaLink="false">http://mininghumanities.com/?p=552</guid>
		<description><![CDATA[A big question for me, as a designer of text analysis tools for the humanities is: how do the tools I&#8217;m building fit in? Sure, you can have fancy word trees and grammatical search histograms. Sure, they&#8217;re chock-full of interesting<span class="ellipsis">&#8230;</span><div class="read-more"><a href="http://wordseer.berkeley.edu/how-do-you-read-survey/">Read more &#8250;</a></div><!-- end of .read-more -->]]></description>
			<content:encoded><![CDATA[<p>A big question for me, as a designer of text analysis tools for the humanities is: how do the tools I&#8217;m building fit in? Sure, you can have fancy word trees and grammatical search histograms. Sure, they&#8217;re chock-full of interesting information that you can make an argument about. But where exactly in the humanistic analysis process does a scholar need things like that? I have no idea.</p>
<p>But there&#8217;s more. I don&#8217;t just build tools, I build <em>environments</em>. And that means support for reading the text, navigating it, searching it, and (most importantly) &#8220;working&#8221; with it. And I have no idea what that means either. So over the past few weeks I&#8217;ve been having hour-long chats with late-stage PhD students from the literature and history departments, and asking them to tell me about how they do research. I asked  all kinds of confusing and mundane questions like, &#8220;How do you decide what to underline?&#8221; and , &#8220;Can you define formalism for me?&#8221; and, &#8220;You mean you actually copy it out by hand?&#8221; and &#8220;How do you organize all the quotes you collect?&#8221; and, &#8220;How do you go about proving that?&#8221; and, &#8220;So you scanned in <em>everything</em> in those boxes?&#8221;</p>
<p>I only did twelve of those interviews, but patterns began to emerge.  So I did a <a title="How do you read? A survey" href="http://wordseer.berkeley.edu/how-do-you-read.html" target="_blank">survey</a>. A simple one, with six questions about reading habits. This survey&#8217;s purpose was to confirm whether some of the patterns I noticed around reading were general. If you just want the charts summarizing the responses, you can find them <a title="Summary of responses" href="https://docs.google.com/spreadsheet/gform?key=0AvQrnc-ag48BdHZkMFZNeE9rRGZXeHExaGU3QXBFTHc&amp;gridId=0#chart" target="_blank">here</a> (those numbers include around 20 more responses I got while I was writing this post). For a full analysis in which I extract some general patterns in humanities scholars&#8217; reading processes, read on.</p>
<p><span id="more-552"></span></p>
<p>This was the first page of the survey, with two screener questions. Over two days, I got 153 responses from humanities scholars whose primary sources were mostly textual (and 18 responses from others, but I removed those responses for this analysis). <a href="http://wordseer.berkeley.edu/www/wp-content/uploads/2012/10/q0.png"><img class="aligncenter size-full wp-image-560" title="q0" src="http://wordseer.berkeley.edu/www/wp-content/uploads/2012/10/q0.png" alt="Filter questions: are you a humanities scholar? Are your primary sources mostly textual?" width="592" height="475" /></a></p>
<h1>Working with text</h1>
<p>My first question was about copied-out snippets. From what I heard in the interviews, humanities scholars working on a project eventually reach a point where they have some interpretation, some interesting angle they want to take. At this point, they start <em>actively</em> reading and re-reading their primary sources. Snippets begin to collect in their notes in large numbers.</p>
<p><a href="http://wordseer.berkeley.edu/www/wp-content/uploads/2012/10/q1.png"><img class="aligncenter size-full wp-image-561" title="q1" src="http://wordseer.berkeley.edu/www/wp-content/uploads/2012/10/q1.png" alt="Do you  copy out snippets of text from your primary sources?" width="595" height="210" /></a></p>
<p>The results are below. An overwhelming 90% of textual humanities scholars surveyed copy out snippets either frequently or occasionally.</p>
<p><a href="http://wordseer.berkeley.edu/www/wp-content/uploads/2012/10/a1.png"><img class="aligncenter size-full wp-image-568" title="a1" src="http://wordseer.berkeley.edu/www/wp-content/uploads/2012/10/a1.png" alt="Answers to &quot;Do you copy out snippets of text from your primary sources into your notes?&quot;" width="595" height="174" /></a></p>
<h2>What are snippets for?</h2>
<p>But why? From what the interviewees said, copied-out snippets seem to be in a different category from other annotations (such as margin notes, underlines, highlights, or circles on the page). In fact, copying out seems reserved for a more important class of items. But what makes these copied-out snippets so special? This was my next question.</p>
<p><a href="http://wordseer.berkeley.edu/www/wp-content/uploads/2012/10/q2.png"><img class="aligncenter size-full wp-image-562" title="q2" src="http://wordseer.berkeley.edu/www/wp-content/uploads/2012/10/q2.png" alt="Why do you copy out snippets of text" width="595" height="159" /></a></p>
<p>The results are below, and the responses confirmed the interviews. Most of the textual humanities scholars surveyed said that snippets were evidence, interesting or thought-provoking passages, or examples of something they were looking for.</p>
<p><a href="http://wordseer.berkeley.edu/www/wp-content/uploads/2012/10/a2.png"><img class="aligncenter size-full wp-image-569" title="a2" src="http://wordseer.berkeley.edu/www/wp-content/uploads/2012/10/a2.png" alt="Answer to why do you copy out snippets of text" width="595" height="131" /></a></p>
<p>In addition to the 4 existing choices, there were 28 &#8220;other&#8221; responses, which fell into these groups:</p>
<ol>
<li>As a way to store quotations I think I&#8217;ll need (5 such responses)</li>
<li>Writing it out helps me understand and think about it (2)</li>
<li>Writing it out helps me remember it (5)</li>
<li>The snippet is a good summary of an author&#8217;s point or argument (4)</li>
<li>They help me outline a longer argument (3)</li>
<li>I need to translate it into another language (3)</li>
</ol>
<p>It seems that copied-out snippets play a role very much like the evidence a lawyer lays before a jury. They provoke thought, they justify an interpretation, they are examples that support an argument (or indeed counterexamples that need to be explained away). A scholar amasses many such &#8220;pieces of argument&#8221; and then organizes them to tell a coherent story.</p>
<h2>What do snippets look like?</h2>
<p>As a tool builder and an information retrievalist I need to know: how long are these snippets, and is there accompanying information that users will want to add?</p>
<h3>Snippet length</h3>
<p>Are snippets a few words long? A few paragraphs?</p>
<p><a href="http://wordseer.berkeley.edu/www/wp-content/uploads/2012/10/q3.png"><img class="aligncenter size-full wp-image-563" title="q3" src="http://wordseer.berkeley.edu/www/wp-content/uploads/2012/10/q3.png" alt="What do snippets look like?" width="595" height="225" /></a></p>
<p>The results are below. They revealed that snippets varied greatly in length.</p>
<ul>
<li>Many respondents had &#8220;some&#8221; or &#8220;many&#8221; snippets between a few words to a paragraph long</li>
<li>About 25% of had &#8220;some&#8221; or &#8220;many&#8221; snippets that were longer than a paragraph.</li>
</ul>
<p><a href="http://wordseer.berkeley.edu/www/wp-content/uploads/2012/10/a3-1.png"><img class="aligncenter size-full wp-image-570" title="a3-1" src="http://wordseer.berkeley.edu/www/wp-content/uploads/2012/10/a3-1.png" alt="" width="585" height="182" /></a></p>
<p><a href="http://wordseer.berkeley.edu/www/wp-content/uploads/2012/10/a3-2.png"><img class="aligncenter size-full wp-image-571" title="a3-2" src="http://wordseer.berkeley.edu/www/wp-content/uploads/2012/10/a3-2.png" alt="" width="575" height="174" /></a><a href="http://wordseer.berkeley.edu/www/wp-content/uploads/2012/10/a3-3.png"><img class="aligncenter size-full wp-image-572" title="a3-3" src="http://wordseer.berkeley.edu/www/wp-content/uploads/2012/10/a3-3.png" alt="" width="594" height="184" /></a><a href="http://wordseer.berkeley.edu/www/wp-content/uploads/2012/10/a3-4.png"><img class="aligncenter size-full wp-image-573" title="a3-4" src="http://wordseer.berkeley.edu/www/wp-content/uploads/2012/10/a3-4.png" alt="" width="583" height="177" /></a><a href="http://wordseer.berkeley.edu/www/wp-content/uploads/2012/10/a3-5.png"><img class="aligncenter size-full wp-image-574" title="a3-5" src="http://wordseer.berkeley.edu/www/wp-content/uploads/2012/10/a3-5.png" alt="" width="589" height="174" /></a></p>
<h3>Accompanying information</h3>
<p>The interviews suggested that in addition to the literal text of the snippets themselves, there were often various kinds of notes as well as visual finding aids and citation information.</p>
<p><a href="http://wordseer.berkeley.edu/www/wp-content/uploads/2012/10/q4.png"><img class="aligncenter size-full wp-image-564" src="http://wordseer.berkeley.edu/www/wp-content/uploads/2012/10/q4.png" alt="" width="586" height="313" /></a></p>
<p>The results are below. Notes about why the snippet was relevant were very common.</p>
<p><a href="http://wordseer.berkeley.edu/www/wp-content/uploads/2012/10/a4-1.png"><img class="aligncenter size-full wp-image-575" title="a4-1" src="http://wordseer.berkeley.edu/www/wp-content/uploads/2012/10/a4-1.png" alt="" width="590" height="226" /></a></p>
<p>And so were ideas that the scholars got from the copied-out snippet.</p>
<p><a href="http://wordseer.berkeley.edu/www/wp-content/uploads/2012/10/a4-2.png"><img class="aligncenter size-full wp-image-576" title="a4-2" src="http://wordseer.berkeley.edu/www/wp-content/uploads/2012/10/a4-2.png" alt="" width="587" height="217" /></a></p>
<p>By far the most common was citation information: 76% <strong>always</strong> added it.</p>
<p><a href="http://wordseer.berkeley.edu/www/wp-content/uploads/2012/10/a4-5.png"><img class="aligncenter size-full wp-image-579" title="a4-5" src="http://wordseer.berkeley.edu/www/wp-content/uploads/2012/10/a4-5.png" alt="" width="577" height="218" /></a></p>
<p>Other markers, such as post-it flags, colored highlights, and tags to enable keyword search were less common, but also done: <a href="http://wordseer.berkeley.edu/www/wp-content/uploads/2012/10/a4-3.png"><img class="aligncenter size-full wp-image-577" src="http://wordseer.berkeley.edu/www/wp-content/uploads/2012/10/a4-3.png" alt="" width="595" height="224" /></a> <a href="http://wordseer.berkeley.edu/www/wp-content/uploads/2012/10/a4-4.png"><img class="aligncenter size-full wp-image-578" src="http://wordseer.berkeley.edu/www/wp-content/uploads/2012/10/a4-4.png" alt="" width="595" height="221" /></a></p>
<p>For a tool designer, the message is clear. If you want humanities scholars to <em>read</em> text using your tool, you must support all of the above activities.</p>
<h1>Collecting evidence</h1>
<p>When a scholar wonders, &#8220;Where else have I seen that before?&#8221; or, &#8220;Is this an unusual exception?&#8221; or, &#8220;Is this a pattern?&#8221;, my interviewees told me that there are two approaches they can take: they had either thought of this before, and relevant passages were already copied into their notes, or they&#8217;d have go back and re-read the relevant texts to search for evidence. To see if this was a general pattern, I created the next two survey question.</p>
<p><a href="http://wordseer.berkeley.edu/www/wp-content/uploads/2012/10/q5.png"><img class="aligncenter size-full wp-image-565" title="q5" src="http://wordseer.berkeley.edu/www/wp-content/uploads/2012/10/q5.png" alt="When do you revisit and re-read textual primary sources that you've already read?" width="585" height="267" /></a></p>
<p>The responses are below, and they confirm my interviewees responses. 79% re-read when they had a new idea or interpretation, 65% when they had a new hypothesis, and 67% when they noticed a new pattern.</p>
<p><a href="http://wordseer.berkeley.edu/www/wp-content/uploads/2012/10/a5.png"><img class="aligncenter size-full wp-image-580" title="a5" src="http://wordseer.berkeley.edu/www/wp-content/uploads/2012/10/a5.png" alt="Answers to why do you revisit sources you've already read." width="595" height="149" /></a></p>
<p>There were also 13 &#8220;other&#8221; responses, which fell into the following categories:</p>
<ol>
<li>Just for fun (2 responses)</li>
<li>When I teach (5)</li>
<li>When I find my notes don&#8217;t have everything I need (3)</li>
<li>To understand it better (2)</li>
<li>To compare with other documents (1)</li>
</ol>
<p>But what do scholars want to do with the text they re-read? This was my final question. I wanted to compare how note-taking behavior varied between first-time reading and re-reading:</p>
<p><a href="http://wordseer.berkeley.edu/www/wp-content/uploads/2012/10/q6.png"><img class="aligncenter size-full wp-image-566" src="http://wordseer.berkeley.edu/www/wp-content/uploads/2012/10/q6.png" alt="" width="592" height="384" /></a></p>
<p>The results are below. Compare how frequently scholars copied out snippets the first time (top) with when they were re-reading (bottom). The copy-out rate is pretty much the same, or perhaps even a little higher while re-reading.</p>
<p><a href="http://wordseer.berkeley.edu/www/wp-content/uploads/2012/10/a6-2.png"><img class="aligncenter size-full wp-image-582" title="Copying out snippets the first time through a text" src="http://wordseer.berkeley.edu/www/wp-content/uploads/2012/10/a6-2.png" alt="" width="595" height="194" /></a> <a href="http://wordseer.berkeley.edu/www/wp-content/uploads/2012/10/a6-4.png"><img class="aligncenter size-full wp-image-584" title="Copying out snippets the second time through the text" src="http://wordseer.berkeley.edu/www/wp-content/uploads/2012/10/a6-4.png" alt="" width="595" height="178" /></a></p>
<p>These responses reveal an inefficiency in the &#8220;finding evidence&#8221; portion of the scholarly process. Reading is a great way to understand, to learn, to remember. But it is a very inefficient way to search for something. First, it&#8217;s very slow (and when you speed it up, it starts to lose its reliability). Second, it&#8217;s very subject to your state of mind (sometimes things pop out at you, sometimes they don&#8217;t). And third, it&#8217;s impossible to do thoroughly for very large collections.Yes, you need to <em>revisit, </em><em>re-find </em>and <em>re-acquaint</em> yourself with the material. But what I object to is that you often have to <em>re-read</em> in order to do it.</p>
<p>A (partial) solution already exists: full-text digitization and search. Search can take you a long way, especially if your examples are associated with particular words. However, there are a great many sophisticated information retrieval technologies that go beyond keyword search. We can calculate and retrieve text passages by similarity, we can allow you to mark relevant passages and return more like those, we can train classifiers on what you&#8217;ve marked interesting, and have them automatically classify text, we can use the google translate API to help identify foreign words, and online dictionaries to find synonyms.</p>
<p>Based on the answers to these two questions, it seems to me that the following three &#8220;finding a snippet-of-text&#8221; problems might really benefit from a little information retrieval and visualization, because finding them by reading is especially hard, and formulating them as search queries can be difficult:</p>
<ol>
<li>Find me more examples like this</li>
<li>Find me other other places in the text where this happens</li>
<li>Show me all the places in the text where this concept comes up</li>
</ol>
<h1>In conclusion</h1>
<p>Together, the survey and interviews helped me understand the mechanics of scholarly reading. As I interpret it, humanities scholars working around<em> textual primary sources</em> follow a process like this (and this is why I&#8217;m blogging, so you can all violently disagree with me in the comments):</p>
<ol>
<li>Scholars begin a project with a (sometimes vague) hypothesis, interest, or interpretation in mind</li>
<li>They read primary sources to solidify their understanding and find evidence for their arguments</li>
<li>They notice or realize things while reading passages from the text</li>
<li>They copy out the passages if they are sufficiently thought-provoking, provide evidence, or are relevant to an interpretation.</li>
<li>They add information to the passages they copy out:
<ol>
<li>Why the passage is relevant/ how it fits into their argument</li>
<li>Ideas they got from the passage</li>
<li>Citation information so they can find it and cite it properly later</li>
</ol>
</li>
<li>If they haven&#8217;t already collected the necessary supporting material, they search, read and re-read other texts to find more support.</li>
<li>When they find supporting or relevant passages, they copy them out (see step 5)</li>
<li>They curate their collected passages (&#8220;pieces of argument&#8221;) into a written product representing their argument</li>
</ol>
<p>Laying out these steps has helped me come up with design requirements. For example, even if no extra visualizations and analysis tools are added, and the interface only supports reading, these steps told me the basic &#8220;reading tools&#8221; my system must have:</p>
<ol>
<li>Copying out variable-length snippets of text into a note-taking area</li>
<li>Preserving the link between the copied-out snippet and its place in the text</li>
<li>Taking notes around a snippet</li>
<li>Tagging and keyword search</li>
<li><em> </em>Exporting all of the above into a format compatible with MS Word or similar, so that scholars can integrate it with their other work</li>
</ol>
<p>Next, if we&#8217;re talking about systems with keyword search functionality, steps 6 and 7 make it clear that the ability to save, organize, and annotate search results is <em>key</em>. This is in addition to the ability to switch seamlessly between looking at search results and reading the source text surrounding any particular search result.</p>
<p>If we want to get more sophisticated with information retrieval (which I do), the three &#8220;finding snippets of text&#8221; problems  I identified above need attention.</p>
<p>But what about when you<em> </em>have visualizations? Coming back to my original question, where do these fit in? I&#8217;m still not sure, but I think the steps I&#8217;ve found above will give me a good place to start looking for answers.</p>
]]></content:encoded>
			<wfw:commentRss>http://wordseer.berkeley.edu/how-do-you-read-survey/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Empirical Study: Finding Examples of a Theme, by Example</title>
		<link>http://wordseer.berkeley.edu/finding-examples-of-a-theme-by-example/</link>
		<comments>http://wordseer.berkeley.edu/finding-examples-of-a-theme-by-example/#comments</comments>
		<pubDate>Tue, 24 Jul 2012 01:34:17 +0000</pubDate>
		<dc:creator>silverasm</dc:creator>
				<category><![CDATA[Digital Humanities]]></category>
		<category><![CDATA[Natural Language Processing]]></category>

		<guid isPermaLink="false">http://mininghumanities.com/?p=538</guid>
		<description><![CDATA[A common task in literature study is to find examples of a theme. Until now, literary scholars searching for examples have had to rely on searching for sets of words they think are associated with the theme. Theme-finding by searching<span class="ellipsis">&#8230;</span><div class="read-more"><a href="http://wordseer.berkeley.edu/finding-examples-of-a-theme-by-example/">Read more &#8250;</a></div><!-- end of .read-more -->]]></description>
			<content:encoded><![CDATA[<p>A common task in literature study is to find examples of a theme. Until now, literary scholars searching for examples have had to rely on searching for sets of words they think are associated with the theme.</p>
<p>Theme-finding by searching for words poses a problem. Synonymy and the infinite variance of language mean that the same theme might surface in many different forms using many different words.  Even for scholars with intimate knowledge of the text, a single set of words is not enough. Depending on their mental context, the  words that come to mind might not always be complete and representative.</p>
<p>For example, take the Shakespearean theme of &#8220;seeing is believing&#8221; &#8212; that seeing an event with one&#8217;s own eyes is more credible than hearing about it second-hand. A scholar might search for the words &#8220;believe&#8221;, &#8220;speak&#8221;, &#8220;eyes&#8221;, and &#8220;see&#8221;. That search might be able to capture this example (from The Winter&#8217;s Tale 5.2):</p>
<blockquote><p>Then have you lost a sight, which was to be seen, can not be spoken of.</p></blockquote>
<p>but not this one (from King Lear 4.6):</p>
<blockquote><p>I would not take this from report; it is, And my heart breaks at it.</p></blockquote>
<p>As a solution, we  at WordSeer propose search-by-example. <a title="relevance feedback" href="http://en.wikipedia.org/wiki/Relevance_feedback">This technology</a> dates back to the 80&#8242;s in the field of information retrieval, and so far, it&#8217;s been successful in helping find relevant documents. We think it could work for theme-finding too.</p>
<p>With search-by-example, instead of inferring which words represent a theme, and then searching for those words, a scholar can search for sentences that match <em>a set of examples</em>. A scholar marks a set of examples of a theme, and the system returns a list of sentences it thinks are relevant.</p>
<p>This process is a cycle. When the system returns results, the scholar gives it feedback by labeling sentences &#8220;relevant&#8221; if they match the theme, and &#8220;not-relevant&#8221; if they don&#8217;t. The system gradually builds a model of what the scholar is interested in, and eventually returns results that are mostly relevant.</p>
<p>For example, in under five minutes, I was able to use the examples above to come up with seven more candidates:</p>
<blockquote><p>Gracious my lord, I should report that which I say I saw, But know not how to do&#8217;t. (Macbeth 5.5)</p>
<p>Most noble sir, That which I shall report will bear no credit, Were not the proof so nigh. (Winter&#8217;s Tale 5.1)</p>
<p>I would not hear your enemy say so, Nor shall you do mine ear that violence, To<br />
make it truster of your own report Against yourself: I know you are no truant. (Hamlet 1.2)</p>
<p>If in Naples I should report this now, would they believe me? (The Tempest 3.3)</p>
<p>They call him Doricles; and boasts himself To have a worthy feeding: but I have it Upon his own report and I believe it; He looks like sooth. (Winter&#8217;s tale 4.4)</p>
<p>It is not so; thou hast misspoke, misheard; Be well advised, tell o&#8217;er thy tale again: It can not be thou dost but say&#8217; tis so: I trust I may not trust thee; for thy word Is but the vain breath of a common man: Believe me, I do not believe thee, man; I have a king&#8217;s oath to the contrary. (King John 3.1)</p>
<p>I do beseech you, either not believe The envious slanders of her false accusers; Or, if she be accused on true report, Bear with her weakness, which, I think, proceeds From wayward sickness, and no grounded malice. (Richard III 1.3)</p></blockquote>
<p>Of course, this is all theory until it&#8217;s been proven to work. And while I&#8217;m not a Shakespeare scholar, I did build this particular system, so it might not be surprising that I can get a few results out of it.</p>
<p>So to find out whether search-by-example works, we&#8217;ve designed a five-minute study around three Shakespearean themes. There are three systems: one search, and two different example-based ones. Participants are shown an example of a theme, and asked to use a system to find as many relevant results as they can in five minutes. The systems and theme are randomly assigned.</p>
<p>We&#8217;ll find our answer by comparing the quality and quantity of the sentences the participants find on the three systems. Expert scholars will help us judge quality: they will rate the relevance of sentences the different systems produce (without knowing which system produced which sentence). For quantity, there is a time limit &#8212; which system produces more high-quality  results in five minutes?</p>
<p>So, does example-based exploration work better than search for theme finding?</p>
<p>If you have five minutes, you can help us find out by participating in the study:</p>
<p><a title="WordSeer Themes Usability Study" href="http://wordseer.berkeley.edu/themes/">http://wordseer.berkeley.edu/themes/</a></p>
]]></content:encoded>
			<wfw:commentRss>http://wordseer.berkeley.edu/finding-examples-of-a-theme-by-example/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>WordSeer 2: Test users wanted</title>
		<link>http://wordseer.berkeley.edu/wordseer-2-test-users-wanted-2/</link>
		<comments>http://wordseer.berkeley.edu/wordseer-2-test-users-wanted-2/#comments</comments>
		<pubDate>Wed, 11 Jul 2012 13:00:45 +0000</pubDate>
		<dc:creator>silverasm</dc:creator>
				<category><![CDATA[Community]]></category>
		<category><![CDATA[Digital Humanities]]></category>
		<category><![CDATA[WordSeer]]></category>
		<category><![CDATA[digital humanities]]></category>

		<guid isPermaLink="false">http://mininghumanities.com/?p=521</guid>
		<description><![CDATA[A new version of WordSeer is in the works. It&#8217;s been guided by the advice of our long-suffering literature-scholar collaborators. And by the tales of frustration and trial-and-error of the students of the Hamlet class who tried to use WordSeer to<span class="ellipsis">&#8230;</span><div class="read-more"><a href="http://wordseer.berkeley.edu/wordseer-2-test-users-wanted-2/">Read more &#8250;</a></div><!-- end of .read-more -->]]></description>
			<content:encoded><![CDATA[<p>A new version of WordSeer is in the works.</p>
<p>It&#8217;s been guided by the advice of our long-suffering literature-scholar collaborators. And by the <a title="The blog of English 203 students at the University of Calgary" href="http://engl203.ucalgaryblogs.ca/category/ph1-wordseer/" target="_blank">tales</a> of frustration and trial-and-error of the students of the <a title="English 203: Michael Ullyot" href="http://ullyot.ucalgaryblogs.ca/teaching/hamlet/" target="_blank">Hamlet class</a> who tried to use WordSeer to analyze parts of the play. We also thought hard about the text analysis process as a series of steps. &#8220;What might Tanya Clement have been thinking and doing at each stage of her <a href="http://scholar.google.com/scholar?cluster=7775081911625158953" target="_blank">computational analysis of repetition</a> in Gertrude Stein&#8217;s <a class="zem_slink" title="The Making of Americans" href="http://en.wikipedia.org/wiki/The_Making_of_Americans" rel="wikipedia" target="_blank">The Making of Americans</a>&#8220;?  &#8221;What about when we <a title="Men and Women in Shakespeare" href="http://mininghumanities.com/2012/01/24/men-and-women-in-shakespeare/" target="_blank">analyzed language use differences</a> in the descriptions of men and women in Shakespeare?&#8221; Out of this has come a better (we hope) understanding of the needs of scholars of text in the humanities.</p>
<p>We&#8217;ve completely rebuilt WordSeer. Instead of a traditional web application with a different visualization on each page, WordSeer now works more like an environment. Almost like a desktop &#8212; with windows and menu bars and persistent, useful, objects.</p>
<p>However, as researchers in <a class="zem_slink" title="Human–computer interaction" href="http://en.wikipedia.org/wiki/Human%E2%80%93computer_interaction" rel="wikipedia" target="_blank">Human-Computer Interaction</a>, we know that we need to do user studies. First, we need to check whether we&#8217;re on the right track. Do our improvements make for a better experience than the old version? More importantly, we need more observations. To understand the humanities text analysis process, we want to observe more humanities text analysis.</p>
<p>Until now, the closest we&#8217;ve come to &#8220;user studies&#8221;  is an iterative bouncing-around of ideas with just three scholars. They have been more like guides and expert consultants than &#8220;users&#8221; and they helped us sketch the first lines, and refine our first ideas into something that was actually useful.</p>
<p>We&#8217;ve acted upon the knowledge they helped us accumulate, the result of which is the completely redesigned WordSeer. We&#8217;re looking for a bigger set of users now, for a formal study. We&#8217;re hoping to find a set of around 15 professional literature scholars who will allow us to observe them as they use WordSeer to explore a problem of genuine professional interest to them.</p>
<p>So what text collection could possibly interest 15 different scholars in the digital humanities community enough to want to do a computationally-assisted analysis of it? And allow us to observe them at it?</p>
<p>In a rare moment of epiphany, we realized we could just <em>ask</em> you. So here&#8217;s a poll. It&#8217;s populated with some examples, but we encourage you to respond in the &#8220;other&#8221; field. Tell us: what collection, if set up with text analysis and visualization tools,  would make <em>you </em>interested?</p>
<p>[polldaddy poll=6382760]</p>
]]></content:encoded>
			<wfw:commentRss>http://wordseer.berkeley.edu/wordseer-2-test-users-wanted-2/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>WordSeer: &#8220;love&#8221; in Shakespeare&#8217;s tragedies and comedies</title>
		<link>http://wordseer.berkeley.edu/wordseer-compares-love-in-tragedies-and-comedies/</link>
		<comments>http://wordseer.berkeley.edu/wordseer-compares-love-in-tragedies-and-comedies/#comments</comments>
		<pubDate>Fri, 16 Dec 2011 01:16:08 +0000</pubDate>
		<dc:creator>silverasm</dc:creator>
				<category><![CDATA[Information Seeking]]></category>
		<category><![CDATA[Text Mining]]></category>
		<category><![CDATA[Visualization]]></category>
		<category><![CDATA[WordSeer]]></category>

		<guid isPermaLink="false">http://mininghumanities.com/?p=430</guid>
		<description><![CDATA[When scholars try to make sense out of large collections of text, they frequently do two things: compare, and collect. They collect samples of &#8220;interesting&#8221; things, and compare them with each other along various relevant dimensions. In this post, I<span class="ellipsis">&#8230;</span><div class="read-more"><a href="http://wordseer.berkeley.edu/wordseer-compares-love-in-tragedies-and-comedies/">Read more &#8250;</a></div><!-- end of .read-more -->]]></description>
			<content:encoded><![CDATA[<p>When scholars try to make sense out of large collections of text, they frequently do two things: <em>compare</em>, and <em>collect</em>. They collect samples of &#8220;interesting&#8221; things, and compare them with each other along various relevant dimensions.</p>
<p>In this post, I demonstrate the collection and comparison features of WordSeer by using it to compare the usage of the word &#8220;love&#8221; in Shakespeares comedies and tragedies. You can watch the screencast, or simply read on.</p>
<p>[youtube http://www.youtube.com/watch?v=DPhQQExQjZ4]</p>
<p><span id="more-430"></span></p>
<div id="attachment_432" class="wp-caption aligncenter" style="width: 547px"><a href="http://wordseer.berkeley.edu/www/wp-content/uploads/2011/12/02-tragedies-collection.png"><img class="size-full wp-image-432" title="02-tragedies-collection" src="http://wordseer.berkeley.edu/www/wp-content/uploads/2011/12/02-tragedies-collection.png" alt="" width="537" height="284" /></a><p class="wp-caption-text">Figure 1. Creating a new collection called &quot;tragedies&quot;</p></div>
<p>The first thing to do is collect the comedies and tragedies into separate lists.  To do this, I created a new collection called &#8220;tragedies&#8221; using the new &#8220;collections&#8221; feature.</p>
<div id="attachment_431" class="wp-caption aligncenter" style="width: 605px"><a href="http://wordseer.berkeley.edu/www/wp-content/uploads/2011/12/01-all-plays.png"><img class="size-full wp-image-431" title="01-all-plays" src="http://wordseer.berkeley.edu/www/wp-content/uploads/2011/12/01-all-plays.png" alt="" width="595" height="367" /></a><p class="wp-caption-text">Figure 2. The list of plays in WordSeer, sorted by title.</p></div>
<p>Next, I had to collect all of Shakespeare&#8217;s tragedies into that collection.  Figure 2 shows WordSeer&#8217;s list of plays.  I walked down this list and clicked the checkboxes next to the tragedies, using Wikipedia as an authoritative source of tragedies.</p>
<div id="attachment_433" class="wp-caption aligncenter" style="width: 155px"><a href="http://wordseer.berkeley.edu/www/wp-content/uploads/2011/12/02-add-items.png"><img class="size-full wp-image-433" title="02-add-items" src="http://wordseer.berkeley.edu/www/wp-content/uploads/2011/12/02-add-items.png" alt="" width="145" height="92" /></a><p class="wp-caption-text">Figure 3. The Add Items button</p></div>
<p>Once I&#8217;d selected all the tragedies, I  clicked  the &#8220;Add Items&#8221; button to add them to a collection.  I selected the &#8220;tragedies&#8221; collection and added the plays.</p>
<div id="attachment_434" class="wp-caption aligncenter" style="width: 605px"><a href="http://wordseer.berkeley.edu/www/wp-content/uploads/2011/12/03-tragedies.png"><img class="size-full wp-image-434" title="03-tragedies" src="http://wordseer.berkeley.edu/www/wp-content/uploads/2011/12/03-tragedies.png" alt="" width="595" height="391" /></a><p class="wp-caption-text">Figure 4. Adding some of the tragedies to the collection</p></div>
<p>This populated the collection with the plays. I did the same for the comedies, ending up with two collections</p>
<div id="attachment_435" class="wp-caption aligncenter" style="width: 510px"><a href="http://wordseer.berkeley.edu/www/wp-content/uploads/2011/12/05-collections.png"><img class=" wp-image-435 " title="05-collections" src="http://wordseer.berkeley.edu/www/wp-content/uploads/2011/12/05-collections.png" alt="" width="500" /></a><p class="wp-caption-text">Figure 5. The two collections. The &quot;comedies&quot; collection is currently open.</p></div>
<p>I was now ready to compare my collections. I opened up two windows to the heat map view. One was going to visualize the tragedies, and one the comedies.</p>
<div id="attachment_436" class="wp-caption aligncenter" style="width: 605px"><a href="http://wordseer.berkeley.edu/www/wp-content/uploads/2011/12/06-heatmap-setup.png"><img class="size-full wp-image-436" title="06-heatmap-setup" src="http://wordseer.berkeley.edu/www/wp-content/uploads/2011/12/06-heatmap-setup.png" alt="" width="595" height="312" /></a><p class="wp-caption-text">Figure 6. Setting up the heat maps. One window visualized the &quot;tragedies&quot; collection, and the other window visualized &quot;comedies&quot;.</p></div>
<p>Finally, I was ready to compare the two. I was interested in the word &#8220;love&#8221;, and whether there would be any differences in how frequently it was used in the comedies and the tragedies. To that end, I typed in &#8220;love&#8221; into the comedies window and got the heat map in Figure 7.</p>
<div id="attachment_438" class="wp-caption aligncenter" style="width: 605px"><a href="http://wordseer.berkeley.edu/www/wp-content/uploads/2011/12/07-comedies-love.png"><img class="size-full wp-image-438" title="07-comedies-love" src="http://wordseer.berkeley.edu/www/wp-content/uploads/2011/12/07-comedies-love.png" alt="" width="595" height="352" /></a><p class="wp-caption-text">FIgure 7. The occurrences of &quot;love&quot; in Shakespeare&#039;s comedies. Each column is a play, each highlighted block represents that the word &quot;love&quot; occurred there.</p></div>
<p>Not surprisingly, &#8220;love&#8221; is everywhere. But what about the tragedies? In the other window, typing in &#8220;love&#8221; yielded the results in Figure 8.</p>
<div id="attachment_439" class="wp-caption aligncenter" style="width: 605px"><a href="http://wordseer.berkeley.edu/www/wp-content/uploads/2011/12/08-tragedies-love1.png"><img class="size-full wp-image-439" title="08-tragedies-love" src="http://wordseer.berkeley.edu/www/wp-content/uploads/2011/12/08-tragedies-love1.png" alt="" width="595" height="358" /></a><p class="wp-caption-text">Figure 8. The occurrences of &quot;love&quot; in Shakespeare&#039;s tragedies.</p></div>
<p>To my surprise, the tragedies were equally full of &#8220;love&#8221;. Which, among other things, reveals my poor knowledge of Shakespeare.</p>
<p>Still, the hope is that  our Shakespeare scholar, <a title="Michael Ullyot" href="http://ucalgary.academia.edu/ullyot">Michael Ullyot</a>, (<a href="http://twitter.com/#!/ullyot">@ullyot</a>) will use collections and heat maps to discover something truly interesting.</p>
]]></content:encoded>
			<wfw:commentRss>http://wordseer.berkeley.edu/wordseer-compares-love-in-tragedies-and-comedies/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>&#8220;Beautiful&#8221; in Shakespeare</title>
		<link>http://wordseer.berkeley.edu/beautiful-in-shakespeare-2/</link>
		<comments>http://wordseer.berkeley.edu/beautiful-in-shakespeare-2/#comments</comments>
		<pubDate>Wed, 07 Dec 2011 19:39:15 +0000</pubDate>
		<dc:creator>silverasm</dc:creator>
				<category><![CDATA[Text Mining]]></category>
		<category><![CDATA[WordSeer]]></category>

		<guid isPermaLink="false">http://mininghumanities.com/?p=404</guid>
		<description><![CDATA[A common problem in search and exploration interfaces is the vocabulary problem. This refers to the great variety of words with which different people can use to describe the same concept. For people exploring a text collection, this makes search difficult. There<span class="ellipsis">&#8230;</span><div class="read-more"><a href="http://wordseer.berkeley.edu/beautiful-in-shakespeare-2/">Read more &#8250;</a></div><!-- end of .read-more -->]]></description>
			<content:encoded><![CDATA[<p>A common problem in search and exploration interfaces is the <em>vocabulary problem. </em>This refers to the great variety of words with which different people can use to describe the same concept. For people exploring a text collection, this makes search difficult. There are only a limited number different queries they can think of to describe that concept, but they may be missing many other instances that use different words. This is an important issue for humanities scholars. Often, the very first step of a literature analysis is to comb through text, trying to find  thought-provoking examples to study later.</p>
<p>In this post, I give an example of how our project <a title="WordSeer project page" href="http://cs.berkeley.edu/~aditi/projects/wordseer.html" target="_blank">WordSeer</a>, a text analysis environment for humanities scholars, can be used to overcome this problem. In this example, I&#8217;ll using an instance of WordSeer running on the complete works of Shakespeare from the <a title="Internet Shakespeare Editions" href="http://internetshakespeare.uvic.ca/" target="_blank">Internet Shakespeare Editions</a>. It&#8217;s live, so you can follow along with this example on the web at <a title="Shakespeare WordSeer" href="http://wordseer.berkeley.edu/shakespeare" target="_blank">wordseer.berkeley.edu/shakespeare</a>.</p>
<p>You can read the post after the jump, or just watch this video.</p>
<p>[youtube http://www.youtube.com/watch?v=OXkuOzl9GrI]</p>
<p><span id="more-404"></span></p>
<p>I began with a simple question, &#8220;What are some things that are &#8216;beautiful&#8217; in Shakespeare?&#8221;. Normally, this would be a challenging question by itself. WordSeer, however, uses <em>grammatical search</em>. This is a search feature that goes beyond keyword search and instead also searches over grammatical relationships between words. These are relationships such as subject-object and modifier-subject. For example, in the sentence &#8220;The good God has given every man intellect&#8221;, there is a relationship between &#8220;good&#8221; and &#8220;God&#8221; &#8212; &#8220;good&#8221; is an adjective that <em>modifies</em> &#8221;God&#8221;. There is also a verb-agency relationship between &#8220;God&#8221; and &#8220;given&#8221;, and a verb-object relationship between &#8220;given&#8221; and &#8220;man&#8221;. For more than a decade, it has been possible to automatically extract such relationships from text using computational linguistics algorithms. WordSeer uses these well-known algorithms to analyze the works of Shakespeare (in this case) and allow users to search over them.</p>
<div id="attachment_410" class="wp-caption aligncenter" style="width: 510px"><a href="http://wordseer.berkeley.edu/www/wp-content/uploads/2011/12/described-as-beautiful.png"><img class=" wp-image-410 " title="described-as-beautiful" src="http://wordseer.berkeley.edu/www/wp-content/uploads/2011/12/described-as-beautiful.png?w=300" alt="" width="500" /></a><p class="wp-caption-text">Figure 1: The query &quot;____ described as beautiful&quot;, which retrieves all words to which the adjective &quot;beautiful&quot; has been applied.</p></div>
<p>In the case of my question, &#8220;What are some things that are &#8216;beautiful&#8217; in Shakespeare?&#8221; I can use grammatical search to good effect, as shown in Figure 1 above. I leave the left-hand-side box blank to retrieve all matches, select the &#8220;described as&#8221; grammatical relationship, and put in &#8220;beautiful&#8221; to create the fill-in-the blanks query &#8220;_____ <strong>described-as</strong> beautiful&#8221;. I press &#8220;Go&#8221; to search, and WordSeer returns all the matching sentences. These are sentences containing the adjective &#8220;beautiful&#8221;, applied to some other word. I get the results in Figure 2.</p>
<div id="attachment_411" class="wp-caption aligncenter" style="width: 510px"><a href="http://wordseer.berkeley.edu/www/wp-content/uploads/2011/12/beautiful-go-to-results.png"><img class=" wp-image-411   " title="beautiful-go-to-results" src="http://wordseer.berkeley.edu/www/wp-content/uploads/2011/12/beautiful-go-to-results.png?w=300" alt="" width="500" /></a><p class="wp-caption-text">Results for the query _____ described-as beautiful. To my alarm, there was only one match!</p></div>
<p>To my alarm, there was only one match: the sentence &#8220;His youngest daughter, beautiful Bianca&#8221;, from <em>The Taming of the Shrew</em>. Concerned that my algorithms had malfunctioned, I did a simple search for the <em>word</em> beautiful, without any grammatical relationships. Lo and behold, there were only 16 results (Figure 3).</p>
<div id="attachment_412" class="wp-caption aligncenter" style="width: 510px"><a href="http://wordseer.berkeley.edu/www/wp-content/uploads/2011/12/beautiful-only-search.png"><img class=" wp-image-412 " title="beautiful-only-search" src="http://wordseer.berkeley.edu/www/wp-content/uploads/2011/12/beautiful-only-search.png?w=300" alt="" width="500" /></a><p class="wp-caption-text">Figure 3. Search results for the word &quot;beautiful&quot; in Shakespeare. There were only 16 results.</p></div>
<p>I had encountered the <strong><em>vocabulary problem. </em></strong>Not being a Shakespeare scholar, I couldn&#8217;t think of any other words that could have been used instead of &#8220;beautiful&#8221;, and it seemed preposterous that these were the only results in all of Shakespeare. There must be other words.</p>
<div id="attachment_413" class="wp-caption aligncenter" style="width: 510px"><a href="http://wordseer.berkeley.edu/www/wp-content/uploads/2011/12/results-bianca.png"><img class=" wp-image-413 " title="results-bianca" src="http://wordseer.berkeley.edu/www/wp-content/uploads/2011/12/results-bianca.png?w=300" alt="" width="500" /></a><p class="wp-caption-text">Figure 4. Clicking on the &quot;book&quot; icon next to a search result opens a new window with the full text of the play, and automatically scrolls to, and highlights, the matching sentence.</p></div>
<p>To investigate further, I decided to read some of the context around the word &#8220;beautiful&#8221;. To do this, I clicked on the &#8220;book&#8221; icon to the left of my &#8220;beautiful Bianca&#8221; search result.  This brought up a new window (Figure 4) with the full text of  The Taming of The Shrew, opened up to  the exact line matching the search result.</p>
<div id="attachment_414" class="wp-caption aligncenter" style="width: 510px"><a href="http://wordseer.berkeley.edu/www/wp-content/uploads/2011/12/related-words.png"><img class=" wp-image-414 " title="related-words" src="http://wordseer.berkeley.edu/www/wp-content/uploads/2011/12/related-words.png?w=300" alt="" width="500" /></a><p class="wp-caption-text">Figure 5. Right-clicking on &quot;beautiful&quot; (or any other word) brings up synonyms of that word -- computed based on similar usage to the query word.</p></div>
<p>After convincing myself that &#8220;beautiful&#8221; did actually mean what I thought it did in Shakespeare, I decided that I needed to see synonyms. WordSeer supports this need. Using the contexts of words, it computes synonyms based on other words that &#8220;behave&#8221; in the same way &#8212; that are used in the same contexts, that have grammatical relationships to similar words, and so on. Right-clicking on a word while reading (Figure 5) brings up synonyms. These are computed based on being used in a similar way to &#8220;beautiful&#8221; in Shakespeare, and not based on some external measure of similarity, such as a dictionary or thesaurus. Therefore, they reflect the particular idiosyncrasies of <em>just</em> the Shakespeare collection.</p>
<div id="attachment_415" class="wp-caption aligncenter" style="width: 510px"><a href="http://wordseer.berkeley.edu/www/wp-content/uploads/2011/12/all-heatmap-queries.png"><img class=" wp-image-415 " title="all-heatmap-queries" src="http://wordseer.berkeley.edu/www/wp-content/uploads/2011/12/all-heatmap-queries.png?w=300" alt="" width="500" /></a><p class="wp-caption-text">Figure 6. Adding some interesting synonyms to the heat map query -- click a word to add it to the heat map query.</p></div>
<p>This list of synonyms seemed promising. It contained words such as &#8220;tractable&#8221;, &#8220;fair&#8221;, and &#8220;gentle&#8221;, that I would never have thought of including in my initial search.  To investigate whether these were more widespread, I decided to investigate their prevalence in the collection using WordSeer&#8217;s heat map tool. I clicked some interesting words and added them to my query (Figure 6). This took me to the heat map view.</p>
<div id="attachment_416" class="wp-caption aligncenter" style="width: 510px"><a href="http://wordseer.berkeley.edu/www/wp-content/uploads/2011/12/fair-heatmap.png"><img class=" wp-image-416 " title="fair-heatmap" src="http://wordseer.berkeley.edu/www/wp-content/uploads/2011/12/fair-heatmap.png?w=300" alt="" width="500" /></a><p class="wp-caption-text">Figure 7. The collection-wide occurrence of some of the synonyms of &quot;beautiful&quot;. It is easy to see that these are much more prevalent.</p></div>
<p>WordSeer heat maps can be confusing if you have never seen one before, so I&#8217;ll explain them here. If you know what this means, you can skip this section.</p>
<h4>Heat Maps</h4>
<p>WordSeer uses heat maps to visualize collection-wide occurrence patterns of words and phrases. In this example, I&#8217;ll use  the word &#8220;fair&#8221; to illustrate.</p>
<div id="attachment_417" class="wp-caption aligncenter" style="width: 510px"><a href="http://wordseer.berkeley.edu/www/wp-content/uploads/2011/12/macbeth-heatmap-example.png"><img class=" wp-image-417 " title="macbeth-heatmap-example" src="http://wordseer.berkeley.edu/www/wp-content/uploads/2011/12/macbeth-heatmap-example.png?w=300" alt="" width="500" /></a><p class="wp-caption-text">Figure 8. The column corresponding to &quot;Macbeth&quot;</p></div>
<p>Typing in &#8220;fair&#8221; creates the pretty picture in Figure 8 above. Each vertical column is a single document &#8212; in the picture, I am hovering over the column corresponding to &#8220;Macbeth&#8221;.  The documents are lined up side by side in long vertical columns.</p>
<div id="attachment_418" class="wp-caption aligncenter" style="width: 510px"><a href="http://wordseer.berkeley.edu/www/wp-content/uploads/2011/12/midsummer-heatmap-example.png"><img class=" wp-image-418 " title="midsummer-heatmap-example" src="http://wordseer.berkeley.edu/www/wp-content/uploads/2011/12/midsummer-heatmap-example.png?w=300" alt="" width="500" /></a><p class="wp-caption-text">Figure 9. The column corresponding to &quot;A Midsummer Night&#039;s Dream&quot;</p></div>
<p>All of Shakespeare&#8217;s works are here . Figure 9 shows the column corresponding to &#8220;A midsummer night&#8217;s dream&#8221;.  The blue highlights show occurrences of the query. In each vertical colum, blue blocks indicate that the query word, in this case &#8220;fair&#8221;, has occurred in that location. Blocks higher up in the column mean that the word occurred near the beginning, and blocks lower down in the column mean that the word occurred towards the end of the document. The documents all &#8220;appear&#8221; the same length, so shorter documents are &#8220;stretched&#8221; (a few taller blocks) and longer documents are squeezed (many squat blocks).</p>
<div id="attachment_419" class="wp-caption aligncenter" style="width: 510px"><a href="http://wordseer.berkeley.edu/www/wp-content/uploads/2011/12/word-in-heatmap.png"><img class=" wp-image-419 " title="word-in-heatmap" src="http://wordseer.berkeley.edu/www/wp-content/uploads/2011/12/word-in-heatmap.png?w=300" alt="" width="500" /></a><p class="wp-caption-text">Figure 10. The blue highlights show where the query word has occurred in each document.</p></div>
<p>Hovering over a highlighted block brings up a window showing the matched sentence. In this case, I&#8217;ve hovered over a line containing &#8220;fair&#8221; from &#8220;The Tempest&#8221;. The popup shows the line, with &#8220;fair&#8221; highlighted in the same color, and a book icon. Clicking the icon opens up a new window, in which I can  more of the text if I wish to.</p>
<h4>But back to the the vocabulary of beauty</h4>
<div id="attachment_420" class="wp-caption aligncenter" style="width: 510px"><a href="http://wordseer.berkeley.edu/www/wp-content/uploads/2011/12/fair-heatmap1.png"><img class=" wp-image-420  " title="fair-heatmap" src="http://wordseer.berkeley.edu/www/wp-content/uploads/2011/12/fair-heatmap1.png?w=300" alt="" width="500" /></a><p class="wp-caption-text">Figure 11. Heat map showing synonyms of the word &quot;beautiful&quot;</p></div>
<p>The heat map I got from my synonym query (Figure 11. above) showed that, although &#8220;beautiful&#8221; was quite a rare word, other synonyms for it seemed much more prevalent. &#8220;Fair&#8221;, in particular, seemed to be used a lot &#8212; the whole map was purple. Hovering over individual instances of &#8220;fair&#8221; showed that it did seem to be used the way &#8220;beautiful&#8221; is in today&#8217;s english.</p>
<div id="attachment_421" class="wp-caption aligncenter" style="width: 510px"><a href="http://wordseer.berkeley.edu/www/wp-content/uploads/2011/12/wordtree-fair.png"><img class=" wp-image-421 " title="wordtree-fair" src="http://wordseer.berkeley.edu/www/wp-content/uploads/2011/12/wordtree-fair.png?w=300" alt="" width="500" /></a><p class="wp-caption-text">The word tree for &quot;fair&quot;. This is a visualization of the contexts surrounding the word &quot;fair&quot;.</p></div>
<p>For further verification, I looked at the word tree for beautiful (Figure 12), which was displayed on the same page just below the heat map. It showed that &#8220;fair&#8221; was used in constructions like &#8220;fair and virtuous&#8221;, &#8220;fair and happy&#8221;, &#8220;fair and good&#8221;.</p>
<div id="attachment_422" class="wp-caption aligncenter" style="width: 510px"><a href="http://wordseer.berkeley.edu/www/wp-content/uploads/2011/12/described-as-fair.png"><img class=" wp-image-422 " title="described-as-fair" src="http://wordseer.berkeley.edu/www/wp-content/uploads/2011/12/described-as-fair.png?w=300" alt="" width="500" /></a><p class="wp-caption-text">Figure 13. A revised query: &quot;_______ described-as fair&quot;</p></div>
<p>All this evidence convinced me that I might be better to use the word &#8220;fair&#8221; instead of &#8220;beautiful&#8221; to investigate the concept of beauty in Shakespeare. Returning to the search page, I typed in a new grammatical search query &#8211; &#8221;_________ <strong>described-as</strong> fair&#8221;. The results (Figure 14) were much more informative:</p>
<div id="attachment_423" class="wp-caption aligncenter" style="width: 510px"><a href="http://wordseer.berkeley.edu/www/wp-content/uploads/2011/12/described-as-fair-results.png"><img class=" wp-image-423  " title="described-as-fair-results" src="http://wordseer.berkeley.edu/www/wp-content/uploads/2011/12/described-as-fair-results.png?w=300" alt="" width="500" /></a><p class="wp-caption-text">Figure 14. Search results for the grammatical search query &quot;_________ described-as fair&quot;</p></div>
<p>Because there was more than one result this time, WordSeer showed a bar graph summarizing the matches. At a glance, I could see that I was on the right track.  There were a lot of women&#8217;s names, interspersed with other words like  &#8221;queen&#8221;, &#8220;daughter&#8221;, and &#8220;day&#8221;. It seemed that I had successfully overcome  the vocabulary problem, at least this time. I had  a starting answer to my original question, &#8220;what are some things that are &#8216;beautiful&#8217; in Shakespeare?&#8221;</p>
<p>The goal of text analysis interfaces like WordSeer is to properly combine powerful language processing algorithms with easy user interfaces. Neither is be enough by itself, but together, they allow users to progress naturally from step to step, assisting them through an analysis.</p>
]]></content:encoded>
			<wfw:commentRss>http://wordseer.berkeley.edu/beautiful-in-shakespeare-2/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Digital Humanities and the Future of Search</title>
		<link>http://wordseer.berkeley.edu/digital-humanities-and-the-future-of-search-2/</link>
		<comments>http://wordseer.berkeley.edu/digital-humanities-and-the-future-of-search-2/#comments</comments>
		<pubDate>Mon, 31 Jan 2011 07:08:22 +0000</pubDate>
		<dc:creator>silverasm</dc:creator>
				<category><![CDATA[Digital Humanities]]></category>
		<category><![CDATA[Information Seeking]]></category>
		<category><![CDATA[Natural Language Processing]]></category>
		<category><![CDATA[Text Mining]]></category>

		<guid isPermaLink="false">http://mininghumanities.com/?p=370</guid>
		<description><![CDATA[On Tuesday, Feb. 1, I&#8217;ll be presenting my latest project WordSeer, at the Farsight 2011 conference on the future of search.  This event will be streamed live from TechCrunch, the tech world&#8217;s favorite blog about new technology and startup news,<span class="ellipsis">&#8230;</span><div class="read-more"><a href="http://wordseer.berkeley.edu/digital-humanities-and-the-future-of-search-2/">Read more &#8250;</a></div><!-- end of .read-more -->]]></description>
			<content:encoded><![CDATA[<p><a href="http://bebop.berkeley.edu/wordseer"></a><a href="http://wordseer.berkeley.edu/www/wp-content/uploads/2011/01/picture-2.png"></a><img class="alignleft" src="http://bebop.berkeley.edu/wordseer/img/wordseer.png" alt="" width="100" />On Tuesday, Feb. 1, I&#8217;ll be presenting my latest project <a title="WordSeer UC Berkeley project page " href="http://www.cs.berkeley.edu/~aditi/projects/wordseer.html"><span style="color:#ff6600;">WordSeer</span></a>, at the <span style="text-decoration:underline;"><a href="http://bigthink.com/series/62"><span style="color:#ff6600;">Farsight 2011</span></a></span> conference on the future of search.  This event will be streamed live from <span style="text-decoration:underline;"><a href="http://techcrunch.com"><span style="color:#ff6600;">TechCrunch</span></a></span>, the tech world&#8217;s favorite blog about new technology and startup news, and will be attended by high-profile techies from Bing, Google, <span style="text-decoration:underline;"><a class="zem_slink" title="Blekko" rel="homepage" href="http://www.blekko.com"><span style="color:#ff6600;">Blekko</span></a></span>, and the like. Please <span style="text-decoration:underline;"><a href="http://bigthink.com/series/62"><span style="color:#ff6600;">tune in</span></a></span> at 10am PST Tuesday, and follow along with <span style="text-decoration:underline;"><a href="http://twitter.com/#search?q=%23futuresearch"><span style="color:#ff6600;">#futuresearch</span></a></span> on twitter, and let&#8217;s get the digital humanities some high-tech exposure that day!</p>
<p><span id="more-370"></span></p>
<p><a href="http://bebop.berekeley.edu"><img class="aligncenter size-full wp-image-392" title="Picture 3" src="http://wordseer.berkeley.edu/www/wp-content/uploads/2011/01/picture-3.png" alt="Wordseer Search Box" width="465" height="71" /></a></p>
<p><a href="http://bebop.berkeley.edu/wordseer"><span style="color:#ff6600;"> </span></a></p>
<p style="text-align:center;"><a href="http://bebop.berkeley.edu/wordseer"></a></p>
<p><a href="http://bebop.berkeley.edu/wordseer"><span style="color:#ff6600;">WordSeer </span></a> is a new way of searching through text inspired by the way literary scholars work. Literature scholars ask detailed, analytical questions of text, for which it&#8217;s important for them to get a sense of how different words are used and in what contexts. For our project, we teamed up with scholars who are exploring language use in a collection of North American slave narratives.</p>
<p>When analyzing text, traditional keyword-based search can only take you so far. Instead of having to read every document hoping to come across relevant passages, you can immediately zoom in on them with a search. But can we do better? When trying to form a hypothesis or get a sense of contents, a long list of search results is still unwieldy  because it&#8217;s not really the matching sentences we&#8217;re interested in, it&#8217;s what they have to say about our topic.</p>
<p>&nbsp;</p>
<p><a href="http://wordseer.berkeley.edu/www/wp-content/uploads/2011/01/picture-8.png"><img class="aligncenter size-full wp-image-397" title="Picture 8" src="http://wordseer.berkeley.edu/www/wp-content/uploads/2011/01/picture-8.png" alt="The grammatical structure of a sentence" width="595" height="118" /></a></p>
<p>Luckily, we don&#8217;t have to stop at matching keywords. Sentences <a href="http://mininghumanities.com/2010/07/23/tools-2-nlp/">aren&#8217;t</a> mysterious bags of words, they follow rules and have structures, which computers have been capable of deciphering with speed and precision for some years now. From these structures, computers can automatically infer relationships between words. For example, in the sentence,</p>
<blockquote><p>&#8220;The good God has given every man intellect&#8221;</p></blockquote>
<p>computers can automatically infer that &#8220;God&#8221; is described as &#8220;good&#8221;, and that he is the agent doing the giving.</p>
<p>With WordSeer, we&#8217;re going beyond keyword search by using language processing to automatically extract and aggregate the parts of matching sentences relevant to a query. In the first place, we make it easy to express an analytical query in terms of a grammatical relationship. For example, if a scholar wanted to know what the slave narratives collection indicated about the relationship between slaves and God, <a href="http://bebop.berkeley.edu/wordseer/index.php?grammatical=on&amp;gov=God&amp;relation=amod+advmod&amp;dep=&amp;results=&amp;page=0&amp;pagelength=100" target="_blank"><span style="color:#ff6600;">they could simply ask</span></a> (live demo link) how God  &#8221;is described&#8221; (for which WordSeer finds and displays all the adjectives that are applied to the word God) and what <a href="http://bebop.berkeley.edu/wordseer/index.php?grammatical=on&amp;gov=&amp;relation=agent+subj+nsubj+csubj+nsubjpass+csubjpass&amp;dep=God&amp;results=&amp;page=0&amp;pagelength=100" target="_blank">&#8220;<span style="color:#ff6600;">is done by</span>&#8220;</a> God (for which WordSeer finds and categorizes all instances of verbs in which God is an agent).</p>
<p style="text-align:center;"><a href="http://bebop.berkeley.edu/wordseer/index.php?grammatical=on&amp;gov=&amp;relation=agent+subj+nsubj+csubj+nsubjpass+csubjpass&amp;dep=God&amp;results=&amp;page=0&amp;pagelength=100"><img class="aligncenter size-full wp-image-393" title="Picture 5" src="http://wordseer.berkeley.edu/www/wp-content/uploads/2011/01/picture-5.png" alt="" width="553" height="383" /></a></p>
<p>Of course, this is only a rough, high-level picture of what the slave narratives say about how God is described and what God does, but a rough idea can often serve to guide intuition and help generate or discredit hypotheses. By making the process of &#8220;getting a rough idea&#8221; quick and inexpensive, we can speed up the entire research pipeline.</p>
]]></content:encoded>
			<wfw:commentRss>http://wordseer.berkeley.edu/digital-humanities-and-the-future-of-search-2/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>WordSeer: Exploring Language Use in Slave Narratives</title>
		<link>http://wordseer.berkeley.edu/wordseer/</link>
		<comments>http://wordseer.berkeley.edu/wordseer/#comments</comments>
		<pubDate>Tue, 07 Dec 2010 23:41:59 +0000</pubDate>
		<dc:creator>silverasm</dc:creator>
				<category><![CDATA[Digital Collections]]></category>
		<category><![CDATA[Digital Humanities]]></category>
		<category><![CDATA[Information Seeking]]></category>
		<category><![CDATA[Natural Language Processing]]></category>
		<category><![CDATA[Visualization]]></category>
		<category><![CDATA[digital humanities]]></category>
		<category><![CDATA[english literature]]></category>
		<category><![CDATA[Grammar]]></category>
		<category><![CDATA[Natural language processing]]></category>
		<category><![CDATA[Parsing]]></category>
		<category><![CDATA[search interface]]></category>
		<category><![CDATA[text mining]]></category>
		<category><![CDATA[user interfaces]]></category>
		<category><![CDATA[visualization]]></category>

		<guid isPermaLink="false">http://mininghumanities.com/?p=340</guid>
		<description><![CDATA[More and more source text in the humanities gets digitized every day, making it accessible to large scale computational analysis. Nevertheless, traditional methods of humanistic analysis are based on detailed arguments built upon on close readings of individual texts. How<span class="ellipsis">&#8230;</span><div class="read-more"><a href="http://wordseer.berkeley.edu/wordseer/">Read more &#8250;</a></div><!-- end of .read-more -->]]></description>
			<content:encoded><![CDATA[<p><span style="font-style:normal;">More and more source text in the humanities gets digitized every day, making it accessible to large scale computational analysis.  Nevertheless, traditional methods of  humanistic analysis are based on detailed arguments built upon on close readings of  individual texts. How will the field adapt? How do we use statistics and text mining to answer humanistic questions?</span></p>
<p>Zoom in to the field of  American literature, and further into the realm of  studying the <a href="http://docsouth.unc.edu/neh/">(digitized) narratives</a> of  escaped former slaves, published by white abolitionists. There are widespread stylistic and thematic similarities among these narratives. How can text mining help literature scholars here? That&#8217;s where WordSeer, my latest project, comes in.</p>
<p><span id="more-340"></span></p>
<p>The <a href="http://www.monkproject.org/">MONK project</a> at CMU, and the <a href="http://voyeurtools.org">Voyeur</a> project at McMaster University share the same cause as WordSeer. But, when it comes to text analysis, they are essentially search interfaces that show simple statistics about word order, type and frequency. The grammatical relationships within text are neglected.</p>
<h3>WordSeer</h3>
<p>WordSeer is an evolving project, as all digital humanities projects inevitably are. As my friends in the English department and I learn what we can do for each other, it will get steadily more well-defined, but right now, it&#8217;s simple: a search interface and a reading interface. The search interface  allows queries based on grammatical structure, and the reading interface is for reading narratives, comparing them, and coming up with new queries.</p>
<h4>Search</h4>
<p>The search screen is shown below.  It supports standard keyword-based search, so scholars can look for words or exact matches in the text. More interestingly, there&#8217;s grammatical search. Using grammatical relationships extracted through natural language processing, users can ask how things were described, what actions were performed upon them and by them, who possessed  certain things, or what was possessed by them.</p>
<p><a href="http://www.cs.berkeley.edu/~aditi/blog/wordseer/searchscreen.png" target="_blank"><img class="   " title="The Search Screen" src="http://www.cs.berkeley.edu/~aditi/blog/wordseer/searchscreen.png" alt="" width="600" /></a></p>
<p>For example, the figure above (click for larger image in new window) shows the query, &#8216;give all adjectives that are applied to the words &#8220;slave, bondman, negro&#8221;&#8216;. The system returns not only a list of occurrences in the narratives, but also automatically-generated graphs, showing the frequencies of the different words. As you can see, &#8220;poor&#8221; is the most frequent adjective. The results are sortable, and filterable: clicking on bars filters the list to show just results containing those words. Above, I&#8217;ve filtered to show just the instances where &#8220;valuable&#8221; is applied to &#8220;slave&#8221;.</p>
<h4>Reading</h4>
<p>Interviews with our literary scholar friends suggested that a search interface alone would not be enough, so WordSeer supports reading narratives individually.</p>
<p>The reading view is shown below. Scholars can select one (or, indeed many) sentences from the search results and be taken to a reading screen, where the narratives are opened up to the correct place. Grammatical search doesn&#8217;t end there, however, because the<em> entire text</em> is interactive.</p>
<p><a href="http://www.cs.berkeley.edu/~aditi/blog/wordseer/readscreen.png" target="_blank"><img class="   " title="The reading screen" src="http://www.cs.berkeley.edu/~aditi/blog/wordseer/readscreen.png" alt="" width="600" /></a>Highlighting a portion of a sentence and clicking the &#8220;examine&#8221; button (bottom right corner) shows the text pattern, as well as all the grammatical relationships in the highlighted portion. For example, I clicked on a passage about hospitals, and was presented with the pattern-examiner screen (below).</p>
<div class="mceTemp mceIEcenter" style="text-align:left;">
<dl class="wp-caption aligncenter">
<dt class="wp-caption-dt"><a href="http://www.cs.berkeley.edu/~aditi/blog/wordseer/selectpattern.png" target="_blank"><img class="  " title="Highlighting Patterns" src="http://www.cs.berkeley.edu/~aditi/blog/wordseer/selectpattern.png" alt="" width="600" /></a></dt>
<dd class="wp-caption-dd"> </dd>
</dl>
</div>
<p style="text-align:left;">I can select some patterns, either the original passage or some grammatical patterns, and examine them further. I can use them as search queries and be taken back to the original search screen, I can save them for later, or I can view their distributions in the text I&#8217;m reading.</p>
<p style="text-align:left;">Being able to compare the distribution of phrases or patterns across texts can give an idea of how similar the texts are, or of how much their subject matter overlaps. For example, if I wanted to know where plantations were mentioned in these texts, I would highlight the word, &#8220;plantation&#8221; and click  &#8220;See in Text&#8221;, giving the result below.</p>
<p><a href="http://www.cs.berkeley.edu/~aditi/blog/wordseer/plantations.png" target="_blank"><img class="   " title="Comparing the distributions of phrases" src="http://www.cs.berkeley.edu/~aditi/blog/wordseer/plantations.png" alt="" width="600" /></a></p>
<p style="text-align:left;">The white column represent the length of the entire text, and green bars indicate that the pattern of interest occurred. If I had selected multiple patterns, I would see different colored bars.Clicking on any of the little green bars takes me to an occurrence of the pattern, highlighted in the text.</p>
<h3 style="text-align:left;">Language Processing</h3>
<p style="text-align:left;">All of this works because I applied language processing to the text beforehand, and stored the information a database for quick access. I applied part-of-speech tagging, syntactic parsing, and dependency parsing to decompose sentences into their grammatical constituents. For example, the sentence, “The cruel man beat us severely”  contains the word “cruel” which is an adjective modfier of the word “man”, which is a noun.  There is  verb object relation between “beat” and “us”, and a verb subject relation between “man” and “beat”.</p>
<p style="text-align:left;">If you want to know more about natural language processing, I gave a BootCamp about text mining at THATCamp SF recently, here are the <a title="Text mining" href="http://www.cs.berkeley.edu/~aditi/thatcampsf/textmining.pptx">slides</a> [<a href="http://www.cs.berkeley.edu/~aditi/thatcampsf/textmining.pdf">pdf</a>]. I also wrote a <a title="Tools for Exploring Text: Natural Language Processing" href="http://mininghumanities.com/2010/07/23/tools-2-nlp/">blog post</a> introducing the subject for a digital humanities audience.</p>
<h3 style="text-align:left;">What next?</h3>
<p style="text-align:left;">Syntactic analysis is just a small part of what natural language processing can do. Right now, I&#8217;m working on being able to track named entities through a narrative and see descriptions applied to them, and actions in which they participate.</p>
]]></content:encoded>
			<wfw:commentRss>http://wordseer.berkeley.edu/wordseer/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Extracting Social Networks from 19th Century Novels</title>
		<link>http://wordseer.berkeley.edu/social-networks-19th-century/</link>
		<comments>http://wordseer.berkeley.edu/social-networks-19th-century/#comments</comments>
		<pubDate>Mon, 13 Sep 2010 16:21:48 +0000</pubDate>
		<dc:creator>silverasm</dc:creator>
				<category><![CDATA[Digital Humanities]]></category>
		<category><![CDATA[Natural Language Processing]]></category>
		<category><![CDATA[digital humanities]]></category>
		<category><![CDATA[english literature]]></category>
		<category><![CDATA[text mining]]></category>

		<guid isPermaLink="false">http://mininghumanities.com/?p=302</guid>
		<description><![CDATA[This year&#8217;s conference of the Association for Computational Linguistics, the most prestigious event in computational linguistics, had a paper that got me very excited. It&#8217;s called Extracting Social Networks from Literary Fiction [pdf], and here&#8217;s the abstract (emphasis added): We<span class="ellipsis">&#8230;</span><div class="read-more"><a href="http://wordseer.berkeley.edu/social-networks-19th-century/">Read more &#8250;</a></div><!-- end of .read-more -->]]></description>
			<content:encoded><![CDATA[<p>This year&#8217;s conference of the Association for Computational Linguistics, the most prestigious event in computational linguistics, had a paper that got me very excited. It&#8217;s called <a href="http://www.google.com/url?sa=t&amp;source=web&amp;cd=1&amp;ved=0CBcQFjAA&amp;url=http%3A%2F%2Fwww.cs.columbia.edu%2F~delson%2Fpubs%2FACL2010-ElsonDamesMcKeown.pdf&amp;ei=1dtpTNrDPIe6sAPrtrXUBw&amp;usg=AFQjCNEJza_v38BKBs5ZqCpA7CUTyp2L7g&amp;sig2=Lxq-cu2TGo-2Hzh1wVacqw">Extracting Social Networks from Literary Fiction</a> [pdf], and here&#8217;s the abstract (emphasis added):</p>
<blockquote><p>We present a method for <strong><span style="color:#ff6600;">extracting social networks from literature</span></strong>,  namely, nineteenth-century British novels and serials. We derive the  networks from dialogue interactions, and thus our method depends on the  ability to determine when two characters are in conversation. Our  approach involves character name chunking, quoted speech attribution and  conversation detection given the set of quotes. We extract features  from the social networks and examine their correlation with one another,  as well as with metadata such as the novel’s setting. <span style="color:#ff6600;"><strong>Our results provide evidence that</strong> <strong>the majority of novels in this time period do not fit two characterizations provided by literacy scholars. </strong></span>Instead, our results suggest an alternative explanation for differences in social networks.</p></blockquote>
<p>The paper advances a new technique for extracting social networks from text, and uses it on 19th century novels to argue that certain aspects of literary theory about novels might be false. In this post, I&#8217;ll explain the analysis to the digital humanities audience and discuss some strengths and weaknesses in the argument.</p>
<p><span id="more-302"></span></p>
<p>Written at Columbia University by two computer scientists and one English scholar, this paper contains exciting things to both computational linguists and literature researchers. For computational linguists, it proposes the first ever algorithm for extracting speaker-to-speaker networks from free text. This opens up fascinating new areas of study because it is now possible to computationally analyze interactions between people in a text and not just what they say to each other.</p>
<p>For literary scholars, it suggests two hypotheses from literary theory about community and society in 19th century novels might be false, namely:</p>
<blockquote><p>Literary studies about the nineteenth-century British novel are often concerned with the nature of the community that surrounds the protagonist. Some theorists have suggested a relationship between the size of a community and the amount of dialogue that occurs, positing that <strong><span style="color:#ff6600;">“face to face time” diminishes as the number of characters in the novel grows</span></strong>. Others suggest that <strong><span style="color:#ff6600;">as the social setting becomes more urbanized, the quality of dialogue also changes,</span></strong> <span style="color:#ff6600;"><strong>with</strong> </span><strong><span style="color:#ff6600;">more interactions occurring in rural communities than urban communities</span></strong>. Such claims have typically been made, however,<strong><span style="color:#ff6600;"> </span></strong><span style="color:#000000;">on the basis of a few novels that are studied in depth</span>. In this paper, <strong><span style="color:#ff6600;">we aim to</span></strong> <strong><span style="color:#ff6600;">determine whether an automated study of a much larger sample of nineteenth century novels supports these claims</span></strong>.</p></blockquote>
<p>To make their arguments, the authors frame the statements above in terms of social networks:</p>
<ul>
<li>If face-to-face time diminishes as the number of characters grows, then the more characters the novel has, the less dense its extracted social network will be.</li>
<li>Second, if more interactions occur in rural settings than urban settings, networks from rural novels will be densely connected, but contain fewer characters, but networks from urban settings be large and loosely connected.</li>
</ul>
<p style="text-align:center;"><a href="http://wordseer.berkeley.edu/www/wp-content/uploads/2010/08/picture-1.png"><img class="size-medium wp-image-318 aligncenter" title="Mansfield Park social network" src="http://wordseer.berkeley.edu/www/wp-content/uploads/2010/08/picture-1.png?w=246" alt="Social network extracted from Mansfield Park by Jane Austen" width="246" height="299" /></a></p>
<p>Then, they extract social networks from novels using the following steps. First, the <a href="http://nlp.stanford.edu/software/">Stanford named-entity tagger</a> automatically locates all the names in each novel. Then, a classifier automatically assigns a speaker to every instance of direct speech in the novel using features of the surrounding text. A &#8220;conversation&#8221; occurs if two characters speak within 300 words each other, and finally, a social network is constructed from the conversations. Nodes are named speakers (that appear 3 times or more &#8211;  the named-entity tagger is somewhat error prone). Edges appear  if there was a conversation between two characters, a heavier edge means more conversations.  The end result is a social network like the one shown above, which was extracted from<em> Mansfield Park </em>by Jane Austen.</p>
<p><a href="http://wordseer.berkeley.edu/www/wp-content/uploads/2010/09/picture-1.png"><img class="aligncenter size-full wp-image-331" title="First person network" src="http://wordseer.berkeley.edu/www/wp-content/uploads/2010/09/picture-1.png" alt="" width="356" height="337" /></a></p>
<p>Using the social networks they extract, the authors show that there is no significant difference in this dimension between urban and rural novels. Instead, they show that the biggest differences seem to be between novels in the third person and novels in the first person &#8211; the third-person novels have &#8220;dense, talkative&#8221; networks, whereas the first-person novels all center around the character &#8220;I&#8221;:</p>
<blockquote><p>Our data suggests &#8230; that the “urban novel” is not as strongly distinctive a form as has been asserted, and that in fact it can look much like the village ﬁctions of the century, as long as the same method of narration is used</p></blockquote>
<p>Their claim seems too strong to me. In order to make the problem tractable, they have reduced the concepts of &#8220;characters&#8221; and &#8220;conversation&#8221; to simple metrics, but important information isleft out:</p>
<ul>
<li>Characters are equated with names that appear more than thrice, but this leaves out
<ul>
<li>Named characters that are mentioned less than 3 times</li>
<li>Nameless characters that don&#8217;t speak</li>
</ul>
</li>
<li>They assume  all conversations are direct speech &#8211; they ignore indirect and reported speech</li>
</ul>
<p>From an NLP perspective, it&#8217;s easy to see why they&#8217;ve made these simplifications. In the kinds of text that we are used to dealing with: expository things like news articles, or explanatory things like journal papers, infrequent, nameless entities don&#8217;t matter and are rare. We&#8217;re used to looking for <em>significant</em> entities, <em>popular</em> topics of discussion, so the &#8220;drop the infrequent&#8221; approach goes a long way, and eliminates noise.</p>
<p>Nevertheless, when it comes to characterizing the aesthetics of a novel&#8217;s depiction of a social network, it&#8217;s a different matter, no longer about how &#8220;important&#8221; a character is or how &#8220;significant&#8221; some topic is. To me, it seems plausible that the number of infrequently-appearing named characters, and the number of nameless characters who are seen but never heard chan change the quality of a social network one experiences in a novel.  Without further investigation into how frequent these infrequent-character, or nameless-character-cases are in this particular corpus, I really don&#8217;t think they have enough data to claim to have refuted scholarly intuition.Without further investigation, we have no idea whether the cases they leave out are frequent or infrequent enough in urban novels to sway the analysis, and in which direction the decision would go.</p>
<p>It&#8217;s very easy to point out problems, but I can&#8217;t think of any ways to fix them: the tools to separate infrequent named entities from junk entities just don&#8217;t exist. And the tools to identify nameless, speechless characters haven&#8217;t made much headway either &#8211; what is the difference, quantitatively, between the words &#8220;the woman standing at the station&#8221; and &#8220;the hansom cab standing at the station&#8221;? To computers that rely on statistical, automatically extracted information about language, the difference is currently very difficult to detect.</p>
<p>What do the humanists among you think of this work? Compared to<a href="http://mininghumanities.com/2010/05/11/text-mining-19th-century-novels-with-the-stanford-humanities-computing-lab/"> the other literary analysis</a> of the same novels done at Stanford,  this approach is more linguistically sophisticated &#8211; but do all of these computational attempts seem heavy handed? Or do they spark ideas in your head, inspiring you to apply and improve them on your own problems?</p>
]]></content:encoded>
			<wfw:commentRss>http://wordseer.berkeley.edu/social-networks-19th-century/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Tools for Exploring Text: Natural Language Processing</title>
		<link>http://wordseer.berkeley.edu/tools-2-nlp/</link>
		<comments>http://wordseer.berkeley.edu/tools-2-nlp/#comments</comments>
		<pubDate>Fri, 23 Jul 2010 08:51:49 +0000</pubDate>
		<dc:creator>silverasm</dc:creator>
				<category><![CDATA[Digital Humanities]]></category>
		<category><![CDATA[Natural Language Processing]]></category>
		<category><![CDATA[Text Mining]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[text mining]]></category>

		<guid isPermaLink="false">http://mininghumanities.com/?p=242</guid>
		<description><![CDATA[Take an example question that a literary scholar might have,

    "How is the character Mary talked about in this text from by author X"?

It's fairly open ended - what does "talked about" mean? How do we translate this into computational terms? In this post, I'll describe some tools that natural language processing (NLP) has to offer, and show how each can be used to tackle this question along with pointers to sofware and tutorials.<div class="read-more"><a href="http://wordseer.berkeley.edu/tools-2-nlp/">Read more &#8250;</a></div><!-- end of .read-more -->]]></description>
			<content:encoded><![CDATA[<p>Natural language processing (NLP), also known as computational linguistics, is a set of models and techniques for analyzing text computationally. In the context of the digital humanities, it can help take a question that a literary scholar or historian might ask of a body of text, and help turn it into a quantitative hypothesis. In a <a href="http://mininghumanities.com/2010/04/28/tools-for-getting-a-sense-of-stuff-part-1-visualization/">previous post</a>, I talked about how visualization can be used to get a sense of text; this is the next in the series.</p>
<p>Throughout this post, we&#8217;ll try to answer a hypothetical question a scholar in the humanities, perhaps a literary scholar or historian, might be interested in:</p>
<blockquote><p><em>&#8220;<strong>How is the character Mary talked about in this novel or historical text?</strong>&#8220;</em></p></blockquote>
<p>It&#8217;s fairly open ended &#8211; what does &#8220;talked about&#8221; mean? How do we translate this into computational terms? In this post, I&#8217;ll describe some tools that natural language processing (NLP) has to offer, and show how each can be used to tackle this question along with pointers to sofware and tutorials.</p>
<p><span id="more-596"></span></p>
<p>The goal of of NLP is to model the workings of <em>natural </em>language as we speak, read, and write it, so all the tools here are motivated by some kind of <em>language model</em>.</p>
<h3>N-Grams</h3>
<p>These are strings of consecutive words within a sentence. Take the sentence,</p>
<blockquote><p><span style="color:#ff6600;">Mary was born on a cold March morning.</span></p></blockquote>
<p>The words <code><span style="color:#ff6600;">born</span></code> and <code><span style="color:#ff6600;">morning</span></code> are <em>1-grams</em>. <code><span style="color:#ff6600;">cold</span> <span style="color:#ff6600;">March</span></code> and <span style="color:#800080;"><code><span style="color:#ff6600;">born on</span></code></span> are examples of <em>2-grams</em>. This might seem like a crude way of modeling a language, but n-grams capture a lot of information because we speak grammatically. We can use them to get a sense of how Mary is talked about, for example, by asking what 4-grams we can find that start with:</p>
<ul>
<li><span style="color:#000000;">Mary is ____.</span></li>
<li><span style="color:#000000;">Mary is a ____.</span></li>
<li><span style="color:#000000;">Mary is an ____.</span></li>
</ul>
<p>Toolkits like <a href="http://www.nltk.org">NLTK</a> and <a href="http://opennlp.sourceforge.net/">openNLP</a> come with <a href="http://nltk.googlecode.com/svn/trunk/doc/book/ch04.html">tutorials</a> that explain how to get started on such analyses. We may find that Mary is <code><span style="color:#ff6600;">nice</span></code> and <code><span style="color:#ff6600;">a darling</span></code>, but of course we may end up with a fragment like <code><span style="color:#ff6600;"> Mary is getting</span></code>, from the sentence<code><span style="color:#ff6600;"> Mary is getting sleepy</span></code>. which isn&#8217;t what we&#8217;re looking for.</p>
<p>N-grams capture information about grammar through frequency of use: the less frequent an n-grams is, the less likely it is to be grammatical. But as early as 1957,  <a href="http://en.wikipedia.org/wiki/Noam_Chomsky">Noam Chomsky</a> argued that there is much more to modeling grammar than this in his <a href="http://en.wikipedia.org/wiki/Colorless_green_ideas_sleep_furiously">famous sentence</a>,</p>
<blockquote><p><span style="color:#ff6600;">&#8220;Colorless green ideas sleep furiously.&#8221;</span></p></blockquote>
<p>which is perfectly grammatical even though it contains no frequent n-grams.</p>
<h3>Parts of Speech</h3>
<p>Parts of speech (POS) are a more detailed way of modeling of how we use words: verbs refer to actions, adjectives describe properties of things, nouns refer to entities, and so on. NLP algorithms have long been capable of assigning part-of-speech labels to words in sentences with high accuracy. This task is called <em>POS-tagging</em>, and we can use it to refine our analysis of how Mary is talked about by asking, &#8220;What are the adjectives that occur within five words of &#8216;Mary&#8217;?&#8221;. From a fragment like:</p>
<blockquote><p><span style="color:#ff6600;">It was a cloudy day, which young Mary found fortunate. &#8220;Are we close yet?&#8221; her companion asked. &#8220;No, actually, it&#8217;s quite far,&#8221; Mary replied.</span></p></blockquote>
<p>Using a part-of-speech tagger such as the <a href="http://nlp.stanford.edu/software/tagger.shtml">Stanford POS tagger</a>, <a href="http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/">GENIA</a>, or one of the <a href="http://nltk.googlecode.com/svn/trunk/doc/book/ch05.html">many taggers</a> that come with NLTK, to extract adjectives, we would get a list like:</p>
<ul>
<li><span style="color:#000000;">cloudy</span></li>
<li><span style="color:#000000;">young</span></li>
<li><span style="color:#000000;">far</span></li>
</ul>
<p>While this is more precise than an n-gram analysis in that we only see adjectives now, it&#8217;s still not perfect because only <code><span style="color:#ff6600;">young</span></code> refers to Mary. This is because &#8220;within five words&#8221; is still an approximation for what we really want: adjectives that <em>refer</em> to Mary. We could try this analysis again, with a smaller window 1 or 2 words, but then we might miss many adjectives.</p>
<h3>Parsing Phrase Structure</h3>
<p>The structure of natural language extends beyond parts of speech, because words have relationships with each other. For example, in English, we say that every sentence has a main verb, which has a subject and, depending on the verb, an object, and an indirect object as well. These constituent parts can be small units like nouns, or bigger units, phrases which have their own constituents. NLP algorithms called <em>parsers</em> analyze sentences and return their internal structure. The <a href="http://nlp.cs.berkeley.edu/Main.html#parsing">Berkeley Parser</a>, for example, parsed the following sentence:</p>
<blockquote><p><span style="color:#ff6600;">She thinks Mary is nice to animals.</span></p></blockquote>
<p>to give:</p>
<p><a href="http://wordseer.berkeley.edu/www/wp-content/uploads/2010/07/shethinksmaryisparsetree.png"><img class="aligncenter size-full wp-image-259" title="shethinksmaryisparsetree" src="http://wordseer.berkeley.edu/www/wp-content/uploads/2010/07/shethinksmaryisparsetree.png" alt="A parse tree of the sentence above" width="218" height="236" /></a>where the symbols on each branch represent parts of speech and phrase types. For example, <code><span style="color:#ff6600;">ADJP</span></code> is an adjective phrase, <code><span style="color:#ff6600;">NP</span></code> a noun phrase, and <code><span style="color:#ff6600;">VP</span></code> a verb phrase. <a href="http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/CQP-HTMLDemo/PennTreebankTS.html">Here</a> is a description of the standard set of labels.</p>
<p>We can use these concepts to ask very precise questions now. Referring to the tree above, if we&#8217;re searching for descriptions like &#8220;Mary is ____&#8221;, we&#8217;re searching for <code><span style="color:#ff6600;">ADJP</span></code>&#8216;s (adjective phrases)which are part of a <code><span style="color:#ff6600;">VP</span></code> (verb phrase) containing the word <code><span style="color:#ff6600;">is</span></code>, and which immediately follows the word <code><span style="color:#ff6600;">Mary</span></code>.</p>
<p>The easiest parser to use is the <a href="http://nlp.stanford.edu/software/lex-parser.shtml">Stanford Parser</a>, which parses about 4-5 sentences a second. Using their <a href="http://nlp.stanford.edu/software/tregex.shtml">Tregex</a> software (which is a little harder to use), you can browse the output and search for specific patterns like the one above.</p>
<h3>Dependency relations: grammatical structure:</h3>
<p>The most precise way to ask which adjectives describe Mary is to look directly at grammatical relationships, and ask which adjectives <em>modify </em> <code><span style="color:#ff6600;">Mary</span></code>.  Modern parsers can do this accurately. For example, the Stanford Parser could look at the phrase structure in the sentence above (Figure 1) and return the following representations:</p>
<ul>
<li>nsubj(She, thinks)</li>
<li>ccomp(thinks, is)</li>
<li>cop(nice, is)</li>
<li>nsubj(nice, Mary)</li>
<li>xcomp(nice, animals)</li>
</ul>
<p>The relation <code><span style="color:#ff6600;">nsubj</span> </code>between<code> <span style="color:#ff6600;">nice</span></code> and<code> <span style="color:#ff6600;">Mary</span></code> indicates, for example, that <code><span style="color:#ff6600;">nice</span></code> is what  <code><span style="color:#ff6600;">Mary is</span></code>. Parsers that extract these types of relationships are called <em>dependency parsers</em>, they extract these grammatical relationships from phrase structures like the one in Figure 1. The Stanford Parser is one of many that includes this ability, and <a href="http://www.google.com/url?sa=t&amp;source=web&amp;cd=1&amp;ved=0CBQQFjAA&amp;url=http%3A%2F%2Fnlp.stanford.edu%2Fsoftware%2Fdependencies_manual.pdf&amp;ei=AUpJTLmeEYr2tgPcyvRI&amp;usg=AFQjCNFvNTtNhYCa9IkZMIaIUvKnzka1nA&amp;sig2=krKH9uS0XcHHTw4kKqWtKA">here</a> [PDF] is a list of all the dependency relationships it can extract.</p>
<div id="attachment_266" class="wp-caption aligncenter" style="width: 585px"><a href="http://wordseer.berkeley.edu/www/wp-content/uploads/2010/07/picture-1.png"><img class="size-full wp-image-266" title="A Phrase Net of  &quot; ____ and _____ &quot; relationships in Pride and Prejudice" src="http://wordseer.berkeley.edu/www/wp-content/uploads/2010/07/picture-1.png" alt="A Phrase Net of  &quot; ____ and _____ &quot; relationships in Pride and Prejudice" width="575" height="464" /></a><p class="wp-caption-text">A Phrase Net of &#8221; ____ and _____ &#8221; relationships in Pride and Prejudice</p></div>
<p>Using dependency parsers gives us a lot of power: we can ask for all the adjectives that apply to Mary and locate them with high accuracy. We can find the verbs of which Mary is a subject, and those of which she is an object and see if there are any interesting patterns, or we can look at all the conjunctions in which Mary participates. A visualization like the one above, specifically designed for visualizing grammatical relationships (more <a href="http://mininghumanities.com/2010/04/29/just-because-its-obvious-doesnt-mean-its-useless-state-of-the-art-vs-useful-visualizations-for-information-seeking/">here</a>), might then make excellent food for thought.</p>
<h3>Topic modeling: a statistical approach</h3>
<p>With the availability and relative popularity of topic modeling algorithms in machine learning toolkits like <a href="http://mallet.cs.umass.edu/">Mallet</a>, it would not be appropriate to leave this class of analysis out of my post.  Topic models were originally developed as a way to represent a large collection of documents in a compact way, but are interesting to more people now because the &#8220;topics&#8221; they produce can sometimes correspond to coherent concepts.</p>
<p>One way of representing a document in a compact way is by representing it as a set of word counts. This<em> bag-of-words</em> contains no information about relative ordering, only information about co-occurrence. Topic modeling is motivated by the idea that there are more words in a language than topics to which they belong, so documents can be represented even <em>more</em> compactly by a set of topics, where a topic itself encodes some distribution of the probability of words. For example, one can imagine that every article in the literature on psychology can be compactly represented by its proportions of a vocabulary of topics such as experiments, personality, drugs, theories, cognition etc.</p>
<p>Below are the most frequent words in the 9 most frequent automatically-extracted topics in the abstracts of the Psychological Review, extracted using <a href="http://psiexp.ss.uci.edu/research/programs_data/toolbox.htm">this</a> topic modeling toolbox.</p>
<blockquote><p><span style="color:#ff6600;"><em>&#8216;similarity bias strategies drug systematic biases conditions&#8217;<br />
&#8216;order serial search process parallel elements attention&#8217;<br />
&#8216;stimulus response stimuli responses color cs increase&#8217;<br />
&#8216;ss s change rate normal underlying practice&#8217;<br />
&#8216;self individual situations individuals those others consequences&#8217;<br />
&#8216;environment general behaviors constraints internal other external&#8217;<br />
&#8217;2 experiments single results experimental high trial&#8217;<br />
&#8216;personality variables measures research consistency issues cross&#8217;<br />
&#8216;pattern patterns changes critical false food sequences&#8217;</em></span></p></blockquote>
<p>This method can be applied to any text, and can give interesting results when paired with humanistic intuition. For an <a href="http://historying.org/2010/04/01/topic-modeling-martha-ballards-diary/">illustrative example</a> from the digital humanities (much better than any I could make up in involving Mary), read the work of  <a href="http://twitter.com/historying">Cameron Blevins</a>, a history Ph.D. student at Stanford, who has used topic modeling to glean relationships and trends from a text he was studying: Martha Ballard&#8217;s diary. Finally, for an excellent, and thorough, introduction aimed at the digital humanities audience, I can&#8217;t think of a better piece than Scott Weingart&#8217;s <a title="A guided tour of topic modeling" href="http://www.scottbot.net/HIAL/?p=19113">guided tour of topic modeling</a>.</p>
<p>The techniques I&#8217;ve talked about here are building blocks. Natural language processing algorithms exist for many more complicated (and potentially more useful) purposes: named entity recognition,  semantic similarity calculation, relationship extraction, opinion mining, pronoun resolution, summarization, question answering, translation&#8230;</p>
<p>There are many tools, and they&#8217;re probably very badly documented, but hopefully I&#8217;ve managed to advance the case for considering sophisticated language processing like this part of the natural toolkit of the digital humanities.</p>
]]></content:encoded>
			<wfw:commentRss>http://wordseer.berkeley.edu/tools-2-nlp/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>MetaOptimize: Q+A for the large data set community</title>
		<link>http://wordseer.berkeley.edu/metaoptimize-qa-for-the-large-data-set-community-2/</link>
		<comments>http://wordseer.berkeley.edu/metaoptimize-qa-for-the-large-data-set-community-2/#comments</comments>
		<pubDate>Wed, 07 Jul 2010 18:19:44 +0000</pubDate>
		<dc:creator>silverasm</dc:creator>
				<category><![CDATA[Community]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Natural Language Processing]]></category>
		<category><![CDATA[Text Mining]]></category>
		<category><![CDATA[Visualization]]></category>
		<category><![CDATA[community]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[text mining]]></category>
		<category><![CDATA[user interfaces]]></category>
		<category><![CDATA[visualization]]></category>

		<guid isPermaLink="false">http://mininghumanities.com/?p=215</guid>
		<description><![CDATA[Joseph Turian &#38; co. at MetaOptimize have started a Q+A forum for "data geeks" - people in machine learning or data mining who deal with questions about visualizing, processing, or otherwise making sense of big data sets<div class="read-more"><a href="http://wordseer.berkeley.edu/metaoptimize-qa-for-the-large-data-set-community-2/">Read more &#8250;</a></div><!-- end of .read-more -->]]></description>
			<content:encoded><![CDATA[<p><a title="Joseph Turian" href="http://metaoptimize.com/blog/about-joseph-turian/">Joseph Turian</a> &amp; co. at MetaOptimize have started a <a title="q+a" href="http://metaoptimize.com/qa">Q+A forum</a> for &#8220;data geeks&#8221; &#8211; people in machine learning or data mining who deal with questions about visualizing, processing, or otherwise making sense of big data sets:</p>
<blockquote><p>You and other data geeks can ask and answer questions on machine learning, natural language processing, artificial intelligence, text analysis, information retrieval, search, data mining, statistical modeling, and data visualization.</p>
<p>Here you can ask and answer questions, comment and vote for the questions of others and their answers. Both questions and answers can be revised and improved. Questions can be tagged with the relevant keywords to simplify future access and organize the accumulated material.</p></blockquote>
<p>I&#8217;ve never been a forum participant, but finally, something I can use! This community was so spread out and disconnected that the best way to get advice about these topics used to be to walk up to one of your colleagues and hope they&#8217;d dealt with the same problem before. But, brought together by this forum, we can give each-other informative answers to obscure (but terribly-important-at-the-moment) questions about stuff we work on every day.</p>
<p>(Thanks to <a href="http://flowingdata.com/2010/07/06/stack-overflow-for-data-geeks/">Flowing Data</a> for the tip.)</p>
]]></content:encoded>
			<wfw:commentRss>http://wordseer.berkeley.edu/metaoptimize-qa-for-the-large-data-set-community-2/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
	</channel>
</rss>
