Tools for Exploring Text: Visualization

In the light of my new tool to help navigate the New York Times, I’ve been reading about previous approaches to the problem of making sense of large collections of text. As far as I can tell, the research seems to come from three different communities, and to answer three slightly different questions

From the visualization community: how can we display aspects of text collections to give users a sense of what they contain?
From the UI community: what kinds of interactions and information do users find useful when exploring text collections?
From the NLP community: what can we extract from text collections that might give people a sense of their contents?

In this post, I’m going to summarize what seemed like the high points of the visualization sub-field, targeted towards the digital humanities community.

Visualization: how can we display aspects of text collections to give users a sense of what they contain?

The big problem that text collections (e.g. digital libraries) pose to visualization is that they consist of unstructured text ( contrasted with structured text, as in a Wikipedia infobox). Unstructured text is difficult to visualize for two reasons. First, it is not what we usually think of as “data”: no inherent order, no clear hierarchy or relationship structure. Second, it’s just unwieldy: it takes up a lot of space, doesn’t lend itself to compact symbolic representation, and is rarely pre-attentive (easy to make out without really paying attention).
Nevertheless, Martin Wattenberg and Fernanda Viegas (IBM Research, ManyEyes, Flowing Media) have come up with great solutions for the problem of getting a “visual perspective” — a sense of similarity, importance, relevance, and relationship on a text — and they seem to have had all the ideas lately:
•   Phrase nets
•   Word trees
•   Two-word clouds (touched on in this magazine article [pdf])

Wattenberg and Viegas use tried-and-tested methods for extracting syntactic structure — parts of speech, phrase structures, grammatical relationships (e.g. “iPad prices” is the subject of “fell” in “iPad prices fell by 70% this morning”). The natural language processing to do this has been around for more than 20 years, and is actually very reliable. The downside is that it produces very rough statistics: frequencies and co-occurrence counts of different kinds of language patterns. Their focus is on the quality of the visualization: responsiveness, legibility, and the understandability of the resulting graphics.

a. Phrase Nets

a phrase net of Pride and Prejudice with the pattern "X and Y"

A phrase net of Pride and Prejudice with the pattern "X at Y"

Phrase nets are a way of visualizing relationships between words or phrases. You select a pattern, such as “X at Y” (above), or a grammatical relationship such as “X is-the-subject-of Y “, and the algorithm creates a visualization of the most frequent occurrences of the pattern, with larger font sizes indicating higher frequency, and a darker value to indicate that the word occurs in the “X” position more often than in the “Y” position.
This is the best, easiest way I’ve seen do do an in-depth exploration of relationships in a text. The visualization is intended to be very quick to re-draw, so you can query different relationships as they strike you.
You can explore phrase nets built from your own text at the IBM Many Eyes project website. The research paper [Google scholar] has more visualizations.

b. Word Trees

All occurrences of "I have a dream" in Martin Luther King's speech

Word trees are a way of visualizing the sequences of words in a text. In the figure above, there’s a line between two words if the second follows the first. These are great for exploring the contexts in which words appear, and revealing patterns in the way some are more frequent than others. IBM Many Eyes lets you create these out of your own text .

c. Two-word word clouds

Comparison between a single-word and two-word cloud of speech by President Obama in 2007

We’re all familiar with tag clouds: displays of words varying in size by how frequent they are. While tag clouds are great for exploring a collection of tags, which are meaningful by themselves, they are less useful for words.Words are less informative, which is why the tag-cloud method can give less than intuitive results when applied to them. Context is an important source of meaning with words, so two-word tag clouds are a way to include some of it. I think they give a better sense of the contents of the text than the corresponding single-word cloud.

Tools for Exploring Text: Visualization

Visualization: how can we display aspects of text collections to give users a sense of what they contain?

a. Phrase Nets

b. Word Trees

c. Two-word word clouds

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112