Corpus: A collection of written texts.
Data mining: The application of statistical and computational methods to large data sets in order to unearth new information for research purposes.
Data visualization: Using graphics to display information in new and innovative ways.
"Distant reading": Phrase coined by literary theorist Franco Moretti to describe the process of using computational methods to analyze massive quantities of text. Often used as a counterpoint to "close reading."
Links: Represents the collocation of terms in a corpus by depicting them in a network through the use of a force directed graph. In this graph the frequency of the word is indicate by relative size of the term.
Word cloud: A visualization of word frequencies. The more frequently a word appears in a given text, the larger its size in the visualization.
Many full-text databases now offer embedded tools for basic text analytics. Two such databases are:
Example 1: Word frequencies over time
Using the New York Times Historical database, search for the term "poverty," with no date limitations. Look for the term frequency chart on the left of the results page.
What patterns do you notice? What questions do you have? How might you go about answering these questions?
Click on one of the decades to see term frequencies by year. What questions do you have?
Example 2: Word clusters over time
Using the New York Times Historical database, search for the terms: Islam* and terror*. (The asterisk tells the database to search for all forms of the words.) Are the results what you expect? Why or why not? What other terms might you want to search?
Example 3: Word clusters in an archival corpus
Using the Archives of Sexuality and Gender, search for the term "discrimination." On the left navigation bar, look for "Analyze Results," and then select "Term clusters."
What does the visualization wheel tell you? What can't it tell you? What other term searches might you want to do?
Voyant Tools is a powerful, free web-based tool for large scale analysis of texts and "distant reading." Voyant is an easy entry point into text analysis because it does not require advanced technical skills. To begin working with Voyant, first gather the digital text(s) you want to analyze.
Voyant provides excellent online documentation and tutorials.
To practice, we will be using the full text of The Moynihan Report, which can be found here.
1. Copy the full text of the report and paste it into Voyant Tools.
2. Refine and apply your stopword list.
3. Experiment with the various visualization options until you find one that seems to offer the best insight into the text or that raises new avenues of inquiry.
Now, imagine that you'd like to compare this visualization with one of editorial responses to the report that appeared in the African American Press.
Step 1: Build your corpus.
Step 2: Copy and Paste the cleaned up text into Voyant and run the program.
Step 3: Experiment with the various visualization options until you find one that seems to offer the best insight into the text.