Collins Memorial Library: Honors 214: Interrogating Inequality: Text Analytics

Glossary

Corpus: A collection of written texts.

Data mining: The application of statistical and computational methods to large data sets in order to unearth new information for research purposes.

Data visualization: Using graphics to display information in new and innovative ways.

"Distant reading": Phrase coined by literary theorist Franco Moretti to describe the process of using computational methods to analyze massive quantities of text. Often used as a counterpoint to "close reading."

Links: Represents the collocation of terms in a corpus by depicting them in a network through the use of a force directed graph. In this graph the frequency of the word is indicate by relative size of the term.

Word cloud: A visualization of word frequencies. The more frequently a word appears in a given text, the larger its size in the visualization.

Embedded Visualization Tools

Many full-text databases now offer embedded tools for basic text analytics. Two such databases are:

New York Times Historical Newspaper
Includes full page images of the New York Daily Times (1851-1857) and the New York Times (1857-2017).
Archives of Sexuality & Gender
Primary source material that explores the many facets of LGBTQ history. Users will find page images of a mixture of activist groups’ organizational papers, flyers and pamphlets, NGO and Governmental reports, periodicals, newsletters, personal papers, legal documents and more.

Example 1: Word frequencies over time

Using the New York Times Historical database, search for the term "poverty," with no date limitations. Look for the term frequency chart on the left of the results page.

What patterns do you notice? What questions do you have? How might you go about answering these questions?

Click on one of the decades to see term frequencies by year. What questions do you have?

Example 2: Word clusters over time

Using the New York Times Historical database, search for the terms: Islam* and terror*. (The asterisk tells the database to search for all forms of the words.) Are the results what you expect? Why or why not? What other terms might you want to search?

Example 3: Word clusters in an archival corpus

Using the Archives of Sexuality and Gender, search for the term "discrimination." On the left navigation bar, look for "Analyze Results," and then select "Term clusters."

What does the visualization wheel tell you? What can't it tell you? What other term searches might you want to do?

Voyant

Voyant Tools is a powerful, free web-based tool for large scale analysis of texts and "distant reading." Voyant is an easy entry point into text analysis because it does not require advanced technical skills. To begin working with Voyant, first gather the digital text(s) you want to analyze.

Voyant provides excellent online documentation and tutorials.

To practice, we will be using the full text of The Moynihan Report, which can be found here.

1. Copy the full text of the report and paste it into Voyant Tools.

2. Refine and apply your stopword list.

3. Experiment with the various visualization options until you find one that seems to offer the best insight into the text or that raises new avenues of inquiry.

Now, imagine that you'd like to compare this visualization with one of editorial responses to the report that appeared in the African American Press.

Black Studies Center
Includes scholarly essays, recent periodicals, historical newspaper articles, reference books, and much more. Including The Schomburg Studies on the Black Experience, Index to Black Periodicals Full Text, Black Literature Index, and the Chicago Defender historical newspaper from 1912-1975.

Step 1: Build your corpus.

Select the articles you want to include in your corpus.
You will need to OCR the PDFs of the articles. You can do this several ways:
- Print the PDF, then use the library's scanner to create "searchable PDFs."
- Importing a PDF into Google Documents and export it from there as HTML, RTF or another format or another format that Voyant can read.
- Save the PDF to the desktop or to Dropbox and use an online OCR generator (these tend to be less reliable).
Clean up the text.

Step 2: Copy and Paste the cleaned up text into Voyant and run the program.

Step 3: Experiment with the various visualization options until you find one that seems to offer the best insight into the text.