A corpus is a collection of texts or text extracts that have been put together to be used as a sample of a language or language variety. It consists of texts that have been produced in "natural contexts" (published books, ordinary conversation, letters, newspapers, lectures etc), which means it mirrors natural language.
A reference corpus (created to be a balanced sample of a language variety) can be used as the basis of comparison between a text/genre and standard language.
Specialized corpora can be used to examine or compare different language varieties, such as language from a particular area, covering a certain genre or text type, produced by particular language users, etc.
Corpora can be synchrone (covering one time) or diachrone (covering several time periods), consist of different media (written or spoken language) and be composed of different languages.
Annotated corpora have extra information added, usually linguistic information (part-of-speech) or metadata (information about the material in the corpus, speakers/authors, situation, extra-linguistic information, etc).
There are corpora that can be consulted online, via a custom-built interface, and ones that you explore with stand-alone tools that you install on your computer.
Is it linguistics or philology? Take a look at word usage frequency over the past three centuries!
HathiTrust contains millions of digital books, journals, government documents, and other volumes, all digitized from research libraries. The collection includes both public domain and in-copyright works across a full range of subjects. Over half of the content is in English, but hundreds of languages are represented, including large amounts of material in German, French, Chinese, Russian and Spanish. Although all items are discoverable, viewability depends on the rights status of the individual item.
Check out our guide to using HathiTrust Digital Library for more information.