Monday, December 30, 2013

How Many Words are there in the English Language?

The Corpus of Global Web-Based English (GloWbE) is composed of 1.9 billion words from 1.8 million web pages in 20 different English-speaking countries. The corpus was created by Mark Davies of Brigham Young University, and it was released in April 2013.
 
GloWbE (pronounced like "globe") is related to other large corpora that we have created, including the 450 million word Corpus of Contemporary American English (COCA) and the 400 million word Corpus of Historical American English (COHA). Together, these three corpora allow researchers to examine variation in English -- by dialect, genre, and over time -- in ways that are not possible with any other large corpora of English.

SIZE: At the most basic level, GloWbE allows you to search through a corpus that is more than four times as large as COCA (and nearly twenty times as large as the British National Corpus). This means that where you might only have 10-12 tokens in the BNC and 50-60 in COCA, you might have 250-300 in GloWbE.

DIALECTS: The real power of GloWbE, though, is the ability to see the frequency of any word, phrase, or grammatical construction in each of the 20 different countries. You can also compare any features in two sets of dialects, such as British and American English (in more than 775 million words of text for just these two dialects). Or you could just limit your search to one or two countries (e.g. Australia (148 million words), South Africa (45 million), or Singapore (43 million)), and you'll still be searching the largest online corpus for most of these twenty countries. 

In terms of searches, with GloWbE you can study an extremely wide range of phenomena (the same as with all of the other corpora from corpus.byu.edu): words, phrases, grammatical constructions, synonyms, customized lists, and collocates (nearby words, which provide insight into meaning and usage). In addition, for many of these searches, they are 5-6 times as fast as with other corpus architectures like Sketch Engine / CQPWeb.


To see a number of examples of what you can do with the corpus, feel free to take a quick five minute tour.

No comments:

Post a Comment