But that raw material is surprisingly elusive. Getting people to speak naturally in a controlled study is hard. Eavesdropping is difficult, time-consuming and invasive of privacy. For these reasons, linguists often rely on a "corpus" of language, a body of recorded speech and writing, nowadays usually computerised. But traditional corpora have their disadvantages too. The British National Corpus contains 100m words, of which 10m are speech and 90m writing. But it represents only British English, and 100m words is not so many when linguists search for rare usages. Other corpora, such as the North American News Text Corpus, are bigger, but contain only formal writing and speech.
Linguists, however, are slowly coming to discover the joys of a free and searchable corpus of maybe 10 trillion words that is available to anyone with an internet connection:the world wide web. The trend, predictably enough, is prevalent on the internet itself. For example, a group of linguists write informally on a weblog called Language Log [languagelog
discussion @ /.