New from the BYU corpora: the NOW corpus and virtual corpora (registrace)

  1. Mark Davies

May 2016 will see two exciting developments from the BYU corpora (corpus.byu.edu), which are probably the most widely-used corpora at present. In this presentation I will give a “sneak peek” of these changes.

First, we will release the NOW corpus (Newspapers on the Web). The corpus is composed of about three billion words of data from web-based newspapers for every day from January 2010 until now. Most importantly, the corpus grows by about 6-7 million words each day, which makes it ideal for looking at ongoing changes in the language.

Second, we have incorporated into all of the BYU corpora the ability to create and use “virtual corpora” (previously only available with the BYU Wikipedia corpus). Users can create virtual corpora based on source (e.g. a particular magazine or newspaper or author), title, date, (sub-)genre, and even words within the text. They can then search within their virtual corpora, compare across them, and even extract keywords.