The study proposes a method for comparing the ranges of linguistic variation covered by different corpora using a model issued from a multi-dimensional (MD) analysis of register variability in a given language. This method is applied to the comparison of two corpora of Czech: Koditex, a “traditional” corpus carefully designed using various sources with rich metadata, and Araneum Bohemicum Maximum, a web-crawled corpus which has an opportunistic composition but is also cheaper and easier to obtain. Texts from both corpora are projected onto the MD model and ranges of variation covered in each dimension are compared in order to identify overlaps on the one hand, and areas covered by only one of the two corpora on the other. We also document a crucial methodological point which has broader relevance for MD analyses in general, namely that texts have to be of similar lengths in order for their scores on the dimensions to be comparable.
Results indicate that the type of language represented by traditional text categories such as journalism or non-fiction is equally well covered by web-crawled data, though of course traditional corpora keep their edge in terms of the richness of the accompanying metadata. Importantly, text categories which are partially or entirely unique as to their linguistic characteristics only emerged from Koditex and correspond to data which is hard to get by general-purpose web-crawling techniques: informal conversations, private correspondence, some types of fiction, but also user-generated content (comments on Facebook, forums etc.).