About | International Comparable Corpus

The International Comparable Corpus (ICC) is an international collaborative project in the field of contrastive corpus-based linguistics. The ultimate goal of the project is the facilitation of contrastive studies between English and other languages involving highly comparable datasets of spoken, written and electronic registers. What we are introducing is not a parallel translation corpus (where source language texts are aligned with their translations) but a set of comparable corpora in different languages, these languages currently involve the following: Czech, Finnish, French, German, Irish, Italian, Norwegian, Polish, Slovak, Swedish, and Chinese. If you are interested in adding another language please do get in touch.

The ICC corpus starts on one hand with the idea that there are plenty of various language data for many languages that could be reused if carefully selected and on the other, that contrastive analysis very often relies on comparisons with English. Therefore, the ICC corpus will largely rely on re-using existing language resources and will be modelled to be comparable with the ICE family corpora. Thus, for the field of contrastive linguistics, a striking and unique feature of each new corpus in ICC will be its substantial spoken component, at present comprising ca. 600,000 words (or 60% of the current total). Such provision of spoken data across 15 or so discourse situations for contrastive analysis among several languages is unique as it will allow the much-needed and unprecedented cross-linguistic corpus-based comparisons of spoken language. Together with balanced data across written registers, ICC will become invaluable for future contrastive corpus-based research. The approach will also allow replicability and comparisons with and between other languages.