A corpus driven comparison of Slavic prepositions and derivational morphology, or: what massively parallel texts are good for

začíná semestr
  1. Ruprecht von Waldenfels

The comparison of cognate functional material in a closely related set of languages such as that of the Slavic genus is difficult and very labour-intensive, since differences tend to be subtle and rarely clear-cut. The talk presents a method to investigate such differences on the basis of translationally equivalent texts and a corpus driven system for the simple investigation of many, heterogenous linguistic variables.

I use a word aligned, morphologically tagged and lemmatized parallel corpus of prose in all major Slavic languages (ParaSol, see www.parasolcorpus.org), to derive an extensionally defined handle on the domain of use of diverse linguistic categories across languages. In the talk, the use of prepositions and derivational affixes in translationally equivalent segments across all major Slavic standard languages is compared and evaluated using clustering algorithms as well as more qualitative techniques, showing the usefulness of the technique as well as new insights into difficult to see patterns of convergence and divergence of, say, Czech, in respect to other Slavic languages.