Multi-Dimensional Analysis of Czech

What is multi-dimensional analysis?

Multi-dimensional analysis (MDA) is a method developed by corpus linguist Douglas Biber used for empirical research of text variation. The aim of MDA is to capture the variation based on the function that variant language features have in texts. In contrast to earlier approaches, the goal of MDA is not the a priori identification of linguistic features that are typical of a particular communication domain; MDA, on the contrary, uses the co-occurrence of linguistic features as the starting point for interpretation. From the features that co-occur frequently in texts, it is then possible to infer what function these features collectively fulfill.

What is the procedure of MDA?

MDA has been used as a research method for modeling register variation of many languages. The research procedure consists of the following steps:

corpus compilation,
feature selection and retrieval from the corpus (operationalization),
statistical evaluation using factor analysis,
interpretation of results.

In addition to describing language variation, MDA results can be used to determine the main registers in a given language (see register classification, which functions as a complement to txtype/genre classification).

Multi-dimensional model of Czech

Based on the analysis of the Koditex corpus, a model with 8 dimensions was created:

dynamic (+) vs. static (-),
spontaneous (+) vs. prepared (-),
higher (+) vs. lower (-) level of cohesion,
polythematic (+) vs. monothematic (-),
higher (+) vs. lower (-) amount of addressee coding,
general/intension (+) vs. particular/extension (-),
prospective (+) vs. retrospective (-),
attitudinal (+) vs. factual (-).

The naming of the dimensions is primarily based on information about which linguistic features contribute most to their establishment (see the inventory of prominent features), and on the position of texts within a given dimension (see the MDAvis tool).

Team members

Key publications of the project (description of the Czech MDA)

Cvrček, V., Komrsková, Z., Lukeš, D., Poukarová, P., Řehořková, A., & Zasina, A. J. (2018). Variabilita češtiny: Multidimenzionální analýza [Variability of Czech: a multidimensional analysis]. Slovo a slovesnost, 79(4), 293–321.
Cvrček, V., Laubeová, Z., Lukeš, D., Poukarová, P., Řehořková, A., & Zasina, A. J. (2020). Registry v češtině [Registers in Czech]. Nakladatelství Lidové noviny.
Cvrček, V., Komrsková, Z., Lukeš, D., Poukarová, P., Řehořková, A., & Zasina, A. J. (2021). From extra- to intratextual characteristics: Charting the space of variation in Czech through MDA. Corpus Linguistics and Linguistic Theory, 17(2), 351–382.

Learn more about the data

Tool for viewing MDA results

Run MDAvis

Lukeš, D., & Cvrček, V. (2021). MDAvis: A Shiny app for visualizing Multi-Dimensional Analysis results. Accessible on-line at https://korpus.cz/mdavis. Source code available at https://github.com/dlukes/shiny-mda.

Koditex Corpus Description

Zasina, A. J., Lukeš, D., Komrsková, Z., Poukarová, P., & Řehořková, A. (2018). Koditex: Korpus diverzifikovaných textů (Verze 1). Ústav Českého národního korpusu FF UK. www.korpus.cz
Zasina, A. J., & Komrsková, Z. (2019). Koditex – korpus diverzifikovaných textů. Studie z aplikované lingvistiky, 10(1), 127–132.

Data

Cvrček, V. et al., 2018, Multi-Dimensional Analysis of Czech (Original data for a general-purpose multi-dimensional analysis model of register variation in Czech). https://doi.org/10.18710/QAJKZW, The Tromsø Repository of Language and Linguistics (TROLLing).
Lukeš, D. 2018, Tidiness: A measure based on information theory to help with selecting an appropriate number of dimensions to extract in MDA. Accessible on-line at https://github.com/czcorpus/mda.

Publications based on the project results

Cvrček, V., Komrsková, Z., & Lukeš, D. (2018). Rozsah registrové variability textů. In D. Kučera, J. M. Havigerová, J. Haviger, V. Cvrček, Z. Komrsková, D. Lukeš, T. Jelínek, T. Urbánek, & J. Franková, Výzkum CPACT: Komputační psycholingvistická analýza českého textu (s. 153–172). Pedagogická fakulta Jihočeské univerzity v Českých Budějovicích.
Henyš, J. (2019). Registrová variabilita českých internetových textů [Diplomová práce, FF UK]. https://dspace.cuni.cz/handle/20.500.11956/110335
Cvrček, V., Komrsková, Z., Lukeš, D., Poukarová, P., Řehořková, A., Zasina, A. J., & Benko, V. (2020). Comparing web-crawled and traditional corpora. Language Resources and Evaluation, 54, 713–745.
Cvrček, V., Laubeová, Z., Lukeš, D., Poukarová, P., Řehořková, A., & Zasina, A. J. (2020). Author and register as sources of variation: A corpus-based study using elicited texts. International Journal of Corpus Linguistics, 25(4), 461–488.
Cvrček, V. (2022). Proměny registrů české žurnalistiky 1995–2018. Časopis pro moderní filologii 104(1), 7-34.
Poukarová, P. – Cvrček, V. (2023): Proměny prózy v letech 1992 až 2018. Česká literatura 70(6), 678–710.
Cvrček, V., Laubeová, Z., Lukeš, D., Poukarová, P., Řehořková, A., & Zasina, A. J. (2024). Register differences and intra-register variation of elicited texts. Register Studies 5(2), 143–170.

Grant support

Czech MDA was conducted at Charles University by researchers from the Institute of the Czech National Corpus; it was supported from the ERDF project Language Variation in the CNC no. CZ.02.1.01/0.0/0.0/16_013/0001758.