News from EuReCo: Annotations, Applications, and LLM Assistance

Keywords: European Languages; Light Verb Constructions; Comparable Corpora; Tools; LLMs

The field of contrastive corpus linguistics is inherently more resource-intensive than single-language studies due to the necessity of at least two corpora that are not only sufficiently representative with respect to the research question and intended language domain but also sufficiently similar. While parallel or translation corpora exist for many languages and domains (cf., Čermák/Rosen 2012) and meet the similarity requirement, their linguistic utility is often affected by translation effects, such as shining-through, over-normalization, and simplification (e.g., Teich, 2003; Granger et al., 2003). Comparable corpora present a more effective alternative for capturing authentic cross-linguistic patterns; however, locating or creating such corpora for specific language constellations and domains can be highly costly and labor-intensive.

The European Reference Corpus (EuReCo) open initiative (Kupietz et al. 2020) provides a sustainable solution to this challenge by re-using existing large corpora, virtually integrating them, and enabling users to define domain- and question-specific virtual sub-corpora based on metadata properties. This approach eliminates the need to build new corpora from scratch while ensuring legal and economic feasibility and autonomy of the corpus providers. EuReCo’s strategy involves dynamically joining national and reference corpora, allowing each corpus to remain physically decentralized yet interoperable through infrastructural means. To ensure the feasibility of developing and maintaining the software that implements this infrastructure, EuReCo adopts a different approach. Rather than specifying interfaces and protocols for interoperable searches across the network, EuReCo offers a prototype open-source implementation, KorAP (Bański et al. 2012, Diewald et al. 2016), which can be installed at the locations of the corpora to provide (an in most cases additional) access to the corpus data (Kupietz et al. 2024). This method is more efficient and guarantees that new features are available to all users in a timely manner without multiplying costs across participating sites or software systems.

KorAP Instances for Romanian (left), Hungarian (middle), and German Corpora (right)

The contribution presents fundamental approaches of EuReCo in addressing these challenges and reports on ongoing research related to methodological feasibility issues, particularly from a user perspective. This includes enhancing metadata on topic domains through multilingual text classifiers for constructing more precise and comparable virtual sub-corpora. Additionally, it addresses query-stage mapping of Universal Part-of-Speech (UPOS) annotations as additional virtual layer to facilitate cross-linguistic comparisons without necessitating re-annotation of entire corpora. Moreover, we show how LLM-based approaches – such as DocPrompting (Zhou et al., 2023), automatic documentation tests, and specialized interfaces like the Model Concept Protocol (MCP) – can be used to enable even casual users to perform complex and reproducible analyses, using KorAP’s client libraries for R and Python (Kupietz/Diewald/Margaretha 2020), in a methodologically sound way. We will demonstrate this with examples for comparative frequency and collocation analyses to identify light-verb-construction (LVC) candidates in different languages (drawing on Bański et al., 2023) and report on ongoing research concerning the identification of possibly emerging light verbs in German and Polish.

References

Bański, P., Diewald, N., Kupietz, M., & Trawiński, B. (2023). Applying the newly extended European reference corpus EuReCo. Pilot studies of light-verb constructions in German, Romanian, Hungarian and Polish. In B. Trawiński, M. Kupietz, K. Proost, & J. Zinken (Eds.), Book of Abstracts of the 10th International Contrastive Linguistics Conference (ICLC-10), 18-21 July, 2023, Mannheim, Germany (pp. 274–276). IDS-Verlag. https://doi.org/10.14618/f8rt-m155

Bański, P., Fischer, P. M., Frick, E., Ketzan, E., Kupietz, M., Schnober, C., Schonefeld, O., & Witt, A. (2012). The New IDS Corpus Analysis Platform: Challenges and Prospects. Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), 2905–2911. http://www.lrec-conf.org/proceedings/lrec2012/pdf/789_Paper.pdf

Čermák, F., & Rosen, A. (2012). The case of InterCorp, a multilingual parallel corpus. International Journal of Corpus Linguistics, 17(3), 411–427. https://doi.org/10.1075/ijcl.17.3.05cer

Diewald, N., Hanl, M., Margaretha, E., Bingel, J., Kupietz, M., Bański, P., & Witt, A. (2016). KorAP architecture—Diving in the Deep Sea of Corpus Data. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), 3586–3591.

Granger, S., Lerot, J., & Petch-Tyson, S. (2003). Corpus-based approaches to contrastive linguistics and translation studies (Vol. 20). Rodopi.

Kupietz, M., Bański, P., Diewald, N., Trawiński, B., & Witt, A. (2024). EuReCo: Not building and yet using federated comparable corpora for cross-linguistic research. In P. Zweigenbaum, R. Rapp, & S. Sharoff (Eds.), Proceedings of the BUCC 2024: The 17th workshop on building and using comparable corpora (pp. 94–103). ELRA Language Resource Association. https://aclanthology.org/2024.bucc-1.10.pdf

Kupietz, M., Diewald, N., & Margaretha, E. (2020). RKorAPClient: An R Package for Accessing the German Reference Corpus DeReKo via KorAP. In N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC’20) (pp. 7015–7021). European Language Resources Association. https://www.aclweb.org/anthology/2020.lrec-1.867

Kupietz, M., Diewald, N., Trawiński, B., Cosma, R., Cristea, D., Tufiş, D., Váradi, T., & Wöllstein, A. (2020). Recent developments in the European Reference Corpus EuReCo. Translating and Comparing Languages: Corpus-Based Insights. Selected Proceedings of the Fifth Using Corpora in Contrastive and Translation Studies Conference. Louvain-La-Neuve: Presses Universitaires de Louvain, 257–273.

Teich, E. (2003). Cross-Linguistic Variation in System and Text: A Methodology for the Investigation of Translations and Comparable Texts. Mouton de Gruyter.

Zhou, S., Alon, U., Xu, F. F., Wang, Z., Jiang, Z., & Neubig, G. (2023). DocPrompting: Generating Code by Retrieving the Docs. https://arxiv.org/abs/2207.05987