One of the major resources in corpus driven linguistics typology are comparable corpus material for a wide range of languages. But the amount of resources that are significantly multilingual is very limited. There are several sources that are not aimed at linguistic research that can be used, such as Wikipedia, StoryWeaver, or Tatoeba, but without annotations these are hard to use without knowledge of all the languages involved. Potentially the largest source of annotated material comes from interlinear glossed text (IGT) examples, spread over many linguistic descriptions, but recently combined in initiatives such as GlossLM (Ginn et al., 2024), IMTVault (Nordhoff and Krämer, 2022) and ODIN (Lewis and Xia, 2010). IGT vaults are great resources, but are not natural language corpus data - they more often than not are isolated examples, often artificial and often including marked or incorrect examples. And there are lexicon-driven resources such as UniMorph (Batsuren et al., 2022) and the Swadesh list, but those do not provide full sentences.
For comparing annotated natural language corpus data across many languages, that leaves largely only language documentation data and POS tagged corpora or treebanks. And the largest projects for those are probably on the side of languages documentation data DoReCo (Seifart et al., 2018), a homogeneous source of language documentation data for 53 spoken language documentation corpora of mostly less-resourced-languages. And on the side of treebanks the Universal Dependencies (de Marneffe et al., 2021) (UD) which currently provides dependency trees for over 100 languages, extended with UDMorph (Janssen, 2024) which provides corpora annotated with POS and lemma following the guidelines of UD for data do not (necessarily) have dependency relations.
There is not too much overlap in the languages included in UD and DoReCo, with the only two overlapping languages being English and Beja, which has been converted to a UD dataset (Kahane et al., 2021). So bringing them together would the perspective of UD, adding the languages of DoReCo would significantly increase the coverage. In this paper, we will describe an initiative to do just that: to create a workflow making it as easy as possible to enrich DoReCo datasets with at least the POS and lemma layers of UD, and potentially also the dependency relations. There are many differences in the two types of resources, not only in format but also in their conception, nature, and choices made in their transcription. UD is mostly written material, while DoReCo are spoken data, sometimes from languages for which no orthography is available. UD corpora tend to be normalized to achieve greater computational accuracy, while DoReCo data focus on disfluencies to maximize truthful representation of the source material. UD transcribes morphosyntactic features from a grammatical perspective, while DoReCo tends to only annotate morphosyntax for explicit morphological markers.
In this paper, we will describe the obstacles, challenges, and potential solutions to bring together DoReCo and UD(Morph), as well as the design of the combined resource. We will describe what can be done using only automatic methods, and how the creation of a fully merged corpus by manually enriching the data can be streamlined by maximizing the use of the existing data and minimizing the required manual intervention. The streamlined process will be illustrated using an ongoing attempt to create a merged resource starting from the DoReCo Evenki data (Kazakevich and Klyachko, 2024), a Northern Tungusic language spoken mainly in northern China, but also in Russia and Mongolia.
Khuyagbaatar Batsuren, Omer Goldman, Salam Khalifa, Nizar Habash, Witold Kierás, Gábor Bella, Brian Leonard, Garrett Nicolai, Kyle Gorman, Yusti- nus Ghanggo Ate, Maria Ryskina, Sabrina Mielke, Elena Budianskaya, Charbel El-Khaissi, Tiago Pimentel, Michael Gasser, William Abbott Lane, Mohit Raj, Matt Coler, Jaime Rafael Montoya Samame, Delio Siticonatzi Camaiteri, Esaú Zumaeta Rojas, Didier López Francis, Arturo Oncevay, Juan López Bau- tista, Gema Celeste Silva Villegas, Lucas Torroba Hennigen, Adam Ek, David Guriel, Peter Dirix, Jean-Philippe Bernardy, Andrey Scherbakov, Aziyana Bayyr-ool, Antonios Anastasopoulos, Roberto Zariquiey, Karina Sheifer, Sofya Ganieva, Hilaria Cruz, Ritván Karahóga, Stella Markantonatou, George Pavli- dis, Matvey Plugaryov, Elena Klyachko, Ali Salehi, Candy Angulo, Jatayu Baxi, Andrew Krizhanovsky, Natalia Krizhanovskaya, Elizabeth Salesky, Clara Vania, Sardana Ivanova, Jennifer White, Rowan Hall Maudslay, Josef Valvoda, Ran Zmigrod, Paula Czarnowska, Irene Nikkarinen, Aelita Salchak, Brijesh Bhatt, Christopher Straughn, Zoey Liu, Jonathan North Washington, Yuval Pinter, Duygu Ataman, Marcin Wolinski, Totok Suhardijanto, Anna Yablonskaya, Niklas Stoehr, Hossep Dolatian, Zahroh Nuriah, Shyam Ratan, Francis M. Tyers, Edoardo M. Ponti, Grant Aiton, Aryaman Arora, Richard J. Hatcher, Ritesh Kumar, Jeremiah Young, Daria Rodionova, Anastasia Yemelina, Taras Andrushko, Igor Marchenko, Polina Mashkovtseva, Alexandra Serova, Emily Prud’hommeaux, Maria Nepomniashchaya, Fausto Giunchiglia, Eleanor Cho- droff, Mans Hulden, Miikka Silfverberg, Arya D. McCarthy, David Yarowsky, Ryan Cotterell, Reut Tsarfaty, and Ekaterina Vylomova. 2022. UniMorph 4.0: Universal Morphology. In Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitos- hi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 840–855, Marseille, France, June. European Language Resources Association.
Marie-Catherine de Marneffe, Christopher D. Manning, Joakim Nivre, and Daniel Zeman. 2021. Universal Dependencies. Computational Linguistics, 47(2):255– 308, June.
Michael Ginn, Lindia Tjuatja, Taiqi He, Enora Rice, Graham Neubig, Alexis Palmer, and Lori Levin. 2024. GlossLM: A massively multilingual corpus and pretrained model for interlinear glossed text. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 12267–12286, Miami, Florida, USA, November. Association for Computational Linguistics.
Maarten Janssen. 2024. Udmorph: Morphosyntactically tagged ud corpora. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 16933–16940, Torino, Italy, May. ELRA and ICCL.
Sylvain Kahane, Martine Vanhove, Rayan Ziane, and Bruno Guillaume. 2021. A morph-based and a word-based treebank for Beja. In Daniel Dakota, Kilian Evang, and Sandra Kübler, editors, Proceedings of the 20th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2021), pages 48–60, Sofia, Bulgaria, December. Association for Computational Linguistics.
Olga Kazakevich and Elena Klyachko. 2024. Evenki doreco dataset. In Frank Seifart, Ludger Paschen, and Matthew Stave, editors, Language Documentation Refe- rence Corpus (DoReCo) 2.0. Laboratoire Dynamique Du Langage (UMR5596, CNRS & Université Lyon 2), Lyon.
William Lewis and Fei Xia. 2010. Developing odin: A multilingual repository of annotated language data for hundreds of the world’s languages. LLC, 25:303– 319, 08.
Sebastian Nordhoff and Thomas Krämer. 2022. IMTVault: Extracting and enriching low-resource language interlinear glossed text from grammatical descriptions and typological survey articles. In Thierry Declerck, John P. McCrae, Elena Montiel, Christian Chiarcos, and Maxim Ionov, editors, Proceedings of the 8th Workshop on Linked Data in Linguistics within the 13th Language Resources and Evaluation Conference, pages 17–25, Marseille, France, June. European Language Resources Association.
Frank Seifart, Nicholas Evans, Harald Hammarström, and Stephen C. Levinson. 2018. Language documentation 25 years on. Language, 94(4):e324–e345.