Building the new general purpose reference corpus of contemporary Polish (and some other related resources)

Datum

úterý 2. 5. 2023 13:00

Přednášející

Witold Kieraś

Abstrakt

The main topic of the presentation is the new corpus of contemporary Polish, which aims at continuing and supplementing the National Corpus of Polish (NCP) in near future. It has been over a decade since the NCP project was concluded and despite its success among the local linguistic community the corpus calls for an update. The new corpus project is striving to find a balance between continuity with the NCP and the need for addressing new linguistic and technical realities. The presentation will cover the basic theoretical and technical concepts behind the new corpus, with special regard to grammatical annotation layers: morphosyntactic tagging and dependency and constituency parsing, all consistent with each other. The hybrid syntactic representation allows the user to focus on their research task rather than commit to a specific syntactic theory, and enhances the expressive power of corpus queries allowing the user to refer to immediate dependency relations and phrase structure simultaneously.

The presentation will also cover some other corpus resources that are currently being developed in the Institute of Computer Science (Polish Academy of Sciences) that are complementing the environment for corpus linguistic research. Those include the Web corpus updated on a daily basis and multilingual version of Korpusomat, a simple web application for building one's own corpora.