PolDSC: Polish dynamic corpus of normal and disordered speech across the lifespan - building a speech corpus (not necessarily) from the scratch

  1. Marzena Stępień

Linguistic fieldwork (as well as laboratory work) is a daily bread of everybody who conducts research on topics such as language acquisition, age-related changes in multidimensional characteristic of speech (including semantic fluency, phonation, articulation, or prosody), different speech, and language disorders – all of this in both monolingual and multilingual individuals. Gathering spoken data, along with metadata, orthographic and phonetic transcripts, followed by different kinds of annotation and segmentation requires hours of hard or – to be more accurate - even painstaking work. Unfortunately, after the project is finished (including bachelor’s and master’s degree projects in speech and language therapy classes), a lot of the data end up in external storage, no longer used by anyone else. Taking into consideration the amount of effort needed for collecting and analyzing it, this is a huge waste. Although there are different initiatives such as DELAD and TalkBank/CHILDES established to provide a platform for speech data sharing, most of them demand a lot of additional work without offering much in return – especially when we talk about Slavic languages. Just a few examples: TalkBank/CHILDES platform is a great and well-known initiative, but it does not include tokenization and lemmatization for Slavic languages, not to mention part of speech and morphological tag set or a parser for syntactic dependency tree, yet these features are necessary for some of the most used measures in assessing speech development in children, normal and disordered speech across lifespan: type/token ratio, part of speech percentage or syntactic complexity. Polish Inforex or Korpusomat, both being a wonderful tool for building a customized corpus from text samples collected independently by a researcher, do not include sound-to-transcript alignment. Spokes and the Spokes.mix are not exactly the dynamic corpora that would allow comparing data sets (which is necessary when comparing speech samples from the control and research group). It seems that there is no other way for us than to build a speech corpus from its rudiments, but – as will be shown in the presentation – not necessarily from scratch.
In the presentation, I will show:

(1)   different kinds of data that our team already has;
(2)   our plans for expanding the database, including already existing protocols for samples collection, storage, and analysis;
(3)   what kind of features are necessary and to what extent it can be covered by already existing computing and corpus linguistic tools (including maximal possible automatization);
(4)   why Teitok is the platform that is most suitable for our goals and where we have to make a compromise between what we need and want to achieve and what is yet possible,
(5)   and – if possible in the designated time frame – how the basic, preliminary topic-focus articulation annotation for Polish can be combined with prosodic features on the one hand and pragmatic markers on the other.