Methodological challenges in the creation of a corpus of German and Italian non-parliamentary political spoken communication: Systems of Automatic Speech Recognition (ASR) and linguistic research questions

Keywords: Political language; non-parliamentary spoken communication; Automatic Speech Recognition; politolinguistics; German-Italian corpus

This contribution illustrates challenging elements in the creation of a corpus of political spoken communication that takes place outside the parliament. In disciplines such as politolinguistics (Burkhardt, 1996; Cedroni, 2014; Niehr, 2014) or political discourse analysis (Van Dijk, 1998; Spieß, 2020), not many studies have dealt with this type of communication. Furthermore, studies on the application of ASR systems and their limitations for this communication seem to be only preliminary and related to speeches delivered by a main speaker (Draxler, 2023; Palladino, 2024).

Various elements represent difficulties for researchers in the field of spoken political language who may be interested in investigating less institutionalized spoken communication and need transcripts to conduct their analyses. First, while parliamentary communication is generally well-structured and speeches are often transcribed in advance or corrected afterwards by stenographers (see, for instance, Brambilla, 2007), less institutionalized forms of communication can preserve a more spontaneous style and are focused on persuading an audience of common people. Secondly, recordings of non-parliamentary communication are generally publicly available on platforms such as YouTube, but the audio quality as well as possible interactions make this type of communication not easy to be orthographically transcribed. Finally, non-parliamentary speeches delivered for election campaigns or party rallies may also last hours and this poses a further challenge in the selection of the most adequate ASR system.

The focus of this study is on the different systems that can be useful to create a multilingual corpus of non-parliamentary political spoken communication, focusing on Italian and German with a contrastive linguistic approach. Aspects such as hesitations, speakers’ mistakes and false starts are considered in order to determine which challenges ASR may solve and which further hurdles it may create. Studies on the performance of ASR for the creation of corpora were already carried out, for instance, by Gorisch et al. (2020) as well as Gorisch and Schmidt (2024), showing possible limitations of ASR. However, they dealt with corpora of a different genre of speeches and especially with conversations.

The main research questions of the present study are: How does ASR adapt to linguistic research purposes of a multilingual corpus of non-parliamentary spoken communication? Which criteria could be considered in the selection of ASR systems? Examples of different methods of ASR and their outputs will be presented and the comparison between Italian and German will be shown. The speeches selected for this contribution come from the Po.La.R.-corpus (Political Language Repository)[1]. The illustration of the samples from the corpus will revolve around the research necessities of linguistic analyses, which means that more technical parameters such as word error rate (WER) will not be deepened. Instead, recurrent units of analysis from politolinguistics (see, among others, Dieckmann, 2005; Girnth, 2015) and discourse analysis (see, among others, Fairclough & Fairclough, 2012) are the focus of the discussion.

Preliminary results from the analysis and from the previous contributions by Palladino (see, for instance, 2025) show that ASR is a useful resource for politolinguistic studies on spoken communication, since it can facilitate and speed up the process of data collection. However, the transcripts need to be adapted to research purposes, and all the examined ASR systems present challenges that range from the default insertion of unwanted punctuation to the automated correction of the mistakes made by the speakers.

[1] This work was supported by the Università di Modena e Reggio Emilia – Fondazione di Modena Project “CUP E93C24001970005 Beyond Parliament: AI-Enhanced Multilingual Corpus Using Innovative Methodology for Non-Institutional Political Speeches in German, French, Spanish and Italian” funded by Fondo di Ateneo per la ricerca Anno 2024 - Bando per il finanziamento di progetti di ricerca interdisciplinari”.

References

Brambilla, M. M. (2007). Il discorso politico nei paesi di lingua tedesca: metodi e modelli di analisi linguistica. Roma: Aracne.

Burkhardt, A. (1996). Politolinguistik: Versuch einer Ortsbestimmung. In J. Klein & H. Diekmannshenke (Eds.), Sprachstrategien und Dialogblockaden: Linguistische und politikwissenschaftliche Studien zur politischen Kommunikation (pp. 75-100). Berlin/Boston: De Gruyter.

Cedroni, L. (2014). Politolinguistica. L’analisi del discorso politico. Roma: Carocci Editore.

Dieckmann, W. (2005). Demokratische Sprache im Spiegel ideologischer Sprach(gebrauchs)konzepte. In J. Kilian (Ed.), Sprache und Politik. Deutsch im demokratischen Staat (pp. 11–30). Mannheim: Dudenverlag.

Draxler, C. (2023). Analysis of transcriptions using Octra – a pilot study. In C. Draxler (Ed.), Elektronische Sprachsignalverarbeitung 2023, 105. Dresden: TUDpress, 17–23. https://www.essv.de/pdf/2023_17_23.pdf

Fairclough, I. & Fairclough, N. (2012). Political Discourse Analysis: A Method for Advanced Students. London/New York: Routledge.

Girnth, H. (2015). Sprache und Sprachverwendung in der Politik. Eine Einführung in die linguistische Analyse öffentlich-politischer Kommunikation. Berlin/Boston: De Gruyter.

Gorisch, J., Gref, M. & Schmidt, T. (2020). Using Automatic Speech Recognition in Spoken Corpus Curation. In: Proceedings of the Twelfth Language Resources and Evaluation Conference. Marseille, France. European Language Resources Association, 6423–6428. https://aclanthology.org/2020.lrec-1.790.pdf

Gorisch, J. & Schmidt, T. (2024). Evaluating Workflows for Creating Orthographic Transcripts for Oral Corpora by Transcribing from Scratch or Correcting ASR-Output. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). Torino, Italy. ELRA and ICCL, 6564–6574. https://aclanthology.org/2024.lrec-main.582.pdf

Haberl A., Fleiß, J., Kowald, D. & Thalmann, S. (2024). Take the aTrain. Introducing an interface for the Accessible Transcription of Interviews. Journal of Behavioral and Experimental Finance, 41. https://doi.org/10.1016/j.jbef.2024.100891

Kisler, T., Reichel, U. & Schiel, F. (2017). Multilingual processing of speech via web services. Computer Speech & Language, 45, 326–347. https://doi.org/10.1016/j.csl.2017.01.005

Niehr, T. (2014). Einführung in die Politolinguistik. Gegenstände und Methoden. Göttingen: Vandenhoeck & Ruprecht.

Palladino, M. (2024). Webbasierte Tools für die Transkription und Analyse von Reden. Hilfreiche Instrumentarien für die (Polito)Linguistik. Lingue e Linguaggi, 65 (2024). Università del Salento, 413–437. https://doi.org/10.1285/i22390359v65p413

Palladino, M. (2025). Politolinguistics through Spoken Language Processing: A Methodological Framework for German and Italian Political Speeches. In S. Grawunder (Ed.), Elektronische Sprachsignalverarbeitung 2025. Tagungsband der 36. Konferenz. Dresden: TUDpress, 204‒211. https://www.essv.de/pdf/2025_204_211.pdf?id=1254

Spieß, C. (2020). Politiksprache und politische Kommunikation. In T. Niehr, T. Kilian & J. Schiewe (Eds.), Handbuch Sprachkritik (pp. 302–309) Stuttgart: Metzler.

Van Dijk, T. A. (1998). What is political discourse analysis? In J. Blommaert, & C. Bulcaen (Eds.), Political linguistics (pp. 11–52). Amsterdam: Benjamins.