Seminář ÚL | Český národní korpus

Místo konání: P104, hlavní budova, 1. patro
Online: seminář je přenášen také online, v případě zájmu o link prosím napište Honzovi nebo Magdě.
Čas konání: středa, 14:10–15:40, není-li uvedeno jinak

		Datum	Téma · Přednášející · Abstrakt
		středa 15. 10. 2025 14:10	LLMs in Social Sciences and Humanities Ondřej Tichý One thing is clear about the role of AI in higher education: it is unavoidable. However, many other aspects remain uncertain. This paper aims to provide illustrative examples, offer several suggestions, and—most importantly—foster a discussion about how and in what contexts AI should be both taught and used in the context of humanities and social sciences. The official Recommendations regarding the use of generative artificial intelligence for university educators at Charles University advise educators to “Monitor developments in AI tools and spend some of your time exploring their capabilities. Check out what they can do, how they can beneﬁt your work, and how reliable they are or aren’t... Actively use these tools where appropriate. Encourage students to use AI tools while respecting their varying levels of knowledge and skills.” These recommendations are, perhaps necessarily, somewhat vague—particularly regarding questions such as: To what extent should teachers and students study the theory behind large language models to truly understand their capabilities and limitations? How should AI be studied and taught? In which areas is the use of AI most beneficial, and where might it pose the greatest challenges?
		středa 29. 10. 2025 14:10	Corpus and Psycholinguistic Perspectives on LLMs Anna Marklová If you’re already tired of hearing about AI, brace yourself: it is not going away. That is why it is crucial to study AI language, from the early days of AI Dungeon (circa 2019), which gave a broad public its first taste of large language models, to the present (and beyond). In this talk, I present research on AI language using corpus- and psycholinguistic methods. Firstly, a live demonstration (or its screenshots alternative) of new publicly accessible AI-corpora, AI-Brown and AI-Koditex, will take place. Then, experiments on on AI-generated texts (including poetry), analyses of stylistic variability, and a study of AI-generated images will be presented. The goal of this talk is to offer a concise overview of recent work on large language models conducted at the Czech National Corpus. Let’s study AI before it studies us.
		středa 12. 11. 2025 14:10	Comparing quantitative morphological features of languages: a study on annotated multi-parallel texts Vojtěch John Research on morphological diversity in typology and contrastive linguistics has traditionally focused on discrete, predominantly inflectional features. However, corpus-based approaches can provide complementary insights into the quantitative and dynamic aspects of morphological systems. While multiple languages have both morphological resources and large parallel corpora, sizeable corpora with detailed morphological annotation - including morphological segmentation and morpheme classification - remain very scarce. As part of a broader effort to address this gap, we present our current work on the detailed automatic annotation of part of the multiparallel corpus Europarl, comprising over 10 million tokens in each of six languages: Czech, English, French, German, Hungarian, and Slovak. The presentation reports preliminary results on quantitative morphological features extracted from these data and their potential to inform further cross-linguistic research. In particular, we discuss observed cross-linguistic regularities in morpheme frequency distributions, relationships among morpheme classes, and their possible connection to word formation strategies.
		středa 26. 11. 2025 14:10	Language in Aphasia with Naive Discriminative Learning Michal Láznička In this talk, I will give a brief overview of my research on language in aphasia. I will start with the relationship between aphasiology and aphasic data and linguistics. This will be followed by three case studies. In the first case study, I will show how entrenchment and chunking modulate fluency, using prepositional phrases as an example. This study shows how a usage-based approach to language can complement approaches that focus more on the role of structural complexity in explaining linguistic behaviour in aphasia. The second study shows how a linguistically informed analysis can provide a more systematic and principled description of aphasic data. Specifically, I will present a description of verb and arguments structure production in aphasia, using the perspective of Construction Grammar and Frame Semantics. The last part of the talk will be dedicated to a new project in which I will focus on Czech inflectional morphology in speakers with aphasia and the possibilities of applying computational models of learning on this data.
		středa 10. 12. 2025 14:10	Semantic networks for children with typical acquisition and specific language impairment Tomáš Savčenko (OAJD) I am preparing a study on semantic networks based on word vectors trained on the Clinical English Gillam corpus (Gillam & Pearson 2004) containing narratives of children with typical language development and specific language impairment (SLI). The aim is to analyse the structure of those semantic networks at different stages of acquisition with the hypothesis that a 'small-world structure', characterized by prominent hub words with many connections and local clusters of closely related words, will be found in typically developing children while a network with less dominant hubs and more evenly linked nodes will be found for children with SLI. Small-world network allows, in theory, effective search strategies in local clusters as well as across distant domains via the hub nodes (Watts & Strogatz 1998; Steyvers & Tenenbaum 2005) which is why I assume that its disruption should occur in SLI. Special focus will lie on whether this network measure would be able to distinguish typical and SLI children with similar mean length of utterance in which case this network measure would outperform a traditional psycholinguistic measure used to diagnose SLI (Rice et al. 2010).
		středa 18. 2. 2026 14:10	Problems and Prospects in Genealogical Linguistics Viktor Elšík Genealogical linguistics is a branch of historical linguistics that classifies languages into genealogical groups and subgroups and formulates hypotheses about the (degree of) relatedness among linguistic varieties. After introducing the established methodology of genealogical linguistics and discussing its limitations, I will present selected empirical, theoretical, and methodological developments in genealogical classification, including some controversial lines of inquiry. I will also briefly assess the potential of computational phylogenetic models and interdisciplinary approaches for genealogical linguistics. Among other things, you will learn why relatedness and similarity are entirely different concepts; whether languages can shift their genealogical affiliation (at least in the minds of linguists); what distinguishes identical from shared linguistic innovations and why this distinction matters; why the tree (Stammbaum) model of linguistic divergence is too restrictive to capture linguistic history adequately; why we can hardly expect to reconstruct a Proto–Homo sapiens language; why the practice of genealogical linguistics has significant social dimensions; and even about a spurious Amazonian language once “discovered” by a Czech cactologist.
		středa 4. 3. 2026 14:10	Atlas české světové literatury (1817–2019) Ondřej Vimr Přednáška představí Atlas české světové literatury a navazující výzkum, který prostřednictvím bibliografické datové vědy sleduje šíření české literatury v mezinárodním prostoru za posledních dvě stě let. Přednáška se zaměří nejen na to, jak uchopit výzkum cirkulace národní literatury a co vlastně znamená pojem „česká literatura“, ale také na problematiku sběru a zpracování bibliografických dat, na otázky, které lze (a které naopak nelze) takto položit, i na konkrétní příklady mezinárodní recepce českých autorů a autorek. Kniha i výzkum využívají digitální humanitní přístupy – kvantitativní analýzu, vizualizace, mapy a grafy – a otevírají téma mezinárodního kánonu české literatury i proměn vzorců jejího šíření v čase. V závěru budou nastíněny nové směry výzkumu: analýza časopisů, práce s novějšími daty, využití metod strojového učení a souvislost s kulturní politikou, zejména otázkou efektivity státních subvencí do literárních překladů
		středa 25. 3. 2026 14:10	Prophetic women and a learned lady: English religious vocabulary in a world turned upside down Jeremy Smith In the 1640s and 1650s English society was turned upside down by civil war and the beheading of King Charles I in 1649: ‘It was a hinge in the world’s history. God was about to do something new’ (Ryrie 2017: 118). In 1647, the victorious Parliamentary army had debated, at Putney near London, radical views on government, expressed by 'Levellers' such as Thomas Rainsborough (d.1648): 'For really I think that the poorest he that is in England hath a life to live, as the greatest he ...' The war also ‘released women into the public world of contention, and into speech and writing’ (Hobby 2001: 174), including figures such as the ‘Fifth Monarchist’ millenarian Anna Trapnel (fl. 1642-1660), or the Quaker Mary Howgill (c.1620-?1666), who denounced the Lord Protector Oliver Cromwell – to his face – as ‘a stinking dunghill in the sight of God’. All these groups developed distinctive linguistic codes to express their various ideologies. In this paper, curated electronic corpora of prophetic women’s and Leveller writings are examined; specialised lexicons thus identified are then contextualised, contributing to the developing field of theolinguistics (see e.g. Crystal 2018). The paper argues – in line with another linguistic paradigm, viz. historical pragmatics – that to understand the delicate shifts of meaning that individual lexemes undergo, when deployed by differing communities of religious practice, demands considerable interdisciplinary sensitivity to the complex cultural contexts of those communities. The paper is part of a larger project on the English religious lexicon’s historical evolution, funded by the Leverhulme Trust (see Smith, forthcoming). Crystal, David 2018. ‘Whatever happened to theolinguistics?’, in Paul Chilton and Monika Kopytowska (eds), Religion, Language, and the Human Mind (Oxford: University Press), 3-18 Hobby, Elaine 2001. ‘Prophecy, enthusiasm and female pamphleteers’, in Neil Keeble (ed), The Cambridge Companion to Writing of the English Revolution (Cambridge: University Press), 162-178 Ryrie, Alec 2017. Protestants (London: HarperCollins) Smith, Jeremy J. forthcoming. Lexicons of English Religion 1380-1850 (Cambridge: University Press)
		středa 1. 4. 2026 14:10	Operationalising discourse-pragmatic omissibility Veronika Raušová This talk presents an ongoing corpus-based study that approaches omissibility not simply as a diagnostic criterion but as an empirical phenomenon in its own right. Omissibility is a context-dependent property of a linguistic unit, established when its deletion does not affect grammatical well-formedness or the propositional content of the sentence in which it occurs. It is operationalized through a controlled deletion procedure implemented in an automated analysis pipeline applied to datasets drawn from a 160M-token Reddit corpus annotated with UDPipe 2, in which candidate units are evaluated for omissibility by GPT-OSS-120B under explicitly defined criteria. At the current stage of the project, the study examines which linguistic units are classified as omissible by the pipeline and analyses properties such as their syntactic behaviour, immediate co-text, and degree of structural embedding, thereby generating distributional profiles that support systematic comparison across linguistic units, including multifunctional items.
		středa 15. 4. 2026 14:10	Quechua: Sociolinguistic situation, and some aspects of verbal morphology and evidentiality Vlastimil Rataj
		středa 29. 4. 2026 14:10	“And that´s what makes us human.” Phraseology in AI compared to human language. Does English shape the phraseology of AI-produced Czech? Denisa Šebestová I am looking into phraseological sequences in AI-produced language from a cross-linguistic perspective, while also comparing them to human-produced texts. Differences between AI- and human-produced language on the phraseological level are subtle, yet they may contribute substantially to the perceived "otherness" of AI language. Mastering phraseological sequences is known to pose a challenge to foreign language learners; LLMs may thus face similar difficulties, particularly in Czech, an inflectional language with little representation in LLM training data. The basic premise is that LLMs process prompts in English internally before generating output in the target language (Zhao et al. 2024; Zhong et al. 2024; Schut, Gal, and Farquhar 2025). The study seeks to clarify whether and how this affects Czech output: Do English lexical bundles transpire into Czech AI texts? If so, what discourse functions do they fulfil, and how are they distributed across registers? To answer these questions, I compare frequent n-grams between two human language corpora: Koditex (Czech) and BE21 (English); and two AI corpora: AI Koditex and AI Brown.
		středa 13. 5. 2026 14:10	Ši-ko-ku-ko-ko-te-ko-ku-ru-ka-ka. Verbální extremismus Xaviera Baumaxy. Pavel Machač Michal Škrabal Spojí-li své síly lingvistický fonetik s korpusovým fámulem, lze očekávat seminář poněkud vybočující. Pokusíme se konec semestru malinko osvěžit ponorem do popkulturních vod, ponorem nicméně seriózním. Osaháváme řečovou ekvilibristiku jednoho osobitého písničkáře, jehož mimořádný verbální výkon nás inspiroval k formulaci (a snad i zodpovězení…?) několika obecných lingvistických otázek. Pomocí fonetické analýzy jeho řečových výkonů konfrontujeme limity běžné mluvené řeči, její produkce a percepce.