Seminář ÚL | Český národní korpus

Místo konání: P104, hlavní budova, 1. patro
Online: seminář je přenášen také online, v případě zájmu o link prosím napište Honzovi nebo Magdě.
Čas konání: středa, 14:10–15:40, není-li uvedeno jinak

		Datum	Téma · Přednášející · Abstrakt
		úterý 8. 9. 2015 13:00	velké úterý ÚČNK
		pondělí 5. 10. 2015 13:00	William Gladstone and the Vocabulary of Homer Geoffrey Sampson A very successful recent popularization of linguistics, Guy Deutscher’s Through the Language Glass, is based on the claim that the British statesman William Gladstone thought the ancient Greeks were colour-blind. And indeed, ever since Gladstone published on the language of the Iliad and Odyssey, he has been misunderstood in this way. In reality, Gladstone made it explicit that the difference he postulated between the ancient Greeks and ourselves was a linguistic and intellectual difference, not a difference in physical perception. But, more than this, his analysis of Homer’s language was a remarkable 19th-century anticipation of a number of separate ideas all of which are commonly taken to be original with the linguistics of the 20th and 21st centuries. This issue is of more than merely historical interest. It has an important lesson for us today about the fact that novel ideas require not only an original mind to produce them, but an audience intellectually ready to receive them.
		úterý 6. 10. 2015 13:00	Who Uses Complex Language? Geoffrey Sampson Corpus linguists are good at thinking about how best to gather, record, and annotate language samples. They have sometimes seemed less successful at exploiting corpora to generate findings of serious human interest, beyond the narrow fields of dictionary-making and language-teaching. However, results that have begun to emerge from the British National Corpus, if borne out by larger-scale research, are potentially significant for social policy and for our understanding of human nature. One section of the BNC samples the conversational speech of a cross-section of the British population. Having equipped a small subset of this material with grammatical annotations, I looked for correlations between speakers’ demographic properties (sex, age, region, social class) and the grammatical complexity of their utterances. Famously, in the 1960s Basil Bernstein claimed that speech complexity correlated with social class, but the BNC data do not bear this out. On the other hand there is a clear, statistically-significant correlation with age. Speech appears to grow more complex not just between childhood and maturity (which is not surprising) but throughout adult life (which is). The social implications of this finding are sufficiently weighty that it merits testing against larger data-sets and more languages.
		úterý 13. 10. 2015 13:00	Určování autorství dokumentů s použitím stylometrie a strojového učení Jan Rygl Určování autorství nachází v dnešní době uplatnění především v soudnictví (znalecké posudky: ověřování autorství listin, odhalování plagiátů) a v boji s extremismem (dohledávání autorství ilegálních dokumentů na Internetu). K odhalování identity autora se využívá strojové učení a stylometrie. Stylometrické techniky extrahují z textu soubor rysů autora (otisk autora), které následně zpracovává strojové učení. Můžeme tak řešit problémy verifikace a přiřazování autorství. Součástí prezentace bude shrnutí čtyřletého vývoje systému ART (Authorship Recognition Tool) pro Ministerstvo vnitra a plány NLP Centra na další stylometrické aplikace.
		úterý 20. 10. 2015 13:00	— seminář se nekoná
		úterý 27. 10. 2015 13:00	Corpus Linguistic Approaches to Learner Finnish Study Jarmo Jantunen Until recently most corpus studies that have focused on learner language have been biased towards Indo-European languages, especially English. This situation is, however, changing gradually. This presentation focuses on learner corpora of Finnish language and provides examples of studies that have been conducted using those learner data. The current learner data can be divided into five second language data and one foreign language data. Examples are given, for example, how these data provide information on phraseology in learner writing, how learners cope with certain grammatical forms and develop during learning and how the learning context (FFL or FSL) affects the learner production.
		úterý 3. 11. 2015 13:00	Automatická syntaktická analýza a analyzátor SET Vojtěch Kovář Přednáška se bude nejprve v obecné rovině zabývat automatickou syntaktickou analýzou a problematikou jejího vyhodnocování. Dále představí analyzátor SET a jeho aplikace, včetně průniků s korpusovou lingvistikou.
		úterý 10. 11. 2015 13:00	Nové automatizované metody pro extrakci dat z korpusu: monokolokabilia a citace Jiří Milička Přednáška bude mít dvě části. V první bude představena metrika pro monokolokabilitu, která je srozumitelná a snadno interpretovatelná, takže by mohla být používána i mimo korpusovou lingvistiku, třeba v digital humanities (aneb jak moc jde osekat MI-score, aby to ještě dávalo smysl, a jaký že smysl to pak dává). Druhá část se bude týkat nástroje pro automatickou extrakci opakujících se kusů textu z korpusu. Metoda dovoluje nastavit si procento povolených odchylek a toleranci ke slovosledným změnám. Budou představeny i nástroje pro využití těchto dat.
		úterý 17. 11. 2015 13:00	— státní svátek
		úterý 24. 11. 2015 13:00	Diskuze nad dalšími možnostmi korpusového dotazování Václav Cvrček / ÚČNK Konkordance je klasickým výstupem korpusového dotazu, ovšem zdaleka ne jediným. Některé další už jsou ztělesněny v existujících korpusových nástrojích, jiné se nám třeba teprve klubou v hlavě. Přijďte si popovídat o tom, na co všechno bychom se ještě korpusových dat chtěli ptát. Úvodní slovo pronese Václav Cvrček, těžištěm setkání pak bude společná diskuse.
		úterý 1. 12. 2015 13:00	Zrušeno (viz níže) Alan Partington Přednášky Alana Partingtona ve dnech 1. a 2. 12. 2015 se z nepředvídaných osobních důvodů bohužel ruší, přednášející nebude moct dorazit do Prahy. Omlouváme se všem, kteří se plánovali zúčastnit, a děkujeme za zájem. Usilovně pracujeme na tom, aby se návštěva mohla uskutečnit na jaře 2016. O detailech budeme včas informovat, mj. i prostřednictvím stávající stránky http://partington.eventzilla.net. Děkujeme za pochopení!
		úterý 8. 12. 2015 13:00	předvolební shromáždění ÚČNK
		úterý 15. 12. 2015 13:00	volba vedení + vánoční besídka ÚČNK
		úterý 22. 12. 2015 13:00	— seminář se nekoná
		úterý 5. 1. 2016 13:00	— seminář se nekoná
		úterý 12. 1. 2016 13:00	— seminář se nekoná
		úterý 19. 1. 2016 13:00	Syntaktické značkování v korpusu SYN2015 Tomáš Jelínek Přednáška představí syntaktické značkování použité v nově vydaném korpusu SYN2015.
		úterý 26. 1. 2016 13:00	— seminář se nekoná
		úterý 2. 2. 2016 13:00	velké úterý ÚČNK
		úterý 9. 2. 2016 13:00	Korpus Merlin Barbora Štindlová Pavel Pečený Jirka Hana Od svého vzniku v roce 2001 se Společný evropský referenční rámec pro jazyky (SERR) stal nejdůležitějším referenčním nástrojem pro výuku a certifikaci jazyků a vytváření kurikul. Navzdory tomu narůstá obava, zda je klasifikace SERR dostatečně jasná a zda jsou profesionálové z oboru schopni spolehlivě odlišit rozdíly mezi jednotlivými úrovněmi, aniž by měli k dispozici jejich názorné empirické charakteristiky. To se týká především jazyků jiných, než je angličtina, u kterých existuje naléhavá poptávka po takových empirických nástrojích. Projekt MERLIN reaguje na tuto poptávku pro češtinu, němčinu a italštinu a nabízí vícejazyčné multifunkční webové rozhraní, které umožňuje seznámit se s psanými studentskými texty na úrovni A1 až C1, jež se k SERR vztahují, a poukazuje na relevantní jazykové rysy.
		úterý 16. 2. 2016 13:00	Multidimenzionální analýza češtiny doktorandi ÚČNK Multidimenzionální analýza (MDA) Douglase Bibera je metodou, která umožňuje na základě vnitrotextových kritérií provést klasifikaci textů do funkčních makroskupin. U každého textu se sleduje množství jazykových rysů, u nichž se napříč žánry předpokládá variabilita; tyto rysy jsou posléze shlukovány do svazků (dimenzí), pro něž se hledá funkční interpretace. Jednou takovou dimenzí je dle Bibera např. informační hustota a souvisí kromě mnoha dalších rysů s častým výskytem substantiv a vysokou průměrnou délkou slova. V našem projektu se snažíme navázat na Bibera jakož i průkopnickou práci Viléma Kodýtka a metodou MDA zmapovat rozmanitost češtiny. Za tímto účelem připravujeme žánrově různorodý korpus a dáváme dohromady seznam jazykových rysů, v nichž by se tato variabilita měla odrážet. V rámci semináře jednak krátce představíme metodu MDA a připravovaný korpus, ale primárně se budeme věnovat rysům, na jejichž správném výběru úspěch celé analýzy závisí. Budeme tedy vděční za jakékoli připomínky a doplnění. Pokud byste si příspěvek do diskuze raději rozmysleli předem, je možné se s rysy, o nichž aktuálně uvažujeme, seznámit již v předstihu prostřednictvím této prezentace.
		úterý 23. 2. 2016 13:00	— seminář se nekoná
		úterý 1. 3. 2016 13:00	— seminář se nekoná
		úterý 8. 3. 2016 13:00	Protetické v- v ČR Jan Chromý Přednáška představí výsledky projektu GAČR Sociolingvistická analýza užívání protetického v- v Čechách, v rámci kterého se zkoumalo užívání tohoto jevu v 5 městech (v Praze, Českých Budějovicích, Plzni, Hradci Králové a Brně). Pozornost bude zaměřena jak na jazykové faktory, tak na sociální faktory, které ovlivňují užívání v-. Výsledky budou rovněž srovnány s dosavadní literaturou k tématu.
		úterý 15. 3. 2016 13:00	Kontrastivní lingvistika s paralelním korpusem InterCorp: případ intenzifikátoru quite Michaela Martinková Anglický intenzifikátor quite je podle lingvistické literatury polyfunkční: s významy, které je možno konstruovat jako ohraničené, maximalizuje, s významy konstruovanými jako neohraničené spíše intenzitu snižuje (např. Paradis 2008). V americké angličtině však quite může fungovat i jako tzv. booster (Quirk et al. 1985). Analýza českých korespondencí adjektivních konstrukcí s intenzifikátorem quite ukázala vedle případů disambiguace na vysoké procento nulových korespondencí a na jeho pragmatickou spíše než sémantickou funkci (Aijmer a Altenberg 2002). Podobným směrem ukazuje i analýza korespondencí španělských, což je v souladu s Johanssonovou tezí o tom, že skrze překlad lze vidět význam (např. 2007). České protějšky jsou dále testovány s ohledem na přítomnost určitých „contextual cues“. Doposud byl pro diagnostiku funkce výrazu quite využit typ adjektiva (Levshina 2014), ten je ale někdy obtížné určit, protože adjektivum může vlivem kontextové modulace měnit svůj “construal” (Paradis 2008). V prezentaci se proto zaměřím na jev jiný – přítomnost negativního prefixu v adjektivu. Na obousměrném vyrovnaném korpusu angličtiny a češtiny (vytvořeném v InterCorpu) pak určuji tzv. mutual correspondences (Altenberg 1999) anglického quite a jeho nejčastějšího, podobně dvojznačného českého protějšku docela. Prezentace tak v neposlední řadě na případu intenzifikátoru quite ukazuje různé způsoby práce s paralelním korpusem. Pozn.: Anotace včetně seznamu literatury je k dispozici ke stažení (modré tlačítko vpravo). QUITE_PRAHA_brezen_2015.pdf
		úterý 22. 3. 2016 13:30	O zkušenostech s kookurenční analýzou při práci na Velké německo-české lexikální databázi Marie Vachková
		úterý 29. 3. 2016 13:00	Představení monografie Románské jazyky a čeština ve světle paralelních korpusů Petr Čermák et al.
		úterý 5. 4. 2016 14:00	Slovník, to zní hrdě!—Čtrnáctero zastavení na lexikografově kalvárii Michal Škrabal Abstrakt této přednášky vznikl zcela spontánně, u piva v jakési knajpě nad konceptem jedné z kapitolek mé disertace, již jsem si četl a piloval, donekonečna a do úmoru. Přemíra teorie spolu se šikanou odborného vyjadřování ve mně probudily touhu po žánru odlehčenějším, živějším, esejistickém. I otevřela se dlouho potlačovaná stavidla poetické řeči naplno… ó jaká očista! (V přednášce si dovolím být poněkud osobnější, nemějte mi to prosím za zlé. Leč i milovníci teorie si přijdou na své, slibuji.)
		pondělí 11. 4. 2016 13:00	Corpus-based analyses of variation in English: Why both size and structure matter (registrace) Mark Davies čas a místo konání: pondělí 12.4.2016 v 18h, místnost 104 na FF UK English corpus linguistics has a tradition of using small (1-5 million word) corpora to look at variation for high frequency phenomena. Within the last 5-10 years, however, very large web-based corpora (like those from Sketch Engine) have also become available. While both of these types of corpora certainly have their advantages, I argue that both have serious weaknesses when it comes to looking at many types of variation in English. I will present many examples of lexical, morphological, syntactic, and semantic variation in English, which can only be studied using corpora that are both large and which have a structure that lends itself to looking at variation (rather than just as a “blob” of billions of words of web pages). These examples of genre-based, historical, and dialectal variation in English will come from the 520 million word Corpus of Contemporary American English (COCA), the 400 million word Corpus of Historical American English (COHA), and the 1.9 billion word Corpus of Global Web-based English (GloWbE). All of these corpora are much larger than comparable corpora of English, and their unique structure allows them to provide insight into variation in English that cannot be obtained with any other source.
		úterý 12. 4. 2016 13:00	New from the BYU corpora: the NOW corpus and virtual corpora (registrace) Mark Davies May 2016 will see two exciting developments from the BYU corpora (corpus.byu.edu), which are probably the most widely-used corpora at present. In this presentation I will give a “sneak peek” of these changes. First, we will release the NOW corpus (Newspapers on the Web). The corpus is composed of about three billion words of data from web-based newspapers for every day from January 2010 until now. Most importantly, the corpus grows by about 6-7 million words each day, which makes it ideal for looking at ongoing changes in the language. Second, we have incorporated into all of the BYU corpora the ability to create and use “virtual corpora” (previously only available with the BYU Wikipedia corpus). Users can create virtual corpora based on source (e.g. a particular magazine or newspaper or author), title, date, (sub-)genre, and even words within the text. They can then search within their virtual corpora, compare across them, and even extract keywords.
		úterý 19. 4. 2016 13:00	Morfologická homonymie v češtině Vladimír Petkevič
		úterý 26. 4. 2016 13:00	— seminář se nekoná
		úterý 3. 5. 2016 13:00	Corpus-assisted Discourse Studies (CADS): Good Practices and Potential Pitfalls Alan Partington I want to start by outlining some of the relatively well-known methodological and epistemological achievements of Corpus Linguistics. I’d like then to show how these both feed into but differentiate from the requirements and practices of corpus-assisted discourse studies, defined as the employment of corpus techniques to shed light on aspects of language used for communicative purposes or, put another way, to analyse how speakers (attempt to) influence the beliefs and behaviour of other people (Partington, Duguid & Taylor 2013). CADS does not refer to a particular school or approach, but is an umbrella term of convenience. Indeed, the types of research it refers to are eclectic and pragmatic in the techniques they adopt given that they are goal-driven, that is, the aims of the research dictate the methodology. However, although a broad church, it does possess its own characteristics, methods, resources, practices and is subject to its own particular temptations and pitfalls. By means of various case studies, I want to illustrate the added values of CADS to discourse study. It can supply an overview of large numbers of texts, and by shunting between statistical analyses, close reading and analysis types half-way between the two, CADS is able to look at language at different levels of abstraction. After all, ‘you cannot understand the world just by looking at it’ (Stubbs 1996: 92), and abstract representations of it need to be built and then tested. Indeed, far from being unable to take context into account (the most common accusation levelled at Corpus Linguistics), CADS contextualises, decontextualises and recontextualises language performance in a variety of ways according to research aims. It also highlights how statistical information, sometimes dismissed as ‘merely’ quantitative, is actually inherently also qualitative in nature. Corpus techniques greatly facilitate comparison among datasets and therefore among discourse types. They can, moreover, ensure analytical transparency and replicability (and para-replicability). And because parts of the analysis are conducted by the machine, they enable the human analyst to step outside the hermeneutic circle, to place some distance between the interpreter and the interpretation. Finally, they enable the researcher to test the validity of their observations, for instance, by searching for counterexamples (‘positive cherry-picking’). Having said all this, the discourse analytical process is always guided by the analyst, and there are many parts of the process which a machine simply cannot tackle. This is why we prefer the term ‘corpus-assisted’ to alternatives such as ‘corpus-driven’ or ‘corpus-based’. The aim is to show how CADS sits within the wider framework of scientific research methodology, what we might mean by scientific objectivity in discourse analysis and what counts as good (in the senses of both ‘useful’ and ‘honest’) practices and what practices are best avoided. Abstract_for_Tuesday_Faculty.docx
		středa 4. 5. 2016 14:10 m. č. 104 na hlavní budově FF UK	“Why are you English all so anti-European?” A corpus-assisted discourse study (CADS) of “stay or leave?” arguments on the eve of the UK Referendum on withdrawal from the EU Alan Partington On June 23rd 2016, the British people will vote on whether to remain in or to withdraw from the European Union. The announcement to hold the referendum was announced by PM David Cameron in January 2013. In this talk I want to compare and contrast the reactions to the referendum proposal in two English newspapers, the left-leaning Guardian and the right-leaning Daily Mail. In the first stage, two corpora were compiled, each containing all the articles in the two newspapers whose headline or leading paragraph contained the items eu OR european union OR brussels OR frankfurt for the years 2013, named respectively DM13_eu and GN13_eu. These were downloaded from the Lexis Nexis database. Another similar pair of focused corpora were compiled from 2005, named DM05_eu and GN05_eu in order to conduct a diachronic comparison to examine whether the newspapers’ stances have altered over this time period as a result of changing circumstances, especially the Eurozone crisis. In a second updating stage, two more corpora were compiled of articles from the two newspapers which contain the terms eu OR europe AND referendum in the first three months of 2016, named respectively DM16_eu and GN16_eu. These corpora were compared and contrasted with each other using the WordSmith key-item tool which produced lists of words and short phrases (or ‘clusters’) which were more frequent in one data-set than another. This affords a window into the relative particular preoccupations of each paper at these particular times. Observations from the research include that British EU-scepticism - and EU-enthusiasm - come in various shades and varieties. Anxieties over the EU have changed over time (especially if we reference Teubert’s 2001 seminal work on EU-sceptic discourses in 2000). The data show clearly that it is not simply a right versus left issue, and to divide viewpoints into just two camps - a pro-EU and anti-EU one - is hugely simplistic. But the problem with Referendums is that they demand a ‘Yes’ or ‘No’ response, and shades of opinion cannot be envisaged. The principal endeavour of Corpus-assisted discourse study (CADS) is the investigation, and often comparison, of features of particular discourse types, integrating into the analysis techniques and tools developed within corpus linguistics, shunting between statistical quantitative overviews of data and traditional close reading, often of segments of data identified as potentially significant by the overview. The aim of the CADS approach is the uncovering, in the discourse type under study, of ‘non-obvious’ meanings and patterns of meanings, that is, meanings which might not be readily available to naked-eye perusal (Partington, Duguid and Taylor 2013). Abstract_for_Wednesday_students.docx
		úterý 10. 5. 2016 13:00
		úterý 17. 5. 2016 13:00	Korpus ROMI Kateřina Šormová
		úterý 24. 5. 2016 13:00	Lexical obsolescence: a qualitative and quantitative look Jan Čermák Ondřej Tichý
		úterý 31. 5. 2016 13:00	návštěva paní děkanky
		úterý 7. 6. 2016 13:00
		úterý 14. 6. 2016 13:00
		úterý 21. 6. 2016 13:00
		úterý 28. 6. 2016 13:00	Ozvěny Růžového přístavu Michal Křen Seminář bude koncipován jako zpráva z konference LREC2016, která se letos konala v Portoroži, a bude mít dvě hlavní části. V té první půjde o stručné přestavení příspěvků, které mě na konferenci nejvíce zaujaly a které mají vztah k (a potenciální využitelnost v) ČNK. Ve druhé části pak ukážu novinky a vylepšení (No)Sketch Engine, která Lexical Computing plánuje nebo která už jsou na http://the.sketchengine.co.uk v provozu.

velké úterý

William Gladstone and the Vocabulary of Homer

Who Uses Complex Language?

Určování autorství dokumentů s použitím stylometrie a strojového učení

—

Corpus Linguistic Approaches to Learner Finnish Study

Automatická syntaktická analýza a analyzátor SET

Nové automatizované metody pro extrakci dat z korpusu: monokolokabilia a citace

—

Diskuze nad dalšími možnostmi korpusového dotazování

Zrušeno (viz níže)

předvolební shromáždění

volba vedení + vánoční besídka

—

—

—

Syntaktické značkování v korpusu SYN2015

—

velké úterý

Korpus Merlin

Multidimenzionální analýza češtiny

—

—

Protetické v- v ČR

Kontrastivní lingvistika s paralelním korpusem InterCorp: případ intenzifikátoru quite

O zkušenostech s kookurenční analýzou při práci na Velké německo-české lexikální databázi

Představení monografie Románské jazyky a čeština ve světle paralelních korpusů

Slovník, to zní hrdě!—Čtrnáctero zastavení na lexikografově kalvárii

Corpus-based analyses of variation in English: Why both size and structure matter (registrace)

New from the BYU corpora: the NOW corpus and virtual corpora (registrace)

Morfologická homonymie v češtině

—

Corpus-assisted Discourse Studies (CADS): Good Practices and Potential Pitfalls

“Why are you English all so anti-European?” A corpus-assisted discourse study (CADS) of “stay or leave?” arguments on the eve of the UK Referendum on withdrawal from the EU

Korpus ROMI

Lexical obsolescence: a qualitative and quantitative look

Ozvěny Růžového přístavu