Abishek Stephen

Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University

Zdeněk Žabokrtský

Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University

Searching for a Minimal Model of Spread of Lexemes across Languages

Keywords: Lexical similarity; Borrowing; Language change; Hierarchical clustering; Indic languages

Lexical elements tend to spread across languages and as a result the lexicon of a language may contain words of different genealogies whose parentage is difficult to trace on the synchronic level (Poplack and Sankoff, 1984). There are vast number of words borrowed from Latin across languages and language families. But Latin has not been an extensive borrower of words from other languages. However, we can find multiple French-origin words in English and English-origin words in the contemporary French lexicon. This shows that there exists a disparity in donor-recipient relations which is obvious given the linguistic histories, areal and genetic factors, social implications, and so on to list a few.

If we consider the fact that the need for a concept is one of the motivations for languages to borrow (Campbell, 2020) there could be an access to illustrating the relations between languages in a linguistic area and capturing the languages carrying the potential of being lexical donors, borrowers or even languages resistant to participate in lexical transfer. We propose to achieve this through agglomerative hierarchical clustering methods. This stems from the technical unavailability of annotated lexical databases for low-resource languages that implicitly illustrate the donor languages for the individual lexical items.

A synchronic clustering of languages based on the contemporary vocabulary set will illustrate how the languages might cluster differently than the genetic phylogenies. This shift can be seen as a result of extensive borrowing where the languages now share words or semantic concepts with genealogically unrelated languages and may cluster closer to the language group through which it received a higher flux of non-native vocabulary. For this study, we consider the languages spoken in the Indian subcontinent due to the rich linguistic diversity albeit the languages are low-resourced. We exploit CogNet (Batsuren et al., 2019) as our data resource and we consider the Indo-Aryan (Assamese, Bengali, Gujarati, Hindi, Kashmiri, Konkani, Nepali, Oriya, Punjabi, Sanskrit, Urdu), Dravidian (Kannada, Malayalam, Tamil, Telugu), and Sino-Tibetan (Bodo) languages (See Table 1).

We achieve this through agglomerative hierarchical clustering and show that it is a promising data structure for visualizing lexeme transfers (Figure 1) using our scoring metric. The scoring function is inspired by the Inverse Document Frequency (Sparck Jones, 1972). The formula is given in the following equation:

score(l₁,l₂) = -log((Σ(c₁) 1/2^(c₁-2) + ε) / (Σ(c₂) 1/2^(c₂-2) + 2ε))

where l₁ and l₂ are languages, c₁ is the number of concepts in l₁ ∩ l₂ (i.e., the intersection of concepts in languages l₁ and l₂), c₂ is the number of concepts in l₁ ∪ l₂ (i.e., the total number of concepts in l₁ and l₂), and ε is 1/len(languages). Once we obtain the similarity scores for language pairs we perform hierarchical clustering using the Ward linkage method which aims to create clusters that are compact and well-separated by minimizing the spread of data points within clusters.

For inducing the horizontal links we collect the remainders i.e. the exclusive concept overlaps, but always using as big donor or recipient groups as possible, instead of many sharings on lower levels. We assume languages or language clusters as potential lexeme donors or recipients without inferring the exact direction of flow. Hence, both ends of the horizontal links can be either a donor or a recipient. The output representation is a graph structure that conceptualizes the “flow" or spread of lexemes. There are two types of nodes and two types of edges in the graph structure.

Types of nodes:

Nodes corresponding to actual languages (terminal nodes).
Nodes corresponding to (hierarchical) groupings of languages (non-terminal nodes).

Types of edges:

Vertical edges, approximating the inheritance of lexemes from the language's antecedents.
Horizontal edges, approximating borrowings from a donor language or a donor language group to a recipient language or a recipient language group.

The terminal nodes are linked by bidirectional arcs hypothesizing the capability of either of the nodes being the donor or recipient of the concept. The arc width also follows a logarithmic scale of exclusive overlaps. Exclusive overlaps mean that the concepts are only found between these two languages or language groups and are not inherited vertically. We have used a dashed line to connect two non-terminal nodes or a non-terminal to a terminal node for representational purposes but the underlying induction logic remains the same.

It can be seen that some of the horizontal links have been efficiently created. For example, Tamil and Kannada exclusively share some concepts that are not shared by the other Dravidian languages (and also languages of other families). Assamese and Bengali both being Eastern Indo-Aryan languages have heavily weighted arcs similar to Hindi and Urdu. In the case of Hindi and Urdu, it is indeed expected given that they are the standard registers of the polycentric language Hindustani.

Most of the horizontal links can be seen within the Indo-Aryan clusters. This leads to an inference that within a language family there are more horizontal links than across language families. These horizontal innovations need further investigation to figure out the respective phenomena responsible for this behavior. Although these horizontal links or exclusive overlaps are heavily dependent on the underlying data, our approach successfully captures the spread of lexemes across languages belonging to three different language families using a minimal number of edges.

Figure 1: The visualization showing the inheritance and the horizontal links. The language codes used are ISO 639-2.

References

Shana Poplack and David Sankoff. 1984. Borrowing: The synchrony of integration. Linguistics, 22(1):99–136.

Lyle Campbell. 2020. Historical Linguistics: An Introduction. Edinburgh University Press.

Khuyagbaatar Batsuren, Gabor Bella, and Fausto Giunchiglia. 2019. CogNet: A large-scale cognate database. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3136–3145, Florence, Italy. Association for Computational Linguistics.

Karen Sparck Jones. 1972. A Statistical Interpretation of Term Specificity and Its Application in Retrieval. Journal of Documentation, 28(1):11–21.