Correlating human ratings of idiom transparency and decomposability with computational semantic relatedness: a cross-linguistic study of Italian and English idioms.

Keywords: idioms; human ratings; semantic similarity; cross-lingual; idiom decomposability; embeddings

Idioms are traditionally defined as non-compositional multiword expressions (Cacciari & Tabossi, 2014) – the meaning of an idiom is ‘more than the sum of the meanings of its parts’. While it has been long held that idioms are fixed expressions, possibly stored as whole chunks in memory, modern research has shown that not all idioms are the same (Libben & Titone, 2008). Idiomatic expressions display different degrees of syntactic flexibility and semantic modifiability (Geeraert et al. 2017; Mancuso et al. 2020). Moreover, idioms differ in their decomposability and transparency (Geeraerts, 2002), two key variables in understanding the relationship between idiom literality and figurativity. Idiom transparency is the degree to which the figurative meaning of an idiomatic expression can be intuitively inferred from the ‘literal interpretation’ of such an expression (or how the literal meaning may have motivated the idiomatic sense; Vega Moreno, 2005). Decomposability or analyzability refers to the degree to which it is possible to observe or infer how different components within an idiomatic expression contribute to the overall figurative meaning (Tabossi et al. 2011). The common example for this is ‘spill the beans’ meaning ‘reveal secrets’, where ‘spill’ corresponds to reveal’ and ‘beans’ corresponds to ‘secrets’. Idioms are considered opaque, nondecomposable, if no such correspondence can be observed, e.g., ‘kick the bucket’ and ‘die’ (Nunberg et al. 1994).

Researchers in psycholinguistics have introduced idiom norming studies where human participants rate idiomatic expressions on different aspects, such as familiarity, ambiguity, transparency, and decomposability (Bulkes & Tanner, 2017; Titone & Connine, 1994). Such ratings have helped to document the variability of idioms on many dimensions, and in different languages, and are commonly used for psycholinguistic experiments on idiom comprehension.

In this study, we investigate how computational semantic relatedness between an idiom and its paraphrase informs our understanding of idiom transparency and decomposability. With this aim, we utilize a dataset of 150 idiomatic expressions (Pagliai, 2023) rated by native speakers in two languages (English and Italian). Specifically, for that dataset, participants provided subjective ratings for idiom decomposability and transparency, using 5 point Likert scales. Each idiom in that dataset was also paired with a literal paraphrase of its figurative meaning. In the current study, we embed the idiomatic expressions into dense embedding vectors. We also embed the paraphrases of the idiomatic meanings into embedding vectors. We then compute pairwise cosine similarity scores between the idiom embeddings and their paraphrase embeddings, estimating how semantically related an idiom (taken literally at ‘face value’) is to its figurative meaning. Finally, we calculate correlations between the obtained cosine similarity scores and human ratings of idiom decomposability and transparency. Results indicate a weak to moderate correlation (Pearson correlation values in the range of 0.3 to 0.5) between cosine similarity scores and human ratings of idiom decomposability and transparency. This indicates that semantic relatedness (or similarity) between the literal and figurative meanings of an idiomatic expression may contribute to the perception of decomposability. These findings are consistent across both English and Italian and are replicated across different embedding spaces, including multilingual models. We present the technical details of this work, discuss the assumptions involved, the tentative conclusions, and some implications for future studies.

References

Bulkes, N. Z., and Tanner, D. (2017). “Going to town”: Large-scale norming and statistical analysis of 870 American English idioms. Behavior Research Methods, 49:772–783.

Cacciari, C., and Tabossi, P. (eds.) (2014). Idioms: Processing, structure, and interpretation. Psychology Press, New York.

Geeraerts, D. (2002). The interaction of metaphor and metonymy in composite expressions. In R.Dirven and R.Parings (eds.), Metaphor and metonymy in comparison and contrast, pp. 435-465. Berlin, New York: Mouton de Gruyter.

Geeraert, K., Newman, J., & Baayen, R. H. (2017). Idiom variation: Experimental data and a blueprint of a computational model. Topics in cognitive science, 9(3), 653-669.

Libben, M. R., and Titone, D. A. (2008). The multidetermined nature of idiom processing. Memory & Cognition, 36 (6), 1103-1121.

Mancuso, A., Elia, A., Laudanna, A. and Vietri, S. (2020). The Role of Syntactic Variability and Literal Interpretation Plausibility in Idiom Comprehension. Journal of Psycholinguist Research, 49, 99–124.

Nunberg, G., Sag, I. A., and Wasow, T. (1994). Idioms. Language, 70, 491–538.

Pagliai, I. (2023). Bridging the Gap: Creation of a Lexicon of 150 Pairs of English and Italian Idioms Including Normed Variables for the Exploration of Idiomatic Ambiguity. Journal of Open Humanities Data, 9: 16, pp. 1–13. DOI: https://doi.org/10.5334/johd.123.

Tabossi, P., Arduino, L., and Fanari,R. (2011). Descriptive norms for 245 Italian idiomatic expressions. Behavior Research Methods, 43:110–123.

Titone, D. A., and Connine, C. M. (1994). Descriptive Norms for 17 1 Idiomatic Expressions: Familiarity, Compositionality, Predictability, and Literality. Metaphor And Symbolic Activity, 9(4), 247-270.

Vega Moreno, R. E. (2005). Idioms, transparency and pragmatic inference. Technical report, UCL Working Papers in Linguistics, 17:389–425.