Automatic Detection of Lexical Loanwords in a Text Corpus
https://doi.org/10.26907/2658-3321.2025.8.2.204-217
Abstract
In the context of globalization and the dynamic interaction between language speakers and representatives of diverse cultures, the incorporation of foreign lexical items into vocabulary systems has become a fundamental driver of linguistic development and enrichment. However, the exponential growth in textual data has rendered manual analysis and lexical unit identification increasingly inefficient and time-consuming. This necessitates the implementation of automated natural language processing (NLP) methods for loanword extraction. This paper aims to examine various approaches to automatic lexical item extraction from 1986 to the present, while also developing an algorithm to address this challenge. The material includes a collected corpus of 22348 English sentences parsed from the websites of 11 leading universities in Austria, Germany and Russia. To verify the results, 47 new sentences were used. Additionally, 1318 new sentences including German loanwords were generated using chatbots. As for the methods, the “bert-base-multilingual-cased model” was used in the study. Corpus was annotated with two tags indicating the presence/absence of a German loanword in a sentence. The model was then retrained on the corpus and on additionally generated sentences. The findings demonstrate that while contemporary methods achieve high accuracy rates, significant challenges persist in model performance across different language pairs and in overall efficiency enhancement. Furthermore, the study describes an algorithm for automatic extraction of German loanwords from English sentences utilizing the BERT large language model trained on a corpus of 900 texts. The model demonstrated robust performance, successfully identifying 30 out of 43 words of German origin.
About the Authors
A. V. DmitrijevRussian Federation
Dmitrijev Alexander Vladislavovich – Associate Professor
Saint-Petersburg
E. S. Krupnova
Russian Federation
Krupnova Elena Sergeevna – Specialist in educational and methodological work of first category
Saint-Petersburg
References
1. Köllner M. Automatic loanword identification using tree reconciliation. Dissertation zur Erlangung des akademischen Grades Doktor der Philosophie in der Philosophischen Fakultat der Eberhard Karls. Universitat Tubingen; 2021. 216 p.
2. Mennecier P., Nerbonne J., Heyer E., Manni F. A Central Asian Language Survey: Collecting Data, Measuring Relatedness and Detecting Loans. Language Dynamics and Change. 2016;6:57–98.
3. Beatrice A. Comparing Corpus-based to Web-based Lookup Techniques for Automatic English Inclusion Detection. Available from: http://www.lrec-conf.org/proceedings/lrec2008/pdf/674_paper.pdf [accessed: 20.01.2025].
4. Álvarez-Mellado E. An Annotated Corpus of Emerging Anglicisms in Spanish Newspaper Headlines; 2020. Available from: https://arxiv.org/pdf/2004.02929.pdf [accessed: 25.01.2025].
5. Shengyi J., Tong C., Yingwen F., Nankai L. and Jieyi X. BERT4EVER at ADoBo 2021: Detection of Borrowings in the Spanish Language Using Pseudolabel Technology; 2021. Available from: https://ceur-ws.org/Vol-2943/adobo_paper1.pdf (accessed: 23.01.2025)
6. Nath A., Saravani S.M., Khebour I., Mannan S., Liand Z., et al. A Generalized Method for Automated Multilingual Loanword Detection. Proceedings of the 29th International Conference on Computational Linguistics; 2022. Pp. 4996–5013.
7. Miller J.E., Tresoldi T., Zariquiey R., Beltrán Castañon C.A., et al. Using lexical language models to detect borrowings in monolingual wordlists. 2020; 15(12):1–23.
8. Vissio N. Cortegoso, Zakharov V.P. Two methods for identifying Russian words in Yakut texts. International Journal of Open Information Technologies. 2022;10(11):26–34. (In Russ.)
9. Paderina T.S. Methods for Terminology Extraction in Scientific Texts (Based on Articles of Earth Sciences). Kazan Linguistic Journal. 2023;6(3):388–396. (In Russ.)
10. Devlin J., Chang M.W., Kenton L., Toutanova K. Pre-training of Deep Bidirectional Transformers for Language. URL: http://arxiv.org/abs/1810.04805 (accessed: 27.01.2025)
Review
For citations:
Dmitrijev A.V., Krupnova E.S. Automatic Detection of Lexical Loanwords in a Text Corpus. Kazan linguistic journal. 2025;8(2):204-217. (In Russ.) https://doi.org/10.26907/2658-3321.2025.8.2.204-217
