Automatic Detection of Lexical Loanwords in a Text Corpus

A. V. Dmitrijev; E. S. Krupnova

doi:10.26907/2658-3321.2025.8.2.204-217

Automatic Detection of Lexical Loanwords in a Text Corpus

A. V. Dmitrijev, E. S. Krupnova

https://doi.org/10.26907/2658-3321.2025.8.2.204-217

Full Text:

PDF (Rus)

Generate QR code

Abstract

In the context of globalization and the dynamic interaction between language speakers and representatives of diverse cultures, the incorporation of foreign lexical items into vocabulary systems has become a fundamental driver of linguistic development and enrichment. However, the exponential growth in textual data has rendered manual analysis and lexical unit identification increasingly inefficient and time-consuming. This necessitates the implementation of automated natural language processing (NLP) methods for loanword extraction. This paper aims to examine various approaches to automatic lexical item extraction from 1986 to the present, while also developing an algorithm to address this challenge. The material includes a collected corpus of 22348 English sentences parsed from the websites of 11 leading universities in Austria, Germany and Russia. To verify the results, 47 new sentences were used. Additionally, 1318 new sentences including German loanwords were generated using chatbots. As for the methods, the “bert-base-multilingual-cased model” was used in the study. Corpus was annotated with two tags indicating the presence/absence of a German loanword in a sentence. The model was then retrained on the corpus and on additionally generated sentences. The findings demonstrate that while contemporary methods achieve high accuracy rates, significant challenges persist in model performance across different language pairs and in overall efficiency enhancement. Furthermore, the study describes an algorithm for automatic extraction of German loanwords from English sentences utilizing the BERT large language model trained on a corpus of 900 texts. The model demonstrated robust performance, successfully identifying 30 out of 43 words of German origin.

Keywords

borrowing, Germanisms, natural language processing, NLP, multilingual BERT model

About the Authors

A. V. Dmitrijev

Peter the Great Saint-Petersburg polytechnic university
Russian Federation

Dmitrijev Alexander Vladislavovich – Associate Professor

Saint-Petersburg

E. S. Krupnova

Peter the Great Saint-Petersburg polytechnic university
Russian Federation

Krupnova Elena Sergeevna – Specialist in educational and methodological work of first category

Saint-Petersburg

References

1. Köllner M. Automatic loanword identification using tree reconciliation. Dissertation zur Erlangung des akademischen Grades Doktor der Philosophie in der Philosophischen Fakultat der Eberhard Karls. Universitat Tubingen; 2021. 216 p.

2. Mennecier P., Nerbonne J., Heyer E., Manni F. A Central Asian Language Survey: Collecting Data, Measuring Relatedness and Detecting Loans. Language Dynamics and Change. 2016;6:57–98.

3. Beatrice A. Comparing Corpus-based to Web-based Lookup Techniques for Automatic English Inclusion Detection. Available from: http://www.lrec-conf.org/proceedings/lrec2008/pdf/674_paper.pdf [accessed: 20.01.2025].

4. Álvarez-Mellado E. An Annotated Corpus of Emerging Anglicisms in Spanish Newspaper Headlines; 2020. Available from: https://arxiv.org/pdf/2004.02929.pdf [accessed: 25.01.2025].

5. Shengyi J., Tong C., Yingwen F., Nankai L. and Jieyi X. BERT4EVER at ADoBo 2021: Detection of Borrowings in the Spanish Language Using Pseudolabel Technology; 2021. Available from: https://ceur-ws.org/Vol-2943/adobo_paper1.pdf (accessed: 23.01.2025)

6. Nath A., Saravani S.M., Khebour I., Mannan S., Liand Z., et al. A Generalized Method for Automated Multilingual Loanword Detection. Proceedings of the 29th International Conference on Computational Linguistics; 2022. Pp. 4996–5013.

7. Miller J.E., Tresoldi T., Zariquiey R., Beltrán Castañon C.A., et al. Using lexical language models to detect borrowings in monolingual wordlists. 2020; 15(12):1–23.

8. Vissio N. Cortegoso, Zakharov V.P. Two methods for identifying Russian words in Yakut texts. International Journal of Open Information Technologies. 2022;10(11):26–34. (In Russ.)

9. Paderina T.S. Methods for Terminology Extraction in Scientific Texts (Based on Articles of Earth Sciences). Kazan Linguistic Journal. 2023;6(3):388–396. (In Russ.)

10. Devlin J., Chang M.W., Kenton L., Toutanova K. Pre-training of Deep Bidirectional Transformers for Language. URL: http://arxiv.org/abs/1810.04805 (accessed: 27.01.2025)

Review

For citations:

Dmitrijev A.V., Krupnova E.S. Automatic Detection of Lexical Loanwords in a Text Corpus. Kazan linguistic journal. 2025;8(2):204-217. (In Russ.) https://doi.org/10.26907/2658-3321.2025.8.2.204-217

This work is licensed under a Creative Commons Attribution 4.0 License.

ISSN 2658-3321 (Print)

Username
Password
	Remember me
Not a user? Register with this site Forgot your password?

User

Kazan linguistic journal

Automatic Detection of Lexical Loanwords in a Text Corpus

Full Text:

Abstract

Keywords

About the Authors

References

Review

For citations:

Cookies policy