Lemmatization of fiction: Unidentified words in the Russian short story corpus 1900–1930
DOI:
https://doi.org/10.33910/1992-6464-2026-219-269-279Keywords:
morphological analyzer, lemmatization, fiction, short story, the Russian Short Story CorpusAbstract
Introduction. Modern morphological analyzers are capable not only of referring to a dictionary of word forms, but also of segmenting unfamiliar units and deriving hypothetical lemmas. However, some word forms of literary texts still cannot be identified during lemmatization.
Materials and Methods. This study examines the problem of word recognition based on material from the Russian Short Story Corpus 1900–1930, a representative electronic corpus containing several thousand texts written in Russia and later the Soviet Union during the first three decades of the 20th century. By comparing frequency dictionary data of a sample from the Corpus with the Russian Orthographic Dictionary, we obtained a list of unrecognized word forms. We then attempted to restore the original words from the text and understand why they were not recognized, which allowed us to identify typical problems in the automatic lemmatization of literary texts.
Results. Unrecognized elements fall into several groups: abbreviations and acronyms, proper names, complex words, stylistically marked lexemes, and words containing Latin letters. Many of these are not found in Russian explanatory dictionaries. The article provides statistical data on the number of unrecognized units in each group and how these figures changed over the three decades. The general trend shows an increase in the number of such words from the first, pre-war period to the second (war and revolution) and third (early Soviet) periods. The most noticeable increase in unrecognized word forms is observed in stories from the Soviet period. We put forward hypotheses to explain the significant differences in the number of unrecognized words in different periods, primarily by appealing to extralinguistic factors, namely changes in the sociopolitical situation.
Conclusions. Changes in the socio-political environment inevitably lead to changes in language; due to their genre, short stories reflect these changes most rapidly.
References
СПИСОК ЛИТЕРАТУРЫ
Бархударов, С. Г., Протченко, И. Ф., Скворцов, Л. И. (2007) Большой орфографический словарь русского языка: более 106000 слов. Москва: Оникс; Мир и Образование, 1160 c.
Большакова, Е. И., Воронцов, К. В., Ефремова, Н. Э. и др. (2017) Автоматическая обработка текстов на естественном языке и анализ данных. Москва: Изд-во НИУ ВШЭ, 269 с.
Воронова, И. Б. (2000) Текстообразующая функция литературных имен собственных (на материале эпических произведений XIX–XX вв.). Автореферат диссертации на соискание степени кандидата филологических наук. Волгоград, Волгоградский государственный педагогический университет, 24 с.
Карцевский, С. И. (1923) Язык, война и революция. Берлин: Русское универсальное издательство, 72 с.
Кузнецов, С. А., Скребцова, Т. Г., Суворов, С. Г., Клементьева, А. В. (2019) Лингвистический анализатор: преобразование текста в метаязыковую структуру данных. Санкт-Петербург: Изд-во СПбГУ, 238 с.
Мазон, А. (2013) Лексика войны и революции в России (1914-1918). Введение. Аббревиация. Политическая лингвистика, № 1, с. 203–210.
Маркасова, Е. В. (2011) Проблемы поиска и лексикографического описания советизмов 1920-30 гг. Russian Language Journal, т. 61, c. 94–118.
Панов, М. В. (1968) Русский язык и советское общество: Социолого-лингвистическое исследование. Словообразование современного русского языка. Москва: Наука, 300 с.
Перцова, Н. Н. (2000) Некоторые проблемы семантики словосложения. В кн.: А. С. Нариньяни (ред.). Труды международного семинара Диалог’2000 по компьютерной лингвистике и ее приложениям в двух томах. Том 1. Теоретические проблемы. Протвино: [б. и.], c. 246–247.
Селищев, А. М. (2008) Язык революционной эпохи: из наблюдений над русским языком последних лет (1917–1926). Москва: УРСС, 248 с.
Скребцова, Т. Г. (2021) Новые реалии общественно-политической жизни 1920-х гг. и их отражение в русской литературе и лексикографии (на примере сложносокращенных слов). Политическая лингвистика, № 2 (86), c. 146–154.
Стернин, И. А. (1979) Проблемы анализа структуры значения слова. Воронеж: Изд-во Воронежского государственного университета, 156 с.
Толстая, С. М. (2020) Сложные слова и словосочетания: синтаксис и семантика. Rocznik Slawisticzny, т. LXIX, с. 167–180. https://www.doi.org/10.24425/rslaw.2020.134712
Фесенко, А. В., Фесенко, Т. (1955) Русский язык при Советах. Нью- Йорк: Rausen Bros., 222 с.
Grebennikov, A. O., Marusenko, N. M., Skrebtsova, T. G. (2023) Mapping word frequencies in fiction on sociopolitical context: the case of early 20th century Russian short stories. Terra Linguistica, vol. 14, no. 1, pp. 21–30. https://doi.org/10.18721/JHSS.14103
Sherstinova, T., Grebennikov, A., Skrebtsova, T. et al. (2020) Frequency word lists and their variability (the case of Russian fiction in 1900-1930). In: Proceeding of the 27th Conference of Fruct Association. Helsinki: FRUCT Oy Publ., no. 27, pp. 366– 373.
Sherstinova, T., Skrebtsova, T. (2019) Russian literature around the October revolution: A quantitative exploratory study of literary themes and narrative structure in Russian short stories of 1900-1930. In: Ceur Workshop Proceedings. International Conference “Internet and Modern Society”. Aachen: RWTH Aachen University Publ., vol. 2813, pp. 117– 128.
Skrebtsova, T. G. (2021) Thematic tagging of literary fiction: the case of early 20th century Russian short stories. In: Ceur Workshop Proceedings. International Conference “Internet and Modern Society”. Aachen: RWTH Aachen University Publ., vol. 2813, pp. 265–276.
REFERENCES
Barkhudarov, S. G., Protchenko, I. F., Skvortsov, L. I. (2007) The Large Spelling Dictionary of the Russian Language. Over 106 000 words. Moscow: Oniks Publ.; Mir i Obrazovaniye Publ., 1160 p. (In Russian)
Bolshakova, Ye. I., Vorontsov, K. V., Yefremova, N. E. et al. (2017) Automatic processing of natural language texts and data analysis. Moscow: HSE University Publ., 269 p. (In Russian)
Fesenko, A. V., Fesenko, T. (1955) The Russian language in the Soviet Era. New York: Rausen Bros. Publ., 222 p. (In Russian)
Grebennikov, A. O., Marusenko, N. M., Skrebtsova, T. G. (2023) Mapping word frequencies in fiction on sociopolitical context: the case of early 20th century Russian short stories. Terra Linguistica, vol. 14, no. 1, pp. 21–30. https://doi.org/10.18721/JHSS.14103 (In English)
Kartsevskiy, S. I. (1923) Language, War and Revolution. Berlin: Russian Universal Publ., 72 p. (In Russian)
Kuznetsov, S. A., Skrebtsova, T. G., Suvorov, S. G., Klementyeva, A. V. (2019) Linguistic analyser: converting text into a meta-language data structure. Saint Peterburg: St. Petersburg State University Publ., 238 p. (In Russian)
Markasova, E. V. (2011) Issues of Identification and Lexicographic Description of Sovietisms of 1920-30s. Russian Language Journal, vol. 61, pp. 94–114. (In Russian)
Mazon, A. (2013) Lexis of war and revolution in Russia (1914—1918). Introduction. Abbreviation. Political Linguitsics, № 1, pp. 203–210. (In Russian)
Panov, M. V. (1968) Russian Language and Soviet Society: A Sociological and Linguistic Study. Word formation of the modern Russian language. Moscow: Nauka Publ., 300 p. (In Russian)
Pertsova, N. N. (2000) Some aspects of semantics of compound words. In: A. S. Narinyani (ed.). Proceedings of the Internationl Seminar on Computaional Linguistics “Dialog” 2000. Vol. 1. Protvino: [s. n.], pp. 246–247. (In Russian)
Selishchev, A. M. (2008) The Language of The Revolutionary Epoch: Ovserving The Russian Language of Recent Years (1917–1926). Moscow: URSS Publ., 248 p. (In Russian)
Sherstinova, T., Grebennikov, A., Skrebtsova, T. et al. (2020) Frequency word lists and their variability (the case of Russian fiction in 1900-1930). In: Proceeding of the 27th Conference of Fruct Association. Helsinki: FRUCT Oy Publ., no. 27, pp. 366– 373. (In English)
Sherstinova, T., Skrebtsova, T. (2019) Russian literature around the October revolution: A quantitative exploratory study of literary themes and narrative structure in Russian short stories of 1900-1930. In: Ceur Workshop Proceedings. International Conference “Internet and Modern Society”. Aachen: RWTH Aachen University Publ., vol. 2813, pp. 117–128. (In English)
Skrebtsova, T. G. (2021) New sociopolitical realities of the 1920s as reflected in Russian literature and lexicography (on the basis of syllabic acronyms). Political Linguitsics, № 2 (86), pp. 146–154. (In Russian)
Skrebtsova, T. G. (2021) Thematic tagging of literary fiction: the case of early 20th century Russian short stories. In: Ceur Workshop Proceedings. International Conference “Internet and Modern Society”. Aachen: RWTH Aachen University Publ., vol. 2813, pp. 265–276. (In English)
Sternin, I. A. (1979) Issues of analysing the structure of word meaning. Voronezh: Voronezh State University Publ., 156 p. (In Russian)
Tolstaya, S. M. (2020) Complex words and phrases: syntax and semantics. Rocznik Slawisticzny, vol. LXIX, pp. 167–180. https://www.doi.org/10.24425/rslaw.2020.134712 (In Russian)
Voronova, I. B. (2000) The text-forming function of literary proper names (based on epic works of the XIX–XX centuries). Extended abstract of the PhD dissertation (Philology). Volgograd, Volgograd Sate Pedagogical University, 24 p. (In Russian)
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Tatyana G. Skrebtsova, Alexander O. Grebennikov

This work is licensed under a Creative Commons Attribution 4.0 International License.
The work is provided under the terms of the Public Offer and of Creative Commons public license Creative Commons Attribution 4.0 International (CC BY 4.0).
This license permits an unlimited number of users to copy and redistribute the material in any medium or format, and to remix, transform, and build upon the material for any purpose, including commercial use.
This license retains copyright for the authors but allows others to freely distribute, use, and adapt the work, on the mandatory condition that appropriate credit is given. Users must provide a correct link to the original publication in our journal, cite the authors' names, and indicate if any changes were made.
Copyright remains with the authors. The CC BY 4.0 license does not transfer rights to third parties but rather grants users prior permission for use, provided the attribution condition is met. Any use of the work will be governed by the terms of this license.




