Lemmatization of fiction: Unidentified words in the Russian short story corpus 1900–1930

Authors

DOI:

https://doi.org/10.33910/1992-6464-2026-219-269-279

Keywords:

morphological analyzer, lemmatization, fiction, short story, the Russian Short Story Corpus

Abstract

Introduction. Modern morphological analyzers are capable not only of referring to a dictionary of word forms, but also of segmenting unfamiliar units and deriving hypothetical lemmas. However, some word forms of literary texts still cannot be identified during lemmatization.

Materials and Methods. This study examines the problem of word recognition based on material from the Russian Short Story Corpus 1900–1930, a representative electronic corpus containing several thousand texts written in Russia and later the Soviet Union during the first three decades of the 20th century. By comparing frequency dictionary data of a sample from the Corpus with the Russian Orthographic Dictionary, we obtained a list of unrecognized word forms. We then attempted to restore the original words from the text and understand why they were not recognized, which allowed us to identify typical problems in the automatic lemmatization of literary texts.

Results. Unrecognized elements fall into several groups: abbreviations and acronyms, proper names, complex words, stylistically marked lexemes, and words containing Latin letters. Many of these are not found in Russian explanatory dictionaries. The article provides statistical data on the number of unrecognized units in each group and how these figures changed over the three decades. The general trend shows an increase in the number of such words from the first, pre-war period to the second (war and revolution) and third (early Soviet) periods. The most noticeable increase in unrecognized word forms is observed in stories from the Soviet period. We put forward hypotheses to explain the significant differences in the number of unrecognized words in different periods, primarily by appealing to extralinguistic factors, namely changes in the sociopolitical situation.

Conclusions. Changes in the socio-political environment inevitably lead to changes in language; due to their genre, short stories reflect these changes most rapidly.

References

СПИСОК ЛИТЕРАТУРЫ

Бархударов, С. Г., Протченко, И. Ф., Скворцов, Л. И. (2007) Большой орфографический словарь русского языка: более 106000 слов. Москва: Оникс; Мир и Образование, 1160 c.

Большакова, Е. И., Воронцов, К. В., Ефремова, Н. Э. и др. (2017) Автоматическая обработка текстов на естественном языке и анализ данных. Москва: Изд-во НИУ ВШЭ, 269 с.

Воронова, И. Б. (2000) Текстообразующая функция литературных имен собственных (на материале эпических произведений XIX–XX вв.). Автореферат диссертации на соискание степени кандидата филологических наук. Волгоград, Волгоградский государственный педагогический университет, 24 с.

Карцевский, С. И. (1923) Язык, война и революция. Берлин: Русское универсальное издательство, 72 с.

Кузнецов, С. А., Скребцова, Т. Г., Суворов, С. Г., Клементьева, А. В. (2019) Лингвистический анализатор: преобразование текста в метаязыковую структуру данных. Санкт-Петербург: Изд-во СПбГУ, 238 с.

Мазон, А. (2013) Лексика войны и революции в России (1914-1918). Введение. Аббревиация. Политическая лингвистика, № 1, с. 203–210.

Маркасова, Е. В. (2011) Проблемы поиска и лексикографического описания советизмов 1920-30 гг. Russian Language Journal, т. 61, c. 94–118.

Панов, М. В. (1968) Русский язык и советское общество: Социолого-лингвистическое исследование. Словообразование современного русского языка. Москва: Наука, 300 с.

Перцова, Н. Н. (2000) Некоторые проблемы семантики словосложения. В кн.: А. С. Нариньяни (ред.). Труды международного семинара Диалог’2000 по компьютерной лингвистике и ее приложениям в двух томах. Том 1. Теоретические проблемы. Протвино: [б. и.], c. 246–247.

Селищев, А. М. (2008) Язык революционной эпохи: из наблюдений над русским языком последних лет (1917–1926). Москва: УРСС, 248 с.

Скребцова, Т. Г. (2021) Новые реалии общественно-политической жизни 1920-х гг. и их отражение в русской литературе и лексикографии (на примере сложносокращенных слов). Политическая лингвистика, № 2 (86), c. 146–154.

Стернин, И. А. (1979) Проблемы анализа структуры значения слова. Воронеж: Изд-во Воронежского государственного университета, 156 с.

Толстая, С. М. (2020) Сложные слова и словосочетания: синтаксис и семантика. Rocznik Slawisticzny, т. LXIX, с. 167–180. https://www.doi.org/10.24425/rslaw.2020.134712

Фесенко, А. В., Фесенко, Т. (1955) Русский язык при Советах. Нью- Йорк: Rausen Bros., 222 с.

Grebennikov, A. O., Marusenko, N. M., Skrebtsova, T. G. (2023) Mapping word frequencies in fiction on sociopolitical context: the case of early 20th century Russian short stories. Terra Linguistica, vol. 14, no. 1, pp. 21–30. https://doi.org/10.18721/JHSS.14103

Sherstinova, T., Grebennikov, A., Skrebtsova, T. et al. (2020) Frequency word lists and their variability (the case of Russian fiction in 1900-1930). In: Proceeding of the 27th Conference of Fruct Association. Helsinki: FRUCT Oy Publ., no. 27, pp. 366– 373.

Sherstinova, T., Skrebtsova, T. (2019) Russian literature around the October revolution: A quantitative exploratory study of literary themes and narrative structure in Russian short stories of 1900-1930. In: Ceur Workshop Proceedings. International Conference “Internet and Modern Society”. Aachen: RWTH Aachen University Publ., vol. 2813, pp. 117– 128.

Skrebtsova, T. G. (2021) Thematic tagging of literary fiction: the case of early 20th century Russian short stories. In: Ceur Workshop Proceedings. International Conference “Internet and Modern Society”. Aachen: RWTH Aachen University Publ., vol. 2813, pp. 265–276.

REFERENCES

Barkhudarov, S. G., Protchenko, I. F., Skvortsov, L. I. (2007) The Large Spelling Dictionary of the Russian Language. Over 106 000 words. Moscow: Oniks Publ.; Mir i Obrazovaniye Publ., 1160 p. (In Russian)

Bolshakova, Ye. I., Vorontsov, K. V., Yefremova, N. E. et al. (2017) Automatic processing of natural language texts and data analysis. Moscow: HSE University Publ., 269 p. (In Russian)

Fesenko, A. V., Fesenko, T. (1955) The Russian language in the Soviet Era. New York: Rausen Bros. Publ., 222 p. (In Russian)

Grebennikov, A. O., Marusenko, N. M., Skrebtsova, T. G. (2023) Mapping word frequencies in fiction on sociopolitical context: the case of early 20th century Russian short stories. Terra Linguistica, vol. 14, no. 1, pp. 21–30. https://doi.org/10.18721/JHSS.14103 (In English)

Kartsevskiy, S. I. (1923) Language, War and Revolution. Berlin: Russian Universal Publ., 72 p. (In Russian)

Kuznetsov, S. A., Skrebtsova, T. G., Suvorov, S. G., Klementyeva, A. V. (2019) Linguistic analyser: converting text into a meta-language data structure. Saint Peterburg: St. Petersburg State University Publ., 238 p. (In Russian)

Markasova, E. V. (2011) Issues of Identification and Lexicographic Description of Sovietisms of 1920-30s. Russian Language Journal, vol. 61, pp. 94–114. (In Russian)

Mazon, A. (2013) Lexis of war and revolution in Russia (1914—1918). Introduction. Abbreviation. Political Linguitsics, № 1, pp. 203–210. (In Russian)

Panov, M. V. (1968) Russian Language and Soviet Society: A Sociological and Linguistic Study. Word formation of the modern Russian language. Moscow: Nauka Publ., 300 p. (In Russian)

Pertsova, N. N. (2000) Some aspects of semantics of compound words. In: A. S. Narinyani (ed.). Proceedings of the Internationl Seminar on Computaional Linguistics “Dialog” 2000. Vol. 1. Protvino: [s. n.], pp. 246–247. (In Russian)

Selishchev, A. M. (2008) The Language of The Revolutionary Epoch: Ovserving The Russian Language of Recent Years (1917–1926). Moscow: URSS Publ., 248 p. (In Russian)

Sherstinova, T., Grebennikov, A., Skrebtsova, T. et al. (2020) Frequency word lists and their variability (the case of Russian fiction in 1900-1930). In: Proceeding of the 27th Conference of Fruct Association. Helsinki: FRUCT Oy Publ., no. 27, pp. 366– 373. (In English)

Sherstinova, T., Skrebtsova, T. (2019) Russian literature around the October revolution: A quantitative exploratory study of literary themes and narrative structure in Russian short stories of 1900-1930. In: Ceur Workshop Proceedings. International Conference “Internet and Modern Society”. Aachen: RWTH Aachen University Publ., vol. 2813, pp. 117–128. (In English)

Skrebtsova, T. G. (2021) New sociopolitical realities of the 1920s as reflected in Russian literature and lexicography (on the basis of syllabic acronyms). Political Linguitsics, № 2 (86), pp. 146–154. (In Russian)

Skrebtsova, T. G. (2021) Thematic tagging of literary fiction: the case of early 20th century Russian short stories. In: Ceur Workshop Proceedings. International Conference “Internet and Modern Society”. Aachen: RWTH Aachen University Publ., vol. 2813, pp. 265–276. (In English)

Sternin, I. A. (1979) Issues of analysing the structure of word meaning. Voronezh: Voronezh State University Publ., 156 p. (In Russian)

Tolstaya, S. M. (2020) Complex words and phrases: syntax and semantics. Rocznik Slawisticzny, vol. LXIX, pp. 167–180. https://www.doi.org/10.24425/rslaw.2020.134712 (In Russian)

Voronova, I. B. (2000) The text-forming function of literary proper names (based on epic works of the XIX–XX centuries). Extended abstract of the PhD dissertation (Philology). Volgograd, Volgograd Sate Pedagogical University, 24 p. (In Russian)

Published

2026-05-08

Issue

Section

Philological Sciences

Similar Articles

1-10 of 13

You may also start an advanced similarity search for this article.