Emostemmer: эффективная программа для определения эмоций в русском языке с использованием N-грамм (эмоциограммы)

M M Abbasi; A P Beltyukov

doi:10.22213/2410-9304-2021-4-148-157

Authors

M. M. Abbasi Udmurt State University
A. P. Beltyukov Udmurt State University

DOI:

https://doi.org/10.22213/2410-9304-2021-4-148-157

Keywords:

text, emotions, blog, communication, stemming, analysis, confusion matrix

Abstract

Emotions and the analysis of their expression in texts is a topic of growing interest in recent years. Researchers are trying to create an intelligent machine that can not only read the text, but also determine its emotional state. The results obtained can be used to prepare the machine for future predictions of the emotional orientation of texts, their authors and readers. This text analysis can also be used to get feedback from people about a product or service, reaction to an event or government policy, etc. It includes syntactic as well as semantic text analysis. Parsing consists of identifying words that represent emotions in a text. For its identification, the stemmer plays an important role - the stem or root of the word. In many languages of the Romano-Germanic group, the identification of words representing emotions is much easier than in Russian, since one word represents emotion regardless of grammatical forms and genders. While for a language such as Russian, where the ending of an emotionally charged word changes depending on the genus, species, etc., the analysis becomes more complex. There are different methods of defining emotions in a text. This work focuses on identifying emotions from the text while limiting the complexity of the algorithm by requiring a minimum amount of memory and time. We have created the Emostemmer program, which is an N-gram stemmer (in which letters from words are grouped in a sequence of 2 letters, 3 letters… ..N letters called N-grams) to identify words that represent emotions in the text. The performance of Emostemmer versus RuSentiLex was determined by training and testing a support vector machine classifier with both algorithms. The results of the work are described in detail below in the “Methodology” and “Discussion” sections.

Author Biographies

M. M. Abbasi, Udmurt State University

Post-graduate

A. P. Beltyukov, Udmurt State University

DSc (Physics and Mathematics), Professor

References

Rijsbergen J., Robertson C. J., Stephen E., (1946) & Porter, Martin F. (1980). New models in probabilistic information retrieval // British Library Research and Development Dept., [London]. No. 5587.

Porter M.F. An algorithm for suffix stripping (1980). Emerald Publishing, Program 1 14 (3), 130-137.

Krovetz R (2000). Viewing morphology as an inference process // Artificial Intelligence Journal, Q1 SJR 1.01. 118(1), 277-294.

Paice C. D (1990). Another Stemmer // ACM SIGIR Forum, 24(3), 56-61.

William B. Frakes , Christopher J (2003). Fox. Strength and similarity of affix removal stemming algorithms // ACM SIGIR Forum, 37(1), 26-30.

Bacchin M., Ferro N., Lucci M (2005). A probabilistic model for stemmer generation // Information Processing and Management 41(1), 121-137.

Wiebe, J., Wilson T., Cardie C (2005). Annotating expressions of opinions and emotions in language // Language Resources and Evaluation 39 (2), 165-210.

Peng, F., Ahmed, N., Li, X., Lu, Y (2007). Context sensitive stemming for web search // Proc. of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval pp. 639-646.

Majumder P., Mitra M., Swapan K., Kole P G., Mitra P., Datta K (2007). “YASS: Yet another suffix stripper” // ACM Transactions on Information Systems. 25 (4) 18.

Adam G., Asimakis K., Bouras C., Poulopoulos V (2010). An efficient mechanism for stemming and tagging: the case of Greek language // In the Proc. of the 14th International Conference on Knowledge-based and Intelligent Information and Engineering Systems: Part III, pp 389-397.

Feinerer I (2010). Analysis and Algorithms for Stemming Inversion. In: Cheng PJ., Kan MY., Lam W., Nakov P. (eds) // Information Retrieval Technology. AIRS 2010 // Lecture Notes in Computer Science vol. 6458. Springer, Berlin, Heidelberg.

Jiaul H. P., Mitra M., Swapan K. P., Järvelin K (2011). GRAS: An effective and efficient stemming algorithm for information retrieval // ACM Transactions on Information Systems, 29 (4), 1-24.

Fernández A., Díaz J., Gutiérrez Y., Muñoz R (2011). An Unsupervised Method to Improve Spanish Stemmer. In: Muñoz R., Montoyo A., Métais E. (eds) // Natural Language Processing and Information Systems // Lecture Notes in Computer Science. Vol. 6716. Springer, Berlin, Heidelberg.

Madani A., M. Kissi M (2014). Building a syntactic rules-based stemmer to improve search effectiveness for Arabic language // 9th International Conference on Intelligent Systems: Theories and Applications (SITA-14), pp. 1-6.

Danilova V., Alexandrov M., Blanco X (2014). A Survey of Multilingual Event Extraction from Text. // In: Métais E., Roche M., Teisseire M. (eds) Natural Language Processing and Information Systems // Lecture Notes in Computer Science, Vol. 8455. Springer, Cham.

Moral C., de Antonio A., Imbert R., Ramírez J (2014). A survey of stemming algorithms in information retrieval // Information Research 19(1), 605.

Loukachevitch, N V., Chetviorkin, I (2014). Open evaluation of sentiment-analysis systems based on the material of the Russian language // Scientific and Technical Information Processing, 41(6), 370-76.

Gadri S., A Moussaoui A. (2015). Information retrieval: A new multilingual stemmer based on a statistical approach // 3rd International Conference on Control, Engineering & Information Technology, Tlemcen, Algeria, pp.1-6.

Brychcín T., Konopík M. (2015). HPS: High precision stemmer // Information Processing and Management, 51 (1), 68-91.

Singh J., Gupta V (2016). Text Stemming: Approaches, Applications, and Challenges // ACM Computing Surveys (CSUR), 49 (3), Article 45.

Beltiukov A.P., Abbasi M.M (2019). Logical analysis of emotions in text from natural language // Vestnik Udmurtskogo Universiteta Matematika Mekhanika Komp'yuternye Nauki, 29 (1), 106-116.

Bölücü N., Burcu C (2019). Unsupervised Joint POS Tagging and Stemming for Agglutinative Languages // ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), 18 (3), Article 25.

Porter M.F (2001). Snowball: A language for stemming algorithms // Published online, (October 2001) Accessed 8.11.2021, 15.00h. http://snowball.tartarus.org/texts/introduction.html

Лукашевич Н. В., Левчик А. В. Создание лексикона оценочных слов русского языка РуСентилекс // Труды конференции OSTIS-2016. 2016. С. 377-382.

Loukachevitch N., Levchik A (2016). Creating a General Russian Sentiment Lexicon. // In the Proc. of Language Resources and Evaluation Conference LREC-2016.

Список чувств и эмоций : блог психолога Петра Зарубина из г. Новосибирска. URL: https://peter-zarubin.ru/spisok-chuvstv-i-emotsij.