Предобработка данных для модели Doc2Vec при автоматической классификации текстовых запросов в службе технической поддержки провайдеров программного обеспечения

O M Shatalova; A A Kuzmin

doi:10.22213/2410-9304-2024-3-103-112

Authors

O. M. Shatalova Kalashnikov Izhevsk State Technical University
A. A. Kuzmin Kalashnikov Izhevsk State Technical University

DOI:

https://doi.org/10.22213/2410-9304-2024-3-103-112

Keywords:

support tickets, text classification, Doc2Vec, data preprocessing, natural language processing

Abstract

Leading software providers typically implement customer technical support functions, which are crucial for promoting and enhancing the competitiveness of their products and services in global markets. The high volume and heterogeneity of support tickets (functional, temporal, linguistic, etc.) highlight the importance of efficient classification systems. Effective classification optimizes the distribution of these tickets among support center specialists and automates their processing using an established knowledge base. However, classifying these tickets is a loosely formalized task. For companies that have accumulated substantial data on customer requests, automating classification through machine learning methods and natural language processing models, such as Word2Vec, FastText, BERT, and GPT, becomes feasible. It is generally accepted that classification effectiveness primarily depends on the model employed. Nevertheless, the quality of these models is significantly influenced by the nature of the training data. Literature review of the reveals significant research interest in methods for the automatic classification of tickets specifically tailored to the operational conditions of software provider support centers. However, there is a noticeable gap in the literature regarding the impact of data preprocessing on the quality of these models. The article aims to clarify the techniques of data preprocessing and analyze their impact on the effectiveness of text classification, considering the specificity of software provider support centers. This study examines the stages of the automatic classification process for tickets, accounting for the unique characteristics of the data (customer text requests). A relevant set of specified methodological and instrumental tools was developed and tested using open data from a global software provider (DevExpress). The testing involved a database of 165,000 tickets. The study's results indicate that preprocessing can improve classification metrics such as F-measure, Precision, and Recall from 77% to 79%. Additionally, preprocessing significantly reduces the dimensionality of text data (by 48.2%) and increases model training speed (by 26.5%) without loss of accuracy, ensuring cost-efficiency and operational efficiency in the use of computational resources.

Author Biographies

O. M. Shatalova, Kalashnikov Izhevsk State Technical University

DSc in Economics

A. A. Kuzmin, Kalashnikov Izhevsk State Technical University

Post-graduate

References

Fränti P. Soft precision and recall / P. Fränti, R. Mariescu-Istodor // Pattern Recognition Letters. 2023. Vol. 167. Pp. 115-121.

Maalouf M. Weighted logistic regression for large-scale imbalanced and rare events data / M. Maalouf, M. Siddiqi // Knowledge-Based Systems. 2014. Vol. 59. Pp. 142-148. DOI https://doi.org/10.1016/j.knosys.2014.01.012.

Ковалев А. Д., Никифоров И. В., Дробинцев П. Д. Автоматизированный подход к обнаружению семантически близких запросов заказчика в системе отслеживания ошибок Jira // Современная наука: актуальные проблемы теории и практики. Серия: Естественные и Технические Науки. 2021. № 05/2. С. 61-67. DOI 10.37882/2223-2966.2021.05-2.15.

Пархоменко П. А., Григорьев А. А., Астраханцев Н. А. Обзор и экспериментальное сравнение методов кластеризации текстов // Труды ИСП РАН. 2017. Т. 29, вып. 2. С. 161-200. DOI: 10.15514/ISPRAS-2017-29(2)-6.

Ковалев А. Д., Никифоров И. В., Дробинцев П. Д. Автоматизированный подход к семантическому поиску по программной документации на основе алгоритма Doc2Vec // Информационно-управляющие системы. 2021. № 1. С. 17-27. Doi:10.31799/1684-8853-2021-1-17-27.

Jaya Hidayat T. H. Sentiment analysis of twitter data related to Rinca Island development using Doc2Vec and SVM and logistic regression as classifier / T. H. Jaya Hidayat, Y.Ruldeviyani, A. R. Aditama и др. // Procedia Computer Science. 2022. Vol. 197. Pp. 660-667. DOIhttps://doi.org/10.1016/j.procs.2021.12.187.

Agrawal R. Developing bug severity prediction models using word2vec / R. Agrawal, R. Goyal // International Journal of Cognitive Computing in Engineering. - 2021. Vol. 2. Pp. 104-115. DOI https://doi.org/10.1016/j.ijcce.2021.08.001.

Frank E. Data preprocessing techniques for NLP in BI / E. Frank, J. Oluwaseyi, G. Olaoye. 2024.

Dvoynikova A. Analytical review of approaches for tonality recognition of Russian text data Аналитический обзор подходов к распознаванию тональности русскоязычных текстовых данных / A. Dvoynikova, A. Karpov // Information and Control Systems. 2020. Pp. 20-30. DOI 10.31799/1684-8853-2020-4-20-30.

Lyubinets V. Automated Labeling of Bugs and Tickets Using Attention-Based Mechanisms in Recurrent Neural Networks / V. Lyubinets, T. Boiko, D. Nicholas // 2018 IEEE Second International Conference on Data Stream Mining & Processing (DSMP). 2018. Pp. 271-275. DOI 10.1109/DSMP.2018.8478511.

Siino M. Is text preprocessing still worth the time? A comparative survey on the influence of popular preprocessing methods on Transformers and traditional classifiers / M. Siino, I. Tinnirello, M. La Cascia // Information Systems. 2024. Vol. 121. Pp. 102342. DOI https://doi.org/10.1016/j.is.2023.102342.

Kashina M. Preprocessing of unstructured medical data: the impact of each preprocessing stage on classification / M. Kashina, I. D. Lenivtceva, G. D. Kopanitsa // Procedia Computer Science. 2020. Vol. 178. Pp. 284-290. DOI https://doi.org/10.1016/j.procs.2020.11.030.

Etaiwi W. The Impact of applying Different Preprocessing Steps on Review Spam Detection / W. Etaiwi, G. Naymat // Procedia Computer Science. 2017. Vol. 113. Pp. 273-279. DOI https://doi.org/10.1016/j.procs.2017.08.368.

Uysal A. K. The impact of preprocessing on text classification / A. K. Uysal, S. Gunal // Information Processing & Management. 2014. Vol. 50. No. 1. Pp. 104-112. DOI https://doi.org/10.1016/j.ipm.2013.08.006.

Han J. Towards Effective Extraction and Linking of Software Mentions from User-Generated Support Tickets /j. Han, K. H. Goh, A. Sun, M. Akbari // Proceedings of the 27th ACM International Conference on Information and Knowledge Management: CIKM '18. - New York, NY, USA: Association for Computing Machinery, 2018. P. 2263-2271. DOI 10.1145/3269206.3272026.

Paramesh S. P. Classifying the Unstructured IT Service Desk Tickets Using Ensemble of Classifiers / S. P. Paramesh, C. Ramya, K. S. Shreedhara // 2018 3rd International Conference on Computational Systems and Information Technology for Sustainable Solutions (CSITSS). 2018. P. 221-227. DOI 10.1109/CSITSS.2018.8768734.

Zangari A. Ticket automation: An insight into current research with applications to multi-level classification scenarios / A. Zangari, M. Marcuzzo, M. Schiavinato // Expert Systems with Applications. - 2023. - Т. 225. - С. 119984. DOI https://doi.org/10.1016/j.eswa.2023.119984.

May M.C., Neidhöfer J., Körner T., Schäfer L., Lanza G. Applying Natural Language Processing in Manufacturing, Procedia CIRP, Vol. 115, 2022, pp. 184-189, ISSN 2212-8271, https://doi.org/10.1016/j.procir.2022.10.071.

Sun W, Cai Z, Li Y, Liu F, Fang S, Wang G. Data Processing and Text Mining Technologies on Electronic Medical Records: A Review. J Healthc Eng. 2018 Apr 8;2018:4302425. doi: 10.1155/2018/4302425. PMID: 29849998; PMCID: PMC5911323.