Topic Modeling of a Text Document Using Processed Document Term Matrix (PDTM)
DOI:
https://doi.org/10.22213/2410-9304-2024-4-24-31Keywords:
lemmatization, tokenization, topic modeling, text, text analysis, topicAbstract
Topic modeling is a method of determining the topic of a text document by analyzing the semantics and syntax of the latter. When analyzing text, the method determines the internal structure of a document or a set of documents and uses this information to classify or group similar words by topic. It also helps to identify the main trends of interests or information in a text document. For example, many people are interested in online shopping, politics, sports, economics, society, and etc. There are various online and offline data mining methods and algorithms used to determine the topic of a text. Most of them use a certain mechanism based on the semantic characteristics of the language and the subject of the text. In this study, the main idea is to develop a methodology that can be effectively used for topic modeling of a text in different languages. At first,the model preprocesses a text, which includes its tokenization, deletion of STOPWORDS and its lemmatization. Text preprocessing and filtering of inappropriate text elements reduces the size of the text and improves its classification performance. The algorithm also assumes the presence of ‘n’ topics in a text document and, based on this assumption, generates the processed document term matrix (PDTM) for a text document. The Processed Document Term Matrix (PDTM) is a two-dimensional matrix that assigns a specific numerical value to each word in the text based on the frequency of its occurrence in the document, and then correlates this word with each topic assumed earlier. The processed document terms (PDTM) are generated to store tokenized words. The proposed model and its results are described in detail in the methodology and discussion sections of this article.References
Qiang J., Qian Z., Li Y., Yuan Y., Wu X. Short Text Topic Modeling Techniques, Applications, and Performance: A Survey // Journal of latex class files, Published in IEEE Transactions on Knowledge and Data Engineering, 2019, 34(1), pp.1427-1445. DOI:10.1109/ tkde.2020.2992485.
Abbasi M.M., Beltiukov A.P. Summarizing emotions from text using Plutchik wheel of emotion // Proc. of the 7thAll Russian Conference on Information technology for intelligent decision making support (ITIDS), Ufa, Russian Federation, 2019, 7 (166), pp. 291-294. DOI: 10.2991/itids-19.2019.52.
Abbasi M. M., Beltiukov A. P., Hussain L., Abbasi A. Q. Analysis of emotions from texts for managing society // Infocommunication technologies Journal, Academy of Telecommunications and Informatics, 2019, 2 (17), pp.246-254.
Abbasi M. M., Beltiukov A. P. Identifying the strength of emotions in relation with the topic of text using Word space // Proc. of the 21thinternational workshop on computer science and information technologies, Austria, Vienna // Journal of Atlantis Highlights in Computer Sciences, 2019, 3 (1), pp. 1-5. DOI: 10.2991/csit-19.2019.1.
Beltiukov A. P., Abbasi M. M. Logical analysis of Emotions in Text from Natural language // Vestnik Udmurtskogo Universiteta. Matematika. Mekhanika. Komp'yuternye Nauki, Ижевск. 2019. 1 (29). Pp. 106-116. URL: https://doi.org/10.20537/vm190110.
Shi T., Kang K., Choo J., ReddyC. K. Short-text topic modeling via non-negative matrix factorization enriched with local word-context correlations // Proc. oftheWorld Wide Web Conferences Steering Committee, 2018, pp. 1105-1114. DOI: 10.1145/3178876.3186009.
Abbasi M. M., Beltiukov A. P. Analyzing emotions from text corpus using word space CSIT`2018 // Proc. of the 20thInternational Workshop on Computer Science and Information Technologies ,Varna- Bulgaria, pp. 90-94, Industry 4.0, 2018, 3 (4), pp. 161-164. URL: https://stumejournals.com/journals/i4/2018/4/161.
Yin D., Chao Z., Liu W., Zhang X., Yu J. Wang. Model-based clustering of short text streams // Proc. of the 22ndACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2018, pp. 2634-2642. Doi: 10.1145/3219819.3220094.
Wang S., Roller S.,Erk K. Distributional model on a diet: One-shot word learning from text only. CoRR. 2017. URL: https://arxiv.org/abs/1704.04550v4.
Qiang J.,Chen P.,Ding W.,Wang T.,Xie F.,Wu X. Topic discovery from heterogeneous texts //In: Tools with Arti cial Intelligence (ICTAI), IEEE 28thInternational Conference on IEEE. 2016. Pp. 196-203. Doi: 10.1109/ICTAI.2016.0039.
Blei D.M. Probabilistic topic models // Communications of the ACM. 2012. 55 (4). Pp.77-84. Doi.org/10.1145/2133806.2133826.
Blei D.M., Griths T., Jordan M. The nested Chinese restaurant process and Bayesian nonparametric inference of topic hierarchies // ACM. 2010. 57 (2). Pp. 1-30. URL: https://cocosci.princeton.edu/tom/papers/ncrp.pdf.
Chang J., Boyd-Graber J., Gerrish S., Wang C., Blei D. M. Reading tea leaves: How humans interpret topic models // Proc.of 23rd Annual Conference on Neural Information Processing Systems // Advances in Neural Information Processing Systems. 2009. 32 (1). Pp. 288-296.
Yao L., Mimno D., McCallum A. Ecient methods for topic model inference on streaming document collections // Proc. of the 15thACM SIGKDD International Conference on Knowledge discovery and data mining. 2009. Pp. 937-946. URL: https://doi.org/10.1145/1557019.1557121.
Asuncion A.,Welling M., Smyth P., The Y.W. On smoothing and inference for topics models // Proc. of the 25th Conference on Uncertainty in Artificial Intelligence, June, 2009, pp. 27-34. URL: https://dl.acm.org/doi/10.5555/1795114.1795118.
Wei X., Croft B. LDA-based document models for ad-hoc retrieval // Proc. of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, Washington, USA. 2006. DOI:10.1145/1148170.1148204.
Wallach H. Topic modeling: beyond bag-ofwords // Proc. ofTwenty-Third International Conference ICML, Pittsburgh, Pennsylvania, USA, 2006, June 25-29, 2006. DOI:10.1145/1143844.1143967.
Griths T., Steyvers M., Blei D.,Tenenbaum J.Integrating topics and syntax // Proc. of Neural Information Processing Systems.Vancouver, British Columbia, Canada. 2004. URL: https://papers.nips.cc/paper_files/paper/2004/hash/ef0917ea498b1665ad6c701057155abe-Abstract.html.
Blei D., Ng. A., Jordan M. Latent Dirichlet Allocation // Journal of Machine Learning Research. 2003. 3 (4), pp. 993-1022. URL: https://dl.acm.org/doi/10.5555/944919.944937.
LiuD.C.,Nocedal J. On the limited memory bfgs method for large scale optimization // Mathematical programming. 1989. 45 (1), pp. 503-528. URL: https://doi.org/10.1007/BF01589116.
Carey S., Bartlett E. Acquiring a single new word // Papers and Reports on Child Language Development. 1978. 15 (1), pp. 17-29. URL: https://api.semanticscholar.org/CorpusID:50145091.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2024 Мохсин Маншад Аббаси, Анатолий Петрович Бельтюков
This work is licensed under a Creative Commons Attribution 4.0 International License.