Data extraction from commercial web forums

Mokrousov M.N., Chirkova N.N.


This article reviews the existing approaches in the area of data retrieval from texts and provides a method for extracting data from commercial web forums based on regular expressions, dictionaries and analysis of adjacent attributes. The article describes the data structure used for storing and organizing regular expressions and information extraction rules and gives the examples of such rules. The experiment is conducted to determine the accuracy of analysis, for which a special information system is used implementing the method described in the article.


natural language processing, data extraction, regular expressions, information retrieval

References References

Feldman R., Sanger J. The Text Mining Handbook. Advanced Approaches in Analyzing Unstructured Data. - Cambridge University Press, 2007. - 424 p.

Ландо Т. Извлечение объектов и фактов из текстов [Электронный ресурс] // Хабрахабр [Сайт] (дата публикации: 07.12.2013). - URL: yandex/blog/205198 (дата обращения: 10.02.2016).

Jeffrey E. F. Friedl. Mastering regular expressions. Understand Your Data and Be More Productive. 3rd Edition. - O'Reilly Media, 2006. - 544 p.

Кормалев Д. А. Приложения методов машинного обучения в задачах анализа текста // Программные системы: теория и приложения : труды Международной конференции, Переславль-Залесский. - М. : Физматлит, 2004. - Т. 2. - С. 35-48.

Matthieu C., Padraig C., Delany S. J. Supervised Learning / Machine Learning Techniques for Multimedia Case Studies on Organization and Retrieval Editors: Matthieu Cord, Padraig Cunningham. - Springer-Verlag Berlin Heidelberg 2008. - P. 21-50.

Мокроусов М. Н., Кучуганов В. Н. Прагматическая компонента текста и человеко-машинный словарь. Труды Конгресса по интеллектуальным системам и информационным технологиям «IS&IT’15». - В 3 т. - Таганрог : Изд-во ЮФУ, 2015. - Т. 1. С. 222-227.

Article Metrics

Metrics Loading ...

Metrics powered by PLOS ALM


  • There are currently no refbacks.

Copyright (c) 2016 Mokrousov M.N., Chirkova N.N.

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.

ISSN 1813-7911