Processing of semi-structured text data for use in data analysis models

Elena A. Makarova

Bryansk State Technical University

In creating data analysis models, it is often advisable to use data of various forms and structures in them - numerical, categorical, textual, video, etc. The article studies the influence of text data without a clear structure on the quality of analysis models, reveals the dependence of the accuracy of analysis models on the methods used for processing semi-structured text data. A model for intelligent processing of semi-structured text data is described, which includes visualization methods and data transformation algorithms proposed by the author in previous works. A modification of the algorithm for the transformation of erroneous spellings, based on the use of vector word representation models, is proposed. An experiment was conducted on the use of data of different structures in the framework of solving the problem of classifying resumes of applicants. An example of processing semi-structured text data for solving the problem of classifying resumes of applicants according to their professions is given. The stages of building a data mining model are described, including exploratory analysis, data extraction and transformation. Problems inherent in the data used in the experiment are described, such as: spelling errors, the use of different terminology to describe the same concepts, etc. The accuracy of applying classification models based on data processed in various ways is calculated. Experiments have shown that the use of semi-structured data for this task almost does not increase the accuracy of the model if they are used without preliminary processing and increases the classification accuracy by several percent if they are correctly processed.

semi-structured text data, data analysis, data classification, CV analysis

Back