Automatic Identification of Slovak Text Author using Machine-Learning Methods

Abstract

In today’s world aid of computers is needed to process large quantities of text data. One of the tasks that can be automated is text document classification. Most classification algorithms require numerical input. Because of that, methods for transforming text into numerical vectors, i.e. vectorization, had to be developed. In this thesis we study different vectorization methods while solving a problem of author identification, using speeches made during Slovak national parliament meetings as training data. We compare well established bag-of-words family of vectorization methods with novel word-graph based approaches. Bag-of-words methods are considered intuitive but come with a number of disadvantages. Most notably, numerical vectors produced are sparse and high-dimensional. These issues are addressed by the word-graph based vectorization. Main goal of the thesis is to answer the question, whether these new approach is better for solving complex text classification problems. Tested vectorization methods are further combined with multiple algorithms for training classification models. These combinations are then compared in terms of classification accuracy and training time. Two dataset variants are examined during experiments: first having similar number of documents for each class and second having significant differences in number of samples available for different authors. The results show that bag-of-words provide better performance than originally proposed word-graph algorithm. We propose a set of modifications which, when applied, significantly improve classification accuracy. We find this modified model useful especially in combination with the decision tree classification method as it provides reasonable accuracy and the added benefit of easy interpretability.

Description

Subject(s)

natural language processing, machine learning, classification, text processing, author identification, vectorization, bag-of-words, word-graph

Citation