Utilization of Entropy in the Text Similarity

Abstract

In our computerized world, computers and users produce an enormous quantum of new data every day. One of the most challenging problems of the modern informatics and computer sciences is the detection of similarities and differences between large amounts of these documents. The presented dissertation thesis focuses on the entropy utilization in the text similarity. The text similarity can be measured by compression-based similarity metrics. Their application is shown in three areas. The first area deals with spam detection, where an incoming e-mail is classified into two classes -- solicited or unsolicited -- spam e-mail. This classification can be done by Bayesian Spam filter. This filter is extended with Normalized Compression Distance and e-mail signatures. This conjunction gives us better results as standalone Bayesian Spam filter. The second area of interest is plagiarism detection. Nowadays we are producing a lot of various types of documents, such as reports, thesis in the school environment, etc. The retrieval and extraction of reused text from large document collections are important to applications such as plagiarism detection, copyright protection, and information flow analysis. To solve these issues, this thesis presents algorithms, which can detect similar -- plagiarized documents. The proposed method is also inspired by the data compression but in different way. The method is using only some initialization parts of the compression algorithm and its modifications. The last part shows how the Encephalography (EEG) data can be processed as text documents. At first, this data has to be converted from measured voltages into text codes. The described conversion of data is performed by Turtle Graphic and coded into text. After such a conversion, the EEG data can be treated and classified by compression-based similarity metric. This transformation of EEG data is applicable to detection of simple cognitive tasks, for example, finger movements.

Description

Import 02/11/2016

Subject(s)

similarity, text data, spam detection, plagiarism detection, EEG, BCI

Citation