Data Compression Approach for Plagiarism Detection
Loading...
Files
Downloads
9
Date issued
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Vysoká škola báňská - Technická univerzita Ostrava
Location
ÚK/Sklad diplomových prací
Signature
201600193
Abstract
In our digital era, the need for plagiarism detection tools is growing with the tremendous number of documents produced on daily basis in and outside academia in all fields of science. This includes, reports, students’ assignments, undergraduate and graduate theses and dissertations. While some students use cut and paste methods, some other students resort to different ways of plagiarism including, changing the sentence structure, paraphrasing and replacing words with their synonyms. This thesis focuses on creating textual plagiarism detection tools for detecting plagiarism of Arabic and Czech texts by implementing initial parts of a compression algorithm with its modifications where text similarity can be measured by compression-based similarity metrics. Next, it expands on this work by integrating this technique with a Czech synonyms thesaurus and a Czech stemmer to detect semantic plagiarism, including, paraphrasing and restructuring of Czech texts. On the other hand,stemming and syllabification are very important in information retrieval, data mining and language processing. Creating good stemming and syllabification rules is crucial. The demand goes even higher with languages spoken by wider population, such as Arabic language. This thesis presents a novel method for syllabification of Arabic text based on Arabic vowel letters. The thesis also presents a light stemming method for Arabic language. To fine-tune the results of this method, an online parser is used, before stemming, to better categorize the different parts of speech and, later, the output words are matched with an electronic dictionary.
Description
Import 13/01/2017
Subject(s)
syllabification, stemming, data compression, similarity, plagiarism detection, text plagiarism