Určování podobnosti dokumentů s použitím tradičních výpočetních metod a spolupráce davu

Abstract

The master thesis deals with categorization of text documents and its improvement through crowdsourcing. Its goal is to design and implement text documents classifier prototype based on documents similarity and to design evaluation and improvements of categorization using crowdsourcing. For categorization the N-grams algorithm has been chosen, which was implemented in Java. Next, interface for crowdsourcing was created using CMS WordPress. In addition to data collection, the purpose of interface is to evaluate categorization accuracy, which leads to extension of classifier's test data set, thus the categorization is more successful. Both parts of the thesis should serve as base for prepared project between University of Ostrava and VŠB - Technical university of Ostrava.

Description

Subject(s)

Categorization, text documents, natural language, documents similarity, N-grams, crowdsourcing, WordPress, Java, PHP

Citation