Text Clustering
Loading...
Downloads
1
Date issued
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Vysoká škola báňská - Technická univerzita Ostrava
Location
Signature
Abstract
This thesis tries to analyse the procedures and the methods used for clustering text documents. Also, explains the challenges in performing the document clustering techniques. We will be performing the document clustering by analysing two real world text datasets: 20 News group and Reuters, where 20 News group has been split into two variants, in which one variant is based on headers, footers and quotes present inside the text documents and the other variant have text documents without these details. Here we will discuss different document clustering methods, their similarities and the challenges in performing these clustering algorithms, its cluster quality validation techniques and its detailed comparison. We will also discuss the dimension reduction techniques, their advantages with their detailed comparison. Finally we discuss and conclude whether these dimension reduction methods produce any better results on both these algorithms.
Description
Subject(s)
Document clustering, text clustering, 20 News group, Reuters, HAC, kmeans, buckshot