Text Clustering

Abstract

This thesis tries to analyse the procedures and the methods used for clustering text documents. Also, explains the challenges in performing the document clustering techniques. We will be performing the document clustering by analysing two real world text datasets: 20 News group and Reuters, where 20 News group has been split into two variants, in which one variant is based on headers, footers and quotes present inside the text documents and the other variant have text documents without these details. Here we will discuss different document clustering methods, their similarities and the challenges in performing these clustering algorithms, its cluster quality validation techniques and its detailed comparison. We will also discuss the dimension reduction techniques, their advantages with their detailed comparison. Finally we discuss and conclude whether these dimension reduction methods produce any better results on both these algorithms.

Description

Subject(s)

Document clustering, text clustering, 20 News group, Reuters, HAC, kmeans, buckshot

Citation