Podobnost dokumentů na Webu

Loading...
Thumbnail Image

Downloads

Date issued

Authors

Suchánek, Jindřich

Journal Title

Journal ISSN

Volume Title

Publisher

Vysoká škola báňská - Technická univerzita Ostrava

Location

ÚK/Sklad diplomových prací

Signature

200905059

Abstract

This bachelor’s thesis is concerned with data extraction from blogs on the Internet, their analysis and its processing into the form of graphs. In the first part of the thesis, a program was created which saves the entries acquired from blogs to XML files. These files then include the title, a link to the entry, the author, date, links and the body of the entry. The second part of the thesis deals with processing these files. Their analysis is visualized using column and pie diagrams and directed graphs. The similarity is then calculated between separate entries using the formula for Cosine similarity and the similarity between texts is then used as the edge value in an undirected graph.

Description

Import 01/09/2009

Subject(s)

extraction, similarity, entry, blog, XML, graph

Citation