Podobnost dokumentů na Webu
Loading...
Downloads
Date issued
Authors
Suchánek, Jindřich
Journal Title
Journal ISSN
Volume Title
Publisher
Vysoká škola báňská - Technická univerzita Ostrava
Location
ÚK/Sklad diplomových prací
Signature
200905059
Abstract
This bachelor’s thesis is concerned with data extraction from blogs on the Internet, their analysis and its processing into the form of graphs. In the first part of the thesis, a program was created which saves the entries acquired from blogs to XML files. These files then include the title, a link to the entry, the author, date, links and the body of the entry. The second part of the thesis deals with processing these files. Their analysis is visualized using column and pie diagrams and directed graphs. The similarity is then calculated between separate entries using the formula for Cosine similarity and the similarity between texts is then used as the edge value in an undirected graph.
Description
Import 01/09/2009
Subject(s)
extraction, similarity, entry, blog, XML, graph