Sledování frekvence slov v internetových zpravodajských serverech
Loading...
Downloads
5
Date issued
Authors
Činčala, Radoslav
Journal Title
Journal ISSN
Volume Title
Publisher
Vysoká škola báňská - Technická univerzita Ostrava
Location
Signature
Abstract
The aim of this work is processing of articles on public Czech news servers. Output is frequency of the most frequent words in a certain period of time or at certain news server. Format of articles is considerably different in dependence on particular server and mechanical extracting of article's main body is not easy. The work is primarily concerned with methods of extracting data from articles for purpose of easily adding of other news servers to monitoring.
The resulting solution is creation of robust tool for mechanical data extraction from articles in news servers and tool that allows easy and fast news servers adding to automatically monitoring and mechanical extraction. Extracted data are then processed and stored into a database along with the frequencies of individual words and other related data in order to obtain statistics for different time intervals and for different servers.
The output of data extraction can be influenced by lists of stop words and equivalent words, which can be easily changed dynamically. Work with tool allows simple web interface that allows efficient searching of words frequency in a given time interval or in a given server.
Description
Import 26/06/2013
Subject(s)
time, article, database, extraction, frequency, information, internet journalism, java, HTML language, lemming, rss feed, word, news server, information retrieval