Extrakce zpráv z webových stránek

Loading...
Thumbnail Image

Downloads

12

Date issued

Authors

Blanár, Štefan

Journal Title

Journal ISSN

Volume Title

Publisher

Vysoká škola báňská - Technická univerzita Ostrava

Location

Signature

Abstract

The main goal of this diploma thesis is to perform large – scale research about text mining methods especially text mining of structured data from web, concrete from HTML documents, what is well-known problem. Results of this research will be summarized in fist part of this document. Next I probe a few web wrapper’s, especially I’ll try to find some existing wrapper, which could be used as solution for extraction news from web. I also perform an extensive observation of the most famous news portals and news on them. Finally acquired knowledge will be used for developing my own solution of problem extraction news from web pages. I’ll define what web news is and how they differs from information. Then I test my solution in real conditions on real well known news portals. All results of this testing will be presented in last chapter of this thesis.

Description

Import 05/08/2014

Subject(s)

text mining, regular expression, extraction, news, Internet, Web, URL, method, algorithm, ReLIE, ONTEA, DOM, XML, HTML, XPath, MDF, TPC, NCSCA, TTR, wrapper, crawler, automatic wrapper, semi-automatic wrapper, keywords, scheme

Citation