Categorization of unorganized text corpora for better domain-specific language modeling

Staš, Ján; Zlacký, Daniel; Hládek, Daniel; Juhár, Jozef

dc.contributor.author	Staš, Ján
dc.contributor.author	Zlacký, Daniel
dc.contributor.author	Hládek, Daniel
dc.contributor.author	Juhár, Jozef
dc.date	2013
dc.date.accessioned	2014-01-15T12:07:04Z
dc.date.available	2014-01-15T12:07:04Z
dc.date.issued	2013
dc.identifier.citation	Advances in electrical and electronic engineering. 2013, vol. 11, no. 5, p. 398-403 : ill.	cs
dc.identifier.issn	1804-3119
dc.identifier.issn	1336-1376
dc.identifier.uri	http://hdl.handle.net/10084/101403
dc.description.abstract	This paper describes the process of categorization of unorganized text data gathered from the Internet to the in-domain and out-of-domain data for better domain-specific language modeling and speech recognition. An algorithm for text categorization and topic detection based on the most frequent key phrases is presented. In this scheme, each document entered into the process of text categorization is represented by a vector space model with term weighting based on computing the term frequency and inverse document frequency. Text documents are then classified to the in-domain and out-of-domain data automatically with predefined threshold using one of the selected distance/similarity measures comparing to the list of key phrases. The experimental results of the language modeling and adaptation to the judicial domain show significant improvement in the model perplexity about 19 % and decreasing of the word error rate of the Slovak transcription and dictation system about 5,54 %, relatively.	cs
dc.format.extent	277132 bytes
dc.format.mimetype	application/pdf
dc.language.iso	en	cs
dc.publisher	Vysoká škola báňská - Technická univerzita Ostrava	cs
dc.relation.ispartofseries	Advances in electrical and electronic engineering	cs
dc.relation.uri	http://advances.utc.sk/index.php/AEEE/article/download/897/898	cs
dc.rights	© Vysoká škola báňská - Technická univerzita Ostrava
dc.rights	Creative Commons Attribution 3.0 Unported (CC BY 3.0)
dc.subject	language modeling	cs
dc.subject	large vocabulary continuous speech recognition	cs
dc.subject	similarity measure	cs
dc.subject	term weighting	cs
dc.subject	text categorization	cs
dc.subject	topic detection	cs
dc.title	Categorization of unorganized text corpora for better domain-specific language modeling	cs
dc.type	article	cs
dc.rights.access	openAccess
dc.type.version	publishedVersion	cs
dc.type.status	Peer-reviewed	cs

Soubory tohoto záznamu

Název:: 897-5029-1-PB-stas.pdf
Velikost:: 270.6Kb
Formát:: PDF
Popis:: publishedVersion

Zobrazit/otevřít

Tento záznam se objevuje v následujících kolekcích

AEEE. 2013, vol. 11 [58]
OpenAIRE [5085]
Kolekce určená pro sklízení infrastrukturou OpenAIRE; obsahuje otevřeně přístupné publikace, případně další publikace, které jsou výsledkem projektů rámcových programů Evropské komise (7. RP, H2020, Horizon Europe).

Zobrazit minimální záznam