n-Gram-based text compression
dc.contributor.author | Nguyen, Vu H. | |
dc.contributor.author | Nguyen, Hien T. | |
dc.contributor.author | Duong, Hieu N. | |
dc.contributor.author | Snášel, Václav | |
dc.date.accessioned | 2017-01-05T07:13:14Z | |
dc.date.available | 2017-01-05T07:13:14Z | |
dc.date.issued | 2016 | |
dc.identifier.citation | Computational Intelligence and Neuroscience. 2016, art. no. 9483646. | cs |
dc.identifier.issn | 1687-5265 | |
dc.identifier.issn | 1687-5273 | |
dc.identifier.uri | http://hdl.handle.net/10084/116564 | |
dc.description.abstract | We propose an efficient method for compressing Vietnamese text using n-gram dictionaries. It has a significant compression ratio in comparison with those of state-of-the-art methods on the same dataset. Given a text, first, the proposed method splits it into n-grams and then encodes them based on n-gram dictionaries. In the encoding phase, we use a sliding window with a size that ranges from bigram to five grams to obtain the best encoding stream. Each n-gram is encoded by two to four bytes accordingly based on its corresponding n-gram dictionary. We collected 2.5 GB text corpus from some Vietnamese news agencies to build n-gram dictionaries from unigram to five grams and achieve dictionaries with a size of 12 GB in total. In order to evaluate our method, we collected a testing set of 10 different text files with different sizes. The experimental results indicate that our method achieves compression ratio around 90% and outperforms state-of-the-art methods. | cs |
dc.format.extent | 1900833 bytes | |
dc.format.mimetype | application/pdf | |
dc.language.iso | en | cs |
dc.publisher | Hindawi | cs |
dc.relation.ispartofseries | Computational Intelligence and Neuroscience | cs |
dc.relation.uri | http://dx.doi.org/10.1155/2016/9483646 | cs |
dc.rights | Copyright © 2016 Vu H. Nguyen et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. | cs |
dc.rights.uri | http://creativecommons.org/licenses/by/4.0/ | cs |
dc.title | n-Gram-based text compression | cs |
dc.type | article | cs |
dc.identifier.doi | 10.1155/2016/9483646 | |
dc.rights.access | openAccess | |
dc.type.version | publishedVersion | cs |
dc.type.status | Peer-reviewed | cs |
dc.description.source | Web of Science | cs |
dc.description.firstpage | art. no. 9483646 | cs |
dc.identifier.wos | 000388857100001 |
Files in this item
This item appears in the following Collection(s)
-
Publikační činnost VŠB-TUO ve Web of Science / Publications of VŠB-TUO in Web of Science [7798]
Kolekce obsahuje bibliografické záznamy článků akademických pracovníků VŠB-TUO publikovaných v časopisech indexovaných ve Web of Science od roku 1990 po současnost. -
Články z časopisů s impakt faktorem / Articles from Impact Factor Journals [6377]
Články z časopisů (od roku 2008), které v době vydání článku měly impakt faktor. -
OpenAIRE [5085]
Kolekce určená pro sklízení infrastrukturou OpenAIRE; obsahuje otevřeně přístupné publikace, případně další publikace, které jsou výsledkem projektů rámcových programů Evropské komise (7. RP, H2020, Horizon Europe). -
Publikační činnost Katedry informatiky / Publications of Department of Computer Science (460) [562]
Kolekce obsahuje bibliografické záznamy publikační činnosti (článků) akademických pracovníků Katedry informatiky (460) v časopisech a v Lecture Notes in Computer Science registrovaných ve Web of Science od roku 2003 po současnost.
Except where otherwise noted, this item's license is described as Copyright © 2016 Vu H. Nguyen et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.