Named Entity Recognition and Text Compression
| dc.contributor.advisor | Snášel, Václav | |
| dc.contributor.author | Nguyen, Hong Vu | |
| dc.contributor.referee | Platoš, Jan | cs |
| dc.contributor.referee | Čermák, Petr | cs |
| dc.contributor.referee | Neruda, Roman | cs |
| dc.date.accepted | 2016-11-02 | |
| dc.date.accessioned | 2016-12-13T12:07:17Z | |
| dc.date.available | 2016-12-13T12:07:17Z | |
| dc.date.issued | 2016 | |
| dc.description | Import 13/01/2017 | cs |
| dc.description.abstract | In recent years, social networks have become very popular. It is easy for users to share their data using online social networks. Since data on social networks is idiomatic, irregular, brief, and includes acronyms and spelling errors, dealing with such data is more challenging than that of news or formal texts. With the huge volume of posts each day, effective extraction and processing of these data will bring great benefit to information extraction applications. This thesis proposes a method to normalize Vietnamese informal text in social networks. This method has the ability to identify and normalize informal text based on the structure of Vietnamese words, Vietnamese syllable rules, and a trigram model. After normalization, the data will be processed by a named entity recognition (NER) model to identify and classify the named entities in these data. In our NER model, we use six different types of features to recognize named entities categorized in three predefined classes: Person (PER), Location (LOC), and Organization (ORG). When viewing social network data, we found that the size of these data are very large and increase daily. This raises the challenge of how to decrease this size. Due to the size of the data to be normalized, we use a trigram dictionary that is quite big, therefore we also need to decrease its size. To deal with this challenge, in this thesis, we propose three methods to compress text files, especially in Vietnamese text. The first method is a syllable-based method relying on the structure of Vietnamese morphosyllables, consonants, syllables and vowels. The second method is trigram-based Vietnamese text compression based on a trigram dictionary. The last method is based on an n-gram slide window, in which we use five dictionaries for unigrams, bigrams, trigrams, four-grams and five-grams. This method achieves a promising compression ratio of around 90% and can be used for any size of text file. | en |
| dc.description.abstract | In recent years, social networks have become very popular. It is easy for users to share their data using online social networks. Since data on social networks is idiomatic, irregular, brief, and includes acronyms and spelling errors, dealing with such data is more challenging than that of news or formal texts. With the huge volume of posts each day, effective extraction and processing of these data will bring great benefit to information extraction applications. This thesis proposes a method to normalize Vietnamese informal text in social networks. This method has the ability to identify and normalize informal text based on the structure of Vietnamese words, Vietnamese syllable rules, and a trigram model. After normalization, the data will be processed by a named entity recognition (NER) model to identify and classify the named entities in these data. In our NER model, we use six different types of features to recognize named entities categorized in three predefined classes: Person (PER), Location (LOC), and Organization (ORG). When viewing social network data, we found that the size of these data are very large and increase daily. This raises the challenge of how to decrease this size. Due to the size of the data to be normalized, we use a trigram dictionary that is quite big, therefore we also need to decrease its size. To deal with this challenge, in this thesis, we propose three methods to compress text files, especially in Vietnamese text. The first method is a syllable-based method relying on the structure of Vietnamese morphosyllables, consonants, syllables and vowels. The second method is trigram-based Vietnamese text compression based on a trigram dictionary. The last method is based on an n-gram slide window, in which we use five dictionaries for unigrams, bigrams, trigrams, four-grams and five-grams. This method achieves a promising compression ratio of around 90% and can be used for any size of text file. | cs |
| dc.description.department | 460 - Katedra informatiky | |
| dc.description.result | vyhověl | cs |
| dc.format | 89 l. : il. | cs |
| dc.format.extent | 1328859 bytes | |
| dc.format.mimetype | application/pdf | |
| dc.identifier.location | ÚK/Sklad diplomových prací | |
| dc.identifier.other | OSD002 | cs |
| dc.identifier.sender | S2724 | cs |
| dc.identifier.signature | 201700083 | cs |
| dc.identifier.thesis | NGU0030_FEI_P1807_1801V001_2016 | |
| dc.identifier.uri | http://hdl.handle.net/10084/116544 | |
| dc.language.iso | en | |
| dc.publisher | Vysoká škola báňská - Technická univerzita Ostrava | cs |
| dc.rights.access | openAccess | |
| dc.subject | text normalization, named entity recognition, text compression. | en |
| dc.subject | text normalization, named entity recognition, text compression. | cs |
| dc.thesis.degree-branch | Informatika | cs |
| dc.thesis.degree-grantor | Vysoká škola báňská - Technická univerzita Ostrava. Fakulta elektrotechniky a informatiky | cs |
| dc.thesis.degree-level | Doktorský studijní program | cs |
| dc.thesis.degree-name | Ph.D. | |
| dc.thesis.degree-program | Informatika, komunikační technologie a aplikovaná matematika | cs |
| dc.title | Named Entity Recognition and Text Compression | en |
| dc.title.alternative | Named Entity Recognition and Text Compression | cs |
| dc.type | Disertační práce | cs |
Files
Original bundle
1 - 5 out of 5 results
Loading...
- Name:
- NGU0030_FEI_P1807_1801V001_2016.pdf
- Size:
- 1.27 MB
- Format:
- Adobe Portable Document Format
Loading...
- Name:
- NGU0030_FEI_P1807_1801V001_2016_autoreferat.pdf
- Size:
- 913.98 KB
- Format:
- Adobe Portable Document Format
Loading...
- Name:
- NGU0030_FEI_P1807_1801V001_2016_posudek_oponent_Cermak_Petr.pdf
- Size:
- 480.7 KB
- Format:
- Adobe Portable Document Format
- Description:
- Posudek oponenta – Čermák, Petr
Loading...
- Name:
- NGU0030_FEI_P1807_1801V001_2016_posudek_oponent_Neruda_Roman.pdf
- Size:
- 657.6 KB
- Format:
- Adobe Portable Document Format
- Description:
- Posudek oponenta – Neruda, Roman
Loading...
- Name:
- NGU0030_FEI_P1807_1801V001_2016_posudek_oponent_Platos_Jan.pdf
- Size:
- 628.44 KB
- Format:
- Adobe Portable Document Format
- Description:
- Posudek oponenta – Platoš, Jan