Named Entity Recognition and Text Compression

dc.contributor.advisorSnášel, Václav
dc.contributor.authorNguyen, Hong Vu
dc.contributor.refereePlatoš, Jancs
dc.contributor.refereeČermák, Petrcs
dc.contributor.refereeNeruda, Romancs
dc.date.accepted2016-11-02
dc.date.accessioned2016-12-13T12:07:17Z
dc.date.available2016-12-13T12:07:17Z
dc.date.issued2016
dc.descriptionImport 13/01/2017cs
dc.description.abstractIn recent years, social networks have become very popular. It is easy for users to share their data using online social networks. Since data on social networks is idiomatic, irregular, brief, and includes acronyms and spelling errors, dealing with such data is more challenging than that of news or formal texts. With the huge volume of posts each day, effective extraction and processing of these data will bring great benefit to information extraction applications. This thesis proposes a method to normalize Vietnamese informal text in social networks. This method has the ability to identify and normalize informal text based on the structure of Vietnamese words, Vietnamese syllable rules, and a trigram model. After normalization, the data will be processed by a named entity recognition (NER) model to identify and classify the named entities in these data. In our NER model, we use six different types of features to recognize named entities categorized in three predefined classes: Person (PER), Location (LOC), and Organization (ORG). When viewing social network data, we found that the size of these data are very large and increase daily. This raises the challenge of how to decrease this size. Due to the size of the data to be normalized, we use a trigram dictionary that is quite big, therefore we also need to decrease its size. To deal with this challenge, in this thesis, we propose three methods to compress text files, especially in Vietnamese text. The first method is a syllable-based method relying on the structure of Vietnamese morphosyllables, consonants, syllables and vowels. The second method is trigram-based Vietnamese text compression based on a trigram dictionary. The last method is based on an n-gram slide window, in which we use five dictionaries for unigrams, bigrams, trigrams, four-grams and five-grams. This method achieves a promising compression ratio of around 90% and can be used for any size of text file.en
dc.description.abstractIn recent years, social networks have become very popular. It is easy for users to share their data using online social networks. Since data on social networks is idiomatic, irregular, brief, and includes acronyms and spelling errors, dealing with such data is more challenging than that of news or formal texts. With the huge volume of posts each day, effective extraction and processing of these data will bring great benefit to information extraction applications. This thesis proposes a method to normalize Vietnamese informal text in social networks. This method has the ability to identify and normalize informal text based on the structure of Vietnamese words, Vietnamese syllable rules, and a trigram model. After normalization, the data will be processed by a named entity recognition (NER) model to identify and classify the named entities in these data. In our NER model, we use six different types of features to recognize named entities categorized in three predefined classes: Person (PER), Location (LOC), and Organization (ORG). When viewing social network data, we found that the size of these data are very large and increase daily. This raises the challenge of how to decrease this size. Due to the size of the data to be normalized, we use a trigram dictionary that is quite big, therefore we also need to decrease its size. To deal with this challenge, in this thesis, we propose three methods to compress text files, especially in Vietnamese text. The first method is a syllable-based method relying on the structure of Vietnamese morphosyllables, consonants, syllables and vowels. The second method is trigram-based Vietnamese text compression based on a trigram dictionary. The last method is based on an n-gram slide window, in which we use five dictionaries for unigrams, bigrams, trigrams, four-grams and five-grams. This method achieves a promising compression ratio of around 90% and can be used for any size of text file.cs
dc.description.department460 - Katedra informatiky
dc.description.resultvyhovělcs
dc.format89 l. : il.cs
dc.format.extent1328859 bytes
dc.format.mimetypeapplication/pdf
dc.identifier.locationÚK/Sklad diplomových prací
dc.identifier.otherOSD002cs
dc.identifier.senderS2724cs
dc.identifier.signature201700083cs
dc.identifier.thesisNGU0030_FEI_P1807_1801V001_2016
dc.identifier.urihttp://hdl.handle.net/10084/116544
dc.language.isoen
dc.publisherVysoká škola báňská - Technická univerzita Ostravacs
dc.rights.accessopenAccess
dc.subjecttext normalization, named entity recognition, text compression.en
dc.subjecttext normalization, named entity recognition, text compression.cs
dc.thesis.degree-branchInformatikacs
dc.thesis.degree-grantorVysoká škola báňská - Technická univerzita Ostrava. Fakulta elektrotechniky a informatikycs
dc.thesis.degree-levelDoktorský studijní programcs
dc.thesis.degree-namePh.D.
dc.thesis.degree-programInformatika, komunikační technologie a aplikovaná matematikacs
dc.titleNamed Entity Recognition and Text Compressionen
dc.title.alternativeNamed Entity Recognition and Text Compressioncs
dc.typeDisertační prácecs

Files

Original bundle

Now showing 1 - 5 out of 5 results
Loading...
Thumbnail Image
Name:
NGU0030_FEI_P1807_1801V001_2016.pdf
Size:
1.27 MB
Format:
Adobe Portable Document Format
Loading...
Thumbnail Image
Name:
NGU0030_FEI_P1807_1801V001_2016_autoreferat.pdf
Size:
913.98 KB
Format:
Adobe Portable Document Format
Loading...
Thumbnail Image
Name:
NGU0030_FEI_P1807_1801V001_2016_posudek_oponent_Cermak_Petr.pdf
Size:
480.7 KB
Format:
Adobe Portable Document Format
Description:
Posudek oponenta – Čermák, Petr
Loading...
Thumbnail Image
Name:
NGU0030_FEI_P1807_1801V001_2016_posudek_oponent_Neruda_Roman.pdf
Size:
657.6 KB
Format:
Adobe Portable Document Format
Description:
Posudek oponenta – Neruda, Roman
Loading...
Thumbnail Image
Name:
NGU0030_FEI_P1807_1801V001_2016_posudek_oponent_Platos_Jan.pdf
Size:
628.44 KB
Format:
Adobe Portable Document Format
Description:
Posudek oponenta – Platoš, Jan