An efficient unsupervised approach for OCR error correction of Vietnamese OCR text

dc.contributor.authorNguyen, Quoc-Dung
dc.contributor.authorPhan, Nguyet-Minh
dc.contributor.authorKrömer, Pavel
dc.contributor.authorLe, Duc-Anh
dc.date.accessioned2024-02-14T06:46:40Z
dc.date.available2024-02-14T06:46:40Z
dc.date.issued2023
dc.description.abstractDifferent types of OCR errors often occur in OCR texts due to the low quality of scanned document images or limitations in OCR software. In this paper, we propose a novel unsupervised approach for OCR error correction. Correction candidates for OCR errors are generated and explored in their neighborhoods using correction character edits controlled by an adapted hill-climbing algorithm. Correction characters are extracted from only original ground truth texts, which do not depend on OCR texts in training data. A weighted objective function used to score and rank correction candidates is heuristically tested to find optimal weight combinations. The proposed model is evaluated on an OCR text dataset originating from the Vietnamese handwritten database in the ICFHR 2018 Vietnamese online handwritten text recognition competition. The proposed model is also verified concerning its stability and complexity. The experimental results show that our model achieves competitive performance compared to the other models in the ICFHR 2018 competition.cs
dc.description.firstpage58406cs
dc.description.lastpage58421cs
dc.description.sourceWeb of Sciencecs
dc.description.volume11cs
dc.identifier.citationIEEE Access. 2023, vol. 11, p. 58406-58421.cs
dc.identifier.doi10.1109/ACCESS.2023.3283340
dc.identifier.issn2169-3536
dc.identifier.urihttp://hdl.handle.net/10084/152178
dc.identifier.wos001012334700001
dc.language.isoencs
dc.publisherIEEEcs
dc.relation.ispartofseriesIEEE Accesscs
dc.relation.urihttps://doi.org/10.1109/ACCESS.2023.3283340cs
dc.rights.accessopenAccesscs
dc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/4.0/cs
dc.subjectOCRcs
dc.subjectcharacter editcs
dc.subjecterror correctioncs
dc.subjectattention-based encoder-decodercs
dc.subjecthill climbingcs
dc.titleAn efficient unsupervised approach for OCR error correction of Vietnamese OCR textcs
dc.typearticlecs
dc.type.statusPeer-reviewedcs
dc.type.versionpublishedVersioncs

Files

Original bundle

Now showing 1 - 1 out of 1 results
Loading...
Thumbnail Image
Name:
2169-3536-2023v11p58406.pdf
Size:
2.05 MB
Format:
Adobe Portable Document Format
Description:

License bundle

Now showing 1 - 1 out of 1 results
Loading...
Thumbnail Image
Name:
license.txt
Size:
718 B
Format:
Item-specific license agreed upon to submission
Description: