Evolutionary and Neural Approaches in OCR Error Correction

Optical Character Recognition (OCR) systems help to digitize paper-based archives. However, the poor quality of scanned documents and the limitations of text recognition techniques result in different types of errors in digitized texts, known as OCR texts. OCR errors impact the readability of OCR texts and suspend their readiness for information retrieval and search applications. Post-processing is an essential and important step in improving the quality of OCR texts by detecting and correcting OCR errors. Different approaches to OCR post-processing have been proposed, including corpus-based language models, machine learning, evolutionary algorithms, and statistical and neural machine translation. However, the current OCR error detection and correction results justify that it is still challenging when dealing with low-quality OCR texts in different languages, especially for historical documents. In this thesis, we present an overview of related works on OCR post-processing; provide statistical study of OCR errors and their causes; develop statistical, evolutionary, optimization-based, and neural methods for OCR error correction; and evaluate them on English and Vietnamese benchmark OCR text datasets. In particular, the main contributions of the dissertation thesis are as follows: 1. Designing and constructing the Vietnamese OCR text dataset for model training and evaluation. 2. Studying and providing the statistical analyses of OCR errors and their possible causes. 3. Proposing the algorithms for extracting and creating correction character patterns from training data, and for generating correction candidates with correction character patterns. 4. Proposing the automatic OCR post-processing models that include preprocessing, error detection, and error correction phases using language models and error models. 5. Proposing three kinds of methods for OCR error correction including statistical language model (SLM), evolutionary and optimization algorithms, and neural machine translation (NMT). Our proposed evolutionary and optimization-based methods are the first approaches that employ the evolutionary and optimization algorithms to solve the OCR error correction problem. 6. Our proposed OCR post-processing models can be used as a tool for OCR post-processing in various domains and languages.

Subject(s)

OCR, post-processing, error detection, error correction, language model, error model, evolutionary algorithm, machine translation

Item identifier

http://hdl.handle.net/10084/149025

Collections

Vysokoškolské kvalifikační práce Fakulty elektrotechniky a informatiky / Theses and dissertations of Faculty of Electrical Engineering and Computer Science (FEI)

Show full item record

Evolutionary and Neural Approaches in OCR Error Correction

Files

Downloads

Date issued

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Location

Signature

Abstract

Description

Delayed publication

Available after

Subject(s)

Citation

Item identifier

Collections