Data extraction from structured documents with Tesseract and OpenCV

Problem statement

9 min readJul 14, 2021

Figure 1. The pipeline of data extraction from the structured document

With the development of information technology and online resources, identity verification and data digitization have become significant. Many companies whose activities are related to personal data require customers to confirm their identity by sending a photo or a scanned copy of the document. Checking and verifying all information manually involves a lot of time and money to pay for this time to employees who process this information. Manual checking can also cause errors. Automating the processing of textual information from images allows you to read information from any document quickly. In this paper, we describe an approach for automatically reading the data from photo documents that has several advantages over manual data input:

Speed. You do not need to manually verify and manually enter each field’s data in the document photo into the database. Using machine learning tools, reading data from a document is much faster.
Mistakes. Manual data entry can be a source of many inaccuracies and errors. Modern technologies and tools avoid a lot of mistakes in data collection.
Integration. The developed approach can be easily integrated into any application and service.
Security. There is a high risk of information leakage when processing and entering confidential data manually. Automation can ensure that personal data remains confidential.

Competitors

To understand the domain, you probably need to check an existing solution in the field of automatic reading text from images like Nanonets. However, the most significant disadvantage of this service is the inability to read the information in languages other than English. In addition, at least ten images that are very similar to the input image must be labeled for the service to work correctly. Furthermore, it would be best if you bought a monthly subscription with a limit of 10,000 images. It is not cost-effective for users who need just a small amount of requests.

Approaches

Most traditional approaches can be grouped into two large categories: template-based and NLP-based.

Template-Based approach

This approach analyzes previously known layouts to determine the type of information at a specific position in the image. The process involves determining the location of specific fields with data, after which we apply an OCR (Optical character recognition).

For example, the system can be implemented for several types of sample documents, and when scanning a specific kind of document, it will be automatically classified into a specific category. Then, optical character recognition will be applied to this document category for specific field locations.

Advantages and disadvantages of this approach:

+ a small amount of data to build and implement the service;

+ read only the necessary information;

+ possibility of local preprocessing for fields;

- the impossibility of processing a document, the type which does not exist in the category list;

- the need to build and train a classifier of documents;

2. Natural Language Processing (NLP) based approach

This approach first reads all the textual information from the document, then uses the NLP analysis technique, known as Named Entity Recognition (NER), to assign tags to each piece of text.

A tag is a class to which a particular piece of information from a document belongs. The type of data read from the paper can be, for example, ‘date’ or ’surname.’

Advantages and disadvantages of this approach:

+ You do not need to label the location of the fields for each type of document so that you can process any document layouts;

+ The text will be read even from a poorly structured document;

- The presence of several fields that can be classified into one class;

- The need to train the model for each of the available document languages;

3. Graph Convolutional Networks

Figure 2. Graph convolutional neural network workflow

To implement this approach, we need to build a convolutional graph network. For this approach, the document image must be converted to a chart. This network scans local templates in graphs, just as CNN scans input through a small window, recognizing local connections between nodes in the window. GCN (Graph Convolutional Networks) can also recognize pattern hierarchies. CNN allows you to compose many layers, where the patterns recognized by the first layer are used as input for the second, which then learns to recognize combinations of patterns (for example: in the passport, the date of issue is always below the name). Using the graph, such patterns can be determined regardless of the dimension or orientation of the image.

Advantages and disadvantages of this approach:

the ability to find relationships between different types of information reduces the likelihood of assigning the wrong class to similar types of information;
you do not need to label templates for each document type;
A large amount of data and resources are required for the network training;

Optical character recognition

Optical character recognition (OCR) technologies are used to read printed or handwritten text from an image. OCR includes several subtasks, as a result of which a two-dimensional image of a document is converted into text.

Subprocesses:

Image preprocessing
Text localization
Character segmentation
Character recognition
Post-processing

Many tools have been developed for OCR and image processing. In this project, our team used only open-source software.

Tesseract is an open-source text recognition engine. Extracting text from an image occurs using the API or directly. It supports many languages and can be used to recognize text from a large document and read from individual fragments of an image.

Tesseract 4.00 includes a new neural network subsystem configured as a text line recognizer. It has its origins in OCRopus’ Python-based LSTM implementation but has been redesigned for Tesseract in C++. The neural network system in Tesseract pre-dates TensorFlow but is compatible with it, as there is a network description language called Variable Graph Specification Language (VGSL), that is also available for TensorFlow.

Preprocessing

Image Registration

Since we receive a photo of the document as input, we need to fit it into the template so that the document fields coincide with the template fields. This process is named image registration.

Feature-based approaches are the most used and effective for image registration. This approach includes several steps: identifying key points, describing the object, matching objects, and distorting the image.

In other words, the algorithm selects significant points in both images and connects the pairs in the template image and the input image. The process continues until the image is aligned according to the template.

Key points define what is important and different in the image (angles, edges, etc.). Such points can be represented by a descriptor: a vector of features containing the main characteristics of key points. The descriptor must be reliable against image transformations (localization, scale, brightness, etc.). Many algorithms are designed for key points detection:

SIFT (Scale-invariant feature transform) is a basic method that was developed to identify key points. The descriptor of the SIFT function is resistant to uniform scaling, orientation, brightness changes, and affine distortions.
SURF (Speeded Up Robust Features) is an algorithm which is based on the above-mentioned SIFT algorithm and includes a detector with a descriptor. The main advantage is that it is much faster than its predecessor but is not free for commercial use.
KAZE is a fast, multi-scale algorithm to identify and describe functions for nonlinear scale-spaces. This method is resistant to both scale and rotation.

Image denoising

As we want completely correct output, we need to preprocess input images because often they have noise or watermarks that can make tesseract output worse. To preprocess the image for OCR, we used some of the following Python functions from OpenCV. Let’s take a look at the visualization of output after applying one of these functions.

Figure 5. Example of output Opencv algorithms on the noisy image

Blur. Used to smooth the image, which reduces noise. It is better to use it on those images where there is a small noise.

Threshold. In some images, the text tends to be darker than the noise. Thresholding sets all pixels whose intensity is above a threshold to 1, and the remaining pixels to 0. A problem with simple thresholding is that you have to manually specify the threshold value, which is not suitable for automatic image processing. But we can simply apply Otsu’s threshold, which avoids having to choose a value and determines it automatically. Or if our images tend to have different lighting conditions in different areas, adaptive thresholding can help.

Canny. An edge detection operator uses a multi-stage algorithm to detect a wide range of edges in images. Methods of detecting edges define the point where brightness changes drastically, to arrange their edges. A good way to filter out only contours of letters.

Morphological transform (open, close, erode, dilate). Morphological transformations are normally performed on binary images. It needs two inputs, one is our original image, and the second one is called a structuring element or kernel which decides the nature of the operation. A powerful method to reduce noise from images.

FastNlMeansDenoising. Used for image denoising with Non-local Means Denoising algorithm that includes some computational optimizations. Noise is expected to be a gaussian white noise. A good way to reduce noise and keep the quality of the text, but also time-consuming.

All of these methods can and should be combined and tested to find out what suits better for specific tasks. The output from Tesseract for each image processed by one of the functions above looks like this:

Figure 6. Example of tesseract output on the preprocessed noisy image

Text post-processing

Having read the text from the image, there is a problem in its processing. Each document has certain types of fields, such as data, text, TIN numbers, social security codes, etc. To “clear” the text of unnecessary characters and symbols, it is necessary to have information about the type of field. Then, determine the processing algorithm depending on the type of field.

Each text field has the following characteristics that should be considered in processing:

Syntactic features: rules that define the structure of the text field. For example, the field “date of issue” or “date of birth” must be presented in a certain format, usually, it has the form of 8 digits separated by a dot.
Field semantics: rules based on determining the content of field information. For example, the first two digits for the “date of birth” field should be the month number, and the last four digits should be the year.
Relationship semantics: rules based on structural-semantic connections of a field with other text fields of a document. For example, the field “date of issue” cannot contain the timestamp preceding the timestamp from the field “date of birth”.

Conclusion

Based on this information, it is pretty easy to develop an algorithm for the automatic reading of textual information from the image. The main advantage is that we described here only open-source tools that simplify the process a lot.

Described algorithms can be widely used in the banking industry, government agencies, and others. Automatic reading of information from the document accelerates the entry of information from documents into the database, significantly reduces the probability of erroneous data, and is much safer than manual entry. Also, the solution supports multi-language image data reading.

Possible key areas for improvement:

- As Tesseract accuracy is not good enough, it can be improved by training Tesseract on custom data;

Having collected enough data, it is possible to train a graph convolutional network to improve the flexibility of the pipeline.

To get a detailed consultation on the topic of text extraction from images, contact us here.

Kostiantyn ISAIENKOV

Data extraction from structured documents with Tesseract and OpenCV

Problem statement

Competitors

Approaches

Optical character recognition

Preprocessing

Text post-processing

Conclusion

Written by R&D Сenter WINSTARS.AI