Introduction to OCR AI

Before image alt After image alt

Automated document processing is essential for modernizing the workflows in today’s enterprises, and it’s relevant to a variety of processes, such as expense management, accounts payable automation, procurement, accounting, insurance, user and employee onboarding, loan applications, underwriting, etc. 

However, processing unstructured data such as PDFs or scanned documents using AI is not a straightforward task. High-quality data annotation is essential in order to train and maintain high-quality document processing and parsing AI tools.

Open Access Dataset

Arabic Documents OCR Dataset

Looking for a dataset you’d like to test your Arabic documents? Why not try our new Arabic Documents OCR Dataset – it’s as easy as 1, 2, 3!

Image of Arabic Document afer annotation
Image of Arabic Document after annotation

Challenges and best practices

Based on our extensive experience annotating data for document processing and OCR, we are sharing below some of our best practices and tips on how to ensure your AI project will be a success:

Challenge

Best Practice

Document processing AI is challenging to scale to different markets because models that are meant for Latin scripts and languages such as English are much more advanced than models for other alphabets. 

Using our multinational team across different geographies, we are able to collect custom training datasets with images of documents from around the world in many world languages. This will help you scale your solution to different markets with the confidence that it will work well in all locations.

During deployment, AI systems may come across a variety of sizes, typographies, designs, and formattings of each document (eg invoices) and will have to find their way in each one.

Using our services for adversarial example collection, you can pinpoint specific failure modes of your model that require additional data collection and to enrich your dataset with challenging and difficult to find examples.

Documents come with a variety of types of content: there may be handwritten text on top of the typed one, as well as images and charts. In addition, the images may vary in quality and lighting and there may be blurry or badly scanned pages, folds and crumples, shadows or other irregularities. Finally, sometimes text can be tilted, either because of formatting, or simply because the entire scanned page is not straight.

These challenges can be resolved by defining clear annotation instructions and using a dedicated team of expert labelers who would be able to learn the subtleties of your annotation process: do they annotate handwritten text and signatures, do they label the text on charts, etc. In addition, we are able to offer specific insights, such as using 4-point polygons or rotated bounding boxes instead of straight bounding boxes, which solve the problem of tilted text detection

This is why it’s really important to start with the right training data and to adopt a human-in-the-loop approach in order to continuously improve the document classification and data extraction AI models that you are using!

Types of annotations for OCR AI

Below we will feature several different use cases of data annotation for document processing and some of the best practices.

Before image alt After image alt

Document recognition

Many document scanning or processing apps start with the simple task of recognizing that there is a document on the image and detecting its position, especially mobile applications. In order to train such a detector, you need a dataset where the documents are precisely annotated using polygonal or semantic segmentation masks in order to detect the page outline.

Document classification

In order to correctly process documents, it may be necessary to classify them first by type: eg. invoice, receipt, contract, book, magazine, etc. The required annotation here is in the form of tags, or by splitting the dataset in different folders. In addition, documents may be classified according to the language they are in, but bear in mind that there may be multiple languages per document. 

Before image alt After image alt
Before image alt After image alt

Text detection

In order to understand the text on the document, you would first need to run a text detector to identify the actual existence of a text. The necessary annotation for this application is a bounding box or a polygonal annotation, which could be either on a block, paragraph or a line level.

Text transcription

Once the text is detected, you would need to understand the actual words and turn them into machine-readable text by applying an OCR model. Most OCR systems now work on a word or even a row level rather than a character level, so the abbreviation is now becoming obsolete, since it means Optical Character Recognition. The dataset requirement here is a bounding box with a transcription of the value in each box.

Before image alt After image alt
Before image alt After image alt

Document parsing

Once the text on the document has been converted into plain text, it’s important to structure it in the same way, based on the hierarchy of headings and subheadings, as well as the existence of tables, charts, captions, and other text sections. This requires a dataset where the bounding boxes are also categorized based on the type of the text (eg. H1, H2, H3, etc).

Entity extraction

In some cases, such as invoice processing, it’s only required to detect and extract the values of particular entities of interest. For example, these can include the total price of invoice, the unit price, the number of units, the currency, the date of issue, the due date, etc. This can be a tricky task because the positioning and formatting can vary on different invoices, some invoices may be missing some items, some items may be spread across different pages, etc.

Before image alt After image alt

Document recognition

Many document scanning or processing apps start with the simple task of recognizing that there is a document on the image and detecting its position, especially mobile applications. In order to train such a detector, you need a dataset where the documents are precisely annotated using polygonal or semantic segmentation masks in order to detect the page outline.

Image of the book pages before annotationImage of the book pages after segmentation

Document classification

In order to correctly process documents, it may be necessary to classify them first by type: eg. invoice, receipt, contract, book, magazine, etc. The required annotation here is in the form of tags, or by splitting the dataset in different folders. In addition, documents may be classified according to the language they are in, but bear in mind that there may be multiple languages per document. 

Image of the invoice before annotationImage of the invoice after annotation

Text detection

In order to understand the text on the document, you would first need to run a text detector to identify the actual existence of a text. The necessary annotation for this application is a bounding box or a polygonal annotation, which could be either on a block, paragraph or a line level.

The image shows before annotation arabic documentImage of the arabic document after annotation

Text transcription

Once the text is detected, you would need to understand the actual words and turn them into machine-readable text by applying an OCR model. Most OCR systems now work on a word or even a row level rather than a character level, so the abbreviation is now becoming obsolete, since it means Optical Character Recognition. The dataset requirement here is a bounding box with a transcription of the value in each box.

Image of the filled out documentImage of the filled out document after annotation

Document parsing

Once the text on the document has been converted into plain text, it’s important to structure it in the same way, based on the hierarchy of headings and subheadings, as well as the existence of tables, charts, captions, and other text sections. This requires a dataset where the bounding boxes are also categorized based on the type of the text (eg. H1, H2, H3, etc).

Image of newspaper before annotationThe image of a newspaper after annotation

Entity extraction

In some cases, such as invoice processing, it’s only required to detect and extract the values of particular entities of interest. For example, these can include the total price of invoice, the unit price, the number of units, the currency, the date of issue, the due date, etc. This can be a tricky task because the positioning and formatting can vary on different invoices, some invoices may be missing some items, some items may be spread across different pages, etc.

Image of the invoice before annotationImage of the invoice after annotation

Tools we love

Here are some of our tips and recommendations on the best tools we’ve used for this type of annotation which can hopefully be useful for anyone working on document parsing or processing models.

As an open source tool, LabelStudio can be great for getting started with document annotation, given its customizable UI and its ability to support different types of annotation. 

V7 is a more advanced tool which offers great auto-labeling features for bounding boxes and polygons and will speed up your annotation process significantly. 

Blogs

Best Annotation Tools for 2021

Found your humans – now all you need is the right tool for the job? Here’s our review of the most popular tools for 2021!

Darwin V7 Visual Object Tracking Gif Cars on Road
Darwin V7 Visual Object Tracking Gif Cars on Road

How to use a human-in-the-loop for Document annotation for OCR AI

Document processing is a continuous operation within enterprises and in order to keep models up to date and to handle the inevitable data drift, it’s important to use human input on a continuous basis, not just for the initial training of your models. Here are some of the ways in which humans can be plugged into the entire MLOps cycle:

Before image alt After image alt

1. Document collection: our humans in the loop can collect multilingual datasets of different types of documents from a variety of geographical locations

2. Ground truth annotation: in order to train your initial models, we offer full dataset annotation from scratch in batches: anything from classification to bounding box annotation, transcription, or page segmentation

3. Output validation with active learning: once you’ve trained an initial model, we can use it in order to pre-annotate a large part of the dataset, which will both increase the speed of the annotators and the impact of their work, by setting up an active learning workflow and prioritizing instances where your model is least certain

4. Adversarial example collection: once you’ve trained an initial model, we can expand your core dataset with additional difficult and challenging edge cases, such as blurry, dark, or tilted images, or specific classes that are underrepresented, depending on the failure modes of your model during testing

5. Real-time edge case handling: once you have a model in deployment, our humans-in-the-loop are available 24/7 to handle potential edge cases that appear in real time or close to real time, using a simple API request and sending the correct response in seconds in order to ensure a second layer of verification for your model’s most critical responses

 

Wondering who is annotating your data?

When you are hiring a company to help you with your annotation needs, you frequently never meet the workers who are labeling your data. We want to change this and present to you the inspiring stories of our annotators!

Image of a teem member Yalda

Does this sound like something you’d like to try out? Get in touch with us and we’d be happy to schedule a free trial so as to explore how we can best help you with your retail annotation needs!

Get In Touch

We’re an award winning social enterprise powering the AI solutions of the future