Open Access Dataset

Arabic Documents OCR Dataset

Free Arabic Documents OCR Dataset

Humans in the Loop is thrilled to publish open access to our latest Arabic document OCR dataset. This dataset is meant to support the development of document recognition and processing models, in addition to Arabic text detection and OCR. The dataset contains 10K images, that are further split into 12 classes, namely: Handwritten text, Invoices, Official documents, Newspaper, Book, Receipts, Label, Business cards, Comics, Administrative forms, Magazine and Map.

On each image, the document outline is marked with a polygon from the class “Page” and each line of text is marked with a bounding box of the class “Body text”. In addition, each title is marked with a bounding box and labeled with a full transcription in Arabic.

All of these images were kindly collected and annotated by the team of Techfugees from Lebanon and represent a diverse selection of types of documents, angles, cameras, lighting conditions and backgrounds.

This Arabic documents OCR dataset is dedicated to the public domain by Humans in the Loop under CC0 1.0 license

Image of Arabic books
Image of Arabic Document afer annotation

Dataset size

The dataset includes 10K images.

Classes

The images are segmented in twelve classes:

  1. Handwritten text
  2. Invoices
  3. Official documents
  4. Newspaper
  5. Books
  6. Receipts
  7. Label
  8. Business cards
  9. Comics
  10. Administrative forms
  11. Magazine
  12. Map

Access the dataset by filling in the form below