Humans in the Loop is thrilled to publish open access to our latest Arabic document OCR dataset. This dataset is meant to support the development of document recognition and processing models, in addition to Arabic text detection and OCR. The dataset contains 10K images, that are further split into 12 classes, namely: Handwritten text, Invoices, Official documents, Newspaper, Book, Receipts, Label, Business cards, Comics, Administrative forms, Magazine and Map.
On each image, the document outline is marked with a polygon from the class “Page” and each line of text is marked with a bounding box of the class “Body text”. In addition, each title is marked with a bounding box and labeled with a full transcription in Arabic.
All of these images were kindly collected and annotated by the team of Techfugees from Lebanon and represent a diverse selection of types of documents, angles, cameras, lighting conditions and backgrounds.