OCR

OCR (Optical Character Recognition) is the process of converting an image of text — a scanned document, a photo of a printed page, a screenshot — into machine-readable text.

Without OCR, a scanned PDF is just a picture; the system cannot search it, extract from it, or summarise it. With OCR, the scanned PDF becomes fully indexable.

Papyrus runs Tesseract 5 (open-source, server-side) for English and Swahili OCR. The pipeline includes pre-processing (deskew, denoise, binarisation) and quality scoring (so low-quality scans are flagged for manual review).

OCR accuracy depends heavily on input quality:

High-quality digital scans: >99% character accuracy
Phone photos of printed pages: 90-97% with our preprocessing
Smudged or skewed scans: 70-90%
Handwriting: HTR (Handwritten Text Recognition) is a separate pipeline