Glossary
OCR
Optical Character Recognition — converting an image of text (a scan, a photo) into machine-readable text.
OCR
OCR (Optical Character Recognition) is the process of converting an image of text — a scanned document, a photo of a printed page, a screenshot — into machine-readable text.
Without OCR, a scanned PDF is just a picture; the system cannot search it, extract from it, or summarise it. With OCR, the scanned PDF becomes fully indexable.
Papyrus runs Tesseract 5 (open-source, server-side) for English and Swahili OCR. The pipeline includes pre-processing (deskew, denoise, binarisation) and quality scoring (so low-quality scans are flagged for manual review).
OCR accuracy depends heavily on input quality:
- High-quality digital scans: >99% character accuracy
- Phone photos of printed pages: 90-97% with our preprocessing
- Smudged or skewed scans: 70-90%
- Handwriting: HTR (Handwritten Text Recognition) is a separate pipeline