Skip to main content
Glossary

OCR

Optical Character Recognition — converting an image of text (a scan, a photo) into machine-readable text.

OCR

OCR (Optical Character Recognition) is the process of converting an image of text — a scanned document, a photo of a printed page, a screenshot — into machine-readable text.

Without OCR, a scanned PDF is just a picture; the system cannot search it, extract from it, or summarise it. With OCR, the scanned PDF becomes fully indexable.

Papyrus runs Tesseract 5 (open-source, server-side) for English and Swahili OCR. The pipeline includes pre-processing (deskew, denoise, binarisation) and quality scoring (so low-quality scans are flagged for manual review).

OCR accuracy depends heavily on input quality:

  • High-quality digital scans: >99% character accuracy
  • Phone photos of printed pages: 90-97% with our preprocessing
  • Smudged or skewed scans: 70-90%
  • Handwriting: HTR (Handwritten Text Recognition) is a separate pipeline

See also

Rejoining the server...

Rejoin failed... trying again in seconds.

Failed to rejoin.
Please retry or reload the page.

The session has been paused by the server.

Failed to resume the session.
Please retry or reload the page.