We Indexed 1 Million Kenyan Documents — Here's What We Learned
After processing 1M+ documents from Kenyan enterprises, some surprising patterns emerged about document hygiene, language mixing, and what AI really needs to be good at.
We Indexed 1 Million Kenyan Documents — Here's What We Learned
Hitting one million documents in production was a milestone for us. It was also the moment we got to look at real data — not synthetic test sets, not benchmark corpora, but actual Kenyan enterprise documents — and update our priors.
Here's what we found.
1. Swahili is more present than expected
We initially designed for “English with occasional Swahili”. The data showed: about 18% of documents have at least one Swahili phrase or paragraph. In NGO and government tenants, that number jumps to 35%.
Implication: bilingual OCR and embedding became a priority, not a stretch goal.
2. The PDF is winning
We expected a mess of Word, Excel, scanned images, and PDFs. The reality:
- 71% PDF (mostly natively generated)
- 14% Image-based scans
- 9% Word (.docx)
- 4% Excel (.xlsx)
- 2% Everything else combined
The native-PDF dominance simplified our extraction pipeline more than we expected.
3. Date formats are gloriously inconsistent
A single document might contain:
- 12/04/2026 (DD/MM/YYYY)
- April 12, 2026 (US format)
- 2026-04-12 (ISO)
- 12th April, 2026 (Kenyan formal style)
Our extraction has to handle all four. We now do.
4. KRA PINs are the most-extracted field
By a wide margin. Invoices, contracts, employee records, supplier files — Kenya's tax PIN appears everywhere and is the single most-valuable extraction target. Pattern: A\d{9}[A-Z].
We optimised our extraction stack accordingly.
5. Handwritten amendments are a quiet majority
We expected typed/printed documents. We're seeing roughly 12% of contracts have handwritten margin amendments — initials, deleted clauses, scribbled new dates. Our HTR (Handwritten Text Recognition) layer became essential, not optional.
6. The 80/20 of complaints is OCR quality on bad scans
Our top customer-reported issue isn't classification accuracy or workflow design — it's OCR on poorly-scanned documents. Cheap mobile scanners + low light + skewed pages = unreadable. We invested heavily in pre-processing (deskew, brightness normalisation, multi-pass OCR) and the complaint rate dropped 70%.
7. The classification taxonomy needs to be tenant-flexible
We launched with 25 universal document categories. Within 6 months, customers were defining 40-80 additional tenant-specific categories (“Loan Restructuring Memo”, “Donor Disbursement Report”, “Tea Plucker Daily Log”). Our retrain pipeline now supports per-tenant fine-tuning.
8. Search queries are getting more conversational
In our first 6 months, the median search was 1.4 words. Today it's 4.2 words. Users are increasingly typing questions, not keywords. We doubled down on RAG and shipped the Copilot.
9. Most “lost” documents are permission issues
When users say “I can't find this”, 60% of the time the document exists but they don't have permission. We built a better “ask for access” workflow — when a search returns no matches, we suggest "X documents matched but you don't have access — request from [owner]".
10. Mobile is closer to web than we expected
We assumed mobile users would be a minority. They're 41% of active monthly users. The mobile app is no longer a nice-to-have; it's nearly half the product.
What's next
These findings are informing the 2026 roadmap. Watch the Release Notes for the specific shipping decisions they prompted.