How Papyrus Achieved 94% Classification Accuracy on Swahili Documents
Off-the-shelf models speak Swahili poorly. Here's the work that took us from 68% to 94% on Swahili-content documents.
How Papyrus Achieved 94% Classification Accuracy on Swahili Documents
The honest story.
Where we started
Our initial classification model was a multilingual transformer fine-tuned on document type labels. Performance:
- English documents: 91% accuracy
- Swahili documents: 68% accuracy
The drop on Swahili was specifically painful for our NGO and county government tenants, who produce significant Swahili content.
The diagnosis
We sampled 500 misclassified Swahili documents. Three patterns:
- Code-switching confused the model — paragraphs mixing English and Swahili were classified based on the English bits, ignoring Swahili context
- Document-type vocabulary was English-biased — words like “minutes”, “memorandum”, “report” anchored classifications; Swahili equivalents (“muhtasari”, “kumbukumbu”, “ripoti”) were underweighted
- Domain-specific Swahili — county-government Swahili terminology was largely absent from the pretraining corpus
What we tried
Iteration 1: Add more Swahili to training data. We sourced an additional 12,000 Swahili-content documents from willing tenants (with consent), labelled them, retrained.
Result: 68% → 79%. Improvement, but not enough.
Iteration 2: Custom tokenizer for Swahili. Default subword tokenizers were over-fragmenting Swahili words. We trained a SentencePiece tokenizer with a Swahili-aware vocabulary.
Result: 79% → 85%.
Iteration 3: Language-aware preprocessing. When we detect a document is predominantly Swahili, we route to a Swahili-specialised classification head rather than the multilingual generalist.
Result: 85% → 91%.
Iteration 4: Hard negatives. We mined cases where the model confidently misclassified Swahili documents — particularly near-class neighbours (e.g., “muhtasari wa mkutano” classified as “memorandum” rather than “minutes”). We added these as hard negatives.
Result: 91% → 94%.
What stayed hard
- Mixed-language documents with English headers and Swahili body still hover around 89%
- Handwritten Swahili (county field officer reports) is harder — OCR limitations compound
- Domain-specific Swahili from NGO-speak (“walengwa”, “huduma za kijamii”) needs more training data
What this implies
If you're building NLP infrastructure for African markets:
- Don't trust multilingual benchmarks. A model that's “good at Swahili” by Western benchmarks may still be 25 points behind English on real-world African documents.
- Tokenizers matter more than you think. Don't accept a generic subword tokenizer if your language has rich morphology.
- Domain Swahili ≠ benchmark Swahili. Newspaper Swahili and county-government Swahili are different languages in some respects.
- Hard negatives beat more data. The marginal correctly-labelled document is less useful than the marginal hard-negative pair.
What's next
We're applying the same playbook to French (for our Francophone West Africa rollout) and exploring Sheng (urban code-mixed Kenyan slang) for customer service ticket classification.
The methodology generalises. The patience required does not.