How Papyrus Achieved 94% Classification Accuracy on Swahili Documents

The honest story.

Where we started

Our initial classification model was a multilingual transformer fine-tuned on document type labels. Performance:

English documents: 91% accuracy
Swahili documents: 68% accuracy

The drop on Swahili was specifically painful for our NGO and county government tenants, who produce significant Swahili content.

The diagnosis

We sampled 500 misclassified Swahili documents. Three patterns:

Code-switching confused the model — paragraphs mixing English and Swahili were classified based on the English bits, ignoring Swahili context
Document-type vocabulary was English-biased — words like “minutes”, “memorandum”, “report” anchored classifications; Swahili equivalents (“muhtasari”, “kumbukumbu”, “ripoti”) were underweighted
Domain-specific Swahili — county-government Swahili terminology was largely absent from the pretraining corpus

What we tried

Iteration 1: Add more Swahili to training data. We sourced an additional 12,000 Swahili-content documents from willing tenants (with consent), labelled them, retrained.

Result: 68% → 79%. Improvement, but not enough.

Iteration 2: Custom tokenizer for Swahili. Default subword tokenizers were over-fragmenting Swahili words. We trained a SentencePiece tokenizer with a Swahili-aware vocabulary.

Result: 79% → 85%.

Iteration 3: Language-aware preprocessing. When we detect a document is predominantly Swahili, we route to a Swahili-specialised classification head rather than the multilingual generalist.

Result: 85% → 91%.

Iteration 4: Hard negatives. We mined cases where the model confidently misclassified Swahili documents — particularly near-class neighbours (e.g., “muhtasari wa mkutano” classified as “memorandum” rather than “minutes”). We added these as hard negatives.

Result: 91% → 94%.

What stayed hard

Mixed-language documents with English headers and Swahili body still hover around 89%
Handwritten Swahili (county field officer reports) is harder — OCR limitations compound
Domain-specific Swahili from NGO-speak (“walengwa”, “huduma za kijamii”) needs more training data

What this implies

If you're building NLP infrastructure for African markets:

Don't trust multilingual benchmarks. A model that's “good at Swahili” by Western benchmarks may still be 25 points behind English on real-world African documents.
Tokenizers matter more than you think. Don't accept a generic subword tokenizer if your language has rich morphology.
Domain Swahili ≠ benchmark Swahili. Newspaper Swahili and county-government Swahili are different languages in some respects.
Hard negatives beat more data. The marginal correctly-labelled document is less useful than the marginal hard-negative pair.

What's next

We're applying the same playbook to French (for our Francophone West Africa rollout) and exploring Sheng (urban code-mixed Kenyan slang) for customer service ticket classification.

The methodology generalises. The patience required does not.