Skip to main content
Blog

How Papyrus Achieved 94% Classification Accuracy on Swahili Documents

Off-the-shelf models speak Swahili poorly. Here's the work that took us from 68% to 94% on Swahili-content documents.

How Papyrus Achieved 94% Classification Accuracy on Swahili Documents

The honest story.

Where we started

Our initial classification model was a multilingual transformer fine-tuned on document type labels. Performance:

  • English documents: 91% accuracy
  • Swahili documents: 68% accuracy

The drop on Swahili was specifically painful for our NGO and county government tenants, who produce significant Swahili content.

The diagnosis

We sampled 500 misclassified Swahili documents. Three patterns:

  1. Code-switching confused the model — paragraphs mixing English and Swahili were classified based on the English bits, ignoring Swahili context
  2. Document-type vocabulary was English-biased — words like “minutes”, “memorandum”, “report” anchored classifications; Swahili equivalents (“muhtasari”, “kumbukumbu”, “ripoti”) were underweighted
  3. Domain-specific Swahili — county-government Swahili terminology was largely absent from the pretraining corpus

What we tried

Iteration 1: Add more Swahili to training data. We sourced an additional 12,000 Swahili-content documents from willing tenants (with consent), labelled them, retrained.

Result: 68% → 79%. Improvement, but not enough.

Iteration 2: Custom tokenizer for Swahili. Default subword tokenizers were over-fragmenting Swahili words. We trained a SentencePiece tokenizer with a Swahili-aware vocabulary.

Result: 79% → 85%.

Iteration 3: Language-aware preprocessing. When we detect a document is predominantly Swahili, we route to a Swahili-specialised classification head rather than the multilingual generalist.

Result: 85% → 91%.

Iteration 4: Hard negatives. We mined cases where the model confidently misclassified Swahili documents — particularly near-class neighbours (e.g., “muhtasari wa mkutano” classified as “memorandum” rather than “minutes”). We added these as hard negatives.

Result: 91% → 94%.

What stayed hard

  • Mixed-language documents with English headers and Swahili body still hover around 89%
  • Handwritten Swahili (county field officer reports) is harder — OCR limitations compound
  • Domain-specific Swahili from NGO-speak (“walengwa”, “huduma za kijamii”) needs more training data

What this implies

If you're building NLP infrastructure for African markets:

  1. Don't trust multilingual benchmarks. A model that's “good at Swahili” by Western benchmarks may still be 25 points behind English on real-world African documents.
  2. Tokenizers matter more than you think. Don't accept a generic subword tokenizer if your language has rich morphology.
  3. Domain Swahili ≠ benchmark Swahili. Newspaper Swahili and county-government Swahili are different languages in some respects.
  4. Hard negatives beat more data. The marginal correctly-labelled document is less useful than the marginal hard-negative pair.

What's next

We're applying the same playbook to French (for our Francophone West Africa rollout) and exploring Sheng (urban code-mixed Kenyan slang) for customer service ticket classification.

The methodology generalises. The patience required does not.

Rejoining the server...

Rejoin failed... trying again in seconds.

Failed to rejoin.
Please retry or reload the page.

The session has been paused by the server.

Failed to resume the session.
Please retry or reload the page.