Joshua Opolko

Lost at the Token: Why Jamaican Patois Needs Its Own AI

A case, with sources, and an invitation to the person who should build this.

Patois says in four words what many take four paragraphs to avoid saying. As language gets more verbose and more evasive, some keep it more honest, and that's worth keeping for everyone.

Key takeaways


1. The defaults are being written now

There are roughly 7,000 living languages, and a landmark 2020 survey of the field found the overwhelming majority are effectively absent from natural-language processing, a handful of "winners" absorb nearly all the resources and attention [1]. Models learn a language from whatever text exists about it. For Jamaican Creole, Patois, Patwa, around 3 million speakers in Jamaica plus a large diaspora [10], there is comparatively little of it, and the language's depth lives in speech, music, and conversation more than in the indexed web.

Absence is not neutrality. When a model has little authentic data, it doesn't abstain; it fills the gap with stereotype, the flattened version it can assemble from the little it has seen. The way machines "understand" Jamaican Patois for the next generation is being set right now, by whoever supplies the data. That's the window this is about.

2. The mechanism: meaning is fragmented before it's read

LLMs don't process words; they process tokens, subword units learned from corpora that are overwhelmingly English. A word the tokenizer has seen often becomes one clean token. A word it has rarely seen is shattered into many fragments, sometimes single characters.

That's exactly Patois's situation. Whether written in the standardized Cassidy/JLU orthography [10] or in ad-hoc spellings, the distinctively Creole layer, grammatical particles (a, fi, deh, nuh, dem), pronouns (unu, mi, im), core lexicon (pickney, nyam), is largely absent from training data. The predictable result: a sentence a Jamaican reads as ordinary prose is, to the tokenizer, a string of rare fragments.

This isn't speculative. Petrov et al. (2023) showed the same text across languages can take up to 15× as many tokens depending on the language, and that the gap persists even in "multilingual" tokenizers [2]. Ahia et al. (2023), across 22 languages on a commercial API, found speakers of many languages are overcharged and given poorer results, and disproportionately come from regions where the service is least affordable [3].

The cost to understanding: five ways meaning degrades

The token tax is the visible half. The subtler and costlier half is what fragmentation does to meaning, through five compounding effects:

  1. The meaning-unit disappears. pickney as one token gets a single vector that means "child." Split into pick + ney, no piece carries "child"; the model has to reassemble it from sub-pieces that mean nothing, or mean something wrong, like ick. The semantic anchor is gone at the token boundary.
  2. Rare fragments are under-trained. A token's vector is refined every time it appears in training. Low-frequency Patois fragments appear rarely, so their representations stay noisy and badly placed in semantic space: weaker, less reliable predictions before any reasoning even starts.
  3. Longer dependency chains. Meaning that English packs into one token now spans four or five positions. The model's attention has to reach across more distance to recover a single concept, more hops, more chances to lose the thread mid-context.
  4. False-friend normalization (the dangerous one). Because Patois is English-lexified, fragments often map onto English subwords, so the model "corrects" Patois into standard English and confidently changes the meaning. im a nyam ("he or she is eating") can be read as a typo of "I'm a" or "named." It does not just lose information; it injects wrong information with confidence.
  5. Grammar evaporates. Patois encodes tense, aspect, and plurality in small particles (a, fi, dem as a plural marker). If those do not survive as clean tokens, the grammatical scaffolding is lost, exactly where the meaning lives. Misreading dem (plural marker) as English "them" breaks the parse.

So the two halves compound: the cost side is a clean multiplier (more tokens, more money, less context room, more compute), and the understanding side is a quality collapse, lost units, weak embeddings, stretched reasoning, English-ward distortion, and dropped grammar.

What's the exact fragmentation rate for Patois? Nobody has published it. That gap is the argument, the precise measurement should be the first deliverable of a Jamaican-built effort. It's cheap, citable, and impossible for the labs to ignore once it exists.

3. From fragments to cost, capability, and justice

Cost. Most LLM services bill per token. If Patois costs several times more tokens for the same meaning [2, 3], a Patois-speaking user, business, or agency pays a multiple for identical service, a tax levied by the tokenizer, hitting hardest those least able to absorb it.

Capability. Fragmentation degrades comprehension. When a word dissolves into fragments, the unit the model should reason over is gone; nuance and intent are lost at the token boundary. The English-lexified surface makes it worse, a model can mistake Patois for "bad English" and confidently normalize it into something the speaker never meant.

Justice. This is what the other two build toward. As AI mediates more of public life, benefits chatbots, healthcare triage, immigration interviews, courtroom transcription, content moderation that ranks "low-quality" text, a population the system systematically misreads accumulates disadvantage at every contact point. A Patois statement misparsed in a tribunal transcript; a claim misclassified by a benefits bot; a Patois post suppressed as spam. Each is a small algorithmic mishearing; in aggregate it's a structural one. Kornai (2013) warned fewer than 5% of languages are likely to make the leap into the digital realm [4]. The subtler modern danger: a language that survives in the street but is misheard by the machines that increasingly arbitrate access. When communication is brokered by AI, to be untokenizable is to be, in part, unheard.

This is more than a digital divide. A divide means being left behind, slower, costlier, last in line. The sharper risk is digital erasure: if the models only ever ingest a thin, unrepresentative slice of Patois, the machine-mediated version of the language, reinforced across millions of interactions, can begin to crowd out the real one with a flattened imitation. Kornai's "digital language death" [4] names the endpoint; the engine is a feedback loop in which sparse, skewed data trains the next model, which produces more of the same. Inclusion now isn't about catching up. It's about making sure the real language is the one that endures.

4. The data is thin; the language is not

Patois is a full language, rule-governed, expressive, and carried by millions in conversation, music, scripture, and proverb. Its thin presence online reflects not the depth of the language but where that depth has lived: in speech and song far more than in the written, indexed text the web is built from. Linguists have made this point for decades, Michel DeGraff (2005), among others, has shown that creoles are complete linguistic systems, as structured and capable as any other [5]. That distinction matters technically: when a language's richness lives mostly in the spoken world, a model trained on web text simply hasn't met it yet, and tends to smooth its unfamiliar forms into something flatter than the original. The opportunity is to introduce the models to that richness directly, with authoritative, speaker-grounded data, built on the premise the language has always deserved: that it is whole.

5. Jamaica is unusually ready

Most languages similar to Patois would greatly benefit from the scaffolding Jamaica already has:

Together these point to a specific technical tailwind: Patois's shared roots with English give it a transfer-learning head start, and the groundwork exists.

And the people are already here. The linguistic foundation was laid at the Jamaican Language Unit by scholars like Professor Emeritus Hubert Devonish, who spent decades arguing, and engineering a writing system, for Patois as a language [10, 26], and lexicographer Dr. Joseph Farquharson [10]. The first computational step was taken by a Jamaican of the diaspora: Ruth-Ann Armstrong, who built JamPatoisNLI as a Stanford researcher and is now a computer-science PhD [6]. So the scholarship exists, the linguists exist, and an early NLP beachhead exists. What's missing isn't feasibility, and it isn't expertise. It's the founder layer, the person who turns this foundation into owned, productized infrastructure.

6. You wouldn't be starting from zero, or alone

An entire movement already exists, it's winning, and it's open. Across Africa, communities stopped waiting to be included and built for themselves.

For a Jamaican founder this means the playbooks already exist (Masakhane's governance, NaijaSenti's funded-corpus model) and there are tables already set, the Creole NLP community, the AfricaNLP workshops, the Deep Learning Indaba. The pitch is simple: Patois is the Caribbean's Naija; let's share what works.

7. The training data is already on the shelf

A builder doesn't begin with a blank page. The raw material exists: the Cassidy/JLU standard [10], the Dictionary of Jamaican English [11], Di Jamiekan Nyuu Testiment as parallel text [12], JamPatoisNLI as a ready evaluation set [6], and Jamaican coverage inside CreoleVal [7] and Kreyòl-MT [19]. The first technical act is assembly and measurement: clean these into a governed corpus and run the tokenization audit. That's a weekend of undeniable results, and the credential that opens every door below.

8. Ownership, not extraction

Two futures. In one, Patois becomes raw material, scraped, tokenized abroad, folded into someone else's model, the community credited with nothing. Couldry and Mejias (2019) call this data colonialism: extracting value from human life as data the way territorial colonialism extracted land [8]. In the other, Jamaicans build and hold the assets, the corpus, the benchmark, the fine-tuned models, the standards: the rails every downstream Patois application will run on.

Infrastructure compounds. Whoever builds the canonical Patois stack owns the category every Patois-facing product, government services, fintech, education, media, will depend on. That's durable value and durable power, and there's no structural reason it must be built in California rather than Kingston.

9. Where the money is

Funding for exactly this is unusually available, and one stream already paid for a Pidgin sibling:

The path isn't charity-dependent: assemble the corpus → publish the tokenization audit and benchmark → win a Lacuna/AI4D grant and a DBJ/Digicel match → build the models → own the rails.

10. The window

Languages don't lose the digital age in a single event; they lose it one default at a time, set quietly by whoever shows up first. Jamaica has the speakers, the orthography, the corpus, and a transfer-learning edge few languages can claim. What it needs is the person who treats this as theirs to build and to own. The window is open and it's closing.


Compiled from public research. Corrections and collaborators welcome.


Sources

  1. Joshi, P. et al. (2020). The State and Fate of Linguistic Diversity and Inclusion in the NLP World. ACL 2020. https://aclanthology.org/2020.acl-main.560/
  2. Petrov, A., La Malfa, E., Torr, P., & Bibi, A. (2023). Language Model Tokenizers Introduce Unfairness Between Languages. NeurIPS 36. https://arxiv.org/abs/2305.15425
  3. Ahia, O. et al. (2023). Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models. EMNLP 2023. https://arxiv.org/abs/2305.13707
  4. Kornai, A. (2013). Digital Language Death. PLOS ONE 8(10):e77056. https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0077056
  5. DeGraff, M. (2005). Linguists' most dangerous myth: The fallacy of Creole Exceptionalism. Language in Society 34(4). https://doi.org/10.1017/S0047404505050207
  6. Armstrong, R.-A., Hewitt, J., & Manning, C. (2022). JamPatoisNLI: A Jamaican Patois Natural Language Inference Dataset. Findings of EMNLP 2022. https://arxiv.org/abs/2212.03419
  7. Lent, H. et al. (2024). CreoleVal: Multilingual Multitask Benchmarks for Creoles. TACL. https://aclanthology.org/2024.tacl-1.53/
  8. Couldry, N., & Mejias, U. A. (2019). Data Colonialism: Rethinking Big Data's Relation to the Contemporary Subject. Television & New Media 20(4):336-349. https://journals.sagepub.com/doi/10.1177/1527476418796632
  9. Nekoto, W. et al. (2020). Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages. Findings of EMNLP 2020. https://aclanthology.org/2020.findings-emnlp.195/
  10. Jamaican Language Unit, UWI Mona (est. 2002); Cassidy/JLU orthography; Writing Jamaican the Jamaican Way. https://www.mona.uwi.edu/dllp/jlu/background
  11. Cassidy, F. G., & Le Page, R. B. (1980). Dictionary of Jamaican English (2nd ed.). Cambridge University Press.
  12. Bible Society of the West Indies (2012). Di Jamiekan Nyuu Testiment; precursor Jiizas: di Buk We Luuk Rait bout Im (2010). https://en.wikipedia.org/wiki/Di_Jamiekan_Nyuu_Testiment
  13. Masakhane. https://www.masakhane.io/ ; ∀ et al. (2020), Masakhane, Machine Translation for Africa, https://arxiv.org/abs/2003.11529
  14. Adelani, D. et al. (2021). MasakhaNER: Named Entity Recognition for African Languages. TACL. https://arxiv.org/abs/2103.11811
  15. UD_Naija-NSC (Naija / Nigerian Pidgin Universal Dependencies treebank), NaijaSynCor. https://universaldependencies.org/treebanks/pcm_nsc/index.html
  16. Muhammad, S. H. et al. (2022). NaijaSenti: A Nigerian Twitter Sentiment Corpus. LREC 2022. https://arxiv.org/abs/2201.08277 ; https://github.com/hausanlp/NaijaSenti
  17. The Rise of AfricaNLP: A Survey of Contributions, Contributors, Community Impact (2025). https://arxiv.org/abs/2509.25477
  18. Ghana NLP. https://ghananlp.org/
  19. Robinson, N. R., Dabre, R. et al. (2024). Kreyòl-MT: Building MT for Latin American, Caribbean and Colonial African Creole Languages. NAACL 2024. https://arxiv.org/abs/2405.05376
  20. Lacuna Fund. https://lacunafund.org/
  21. AI4D / IDRC African Languages Lab. https://idrc-crdi.ca/en/what-we-do/projects-we-support/project/ai4d-african-languages-lab ; https://ai4d.ai/about-ai4d/
  22. Mozilla Common Voice. https://commonvoice.mozilla.org/
  23. Development Bank of Jamaica, BIGEE programme. https://dbankjm.com/
  24. DBJ × Caribbean Export corporate-venturing grant facility (2025). https://eulacdigitalaccelerator.com/2025/02/27/the-development-bank-of-jamaica-signs-memorandum-of-understanding-with-caribbean-export-to-launch-corporate-venturing-pilot-programme/
  25. Digicel Foundation, Build Jamaica / Mek A Muckle grants. https://www.digicelfoundation.org/jm/en/apply-for-a-grant
  26. Devonish, H., & Carpenter, K. (2020). Language, Race and the Global Jamaican. Palgrave Macmillan. (Hubert Devonish & Joseph Farquharson, Jamaican Language Unit, UWI Mona.) https://mona-uwi.academia.edu/HubertDevonish