trigrams

Trigrams for 500+ languages.
Contents
What is this?
This package exposes all trigrams for natural languages.
Based on the most translated copyright-free document on this planet:
UDHR.
When should I use this?
When you are dealing with natural language detection.
Install
This package is ESM only.
In Node.js
(version 18+),
install with npm:
npm install trigrams
In Deno with esm.sh
:
import {min, top} from 'https://esm.sh/trigrams@6'
In browsers with esm.sh
:
<script type="module">
import {min, top} from 'https://esm.sh/trigrams@6?bundle'
</script>
Use
import {min, top} from 'trigrams'
console.log((await min()).nld)
console.log((await top()).pam)
Yields:
[
' ar',
'eer',
'tij',
'de ',
'an ',
'en '
]
{
'isa': 6,
'upa': 6,
'i k': 6,
'ang': 273,
'ing': 282,
'ng ': 572
}
API
This package exports the identifiers
min
and
top
.
It exports no TypeScript types.
There is no default export.
min()
Get top trigrams.
Returns
Returns a promise resolving to arrays containing the top 300 trigrams sorted
from least occurring to most occurring
(Promise<Record<string, Array<string>>>
).
top()
Get top trigrams to occurrence counts.
Returns
Returns a promise resolving to an object mapping
UDHR in Unicode
codes to objects mapping the top 300 trigrams to occurrence counts
(Promise<Record<string, Record<string, number>>>
).
Data
The trigrams are based on the unicode versions of the
universal declaration of human rights.
The files are created from all paragraphs made available by
wooorm/udhr
and do not include headings and such.
Before creating trigrams,
- the unicode characters from
\u0021
to \u0040
(both including) are
removed
- one or more white space characters (
\s+
) are replaced with a single space
- alphabetic characters are lower cased (
[A-Z]
)
Additionally,
the input is padded with two spaces on both sides.
Code |
Name |
007 |
Sãotomense |
008 |
Crioulo, Upper Guinea (008) |
009 |
Mbundu (009) |
010 |
Tetun Dili |
011 |
Umbundu (011) |
013 |
(Mijisa) |
014 |
(Maiunan) |
016 |
(Minjiang, spoken) |
017 |
(Minjiang, written) |
020 |
Drung |
021 |
(Muzzi) |
022 |
(Klau) |
025 |
(Bizisa) |
026 |
(Yeonbyeon) |
027 |
Gumuz |
028 |
Kafa |
029 |
Sidamo |
030 |
Kituba (2) |
032 |
South Azerbaijani |
041 |
Latvian (2) |
042 |
Spanish (resolution) |
043 |
Zarma |
044 |
Mirandese |
045 |
Maasai |
046 |
Malay, Papuan |
047 |
Malay, Ambonese |
048 |
Minangkabau (2) |
049 |
Banjar |
050 |
(Bataknese) |
052 |
Morisyen |
053 |
Hausa (2) |
054 |
Catalan (2) |
055 |
Jamaican Creole English |
056 |
Saint Lucian Creole French |
057 |
Maay |
058 |
Somali (Af Marka) |
059 |
North Saami (2) |
060 |
Inari Saami |
061 |
Skolt Saami |
062 |
Swahili (Chimwiini) |
063 |
Swahili (Kibajuni) |
064 |
Dabarre |
065 |
Garre |
066 |
Jiiddu |
067 |
Finnish (2) |
068 |
French (Welche) |
069 |
Maori (2) |
071 |
Kabyle |
aar |
Afar |
abk |
Abkhaz |
ace |
Aceh |
acu |
Achuar-Shiwiar |
acu_1 |
Achuar-Shiwiar (1) |
ada |
Dangme |
ady |
Adyghe |
afr |
Afrikaans |
agr |
Aguaruna |
aii |
Assyrian Neo-Aramaic |
ajg |
Aja |
aka_akuapem |
Twi (Akuapem) |
aka_asante |
Twi (Asante) |
aka_fante |
Fante |
als |
Albanian, Tosk |
alt |
Altai, Southern |
amc |
Amahuaca |
ame |
Yaneshaʼ |
amh |
Amharic |
ami |
Amis |
amr |
Amarakaeri |
arb |
Arabic, Standard |
arl |
Arabela |
arn |
Mapudungun |
ast |
Asturian |
auc |
Waorani |
auv |
Occitan (Auvergnat) |
ayo |
Ayoreo |
ayr |
Aymara, Central |
azj_cyrl |
Azerbaijani, North (Cyrillic) |
azj_latn |
Azerbaijani, North (Latin) |
bam |
Bamanankan |
ban |
Bali |
bax |
Bamun |
bba |
Baatonum |
bci |
Baoulé |
bcl |
Bicolano, Central |
bel |
Belarusan |
bem |
Bemba |
ben |
Bengali |
bfa |
Bari |
bho |
Bhojpuri |
bin |
Edo |
bis |
Bislama |
blt |
Tai Dam |
blu |
Hmong Njua |
boa |
Bora |
bod |
Tibetan, Central |
bos_cyrl |
Bosnian (Cyrillic) |
bos_latn |
Bosnian (Latin) |
bre |
Breton |
btb |
Bulu |
buc |
Bushi |
bug |
Bugis |
bul |
Bulgarian |
bvi |
Belanda Viri |
cab |
Garifuna |
cak |
Kaqchikel, Central |
cas |
Tsimané |
cat |
Catalan |
cbi |
Chachi |
cbr |
Cashibo-Cacataibo |
cbs |
Cashinahua |
cbt |
Chayahuita |
cbu |
Candoshi-Shapra |
ccx |
Zhuang, Yongbei |
ceb |
Cebuano |
ces |
Czech |
cha |
Chamorro |
chj |
Chinantec, Ojitlán |
chk |
Chuukese |
chr_cased |
Cherokee (cased) |
chr_uppercase |
Cherokee (uppercase) |
chv |
Chuvash |
cic |
Chickasaw |
cjk |
Chokwe |
cjk_AO |
Chokwe (Angola) |
cjs |
Shor |
ckb |
Kurdish, Central |
cnh |
Chin, Haka |
cni |
Asháninka |
cnr |
Montenegrin |
cof |
Colorado |
cos |
Corsican |
cot |
Caquinte |
cpu |
Ashéninka, Pichis |
crh |
Crimean Tatar |
crs |
Seselwa Creole French |
csa |
Chinantec, Chiltepec |
csw |
Cree, Swampy |
ctd |
Chin, Tedim |
cym |
Welsh |
dag |
Dagbani |
dan |
Danish |
ddn |
Dendi |
deu_1901 |
German, Standard (1901) |
deu_1996 |
German, Standard (1996) |
dga |
Dagaare, Southern |
dip |
Dinka, Northeastern |
div |
Maldivian |
dyo |
Jola-Fonyi |
dyu |
Jula |
dzo |
Dzongkha |
ell_monotonic |
Greek (monotonic) |
ell_polytonic |
Greek (polytonic) |
emk |
Maninkakan, Eastern |
eml |
Romagnolo |
eng |
English |
epo |
Esperanto |
ese |
Ese Ejja |
est |
Estonian |
eus |
Basque |
eve |
Even |
evn |
Evenki |
ewe |
Éwé |
fao |
Faroese |
fij |
Fijian |
fin |
Finnish |
fkv |
Finnish, Kven |
flm |
Chin, Falam |
fon |
Fon |
fra |
French |
fri |
Frisian, Western |
fuf |
Pular |
fur |
Friulian |
fuv |
Fulfulde, Nigerian |
fuv2 |
Fulfulde, Nigerian (2) |
fvr |
Fur |
gaa |
Ga |
gag |
Gagauz |
gax |
Oromo, Borana-Arsi-Guji |
gjn |
Gonja |
gkp |
Kpelle, Guinea |
gla |
Gaelic, Scottish |
gld |
Nanai |
gle |
Gaelic, Irish |
glg |
Galician |
glv |
Manx |
gnw |
Guarani, Western Bolivian |
gsw1 |
Alemannisch (Elsassisch) |
guc |
Wayuu |
gug |
Guaraní, Paraguayan |
guj |
Gujarati |
guu |
Yanomamö |
gyr |
Guarayu |
hat_kreyol |
Haitian Creole French (Kreyol) |
hat_popular |
Haitian Creole French (Popular) |
hau_NE |
Hausa (Niger) |
hau_NG |
Hausa (Nigeria) |
hau_3 |
Hausa |
haw |
Hawaiian |
hea |
Hmong, Northern Qiandong |
heb |
Hebrew |
hil |
Hiligaynon |
hin |
Hindi |
hlt |
Chin, Matu |
hms |
Hmong, Southern Qiandong |
hna |
Gen |
hni |
Hani |
hns |
Hindustani, Sarnami |
hrv |
Croatian |
hsb |
Sorbian, Upper |
hsf |
Huastec (Sierra de Otontepec) |
hun |
Hungarian |
hus |
Huastec (Veracruz) |
huu |
Huitoto, Murui |
hva |
Huastec (San Luís Potosí) |
hye |
Armenian |
ibb |
Ibibio |
ibo |
Igbo |
ido |
Ido |
idu |
Idoma |
ijs |
Ijo, Southeast |
ike |
Inuktitut, Eastern Canadian |
ilo |
Ilocano |
ina |
Interlingua |
ind |
Indonesian |
isl |
Icelandic |
ita |
Italian |
jav |
Javanese (Latin) |
jav_java |
Javanese (Javanese) |
jiv |
Shuar |
jpn |
Japanese |
jpn_osaka |
Japanese (Osaka) |
jpn_tokyo |
Japanese (Tokyo) |
kaa |
Karakalpak |
kal |
Inuktitut, Greenlandic |
kan |
Kannada |
kat |
Georgian |
kaz |
Kazakh |
kbd |
Kabardian |
kbp |
Kabiyé |
kde |
Makonde |
kdh |
Tem |
kea |
Kabuverdianu |
kek |
Q'eqchi' |
kha |
Khasi |
khk |
Mongolian, Halh (Cyrillic) |
khm |
Khmer, Central |
kin |
Rwanda |
kir |
Kirghiz |
kjh |
Khakas |
kkh_lana |
Khün |
kmb |
Mbundu |
kmr |
Kurdish, Northern |
knc |
Kanuri, Central |
kng |
Koongo |
kng_AO |
Koongo (Angola) |
koi |
Komi-Permyak |
koo |
Konjo |
kor |
Korean |
kqn |
Kaonde |
kqs |
Kissi, Northern |
kri |
Krio |
krl |
Karelian |
ktu |
Kituba |
kwi |
Awa-Cuaiquer |
lad |
Ladino |
lao |
Lao |
lat |
Latin |
lat_1 |
Latin (1) |
lav |
Latvian |
lia |
Limba, West-Central |
lij |
Ligurian |
lin |
Lingala |
lin_tones |
Lingala (tones) |
lit |
Lithuanian |
lld |
Ladin |
lnc |
Occitan (Languedocien) |
lns |
Lamnso' |
lob |
Lobi |
lot |
Otuho |
loz |
Lozi |
ltz |
Luxembourgeois |
lua |
Luba-Kasai |
lue |
Luvale |
lug |
Ganda |
lun |
Lunda |
lus |
Mizo |
mad |
Madura |
mag |
Magahi |
mah |
Marshallese |
mai |
Maithili |
mal |
Malayalam |
mal_chillus |
Malayalam |
mam |
Mam, Northern |
mar |
Marathi |
maz |
Mazahua Central |
mcd |
Sharanahua |
mcf |
Matsés |
men |
Mende |
mfq |
Moba |
mic |
Micmac |
min |
Minangkabau |
miq |
Mískito |
mkd |
Macedonian |
mlt |
Maltese |
mly_arab |
Malay (Arabic) |
mly_latn |
Malay (Latin) |
mnw |
Mon |
mor |
Moro |
mos |
Mòoré |
mri |
Maori |
mto |
Mixe, Totontepec |
mtp |
Wichí Lhamtés Nocten |
mxi |
Mozarabic |
mxv |
Mixtec, Metlatónoc |
mya |
Burmese |
mzi |
Mazatec, Ixcatlán |
nav |
Navajo |
nba |
Nyemba |
nbl |
Ndebele |
ndo |
Ndonga |
nds |
Saxon, Low |
nep |
Nepali |
nhn |
Nahuatl, Central |
nio |
Nganasan |
niu |
Niue |
niv |
Gilyak |
njo |
Naga, Ao |
nku |
Kulango, Bouna |
nld |
Dutch |
nno |
Norwegian, Nynorsk |
nob |
Norwegian, Bokmål |
not |
Nomatsiguenga |
nso |
Sotho, Northern |
nya_chechewa |
Nyanja (Chechewa) |
nya_chinyanja |
Nyanja (Chinyanja) |
nym |
Nyamwezi |
nyn |
Nyankore |
nzi |
Nzema |
oaa |
Orok |
oci_1 |
Francoprovençal (Fribourg) |
oci_2 |
Francoprovençal (Savoie) |
oci_3 |
Francoprovençal (Vaud) |
oci_4 |
Francoprovençal (Valais) |
ojb |
Ojibwa, Northwestern |
oki |
Okiek |
orh |
Oroqen |
oss |
Osetin |
ote |
Otomi, Mezquital |
pam |
Pampangan |
pan |
Panjabi, Eastern |
pap |
Papiamentu |
pau |
Palauan |
pbb |
Páez |
pbu |
Pashto, Northern |
pcd |
Picard |
pcm |
Pidgin, Nigerian |
pes_1 |
Farsi, Western |
pes_2 |
Dari |
pis |
Pijin |
piu |
Pintupi-Luritja |
plt |
Malagasy, Plateau |
pnb |
Panjabi, Western |
pol |
Polish |
pon |
Pohnpeian |
por_BR |
Portuguese (Brazil) |
por_PT |
Portuguese (Portugal) |
pov |
Crioulo, Upper Guinea |
ppl |
Pipil |
prv |
Occitan |
quc |
K'iche', Central |
qud |
Quechua (Unified Quichua, old Hispanic orthography) |
qug |
Quichua, Chimborazo Highland |
qul |
Quechua, North Bolivian |
quy |
Quechua, Ayacucho |
quz |
Quechua, Cusco |
qva |
Quechua, Ambo-Pasco |
qvc |
Quechua, Cajamarca |
qvh |
Quechua, Huamalíes-Dos de Mayo Huánuco |
qvm |
Quechua, Margos-Yarowilca-Lauricocha |
qvn |
Quechua, North Junín |
qwh |
Quechua, Huaylas Ancash |
qxa |
Quechua, South Bolivian |
qxn |
Quechua, Northern Conchucos Ancash |
qxu |
Quechua, Arequipa-La Unión |
rar |
Rarotongan |
rmn |
Romani, Balkan |
rmn_1 |
Romani, Balkan (1) |
rmy |
Aromanian |
roh |
Romansch |
roh_puter |
Romansch (Puter) |
roh_rumgr |
Romansch (Grischun) |
roh_surmiran |
Romansch (Surmiran) |
roh_sursilv |
Romansch (Sursilvan) |
roh_sutsilv |
Romansch (Sutsilvan) |
roh_vallader |
Romansch (Vallader) |
ron_1953 |
Romanian (1953) |
ron_1993 |
Romanian (1993) |
ron_2006 |
Romanian (2006) |
run |
Rundi |
rus |
Russian |
sag |
Sango |
sah |
Yakut |
san |
Sanskrit |
sco |
Scots |
sey |
Secoya |
shk |
Shilluk |
shn |
Shan |
shp |
Shipibo-Conibo |
sin |
Sinhala |
skr |
Seraiki |
slk |
Slovak |
slr |
Salar |
slv |
Slovenian |
sme |
North Saami |
smo |
Samoan |
sna |
Shona |
snk |
Soninke |
snn |
Siona |
som |
Somali |
sot |
Sotho, Southern |
spa |
Spanish |
src |
Sardinian, Logudorese |
srp_cyrl |
Serbian (Cyrillic) |
srp_latn |
Serbian (Latin) |
srq |
Sirionó |
srr |
Serer-Sine |
ssw |
Swati |
suk |
Sukuma |
sun |
Sunda |
sus |
Susu |
swb |
Comorian, Maore |
swe |
Swedish |
swh |
Swahili |
tah |
Tahitian |
tam |
Tamil |
tam_LK |
Tamil (Sri Lanka) |
tat |
Tatar |
tbz |
Ditammari |
tca |
Ticuna |
tel |
Telugu |
tem |
Themne |
tet |
Tetun |
tgk |
Tajiki |
tgl |
Tagalog |
tha |
Thai |
tha2 |
Thai (2) |
tir |
Tigrigna |
tiv |
Tiv |
tji |
Tujia, Nothern |
tly |
Talysh |
tna |
Tacana |
tob |
Toba |
toi |
Tonga |
toj |
Tojolabal |
ton |
Tongan |
top |
Totonac, Papantla |
tpi |
Tok Pisin |
trn |
Trinitario |
tsn |
Tswana |
tso_MZ |
Tsonga (Mozambique) |
tso_ZW |
Tsonga (Zimbabwe) |
tsz |
Purepecha |
tuk_cyrl |
Turkmen (Cyrillic) |
tuk_latn |
Turkmen (Latin) |
tur |
Turkish |
tyv |
Tuva |
tzc |
Tzotzil (Chamula) |
tzh |
Tzeltal, Oxchuc |
tzm |
Tamazight, Central Atlas |
udu |
Uduk |
uig_arab |
Uyghur (Arabic) |
uig_latn |
Uyghur (Latin) |
ukr |
Ukrainian |
umb |
Umbundu |
ura |
Urarina |
urd |
Urdu |
urd_2 |
Urdu (2) |
uzn_cyrl |
Uzbek, Northern (Cyrillic) |
uzn_latn |
Uzbek, Northern (Latin) |
vai |
Vai |
vec |
Venetian |
ven |
Venda |
ven2 |
Venda |
vep |
Veps |
vie |
Vietnamese |
vmw |
Makhuwa |
war |
Waray-Waray |
wln |
Walloon |
wol |
Wolof |
wwa |
Waama |
xho |
Xhosa |
xsm |
Kasem |
yad |
Yagua |
yao |
Yao |
yap |
Yapese |
ydd |
Yiddish, Eastern |
ykg |
Yukaghir, Northern |
yor |
Yoruba |
yrk |
Nenets |
yua |
Maya, Yucatán |
yuz |
Yuracare |
zam |
Zapotec, Miahuatlán |
zdj |
Comorian, Ngazidja |
zgh |
Tamazight, Standard Morocan |
zro |
Záparo |
ztu |
Zapotec, Güilá |
zul |
Zulu |
Compatibility
This package is at least compatible with all maintained versions of Node.js.
As of now,
that is Node.js 18+.
It also works in Deno and modern browsers.
Contribute
Yes please!
See How to Contribute to Open Source.
Security
This package is safe.
License
MIT © Titus Wormer