Skip to content

Latest commit

ย 

History

History
1041 lines (1028 loc) ยท 142 KB

README.md

File metadata and controls

1041 lines (1028 loc) ยท 142 KB

Corpus Crawler

Corpus Crawler is a tool for Corpus Linguistics.

Modern linguistic research works on language corpora, which are large samples of โ€œreal worldโ€ text. This crawler helps to build such corpora: it follows links to publicly accessible web pages known to be written in a certain language; it removes boilerplate and HTML markup; finally, it writes its output into plaintext files. The crawler implements the Robots Exclusion Standard, and it is intentionally slow so it does not cause much load on the crawled web sites.

This is not an official Google product. But if youโ€™re a linguistic researcher, or if youโ€™re writing a spell checker (or similar language-processing software) for an โ€œexoticโ€ language, you might find Corpus Crawler useful.

To build corpora for not-yet-supported languages, please read the contribution guidelines and send us GitHub pull requests.

The crawled corpora have been used to compute word frequencies in Unicodeโ€™s Unilex project.

Supported Languages

IETF BCP47 Code Language Tokensยน
aai Arifama-Miniafia 181K ๐Ÿ’พ
aak Ankave 194K ๐Ÿ’พ
aau Abau 313K ๐Ÿ’พ
aaz Amarasi 308K ๐Ÿ’พ
abt Ambulas 297K ๐Ÿ’พ
aby Aneme Wake 233K ๐Ÿ’พ
acd Gikyode 323K ๐Ÿ’พ
ace Aceh/Acehnese 817K ๐Ÿ’พ
acf Saint Lucian Creole French 236K ๐Ÿ’พ
ach Acoli 178K ๐Ÿ’พ
acn Achang 232K ๐Ÿ’พ
acr Achi 239K ๐Ÿ’พ
acu Achuar-Shiwiar 174K ๐Ÿ’พ
ade Adele 267K ๐Ÿ’พ
adh Adhola 166K ๐Ÿ’พ
adj Adioukrou 233K ๐Ÿ’พ
ae Avestan 129K ๐Ÿ’พ
ae-Latn Avestan (Latin) 141K ๐Ÿ’พ
aey Amele 218K ๐Ÿ’พ
agd Agarabi 256K ๐Ÿ’พ
agg Angor 214K ๐Ÿ’พ
agm Angaataha 238K ๐Ÿ’พ
agn Agutaynen 234K ๐Ÿ’พ
agr Aguaruna 149K ๐Ÿ’พ
ahk Akha 367K ๐Ÿ’พ
aia Arosi 223K ๐Ÿ’พ
akb Batak Angkola 220K ๐Ÿ’พ
ake Akawaio 190K ๐Ÿ’พ
akh Akha 408K ๐Ÿ’พ
akp Siwu 191K ๐Ÿ’พ
alj Alangan 185K ๐Ÿ’พ
alp Alune 225K ๐Ÿ’พ
alt Southern Altai 121K ๐Ÿ’พ
alz Alur 160K ๐Ÿ’พ
am Amharic 2,170K ๐Ÿ’พ
ame Yanesha' 221K ๐Ÿ’พ
amf Hamer-Banna 152K ๐Ÿ’พ
amk Ambai 229K ๐Ÿ’พ
amm Ama (Papua New Guinea) 246K ๐Ÿ’พ
amn Amanab 207K ๐Ÿ’พ
amp Alamblak 241K ๐Ÿ’พ
amr Amarakaeri 151K ๐Ÿ’พ
amu Guerrero Amuzgo 202K ๐Ÿ’พ
ann Obolo 236K ๐Ÿ’พ
anv Denya 214K ๐Ÿ’พ
aoj Mufian 217K ๐Ÿ’พ
aom ร–mie 231K ๐Ÿ’พ
aon Bumbita Arapesh 294K ๐Ÿ’พ
aoz Uab Meto 197K ๐Ÿ’พ
ape Bukiyip 294K ๐Ÿ’พ
apr Arop-Lokep 373K ๐Ÿ’พ
apz Safeyoka 235K ๐Ÿ’พ
ar Arabic 19,593K ๐Ÿ’พ
arl Arabela 206K ๐Ÿ’พ
asg Cishingini 270K ๐Ÿ’พ
aso Dano 290K ๐Ÿ’พ
ata Pele-Ata 248K ๐Ÿ’พ
atb Zaiwa 291K ๐Ÿ’พ
atg Ivbie North-Okpela-Arhe 229K ๐Ÿ’พ
atq Aralle-Tabulahan 202K ๐Ÿ’พ
auy Awiyaana 164K ๐Ÿ’พ
av Avaric 111K ๐Ÿ’พ
avn Avatime 229K ๐Ÿ’พ
avt Au 263K ๐Ÿ’พ
avu Avokaya 391K ๐Ÿ’พ
awa Awadhi 211K ๐Ÿ’พ
awb Awa (Papua New Guinea) 179K ๐Ÿ’พ
ay Aymara 482K ๐Ÿ’พ
ayo Ayoreo 264K ๐Ÿ’พ
az Azerbaijani 3,413K ๐Ÿ’พ
azg San Pedro Amuzgos Amuzgo 271K ๐Ÿ’พ
azz Highland Puebla Nahuatl 265K ๐Ÿ’พ
ba Bashkir 666K ๐Ÿ’พ
ban Balinese 211K ๐Ÿ’พ
bao Waimaha 232K ๐Ÿ’พ
bav Vengo 250K ๐Ÿ’พ
bba Baatonum 792K ๐Ÿ’พ
bbb Barai 289K ๐Ÿ’พ
bbo Northern Bobo Madarรฉ 211K ๐Ÿ’พ
bbr Girawa 245K ๐Ÿ’พ
bch Bariai 248K ๐Ÿ’พ
bcw Bana 304K ๐Ÿ’พ
bdd Bunama 171K ๐Ÿ’พ
be Belarusian 1,441K ๐Ÿ’พ
be-tarask Belarusian (Taraลกkievica) 108,431K ๐Ÿ’พ
bef Benabena 239K ๐Ÿ’พ
bep Besoa 204K ๐Ÿ’พ
bex Jur Modo 254K ๐Ÿ’พ
bfd Bafut 276K ๐Ÿ’พ
bfo Malba Birifor 260K ๐Ÿ’พ
bg Bulgarian 10,597K ๐Ÿ’พ
bgr Bawm Chin 213K ๐Ÿ’พ
bgz Banggai 186K ๐Ÿ’พ
bhl Bimin 324K ๐Ÿ’พ
bhw Biak 164K ๐Ÿ’พ
bi Bislama 315K ๐Ÿ’พ
bib Bissa 243K ๐Ÿ’พ
big Biangai 229K ๐Ÿ’พ
bik Central Bikol 183K ๐Ÿ’พ
bim Bimoba 215K ๐Ÿ’พ
biv Southern Birifor 221K ๐Ÿ’พ
bjr Binumarien 226K ๐Ÿ’พ
bjv Bedjond 268K ๐Ÿ’พ
bkl Berik 306K ๐Ÿ’พ
bku Buhid 204K ๐Ÿ’พ
bkv Bekwarra 244K ๐Ÿ’พ
blh Kuwaa 259K ๐Ÿ’พ
blt-Latn Tai Dam (Latin) 262K ๐Ÿ’พ
blz Balantak 199K ๐Ÿ’พ
bm Bambara 30K ๐Ÿ’พ
bmh Kein 253K ๐Ÿ’พ
bmq Bomu 207K ๐Ÿ’พ
bmr Muinane 122K ๐Ÿ’พ
bmu Somba-Siawari 234K ๐Ÿ’พ
bmv Bum 258K ๐Ÿ’พ
bn Bangla 7,258K ๐Ÿ’พ
bnj Eastern Tawbuid 239K ๐Ÿ’พ
bnp Bola 263K ๐Ÿ’พ
bo Tibetan 5,642K ๐Ÿ’พ
boa Bora 133K ๐Ÿ’พ
boj Anjam 255K ๐Ÿ’พ
bon Bine 244K ๐Ÿ’พ
bov Tuwuli 203K ๐Ÿ’พ
box Buamu 274K ๐Ÿ’พ
bpr Koronadal Blaan 204K ๐Ÿ’พ
bps Sarangani Blaan 214K ๐Ÿ’พ
bqc Boko 567K ๐Ÿ’พ
bqj Bandial 175K ๐Ÿ’พ
bqp Busa 162K ๐Ÿ’พ
bru Eastern Bru 261K ๐Ÿ’พ
bs Bosnian 8,993K ๐Ÿ’พ
bsn Barasana-Eduria 225K ๐Ÿ’พ
bss Akoose 199K ๐Ÿ’พ
btd Batak Dairi 192K ๐Ÿ’พ
bts Batak Simalungun 175K ๐Ÿ’พ
btt Bete-Bendi 266K ๐Ÿ’พ
btx Batak Karo 189K ๐Ÿ’พ
bua Buriat 143K ๐Ÿ’พ
bud Ntcham 207K ๐Ÿ’พ
buk Bugawac 264K ๐Ÿ’พ
bus Bokobaru 159K ๐Ÿ’พ
bvc Baelelea 308K ๐Ÿ’พ
bvz Bauzi 509K ๐Ÿ’พ
bwq Southern Bobo Madarรฉ 214K ๐Ÿ’พ
bwu Buli 285K ๐Ÿ’พ
byr Baruya 182K ๐Ÿ’พ
byx Qaqet 387K ๐Ÿ’พ
bzh Mapos Buang 251K ๐Ÿ’พ
bzi Bisu 381K ๐Ÿ’พ
bzj Belize Kriol English 240K ๐Ÿ’พ
ca-valencia Valencian 24,295K ๐Ÿ’พ
caa Chortรญ 307K ๐Ÿ’พ
cab Garifuna 154K ๐Ÿ’พ
cac Chuj 244K ๐Ÿ’พ
cak Kaqchikel 259K ๐Ÿ’พ
cap Chipaya 154K ๐Ÿ’พ
car Galibi Carib 160K ๐Ÿ’พ
cax Chiquitano 149K ๐Ÿ’พ
cbc Carapana 256K ๐Ÿ’พ
cbi Chachi 187K ๐Ÿ’พ
cbl Bualkhaw Chin 210K ๐Ÿ’พ
cbr Cashibo-Cacataibo 236K ๐Ÿ’พ
cbs Cashinahua 198K ๐Ÿ’พ
cbt Chayahuita 150K ๐Ÿ’พ
cbv Cacua 265K ๐Ÿ’พ
cce Chopi 204K ๐Ÿ’พ
ccp Chakma 79K ๐Ÿ’พ
cdf Chiru 193K ๐Ÿ’พ
ce Chechen 669K ๐Ÿ’พ
ceb Cebuano 1,067K ๐Ÿ’พ
ceg Chamacoco 232K ๐Ÿ’พ
cfm Falam Chin 438K ๐Ÿ’พ
cgc Kagayanen 299K ๐Ÿ’พ
chj Ojitlรกn Chinantec 305K ๐Ÿ’พ
chm Mari 132K ๐Ÿ’พ
chr Cherokee 119K ๐Ÿ’พ
chz Ozumacรญn Chinantec 205K ๐Ÿ’พ
cjo Ashรฉninka Pajonal 141K ๐Ÿ’พ
cjp Cabรฉcar 199K ๐Ÿ’พ
cjv Chuave 286K ๐Ÿ’พ
cko Anufo 272K ๐Ÿ’พ
cle Lealao Chinantec 313K ๐Ÿ’พ
cme Cerma 230K ๐Ÿ’พ
cmr Mro-Khimi Chin 275K ๐Ÿ’พ
cnh Hakha Chin 934K ๐Ÿ’พ
cni Ashรกninka 122K ๐Ÿ’พ
cnk Khumi Chin 237K ๐Ÿ’พ
cnl Lalana Chinantec 308K ๐Ÿ’พ
cnt Tepetotutla Chinantec 261K ๐Ÿ’พ
coe Koreguaje 181K ๐Ÿ’พ
cof Colorado 183K ๐Ÿ’พ
cok Santa Teresa Cora 230K ๐Ÿ’พ
con Cofรกn 151K ๐Ÿ’พ
cot Caquinte 128K ๐Ÿ’พ
crh Crimean Tatar 505K ๐Ÿ’พ
cs Czech 3,141K ๐Ÿ’พ
csk Jola-Kasa 177K ๐Ÿ’พ
cso Sochiapam Chinantec 328K ๐Ÿ’พ
ctd-Latn Tedim Chin (Latin) 852K ๐Ÿ’พ
ctu Chol 203K ๐Ÿ’พ
cub Cubeo 220K ๐Ÿ’พ
cuc Usila Chinantec 278K ๐Ÿ’พ
cui Cuiba 292K ๐Ÿ’พ
cuk San Blas Kuna 187K ๐Ÿ’พ
cul Culina 221K ๐Ÿ’พ
cv Chuvash 111K ๐Ÿ’พ
cwe Kwere 144K ๐Ÿ’พ
cwt Kuwaataay 168K ๐Ÿ’พ
cy Welsh 11,519K ๐Ÿ’พ
cya Nopala Chatino 245K ๐Ÿ’พ
czt Zotung Chin 227K ๐Ÿ’พ
da Danish 655K ๐Ÿ’พ
daa Dangalรฉat 208K ๐Ÿ’พ
dad Marik 197K ๐Ÿ’พ
dah Gwahatike 274K ๐Ÿ’พ
ddn Dendi 210K ๐Ÿ’พ
de German 46,431K ๐Ÿ’พ
ded Dedua 146K ๐Ÿ’พ
des Desano 210K ๐Ÿ’พ
dga Southern Dagaare 458K ๐Ÿ’พ
dgi Northern Dagara 257K ๐Ÿ’พ
dgz Daga 219K ๐Ÿ’พ
din Southwestern Dinka 196K ๐Ÿ’พ
dip Northeastern Dinka 193K ๐Ÿ’พ
djk Eastern Maroon Creole 307K ๐Ÿ’พ
dln Darlong 776K ๐Ÿ’พ
dnw Western Dani 254K ๐Ÿ’พ
dob Dobu 179K ๐Ÿ’พ
dop Lukpa 226K ๐Ÿ’พ
dsh Daasanach 211K ๐Ÿ’พ
dtb Labuk-Kinabatangan Kadazan 248K ๐Ÿ’พ
dtp Kadazan Dusun 1,038K ๐Ÿ’พ
dts Toro So Dogon 202K ๐Ÿ’พ
due Umiray Dumaget Agta 247K ๐Ÿ’พ
dug Duruma 172K ๐Ÿ’พ
duo Dupaninan Agta 266K ๐Ÿ’พ
dwr Dawro 254K ๐Ÿ’พ
dww Dawawa 208K ๐Ÿ’พ
dyi Djimini Senoufo 268K ๐Ÿ’พ
dyo Jola-Fonyi 158K ๐Ÿ’พ
dyu Dyula 1,156K ๐Ÿ’พ
dz Dzongkha 61K ๐Ÿ’พ
ee Ewe 421K ๐Ÿ’พ
eka Ekajuk 213K ๐Ÿ’พ
el Greek 5,470K ๐Ÿ’พ
emi Mussau-Emira 176K ๐Ÿ’พ
emp Northern Emberรก 158K ๐Ÿ’พ
enb Markweeta 147K ๐Ÿ’พ
enq Enga 217K ๐Ÿ’พ
enx Enxet 772K ๐Ÿ’พ
eri Ogea 269K ๐Ÿ’พ
es Spanish 32,670K ๐Ÿ’พ
ese Ese Ejja 226K ๐Ÿ’พ
et Estonian 3,658K ๐Ÿ’พ
eu Basque 130K ๐Ÿ’พ
ewo Ewondo 158K ๐Ÿ’พ
eza Ezaa 963K ๐Ÿ’พ
fa Persian 9,114K ๐Ÿ’พ
fa-AF Dari 7,363K ๐Ÿ’พ
faa Fasu 238K ๐Ÿ’พ
fai Faiwol 256K ๐Ÿ’พ
fal South Fali 198K ๐Ÿ’พ
far Fataleka 286K ๐Ÿ’พ
fi Finnish 4,837K ๐Ÿ’พ
fil Tagalog 184K ๐Ÿ’พ
fip Fipa 134K ๐Ÿ’พ
fit Tornedalen Finnish 292K ๐Ÿ’พ
fj Fijian 257K ๐Ÿ’พ
fo Faroese 851K ๐Ÿ’พ
fon Fon 266K ๐Ÿ’พ
for Fore 169K ๐Ÿ’พ
fr French 5,488K ๐Ÿ’พ
fue Borgu Fulfulde 148K ๐Ÿ’พ
fuf Pular 174K ๐Ÿ’พ
fuq Central-Eastern Niger Fulfulde 156K ๐Ÿ’พ
fuv Nigerian Fulfulde 13K ๐Ÿ’พ
ga Irish 7,587K ๐Ÿ’พ
gag Gagauz 245K ๐Ÿ’พ
gah Alekano 210K ๐Ÿ’พ
gam Kandawo 250K ๐Ÿ’พ
gaw Nobonob 246K ๐Ÿ’พ
gbi Galela 288K ๐Ÿ’พ
gd Scottish Gaelic 17,105K ๐Ÿ’พ
gde Gude 217K ๐Ÿ’พ
gdn Umanakaina 306K ๐Ÿ’พ
gdr Wipi 271K ๐Ÿ’พ
gej Gen 236K ๐Ÿ’พ
gfk Patpatar 294K ๐Ÿ’พ
ghs Guhu-Samane 186K ๐Ÿ’พ
gil Gilbertese 228K ๐Ÿ’พ
gkn Gokana 267K ๐Ÿ’พ
gmv-Latn Gamo (Latin) 127K ๐Ÿ’พ
gn Guarani 142K ๐Ÿ’พ
gnd Zulgo-Gemzek 364K ๐Ÿ’พ
gng Ngangam 219K ๐Ÿ’พ
gnw Western Bolivian Guaranรญ 263K ๐Ÿ’พ
gof Gofa 124K ๐Ÿ’พ
gog Gogo 173K ๐Ÿ’พ
gor Gorontalo 211K ๐Ÿ’พ
gqr Gor 218K ๐Ÿ’พ
grb Northern Grebo 270K ๐Ÿ’พ
grt Garo 141K ๐Ÿ’พ
gso Southwest Gbaya 228K ๐Ÿ’พ
gsw-u-sd-chag Swiss German (Aargau) 99K ๐Ÿ’พ
gsw-u-sd-chbe Swiss German (Bern) 73K ๐Ÿ’พ
gsw-u-sd-chfr Swiss German (Fribourg) 42K ๐Ÿ’พ
gu Gujarati 702K ๐Ÿ’พ
gub Guajajรกra 997K ๐Ÿ’พ
guc Wayuu 211K ๐Ÿ’พ
gud Yocobouรฉ Dida 216K ๐Ÿ’พ
guh Guahibo 204K ๐Ÿ’พ
gui Eastern Bolivian Guaranรญ 197K ๐Ÿ’พ
gum Guambiano 186K ๐Ÿ’พ
gun Mbyรก Guaranรญ 176K ๐Ÿ’พ
guo Guayabero 203K ๐Ÿ’พ
guq Achรฉ 184K ๐Ÿ’พ
gur Farefare 240K ๐Ÿ’พ
gux Gourmanchรฉma 215K ๐Ÿ’พ
gv Manx Gaelic 152K ๐Ÿ’พ
gvc Guanano 241K ๐Ÿ’พ
gvf Golin 276K ๐Ÿ’พ
gvl Gulay 270K ๐Ÿ’พ
gwr Gwere 157K ๐Ÿ’พ
gym Ngรคbere 294K ๐Ÿ’พ
gyr Guarayu 176K ๐Ÿ’พ
ha Hausa 1,775K ๐Ÿ’พ
hae Eastern Oromo 163K ๐Ÿ’พ
hag Hanga 202K ๐Ÿ’พ
haw Hawaiian 2,221K ๐Ÿ’พ
hay Haya 112K ๐Ÿ’พ
heh Hehe 136K ๐Ÿ’พ
hi Hindi 10,004K ๐Ÿ’พ
hif Fiji Hindi 204K ๐Ÿ’พ
hig Kamwe 261K ๐Ÿ’พ
hil Hiligaynon 208K ๐Ÿ’พ
hla Halia 273K ๐Ÿ’พ
hne Chhattisgarhi 207K ๐Ÿ’พ
hnn Hanunoo 212K ๐Ÿ’พ
hns Caribbean Hindustani 312K ๐Ÿ’พ
ho Hiri Motu 240K ๐Ÿ’พ
hot Hote 222K ๐Ÿ’พ
hr Croatian 8,188K ๐Ÿ’พ
ht Haitian 1,101K ๐Ÿ’พ
hto Minica Huitoto 182K ๐Ÿ’พ
hu Hungarian 600K ๐Ÿ’พ
hub Huambisa 160K ๐Ÿ’พ
hui Huli 232K ๐Ÿ’พ
hus Huastec 236K ๐Ÿ’พ
huu Murui Huitoto 165K ๐Ÿ’พ
huv San Mateo Del Mar Huave 197K ๐Ÿ’พ
hvn Sabu 312K ๐Ÿ’พ
hy Armenian 25,972K ๐Ÿ’พ
ian Iatmul 224K ๐Ÿ’พ
iba Iban 179K ๐Ÿ’พ
icr Islander Creole English 248K ๐Ÿ’พ
id Indonesian 6,634K ๐Ÿ’พ
ifa Amganad Ifugao 810K ๐Ÿ’พ
ifb Batad Ifugao 835K ๐Ÿ’พ
ife Ifรจ 300K ๐Ÿ’พ
ifk Tuwali Ifugao 214K ๐Ÿ’พ
ifu Mayoyao Ifugao 258K ๐Ÿ’พ
ify Keley-I Kallahan 863K ๐Ÿ’พ
ig Igbo 13K ๐Ÿ’พ
ign Ignaciano 161K ๐Ÿ’พ
ik Inupiaq 96K ๐Ÿ’พ
ilo Iloko 169K ๐Ÿ’พ
imo Imbongu 280K ๐Ÿ’พ
inb Inga 151K ๐Ÿ’พ
ino Inoke-Yate 236K ๐Ÿ’พ
iou Tuma-Irumu 225K ๐Ÿ’พ
ipi Ipili 312K ๐Ÿ’พ
iri Irigwe 243K ๐Ÿ’พ
irk Iraqw 184K ๐Ÿ’พ
iry Iraya 205K ๐Ÿ’พ
it Italian 13,569K ๐Ÿ’พ
itv Itawit 242K ๐Ÿ’พ
iu Inuktitut 98K ๐Ÿ’พ
iws Sepik Iwam 307K ๐Ÿ’พ
izr Izere 216K ๐Ÿ’พ
izz Izii 908K ๐Ÿ’พ
ja Japanese 2,116K ๐Ÿ’พ
jac Popti' 221K ๐Ÿ’พ
jae Yabem 186K ๐Ÿ’พ
jam Jamaican Creole English 254K ๐Ÿ’พ
jbu Jukun Takum 264K ๐Ÿ’พ
jic Tol 285K ๐Ÿ’พ
jiv Shuar 134K ๐Ÿ’พ
jmc Machame 150K ๐Ÿ’พ
jun Juang 178K ๐Ÿ’พ
jv Javanese 177K ๐Ÿ’พ
jvn Caribbean Javanese 211K ๐Ÿ’พ
ka Georgian 4,978K ๐Ÿ’พ
kaa Kara-Kalpak 135K ๐Ÿ’พ
kab-Arab Kabyle (Arabic) 715K ๐Ÿ’พ
kab-Tfng Kabyle (Tifinagh) 1,338K ๐Ÿ’พ
kab Kabyle 66K ๐Ÿ’พ
kac Kachin 1,057K ๐Ÿ’พ
kao Xaasongaxango 205K ๐Ÿ’พ
kaq Capanahua 164K ๐Ÿ’พ
kbh Camsรก 193K ๐Ÿ’พ
kbm Iwal 298K ๐Ÿ’พ
kbp Kabiyรจ 571K ๐Ÿ’พ
kbq Kamano 156K ๐Ÿ’พ
kbr Kafa 147K ๐Ÿ’พ
kcg Tyap 279K ๐Ÿ’พ
kdc Kutu 140K ๐Ÿ’พ
kdi Kumam 195K ๐Ÿ’พ
kdj Karamojong 163K ๐Ÿ’พ
kdn Kunda 144K ๐Ÿ’พ
kek Kekchรญ 406K ๐Ÿ’พ
ken Kenyang 200K ๐Ÿ’พ
keo Kakwa 215K ๐Ÿ’พ
ker Kera 267K ๐Ÿ’พ
kew West Kewa 247K ๐Ÿ’พ
kez Kukele 173K ๐Ÿ’พ
kgf Kube 175K ๐Ÿ’พ
kgr Abun 356K ๐Ÿ’พ
khz Keapara 196K ๐Ÿ’พ
kia Kim 525K ๐Ÿ’พ
kij Kilivila 155K ๐Ÿ’พ
kj Kuanyama 1,474K ๐Ÿ’พ
kjb Q'anjob'al 263K ๐Ÿ’พ
kje Kisar 235K ๐Ÿ’พ
kjh Khakas 128K ๐Ÿ’พ
kjs East Kewa 251K ๐Ÿ’พ
kk Kazakh 642K ๐Ÿ’พ
kki Kagulu 125K ๐Ÿ’พ
kkj Kako 263K ๐Ÿ’พ
kln Kalenjin 149K ๐Ÿ’พ
km Khmer 29,110K ๐Ÿ’พ
kma Konni 230K ๐Ÿ’พ
kmg Kรขte 127K ๐Ÿ’พ
kmo Kwoma 213K ๐Ÿ’พ
kms Kamasau 293K ๐Ÿ’พ
kmu Kanite 214K ๐Ÿ’พ
kn Kannada 126K ๐Ÿ’พ
kne Kankanaey 230K ๐Ÿ’พ
knf Mankanya 164K ๐Ÿ’พ
knj Western Kanjobal 1,350K ๐Ÿ’พ
knk Kuranko 228K ๐Ÿ’พ
kno Kono 360K ๐Ÿ’พ
knv Tabo 243K ๐Ÿ’พ
kog Cogui 189K ๐Ÿ’พ
kpf Komba 174K ๐Ÿ’พ
kpg Kapingamarangi 967K ๐Ÿ’พ
kpr Korafe-Yegha 262K ๐Ÿ’พ
kpw Kobon 288K ๐Ÿ’พ
kpx Mountain Koiali 190K ๐Ÿ’พ
kpz Kupsabiny 166K ๐Ÿ’พ
kqc Doromu-Koki 209K ๐Ÿ’พ
kqe Kalagan 241K ๐Ÿ’พ
kqp Kimrรฉ 254K ๐Ÿ’พ
kqw Kandas 201K ๐Ÿ’พ
kqy Koorete 156K ๐Ÿ’พ
krc Karachay-Balkar 132K ๐Ÿ’พ
kri Krio 256K ๐Ÿ’พ
krj Kinaray-A 228K ๐Ÿ’พ
kru Kurukh 182K ๐Ÿ’พ
ksd Kuanua 228K ๐Ÿ’พ
ksr Borong 233K ๐Ÿ’พ
ktb Kambaata 113K ๐Ÿ’พ
ktj Plapo Krumen 356K ๐Ÿ’พ
kto Kuot 286K ๐Ÿ’พ
ku Kurdish 2,479K ๐Ÿ’พ
kub Kutep 281K ๐Ÿ’พ
kud โ€˜Auhelawa 167K ๐Ÿ’พ
kue Kuman (Papua New Guinea) 230K ๐Ÿ’พ
kum Kumyk 142K ๐Ÿ’พ
kup Kunimaipa 279K ๐Ÿ’พ
kus Kusaal 200K ๐Ÿ’พ
kv Komi 122K ๐Ÿ’พ
kvn Border Kuna 212K ๐Ÿ’พ
kwf Kwara'ae 296K ๐Ÿ’พ
kwi Awa-Cuaiquer 165K ๐Ÿ’พ
kwj Kwanga 290K ๐Ÿ’พ
kxc Konso 148K ๐Ÿ’พ
kxm Northern Khmer 257K ๐Ÿ’พ
ky Kyrgyz 18,597K ๐Ÿ’พ
kyc Kyaka 220K ๐Ÿ’พ
kyf Kouya 215K ๐Ÿ’พ
kyg Keyagana 190K ๐Ÿ’พ
kyq Kenga 250K ๐Ÿ’พ
kyu Western Kayah 466K ๐Ÿ’พ
kyz Kayabรญ 324K ๐Ÿ’พ
kze Kosena 164K ๐Ÿ’พ
kzf Da'a Kaili 213K ๐Ÿ’พ
kzj Coastal Kadazan 215K ๐Ÿ’พ
la Latin 48K ๐Ÿ’พ
laj Lango 175K ๐Ÿ’พ
las Lama 235K ๐Ÿ’พ
law Lauje 262K ๐Ÿ’พ
lb Luxembourgish 5,173K ๐Ÿ’พ
lcm Tungag 239K ๐Ÿ’พ
lee Lyรฉlรฉ 257K ๐Ÿ’พ
lef Lelemi 211K ๐Ÿ’พ
lem Nomaande 249K ๐Ÿ’พ
leu Kara (Papua New Guinea) 255K ๐Ÿ’พ
lew Ledo Kaili 198K ๐Ÿ’พ
lex Luang 271K ๐Ÿ’พ
lgg Lugbara 188K ๐Ÿ’พ
lhu Lahu 352K ๐Ÿ’พ
lia West-Central Limba 247K ๐Ÿ’พ
lid Nyindrou 308K ๐Ÿ’พ
lif Limbu 138K ๐Ÿ’พ
lip Sekpele 214K ๐Ÿ’พ
lis Lisu 304K ๐Ÿ’พ
ljp Lampung Api 188K ๐Ÿ’พ
lln Lele 291K ๐Ÿ’พ
lme Pรฉvรฉ 245K ๐Ÿ’พ
lmk Lamkang 217K ๐Ÿ’พ
lnd Lundayeh 670K ๐Ÿ’พ
lo Lao 4,384K ๐Ÿ’พ
lob Lobi 192K ๐Ÿ’พ
loe Saluan 220K ๐Ÿ’พ
lok Loko 264K ๐Ÿ’พ
lon Malawi Lomwe 137K ๐Ÿ’พ
lsi Lashi 1,077K ๐Ÿ’พ
lsm Saamia 156K ๐Ÿ’พ
lt Lithuanian 39,575K ๐Ÿ’พ
luc Aringa 242K ๐Ÿ’พ
lus Lushai 204K ๐Ÿ’พ
lv Latvian 1,020K ๐Ÿ’พ
lwo Luwo 255K ๐Ÿ’พ
maa San Jerรณnimo Tecรณatl Mazatec 487K ๐Ÿ’พ
mad Madurese 706K ๐Ÿ’พ
mag Magahi 193K ๐Ÿ’พ
mai Maithili 211K ๐Ÿ’พ
maj Jalapa De Dรญaz Mazatec 188K ๐Ÿ’พ
mak Makasar 179K ๐Ÿ’พ
mam Mam 834K ๐Ÿ’พ
maw Mampruli 251K ๐Ÿ’พ
maz Central Mazahua 286K ๐Ÿ’พ
mbb Western Bukidnon Manobo 278K ๐Ÿ’พ
mbc Macushi 221K ๐Ÿ’พ
mbh Mangseng 321K ๐Ÿ’พ
mbt Matigsalug Manobo 226K ๐Ÿ’พ
mca Maca 208K ๐Ÿ’พ
mcb Machiguenga 132K ๐Ÿ’พ
mcd Sharanahua 200K ๐Ÿ’พ
mco Coatlรกn Mixe 217K ๐Ÿ’พ
mcp Makaa 237K ๐Ÿ’พ
mcq Ese 158K ๐Ÿ’พ
mcu Cameroon Mambila 260K ๐Ÿ’พ
mda Mada 312K ๐Ÿ’พ
mdy Male 589K ๐Ÿ’พ
med Melpa 283K ๐Ÿ’พ
mee Mengen 301K ๐Ÿ’พ
mej Meyah 323K ๐Ÿ’พ
mek Mekeo 234K ๐Ÿ’พ
men Mende 210K ๐Ÿ’พ
meq Merey 291K ๐Ÿ’พ
meu Motu 175K ๐Ÿ’พ
mfe Morisyen 172K ๐Ÿ’พ
mfh Matal 238K ๐Ÿ’พ
mfi Wandala 265K ๐Ÿ’พ
mfk North Mofu 248K ๐Ÿ’พ
mfq Moba 232K ๐Ÿ’พ
mfy Mayo 167K ๐Ÿ’พ
mfz Mabaan 237K ๐Ÿ’พ
mg Malagasy 1,623K ๐Ÿ’พ
mgd Moru 192K ๐Ÿ’พ
mgh Makhuwa-Meetto 150K ๐Ÿ’พ
mgo Meta' 251K ๐Ÿ’พ
mh Marshallese 750K ๐Ÿ’พ
mhi Ma'di 192K ๐Ÿ’พ
mhl Mauwake 235K ๐Ÿ’พ
mhx Maru 291K ๐Ÿ’พ
mhy Ma'anyan 190K ๐Ÿ’พ
mi Maori 1,504K ๐Ÿ’พ
mib Atatlรกhuca Mixtec 263K ๐Ÿ’พ
mif Mofu-Gudur 283K ๐Ÿ’พ
mil Peรฑoles Mixtec 365K ๐Ÿ’พ
min Minangkabau 242K ๐Ÿ’พ
mio Pinotepa Nacional Mixtec 288K ๐Ÿ’พ
miq Mรญskito 214K ๐Ÿ’พ
mit Southern Puebla Mixtec 273K ๐Ÿ’พ
mk Macedonian 10,422K ๐Ÿ’พ
mkl Mokole 230K ๐Ÿ’พ
ml Malayalam 118K ๐Ÿ’พ
mlh Mape 235K ๐Ÿ’พ
mlp Bargam 297K ๐Ÿ’พ
mmo Mangga Buang 269K ๐Ÿ’พ
mmx Madak 271K ๐Ÿ’พ
mna Mbula 257K ๐Ÿ’พ
mnb Muna 151K ๐Ÿ’พ
mnf Mundani 241K ๐Ÿ’พ
mnw Mon 1,836K ๐Ÿ’พ
moa Mwan 308K ๐Ÿ’พ
mog Mongondow 220K ๐Ÿ’พ
mop Mopรกn Maya 296K ๐Ÿ’พ
mor Moro 152K ๐Ÿ’พ
mox Molima 222K ๐Ÿ’พ
mpg Marba 210K ๐Ÿ’พ
mpm Yosondรบa Mixtec 336K ๐Ÿ’พ
mps Dadibi 1,270K ๐Ÿ’พ
mpt Mian 256K ๐Ÿ’พ
mpx Misima-Panaeati 227K ๐Ÿ’พ
mqb Mbuko 302K ๐Ÿ’พ
mqj Mamasa 164K ๐Ÿ’พ
mqn Moronene 164K ๐Ÿ’พ
mr Marathi 16,594K ๐Ÿ’พ
mrw Maranao 912K ๐Ÿ’พ
ms Malay 659K ๐Ÿ’พ
msm Agusan Manobo 225K ๐Ÿ’พ
msy Aruamu 229K ๐Ÿ’พ
mt Maltese 3,331K ๐Ÿ’พ
mta Cotabato Manobo 262K ๐Ÿ’พ
mti Maiwa (Papua New Guinea) 166K ๐Ÿ’พ
mtj Moskona 321K ๐Ÿ’พ
mto Totontepec Mixe 233K ๐Ÿ’พ
mtp Wichรญ Lhamtรฉs Nocten 183K ๐Ÿ’พ
muh Mรผndรผ 392K ๐Ÿ’พ
mur Murle 210K ๐Ÿ’พ
mux Bo-Ung 363K ๐Ÿ’พ
muy Muyang 265K ๐Ÿ’พ
mva Manam 231K ๐Ÿ’พ
mvp Duri 174K ๐Ÿ’พ
mwv Mentawai 141K ๐Ÿ’พ
mxb Tezoatlรกn Mixtec 281K ๐Ÿ’พ
mxt Jamiltepec Mixtec 267K ๐Ÿ’พ
my Burmese 1,007K ๐Ÿ’พ
my-t-d0-zawgyi Burmese (Zawgyi encoding) 593K ๐Ÿ’พ
myb Mbay 192K ๐Ÿ’พ
myk Mamara Senoufo 272K ๐Ÿ’พ
myv Erzya 143K ๐Ÿ’พ
myw Muyuw 150K ๐Ÿ’พ
myx Masaaba 164K ๐Ÿ’พ
myy Macuna 245K ๐Ÿ’พ
mza Santa Marรญa Zacatepec Mixtec 316K ๐Ÿ’พ
mzi Ixcatlรกn Mazatec 190K ๐Ÿ’พ
mzk Nigeria Mambila 283K ๐Ÿ’พ
mzm Mumuye 265K ๐Ÿ’พ
naf Nabak 220K ๐Ÿ’พ
nak Nakanai 333K ๐Ÿ’พ
nan-Latn Min Nan Chinese (Latin) 231K ๐Ÿ’พ
nas Naasioi 168K ๐Ÿ’พ
nca Iyo 203K ๐Ÿ’พ
nch Central Huasteca Nahuatl 195K ๐Ÿ’พ
ncj Northern Puebla Nahuatl 164K ๐Ÿ’พ
ncu Chumburung 312K ๐Ÿ’พ
ndj Ndamba 141K ๐Ÿ’พ
ndy Lutos 216K ๐Ÿ’พ
ndz Ndogo 350K ๐Ÿ’พ
neb Toura 326K ๐Ÿ’พ
new Newari 150K ๐Ÿ’พ
nfr Nafaanra 233K ๐Ÿ’พ
ngp Ngulu 149K ๐Ÿ’พ
nho Takuu 309K ๐Ÿ’พ
nhu Noone 270K ๐Ÿ’พ
nhw Western Huasteca Nahuatl 194K ๐Ÿ’พ
nhy Northern Oaxaca Nahuatl 185K ๐Ÿ’พ
nia Nias 182K ๐Ÿ’พ
nii Nii 316K ๐Ÿ’พ
nij Ngaju 194K ๐Ÿ’พ
nim Nilamba 117K ๐Ÿ’พ
nin Ninzo 267K ๐Ÿ’พ
nkf Inpui Naga 197K ๐Ÿ’พ
nko Nkonya 168K ๐Ÿ’พ
nl Dutch 58,357K ๐Ÿ’พ
nlc Nalca 241K ๐Ÿ’พ
nmz Nawdm 209K ๐Ÿ’พ
nnb Nande 127K ๐Ÿ’พ
nnq Ngindo 137K ๐Ÿ’พ
nnw Southern Nuni 291K ๐Ÿ’พ
noa Woun Meu 275K ๐Ÿ’พ
nog Nogai 104K ๐Ÿ’พ
nop Numanggang 183K ๐Ÿ’พ
not Nomatsiguenga 141K ๐Ÿ’พ
nou Ewage-Notu 266K ๐Ÿ’พ
npl Southeastern Puebla Nahuatl 148K ๐Ÿ’พ
npy Napu 192K ๐Ÿ’พ
nsn Nehan 248K ๐Ÿ’พ
nsu Sierra Negra Nahuatl 170K ๐Ÿ’พ
ntm Nateni 229K ๐Ÿ’พ
ntp Northern Tepehuan 173K ๐Ÿ’พ
ntr Delo 272K ๐Ÿ’พ
nuj Nyole 151K ๐Ÿ’พ
nus Nuer 195K ๐Ÿ’พ
nvm Namiae 290K ๐Ÿ’พ
nwb Nyabwa 316K ๐Ÿ’พ
nwi Southwest Tanna 230K ๐Ÿ’พ
ny Nyanja 356K ๐Ÿ’พ
nyf Giryama 169K ๐Ÿ’พ
nyn Nyankole 120K ๐Ÿ’พ
nyo Nyoro 120K ๐Ÿ’พ
nyy Nyakyusa-Ngonde 138K ๐Ÿ’พ
nzi Nzima 201K ๐Ÿ’พ
obo Obo Manobo 266K ๐Ÿ’พ
oc Occitan 2,706K ๐Ÿ’พ
oku Oku 239K ๐Ÿ’พ
okv Orokaiva 212K ๐Ÿ’พ
old Mochi 151K ๐Ÿ’พ
ong Olo 284K ๐Ÿ’พ
opm Oksapmin 332K ๐Ÿ’พ
or Oriya 175K ๐Ÿ’พ
os Ossetic 135K ๐Ÿ’พ
osa Osage 3K ๐Ÿ’พ
otd Ot Danum 187K ๐Ÿ’พ
ote Mezquital Otomi 251K ๐Ÿ’พ
ozm Koonzime 267K ๐Ÿ’พ
pa Punjabi 59,990K ๐Ÿ’พ
pab Parecรญs 156K ๐Ÿ’พ
pad Paumarรญ 242K ๐Ÿ’พ
pag Pangasinan 177K ๐Ÿ’พ
pah Tenharim 268K ๐Ÿ’พ
pam Pampanga 196K ๐Ÿ’พ
pau Palauan 255K ๐Ÿ’พ
pbc Patamona 181K ๐Ÿ’พ
pbi Parkwa 272K ๐Ÿ’พ
pck Paite Chin 770K ๐Ÿ’พ
pcm Nigerian Pidgin 315K ๐Ÿ’พ
pez Eastern Penan 235K ๐Ÿ’พ
pib Yine 114K ๐Ÿ’พ
pir Piratapuyo 229K ๐Ÿ’พ
pis Pijin 263K ๐Ÿ’พ
pjt Pitjantjatjara 237K ๐Ÿ’พ
pkb Pokomo 166K ๐Ÿ’พ
pl Polish 7,148K ๐Ÿ’พ
plw Brooke's Point Palawano 203K ๐Ÿ’พ
pmf Pamona 307K ๐Ÿ’พ
pny Pinyin 247K ๐Ÿ’พ
poh Poqomchi' 266K ๐Ÿ’พ
poi Highland Popoluca 179K ๐Ÿ’พ
poy Pogolo 147K ๐Ÿ’พ
ppk Uma 220K ๐Ÿ’พ
ppo Folopa 258K ๐Ÿ’พ
prf Paranan 203K ๐Ÿ’พ
prk Parauk 1,026K ๐Ÿ’พ
ps Pashto 7,343K ๐Ÿ’พ
pss Kaulong 326K ๐Ÿ’พ
pt Portuguese 20,891K ๐Ÿ’พ
pt-PT Portuguese (Portugal) 666K ๐Ÿ’พ
ptp Patep 294K ๐Ÿ’พ
ptu Bambam 194K ๐Ÿ’พ
pwg Gapapaiwa 208K ๐Ÿ’พ
pww Pwo Northern Karen 345K ๐Ÿ’พ
pxm Quetzaltepec Mixรฉ 720K ๐Ÿ’พ
qu Quechua 580K ๐Ÿ’พ
qub Huallaga Huรกnuco Quechua 122K ๐Ÿ’พ
quc K'iche' 207K ๐Ÿ’พ
quf Lambayeque Quechua 161K ๐Ÿ’พ
quh South Bolivian Quechua 623K ๐Ÿ’พ
qul North Bolivian Quechua 140K ๐Ÿ’พ
qup Southern Pastaza Quechua 177K ๐Ÿ’พ
quw Tena Lowland Quichua 116K ๐Ÿ’พ
quy Ayacucho Quechua 106K ๐Ÿ’พ
qvc Cajamarca Quechua 166K ๐Ÿ’พ
qve Eastern Apurรญmac Quechua 168K ๐Ÿ’พ
qvi Imbabura Highland Quichua 146K ๐Ÿ’พ
qvm Margos-Yarowilca-Lauricocha Quechua 132K ๐Ÿ’พ
qvn North Junรญn Quechua 139K ๐Ÿ’พ
qvo Napo Lowland Quechua 117K ๐Ÿ’พ
qvs San Martรญn Quechua 153K ๐Ÿ’พ
qvw Huaylla Wanca Quechua 111K ๐Ÿ’พ
qvz Northern Pastaza Quichua 157K ๐Ÿ’พ
qwh Huaylas Ancash Quechua 128K ๐Ÿ’พ
qxh Panao Huรกnuco Quechua 123K ๐Ÿ’พ
qxl Salasaca Highland Quichua 127K ๐Ÿ’พ
qxn Northern Conchucos Ancash Quechua 150K ๐Ÿ’พ
qxo Southern Conchucos Ancash Quechua 136K ๐Ÿ’พ
qxr Caรฑar Highland Quichua 509K ๐Ÿ’พ
rai Ramoaaina 273K ๐Ÿ’พ
raj Malvi 198K ๐Ÿ’พ
rav Sampang 138K ๐Ÿ’พ
rej Rejang 178K ๐Ÿ’พ
rim Nyaturu 151K ๐Ÿ’พ
rm-puter Romansh (Puter) 1,068K ๐Ÿ’พ
rm-rumgr Romansh (Grischun) 4,794K ๐Ÿ’พ
rm-surmiran Romansh (Surmiran) 2,540K ๐Ÿ’พ
rm-sursilv Romansh (Sursilvan) 11,678K ๐Ÿ’พ
rm-sutsilv Romansh (Sutsilvan) 1,007K ๐Ÿ’พ
rm-vallader Romansh (Vallader) 5,560K ๐Ÿ’พ
rmc Carpathian Romani 170K ๐Ÿ’พ
rmo Sinte Romani 228K ๐Ÿ’พ
rn Rundi 120K ๐Ÿ’พ
rnl Ranglong 221K ๐Ÿ’พ
ro Romanian 13,962K ๐Ÿ’พ
ro-MD Moldavian 2,694K ๐Ÿ’พ
rom Vlax Romani 186K ๐Ÿ’พ
roo Rotokas 292K ๐Ÿ’พ
rro Waima 177K ๐Ÿ’พ
ru Russian 40,987K ๐Ÿ’พ
ruf Luguru 135K ๐Ÿ’พ
rug Roviana 956K ๐Ÿ’พ
rw Kinyarwanda 605K ๐Ÿ’พ
rwo Rawa 261K ๐Ÿ’พ
sab Buglere 405K ๐Ÿ’พ
sah Sakha 2,457K ๐Ÿ’พ
sas Sasak 196K ๐Ÿ’พ
sat Santali 149K ๐Ÿ’พ
sba Ngambay 246K ๐Ÿ’พ
sbl Botolan Sambal 251K ๐Ÿ’พ
sck Sadri 189K ๐Ÿ’พ
sda Toraja-Sa'dan 154K ๐Ÿ’พ
seh Sena 155K ๐Ÿ’พ
sey Secoya 163K ๐Ÿ’พ
sg Sango 265K ๐Ÿ’พ
sgb Mag-antsi Ayta 233K ๐Ÿ’พ
sgw Sebat Bet Gurage 116K ๐Ÿ’พ
sgz Sursurunga 327K ๐Ÿ’พ
shk Shilluk 189K ๐Ÿ’พ
shn Shan 1,435K ๐Ÿ’พ
shp Shipibo-Conibo 169K ๐Ÿ’พ
si Sinhala 1,046K ๐Ÿ’พ
sig Paasaal 277K ๐Ÿ’พ
sil Tumulung Sisaala 256K ๐Ÿ’พ
sim Mende (Papua New Guinea) 273K ๐Ÿ’พ
sja Epena 194K ๐Ÿ’พ
sk Slovak 70,933K ๐Ÿ’พ
sl Slovenian 10,975K ๐Ÿ’พ
sld Sissala 206K ๐Ÿ’พ
sll Salt-Yui 264K ๐Ÿ’พ
sm Samoan 248K ๐Ÿ’พ
smt Simte 177K ๐Ÿ’พ
sn Shona 2,542K ๐Ÿ’พ
snc Sinaugoro 216K ๐Ÿ’พ
snn Siona 222K ๐Ÿ’พ
snp Siane 237K ๐Ÿ’พ
snw Selee 212K ๐Ÿ’พ
sny Saniyo-Hiyewe 348K ๐Ÿ’พ
so Somali 874K ๐Ÿ’พ
soq Kanasi 213K ๐Ÿ’พ
soy Miyobe 205K ๐Ÿ’พ
spl Selepet 244K ๐Ÿ’พ
spp Supyire Senoufo 251K ๐Ÿ’พ
sps Saposa 324K ๐Ÿ’พ
sq Albanian 10,104K ๐Ÿ’พ
sr Serbian 4,785K ๐Ÿ’พ
sr-Latn Serbian (Latin) 10,143K ๐Ÿ’พ
sri Siriano 166K ๐Ÿ’พ
srm Saramaccan 369K ๐Ÿ’พ
srn Sranan Tongo 232K ๐Ÿ’พ
ssd Siroi 210K ๐Ÿ’พ
ssg Seimat 221K ๐Ÿ’พ
ssx Samberigi 233K ๐Ÿ’พ
stn Owa 263K ๐Ÿ’พ
su Sundanese 172K ๐Ÿ’พ
sua Sulka 458K ๐Ÿ’พ
sue Suena 227K ๐Ÿ’พ
sur Mwaghavul 261K ๐Ÿ’พ
sus Susu 205K ๐Ÿ’พ
suz Sunwar 732K ๐Ÿ’พ
sv Swedish 33,633K ๐Ÿ’พ
sw Swahili 8,817K ๐Ÿ’พ
swp Suau 175K ๐Ÿ’พ
sxn Sangir 209K ๐Ÿ’พ
ta Tamil 1,413K ๐Ÿ’พ
tab Tabassaran 132K ๐Ÿ’พ
taj Eastern Tamang 169K ๐Ÿ’พ
tap Taabwa 145K ๐Ÿ’พ
taq Tamasheq 218K ๐Ÿ’พ
tav Tatuyo 256K ๐Ÿ’พ
taw Tai 268K ๐Ÿ’พ
tbc Takia 278K ๐Ÿ’พ
tbg North Tairora 235K ๐Ÿ’พ
tbo Tawala 198K ๐Ÿ’พ
tby Tabaru 226K ๐Ÿ’พ
tbz Ditammari 692K ๐Ÿ’พ
tca Ticuna 251K ๐Ÿ’พ
tcc Datooga 135K ๐Ÿ’พ
te Telugu 574K ๐Ÿ’พ
ted Tepo Krumen 346K ๐Ÿ’พ
tem Timne 190K ๐Ÿ’พ
teo Teso 118K ๐Ÿ’พ
ter Tereno 187K ๐Ÿ’พ
tfr Teribe 228K ๐Ÿ’พ
tgo Sudest 216K ๐Ÿ’พ
tgp Tangoa 228K ๐Ÿ’พ
thk Tharaka 150K ๐Ÿ’พ
ti Tigrinya 803K ๐Ÿ’พ
tif Tifal 413K ๐Ÿ’พ
tih Timugon Murut 879K ๐Ÿ’พ
tik Tikar 264K ๐Ÿ’พ
tim Timbe 206K ๐Ÿ’พ
tk Turkmen 516K ๐Ÿ’พ
tlb Tobelo 209K ๐Ÿ’พ
tlf Telefol 422K ๐Ÿ’พ
tlj Talinga-Bwisi 159K ๐Ÿ’พ
tmc Tumak 245K ๐Ÿ’พ
tna Tacana 216K ๐Ÿ’พ
tnr Mรฉnik 254K ๐Ÿ’พ
to Tonga 1,214K ๐Ÿ’พ
tob Toba 229K ๐Ÿ’พ
toc Coyutla Totonac 218K ๐Ÿ’พ
toh Gitonga 194K ๐Ÿ’พ
top Papantla Totonac 168K ๐Ÿ’พ
tos Highland Totonac 224K ๐Ÿ’พ
tpi Tok Pisin 8,049K ๐Ÿ’พ
tpm Tampulma 892K ๐Ÿ’พ
tpp Pisaflores Tepehua 162K ๐Ÿ’พ
tpt Tlachichilco Tepehua 173K ๐Ÿ’พ
tpz Tinputz 370K ๐Ÿ’พ
tqo Toaripi 215K ๐Ÿ’พ
tr Turkish 13,846K ๐Ÿ’พ
trs Chicahuaxtla Triqui 287K ๐Ÿ’พ
tsz Purepecha 129K ๐Ÿ’พ
tt Tatar 1,356K ๐Ÿ’พ
ttc Tektiteko 231K ๐Ÿ’พ
tte Bwanabwana 198K ๐Ÿ’พ
tue Tuyuca 141K ๐Ÿ’พ
tuf Central Tunebo 237K ๐Ÿ’พ
twb Western Tawbuid 198K ๐Ÿ’พ
twu Termanu 242K ๐Ÿ’พ
txa Tombonuo 224K ๐Ÿ’พ
txu Kayapรณ 354K ๐Ÿ’พ
tyv Tuvinian 614K ๐Ÿ’พ
tyz Tร y 260K ๐Ÿ’พ
tzh Tzeltal 901K ๐Ÿ’พ
tzj Tz'utujil 245K ๐Ÿ’พ
ubr Ubir 222K ๐Ÿ’พ
ubu Umbu-Ungu 308K ๐Ÿ’พ
udm Udmurt 135K ๐Ÿ’พ
udu Uduk 287K ๐Ÿ’พ
ug Uyghur 9,493K ๐Ÿ’พ
uk Ukrainian 12,921K ๐Ÿ’พ
ur Urdu 3,622K ๐Ÿ’พ
ura Urarina 193K ๐Ÿ’พ
urb Urubรบ-Kaapor 347K ๐Ÿ’พ
urk Urak Lawoi' 368K ๐Ÿ’พ
ury Orya 301K ๐Ÿ’พ
usa Usarufa 171K ๐Ÿ’พ
usp Uspanteco 228K ๐Ÿ’พ
uvl Lote 277K ๐Ÿ’พ
uz Uzbek 131K ๐Ÿ’พ
vag Vagla 221K ๐Ÿ’พ
vec Venetian 2K ๐Ÿ’พ
vec-u-sd-itpd Venetian (Padua) 813K ๐Ÿ’พ
vec-u-sd-itts Venetian (Trieste) 12K ๐Ÿ’พ
vec-u-sd-itvr Venetian (Verona) 16K ๐Ÿ’พ
vid Vidunda 151K ๐Ÿ’พ
viv Iduna 220K ๐Ÿ’พ
vmw Makhuwa 130K ๐Ÿ’พ
vun Vunjo 141K ๐Ÿ’พ
vut Vute 206K ๐Ÿ’พ
waj Waffa 236K ๐Ÿ’พ
wap Wapishana 193K ๐Ÿ’พ
war Waray 208K ๐Ÿ’พ
way Wayana 143K ๐Ÿ’พ
wer Weri 209K ๐Ÿ’พ
wiu Wiru 232K ๐Ÿ’พ
wlx Wali 847K ๐Ÿ’พ
wmw Mwani 139K ๐Ÿ’พ
wnc Wantoat 238K ๐Ÿ’พ
wnu Usan 234K ๐Ÿ’พ
wob Wรจ Northern 270K ๐Ÿ’พ
wos Hanga Hundi 264K ๐Ÿ’พ
wrs Waris 213K ๐Ÿ’พ
wsk Waskia 239K ๐Ÿ’พ
wuv Wuvulu-Aua 187K ๐Ÿ’พ
wwa Waama 239K ๐Ÿ’พ
xal Kalmyk 135K ๐Ÿ’พ
xav Xavรกnte 440K ๐Ÿ’พ
xed Hdi 229K ๐Ÿ’พ
xla Kamula 230K ๐Ÿ’พ
xog Soga 127K ๐Ÿ’พ
xrb Eastern Karaboro 286K ๐Ÿ’พ
xsb Sambal 244K ๐Ÿ’พ
xsi Sio 319K ๐Ÿ’พ
xsm Kasem 604K ๐Ÿ’พ
xsr Sherpa 184K ๐Ÿ’พ
xsu Sanumรก 408K ๐Ÿ’พ
xtd Diuxi-Tilantongo Mixtec 277K ๐Ÿ’พ
xtm Magdalena Peรฑasco Mixtec 335K ๐Ÿ’พ
xuo Kuo 306K ๐Ÿ’พ
yaa Yaminahua 204K ๐Ÿ’พ
yad Yagua 142K ๐Ÿ’พ
yal Yalunka 203K ๐Ÿ’พ
yam Yamba 277K ๐Ÿ’พ
yaz Lokaa 222K ๐Ÿ’พ
yby Yaweyuha 219K ๐Ÿ’พ
ycn Yucuna 202K ๐Ÿ’พ
yle Yele 298K ๐Ÿ’พ
yli Angguruk Yali 221K ๐Ÿ’พ
yml Iamalele 245K ๐Ÿ’พ
yo Yoruba 270K ๐Ÿ’พ
yon Yongkom 202K ๐Ÿ’พ
yrb Yareba 184K ๐Ÿ’พ
yre Yaourรฉ 285K ๐Ÿ’พ
yss Yessan-Mayo 227K ๐Ÿ’พ
yua Yucateco 813K ๐Ÿ’พ
yuj Karkar-Yuri 258K ๐Ÿ’พ
yut Yopno 227K ๐Ÿ’พ
yuw Yau (Morobe Province) 243K ๐Ÿ’พ
yva Yawa 250K ๐Ÿ’พ
zaa Sierra de Juรกrez Zapotec 265K ๐Ÿ’พ
zad Cajonos Zapotec 180K ๐Ÿ’พ
zae Yareni Zapotec 248K ๐Ÿ’พ
zap Zapotec 194K ๐Ÿ’พ
zas Santo Domingo Albarradas Zapotec 184K ๐Ÿ’พ
zaw Mitla Zapotec 157K ๐Ÿ’พ
zca Coatecas Altas Zapotec 236K ๐Ÿ’พ
zia Zia 242K ๐Ÿ’พ
ziw Zigula 140K ๐Ÿ’พ
zlm Malay 664K ๐Ÿ’พ
zne Zande 253K ๐Ÿ’พ
zpc Choapan Zapotec 208K ๐Ÿ’พ
zpi Santa Marรญa Quiegolani Zapotec 209K ๐Ÿ’พ
zpq Zoogocho Zapotec 208K ๐Ÿ’พ
zpt San Vicente Coatlรกn Zapotec 229K ๐Ÿ’พ
zpz Texmelucan Zapotec 281K ๐Ÿ’พ
zyp Zyphe Chin 230K ๐Ÿ’พ

ยน Downloadable files include counts for each token; to get raw text, run the crawler yourself. For breaking text into words, we use an ICU word break iterator and count all tokens whose break status is one of UBRK_WORD_LETTER, UBRK_WORD_KANA, or UBRK_WORD_IDEO.

Running the Crawler

./corpuscrawler --language=yo --output=./corpus