Skip to content

jonathandunn/earthLings

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

earthLings

Corpus-based language and dialect mapping

This project visualizes three datasets:

Twitter Corpus 
	(23.7 billion words, 2017-2023) 
	
Corpus of Global Language Use
	(ISLRN: 951-235-998-601-3)
	(329.4 billion words, 2013-2019, from the Common Crawl)
	
GeoWAC 
	(ISLRN: 946-519-559-042-9)
	(42 billion words; geographically-balanced gigaword corpora for 48 languages)

The per-country aggregates can be found in the docs/data folder as CSV files.

View this project through GitHub Pages: https://jonathandunn.github.io/earthLings/

The full web dataset is now available through this repository: CGLU -> https://www.earthlings.io/download_cglu.html GeoWAC -> https://www.earthlings.io/download_geowac.html

For a description of data collection procedures and the language identification component, see this paper: https://jdunn.name/2020/03/08/mapping-languages-the-corpus-of-global-language-use/

For a description of population-based sampling techniques to create unbiased corpora, see this paper: https://jdunn.name/2020/03/08/geographically-balanced-gigaword-corpora-for-50-language-varieties/

For a study of changes in linguistic diversity during COVID-19, see this paper: https://jdunn.name/2020/10/14/measuring-linguistic-diversity-during-covid-19/

For a demographic and census-based evaluation of these corpora, see this paper: https://jdunn.name/2019/07/22/mapping-languages-and-demographics-with-georeferenced-corpora/

For an overview of dialectal variation and dialect uniqueness values, see this paper: https://jdunn.name/2019/07/22/global-syntactic-variation-in-seven-languages-towards-a-computational-dialectology/

You can also look at my related repositories:

Language ID: https://github.com/jonathandunn/idNet

Web Collection: https://github.com/jonathandunn/common_crawl_corpus

Releases

No releases published

Packages

No packages published