Character-based N-gram Modeling Predicts Language Groups and Evolution

Main results/writeup is in /results.ipynb. Code is in /code-project-1/main.ipynb. It is also available at https://github.com/kakduman/cs482/. To begin running the code, you will need to download the Tatoeba Sentences dataset into the data/tatoeba/ directory: https://downloads.tatoeba.org/exports/sentences.tar.bz2

Character-based N-gram Modeling Predicts Language Groups and Evolution

Koray Akduman | March 3, 2024

Introduction

The study of language evolution and grouping is a pivotal area of research that delves into the origins, development, and classification of languages. Understanding unlocks the mysteries of human communication, social dynamic, and even biological evolution (Dunbar, 1997). Languages serve as a mirror reflecting the socio-political changes, migratory patterns, and cultural interactions of different communities. By tracing the genealogy and transformation of languages, researchers can piece together the puzzle of human civilization and its diverse manifestations.

In this paper, as a proof-of-concept, we implement a simple character-based n-gram approach that successfully predicts commonly recognized language families (e.g. "West Germanic," "Romance," etc.) in a completely unsupervised manner, using k-means clustering and Uniform Manifold Approximation and Projection (UMAP) to visualize the vectors. Further, our character-based n-gram approach classifies languages like "Old German" and "German" into the same language group, demonstrating an ability to trace backwards in time to determine the evolution of language through machine learning. Our results also show that a character-based n-gram approach is able to find the relationships between different language groups, predicting, for example, that West Germanic and North Germanic languages are similar.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
code-project-1		code-project-1
.gitignore		.gitignore
README.md		README.md
image-1.png		image-1.png
image.png		image.png
results.ipynb		results.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Character-based N-gram Modeling Predicts Language Groups and Evolution

Introduction

About

Releases

Packages

Languages

kakduman/n-gram-language-classification

Folders and files

Latest commit

History

Repository files navigation

Character-based N-gram Modeling Predicts Language Groups and Evolution

Introduction

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages