language-detect

This is a language detection library using ML models that supports 862 languages out of the box and can be updated by users to support more languages.

System Description

The goal is to examine the script and word formations of a given text in order to accurately identify the language. This is particularly important for distinguishing between languages that share the same script. This functionality serves as a crucial first step in text processing pipelines, where subsequent steps such as spell checking or rendering are dependent on the identified language.

Installation

To install language_detect, you can use pip:

    pip install language-detect

Usage

import language_detect as ld

res1 = ld.recognize_language("Hello there, how are you? Hope you are doing well.") #("English", "Latin")

res2 = ld.detect_script("नमस्ते, आप कैसे हैं? मुझे आशा है कि आप अच्छा कार्य कर रहे हैं।") #Devanagari

ld.list_languages() #[("English", "Latin"), ("Hindi", "Devanagari"), ...]

ld.list_scripts() #["Latin", "Devanagari", "Cyrillic", ...]

models = ld.list_models(script_name="Devanagari", lang_name="Hindi", downloaded=True) #[{"script_name": "Devanagari","languages": ["Marathi", "Nepali (individual language)", "Sanskrit", "Urdu", "Hindi",...], "model_name": "Devanagari_model", "downloaded": True, "model_type": "Multinomial Naive Bayes", "vectorizer_model_name": "Devanagari_vectorizer", "vectorizer_type": "CountVectorizer", "vectorizer_params": {"ngram_range": "(3, 3)", "max_features": 2000, "analyzer": "char"}, ...]

model = ld.get_model(script_name="Devanagari", lang_name="Hindi") #{"script_name": "Devanagari","languages": ["Marathi", "Nepali (individual language)", "Sanskrit", "Urdu", "Hindi",...], "model_name": "Devanagari_model", "downloaded": True, "model_type": "Multinomial Naive Bayes", "vectorizer_model_name": "Devanagari_vectorizer", "vectorizer_type": "CountVectorizer", "vectorizer_params": {"ngram_range": "(3, 3)", "max_features": 2000, "analyzer": "char"}

Functions

recognize_language(text) : Takes a string as input and returns a tuple containing the detected language name and script name.

detect_script(text) : Takes a string as input and returns the name of the script used in the given text.

list_languages() : Returns a list of all languages available in the database, each paired with its associated script.

list_scripts() : Returns a list of all script names available in the database.

list_models(script_name=None, lang_name=None, downloaded=None) : Filters and returns models as a list of dictionaries based on the provided script name, language name or download status. The filter arguments are optional, and if no arguments are provided, it returns all available models.

get_model(script_name, lang_name=None) : Fetches and returns a specific model from the database as a dictionary, based on the provided script name and optional language name.

Experiments

We conducted experiments to identify the most effective methods for script-wise language detection. Below is a summary of our approach:

Datasets: We utilized datasets from the eBible Corpus and Vachan Data. These datasets were categorized based on their scripts.

Models and Algorithms: Each script-wise dataset was used to train multiple machine learning models, including:

Multinomial Naive Bayes
Logistic Regression
Decision Tree
Support Vector Machine (SVM)

Feature Extraction Techniques: To represent text data effectively, we tried different feature extraction techniques:

Character n-gram Approach: Explored ranges like [3, 3], [2, 4], and [2, 3].
Frequent Word Selection: Retained frequent words from the dataset as features.

Feature Limitations: To optimize performance and reduce overfitting, we experimented with limiting the maximum number of features:

Used all available features
5000 features
2000 features.

Results: After testing various combinations of algorithms, feature extraction methods, and feature limits, we found that the most accurate results were achieved using:

Multinomial Naive Bayes as the algorithm
Character n-gram Approach with ranges [3, 3] or [2, 3]
Maximum Features set to 2000 Using this combination, we trained and finalized 15 models, each corresponding to a specific script.

For detailed experiment findings, please refer to the Experiment notes

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
dev_scripts		dev_scripts
docs		docs
experiment_data		experiment_data
logs		logs
models		models
src/language_detect		src/language_detect
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
dev-requirements.txt		dev-requirements.txt
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

language-detect

System Description

Installation

Usage

Functions

Experiments

About

Releases

Packages

Contributors 2

Languages

License

Bridgeconn/LanguageRecognizer

Folders and files

Latest commit

History

Repository files navigation

language-detect

System Description

Installation

Usage

Functions

Experiments

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages