GitHub - akshita-sukhlecha/bhasha-embed: Text Embeddings for Hindi, English and Romanized Hindi texts

Bhasha embed v0 model

This is an embedding model that can embed texts in Hindi (Devanagari script), English and Romanized Hindi. There are many multilingual embedding models which work well for Hindi and English texts individually, but lack the following capabilities.

Romanized Hindi support: This is the first embedding model to support Romanized Hindi (transliterated Hindi / hin_Latn).
Cross-lingual alignment: This model outputs language-agnostic embedding. This enables querying a multilingual candidate pool containing a mix of Hindi, English and Romanised Hindi texts.

Model Details

Supported Languages: Hindi, English, Romanised Hindi
Base model: google/muril-base-cased
Training GPUs: 1xRTX4090
Training methodology: Distillation from English embedding model and Fine-tuning on triplet data.
Maximum Sequence Length: 512 tokens
Output Dimensionality: 768 tokens
Similarity Function: Cosine Similarity

Model Sources

Hugging Face: link
Developer: Akshita Sukhlecha

Results

Results for English-Hindi cross-lingual alignment : Tasks with corpus containing texts in Hindi as well as English

Results for Romanised Hindi tasks : Tasks with texts in Romanised Hindi

Results for retrieval tasks with multilingual corpus : Retrieval task with corpus containing texts in Hindi, English as well as Romanised Hindi

Results for Hindi tasks : Tasks with texts in Hindi (Devanagari script)

Additional information

Some task dataset links: Belebele, MLQA, XQuAD, SemRel24
hin_Latn tasks: Most hin_Latn tasks have been created by transliterating hindi texts using indic-trans library
Detailed results: github_link

Sample outputs

Example 1

Example 2

Example 3

Example 4

Scripts

Replicate results

cd eval
pip install -r requirements.txt
python evaluator.py

Run examples

Run few examples depicting model's capabilities :

pip install sentence-transformers numpy
python examples.py

Usage

Script to encode queries and passages and compute similarity scores using Sentence Transformers or 🤗 Transformers.

pip install sentence-transformers numpy
python usage.py

Citation

To cite this model:

@misc{sukhlecha_2024_bhasha_embed_v0,
  author = {Sukhlecha, Akshita},
  title = {Bhasha-embed-v0},
  howpublished = {Hugging Face},
  month = {June},
  year = {2024},
  url = {https://huggingface.co/AkshitaS/bhasha-embed-v0}
}

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
assets		assets
eval		eval
.gitignore		.gitignore
README.md		README.md
examples.py		examples.py
usage.py		usage.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bhasha embed v0 model

Model Details

Model Sources

Results

Additional information

Sample outputs

Example 1

Example 2

Example 3

Example 4

Scripts

Replicate results

Run examples

Usage

Citation

About

Releases

Packages

Languages

akshita-sukhlecha/bhasha-embed

Folders and files

Latest commit

History

Repository files navigation

Bhasha embed v0 model

Model Details

Model Sources

Results

Additional information

Sample outputs

Example 1

Example 2

Example 3

Example 4

Scripts

Replicate results

Run examples

Usage

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages