This is an embedding model that can embed texts in Hindi (Devanagari script), English and Romanized Hindi. There are many multilingual embedding models which work well for Hindi and English texts individually, but lack the following capabilities.
- Romanized Hindi support: This is the first embedding model to support Romanized Hindi (transliterated Hindi / hin_Latn).
- Cross-lingual alignment: This model outputs language-agnostic embedding. This enables querying a multilingual candidate pool containing a mix of Hindi, English and Romanised Hindi texts.
- Supported Languages: Hindi, English, Romanised Hindi
- Base model: google/muril-base-cased
- Training GPUs: 1xRTX4090
- Training methodology: Distillation from English embedding model and Fine-tuning on triplet data.
- Maximum Sequence Length: 512 tokens
- Output Dimensionality: 768 tokens
- Similarity Function: Cosine Similarity
- Hugging Face: link
- Developer: Akshita Sukhlecha
Results for English-Hindi cross-lingual alignment : Tasks with corpus containing texts in Hindi as well as English
Results for Romanised Hindi tasks : Tasks with texts in Romanised Hindi
Results for retrieval tasks with multilingual corpus : Retrieval task with corpus containing texts in Hindi, English as well as Romanised Hindi
Results for Hindi tasks : Tasks with texts in Hindi (Devanagari script)
- Some task dataset links: Belebele, MLQA, XQuAD, SemRel24
- hin_Latn tasks: Most hin_Latn tasks have been created by transliterating hindi texts using indic-trans library
- Detailed results: github_link
cd eval
pip install -r requirements.txt
python evaluator.py
Run few examples depicting model's capabilities :
pip install sentence-transformers numpy
python examples.py
Script to encode queries and passages and compute similarity scores using Sentence Transformers or 🤗 Transformers.
pip install sentence-transformers numpy
python usage.py
To cite this model:
@misc{sukhlecha_2024_bhasha_embed_v0,
author = {Sukhlecha, Akshita},
title = {Bhasha-embed-v0},
howpublished = {Hugging Face},
month = {June},
year = {2024},
url = {https://huggingface.co/AkshitaS/bhasha-embed-v0}
}