Skip to content

Text Embeddings for Hindi, English and Romanized Hindi texts

Notifications You must be signed in to change notification settings

akshita-sukhlecha/bhasha-embed

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bhasha embed v0 model

This is an embedding model that can embed texts in Hindi (Devanagari script), English and Romanized Hindi. There are many multilingual embedding models which work well for Hindi and English texts individually, but lack the following capabilities.

  1. Romanized Hindi support: This is the first embedding model to support Romanized Hindi (transliterated Hindi / hin_Latn).
  2. Cross-lingual alignment: This model outputs language-agnostic embedding. This enables querying a multilingual candidate pool containing a mix of Hindi, English and Romanised Hindi texts.

Model Details

  • Supported Languages: Hindi, English, Romanised Hindi
  • Base model: google/muril-base-cased
  • Training GPUs: 1xRTX4090
  • Training methodology: Distillation from English embedding model and Fine-tuning on triplet data.
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 768 tokens
  • Similarity Function: Cosine Similarity

Model Sources


Results

Results for English-Hindi cross-lingual alignment : Tasks with corpus containing texts in Hindi as well as English

Results for Romanised Hindi tasks : Tasks with texts in Romanised Hindi

Results for retrieval tasks with multilingual corpus : Retrieval task with corpus containing texts in Hindi, English as well as Romanised Hindi

Results for Hindi tasks : Tasks with texts in Hindi (Devanagari script)

Additional information


Sample outputs

Example 1

Example 2

Example 3

Example 4


Scripts

Replicate results

  1. cd eval
  2. pip install -r requirements.txt
  3. python evaluator.py

Run examples

Run few examples depicting model's capabilities :

  1. pip install sentence-transformers numpy
  2. python examples.py

Usage

Script to encode queries and passages and compute similarity scores using Sentence Transformers or 🤗 Transformers.

  1. pip install sentence-transformers numpy
  2. python usage.py

Citation

To cite this model:

@misc{sukhlecha_2024_bhasha_embed_v0,
  author = {Sukhlecha, Akshita},
  title = {Bhasha-embed-v0},
  howpublished = {Hugging Face},
  month = {June},
  year = {2024},
  url = {https://huggingface.co/AkshitaS/bhasha-embed-v0}
}

About

Text Embeddings for Hindi, English and Romanized Hindi texts

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages