Matryoshka Adaptor

Implementation of the Matryoshka Adaptor (paper).

The Matryoshka Adaptor is a proposed method to tune an adaptor-network to map embeddings from some source (potentially black-box) embedding model into a different space with a more desirable structure. First and foremost, mapped embeddings should posses the "Matryoshka property" (see here) which in embedding models refers to the property of largely maintained performance when only the first $k$ dimensions of mapped embeddings are considered. Use case: allows performance trade-offs through dim. reduction that can be very desirable.

For details of the method please refer to the paper.

Usage

To train an adaptor on a specific MTEB dataset, configure and run main.py.

To evaluate a specific adaptor, configure and run matryoshka_adaptor/eval.py.

File overview

main.py: entry point to train an adaptor
matryoshka_adaptor/
├── model.py: adaptor model design
├── train.py: training loop and data loading
├── losses.py: implementation of loss functions
└── eval.py: evaluation entry point and logic

Requirements

torch
sentence-transformers: base embedding model
datasets: training data
pandas: training data processing
matplotlib: evaluation plotting
mteb: evaluation framework
ptvsd: debugging

Quickstart

pip install -r requirements.txt
python main.py

This will run the current config defined in main.py (training on HotpotQA). Tested on a GH200 superchip and a Vast.ai machine with A100, AMD EPYC 7J13 64-Core Processor and 256 GB RAM.

Saved models

Both the unsupervised and supervised trained adaptor on SciFact are included in the repository and can evaluated (see matryoshka_adaptor/eval.py).

Evaluation

For the experiment below the sentence-transformers/all-MiniLM-L6-v2 model was used. The trained adaptor is a shallow multi layer perceptron yielding performances shown below in the supervised and unsupervised setting respectively.

The experiment has been conducted on a GH200 chip, re-running the experiment might require the adaption of batch sizes and other parameters.

Training and evaluation have been performed both on the SciFact and HotpotQA dataset.

References

Yoon, Jinsung et al. “Matryoshka-Adaptor: Unsupervised and Supervised Tuning for Smaller Embedding Dimensions.” ArXiv abs/2407.20243 (2024): n. pag. ArXiv

Muennighoff, Niklas et al. “MTEB: Massive Text Embedding Benchmark.” Conference of the European Chapter of the Association for Computational Linguistics (2022).

sentence-transformers/all-MiniLM-L6-v2 · Hugging Face. Huggingface.co. Published January 5, 2024. Accessed September 1, 2024. https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 ‌

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
matryoshka_adaptor		matryoshka_adaptor
plots		plots
.gitignore		.gitignore
README.md		README.md
adaptor_hotpotqa_supervised.pt		adaptor_hotpotqa_supervised.pt
adaptor_hotpotqa_unsupervised.pt		adaptor_hotpotqa_unsupervised.pt
adaptor_scifact_supervised.pt		adaptor_scifact_supervised.pt
adaptor_scifact_unsupervised.pt		adaptor_scifact_unsupervised.pt
eval_adaptor_hotpotqa_supervised.pt.json		eval_adaptor_hotpotqa_supervised.pt.json
eval_adaptor_hotpotqa_unsupervised.pt.json		eval_adaptor_hotpotqa_unsupervised.pt.json
eval_adaptor_scifact_supervised.pt.json		eval_adaptor_scifact_supervised.pt.json
eval_adaptor_scifact_unsupervised.pt.json		eval_adaptor_scifact_unsupervised.pt.json
eval_original.json		eval_original.json
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Matryoshka Adaptor

Usage

File overview

Requirements

Quickstart

Saved models

Evaluation

References

About

Releases

Packages

Languages

f-linus/matryoshka_adaptor

Folders and files

Latest commit

History

Repository files navigation

Matryoshka Adaptor

Usage

File overview

Requirements

Quickstart

Saved models

Evaluation

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages