Model compatiblity #50

fabiannagel · 2024-09-09T12:09:57Z

fabiannagel
Sep 9, 2024

Hi! I noticed that you are neither relying on Stanford ColBERT, nor on RAGatouille in your dependencies. What models are compatible with pylate- only those which have also been created using your library?

Answered by NohTow

Sep 12, 2024

It comes from the fact that I know Ben and I worked with him to add the weights for PyLate on the repository!
However, this approach come at the cost of having to modify repositories, duplicate weights and actually created some issues.
Thus, I reworked the loading logic in #52 and you should now be able to load any existing stanford-nlp model!

I am still making some adjustments to handle more models (e.g, the very recent jina-colbert-v2) but it's already usable and should be even better very soon!

View full answer

NohTow · 2024-09-09T14:52:53Z

NohTow
Sep 9, 2024
Maintainer

Hello,

Indeed, PyLate is a standalone library that use sentence-transformers (and thus transformers) for the modeling, so we do not use Stanford nor RAGatouille!

Right now, the models usable are the ones trained using the library, as well as ColBERT-v2 and ColBERT-small from AnswerAI.

For starter, I'll share a conversion script soon so people can translate their favorite models. Did you have any model in mind so I can help?
Also, I will work on being able to load the stanford-nlp weights on the fly and cast it to a PyLate model. It'll be better for cross-compatibility than to store both version on the same repository as done for colbert-small.

0 replies

fabiannagel · 2024-09-12T12:27:48Z

fabiannagel
Sep 12, 2024
Author

Makes sense! But how come answerdotai/answerai-colbert-small-v1 is usable? I cannot see any reference to pylate on their model page but they do say it works with Stanford ColBERT or RAGatouille.

edit: I'm trying to get up and running in German which is unfortunately a bit difficult. I came across these models:

AdrienB134/ColBERTv1.0-german-mmarcoDE: Only ColBERTv1, but seems ready to go.
domci/ColBERTv2-mmarco-de-0.1: ColBERTv2, but not fully trained.
antoinelouis/colbert-xm: Multilingual ColBERTv2

0 replies

NohTow · 2024-09-12T12:33:08Z

NohTow
Sep 12, 2024
Maintainer

It comes from the fact that I know Ben and I worked with him to add the weights for PyLate on the repository!
However, this approach come at the cost of having to modify repositories, duplicate weights and actually created some issues.
Thus, I reworked the loading logic in #52 and you should now be able to load any existing stanford-nlp model!

I am still making some adjustments to handle more models (e.g, the very recent jina-colbert-v2) but it's already usable and should be even better very soon!

0 replies

fabiannagel · 2024-09-12T14:48:47Z

fabiannagel
Sep 12, 2024
Author

So by "stanford-nlp model", you mean the reference implementation?

I guess this warning refers to the loading logic you mentioned. Is there any way to store the converted model on disk in order to save some initialization time?

sentence_transformers.SentenceTransformer          Load pretrained SentenceTransformer: antoinelouis/colbert-xm
WARNING pylate.models.colbert                      No sentence-transformers model found with name antoinelouis/colbert-xm. Creating a ColBERT model from base encoder.

Apart from that, kudos for this library. I got a small retriever running already :)

0 replies

NohTow · 2024-09-12T15:22:57Z

NohTow
Sep 12, 2024
Maintainer

Yes indeed, the models built using this repository (and also RAGatouille ones as it is using the lib as backend).

I am not 100% sure the model from Antoine Louis will be loaded correctly as it has been trained with its own codebase (and idk how much the modeling is compatible).

You can save the model locally by using model.save_pretrained(path). You can then load it by indicating the path. You can also upload it to the hub to be able to load it remotely!

My last PR (#54) allow to load the recent jina-colbert-v2 model which is an amazing multilingual ColBERT. You could use this one once the PR is merged.

0 replies

fabiannagel · 2024-09-13T09:05:45Z

fabiannagel
Sep 13, 2024
Author

That's good to know. I haven't checked the results, but so far everything works without errors. jina-colbert-v2 looks perfect but unfortunately I cannot use it due to its license prohibiting commercial usage.

How much work would it be to make sure ColBERT-XM works properly with pylate? Can I help you with that somehow?
I looked at various libraries for ColBERT-based retrieval and so far pylate is the most mature in terms of stability & ease of use. So I'd love to keep it but I really rely on German or at least some multilingual capabilities.

0 replies

NohTow · 2024-09-13T10:10:25Z

NohTow
Sep 13, 2024
Maintainer

I honestly have no idea as I did not dig much into ColBERT-XM code.
I might have a look at some point but we have a lot of things planned and the support of ColBERT-XM might not be a priority right now.
Maybe @ant-louis will now better how painful it could be.

5 replies

ant-louis Sep 19, 2024

Hey, ColBERT-XM was trained with the original ColBERT library from stanford-nlp , so it should be compatible with PyLate natively too.

However, you need to manually activate the modular adapter layers for the language of interest and, if training, freeze the adapters and the embedding matrix. After a quick look to @NohTow's code, here is how I would do it:

from pylate import models

model = models.ColBERT(model_name_or_path="antoinelouis/colbert-xm")
language = "fr_XX" # Use a code from https://huggingface.co/facebook/xmod-base#languages

backbone = model[0].auto_model
if backbone.__class__.__name__.lower().startswith("xmod"):
    backbone.set_default_language(language)
    if do_train:
        backbone.freeze_embeddings_and_language_adapters()

I hope it helps!

NohTow Sep 19, 2024
Maintainer

Thanks a lot for taking the time to have a look @ant-louis, really appreciate it!

fabiannagel Sep 27, 2024
Author

Thanks a lot @ant-louis. I took the liberty of moving the model over to Huggingface. It's configured for German with instructions for other languages: https://huggingface.co/fabiannagel/pylate-colbert-xm-german

Should be ready to go now for pylate.

NohTow Sep 27, 2024
Maintainer

Very cool!
Did you assert the model is working correctly?

fabiannagel Sep 27, 2024
Author

I just manually inspected my retrieval results and they look fine, also in comparison with a BM25 baseline.

fabiannagel · 2024-09-17T16:10:11Z

fabiannagel
Sep 17, 2024
Author

Alright, I understand. I guess the following warning indicates that the conversion did not work properly...?

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

0 replies

NohTow · 2024-09-18T07:56:53Z

NohTow
Sep 18, 2024
Maintainer

Not really, it just means that the query/document prefixes have been added to the vocabulary.
It can mean that the prefixes are not correct or that the base model did not actually add them to the tokenizer.
In any case, it does not really affect the performance much.

Which model is it?

0 replies

fabiannagel · 2024-09-18T08:53:09Z

fabiannagel
Sep 18, 2024
Author

Still antoinelouis/colbert-xm after loading from Huggingface & converting using model.save_pretrained(path).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model compatiblity #50

{{title}}

Replies: 10 comments 5 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Model compatiblity #50

fabiannagel Sep 9, 2024

Replies: 10 comments · 5 replies

NohTow Sep 9, 2024 Maintainer

fabiannagel Sep 12, 2024 Author

NohTow Sep 12, 2024 Maintainer

fabiannagel Sep 12, 2024 Author

NohTow Sep 12, 2024 Maintainer

fabiannagel Sep 13, 2024 Author

NohTow Sep 13, 2024 Maintainer

ant-louis Sep 19, 2024

NohTow Sep 19, 2024 Maintainer

fabiannagel Sep 27, 2024 Author

NohTow Sep 27, 2024 Maintainer

fabiannagel Sep 27, 2024 Author

fabiannagel Sep 17, 2024 Author

NohTow Sep 18, 2024 Maintainer

fabiannagel Sep 18, 2024 Author

fabiannagel
Sep 9, 2024

Replies: 10 comments 5 replies

NohTow
Sep 9, 2024
Maintainer

fabiannagel
Sep 12, 2024
Author

NohTow
Sep 12, 2024
Maintainer

fabiannagel
Sep 12, 2024
Author

NohTow
Sep 12, 2024
Maintainer

fabiannagel
Sep 13, 2024
Author

NohTow
Sep 13, 2024
Maintainer

NohTow Sep 19, 2024
Maintainer

fabiannagel Sep 27, 2024
Author

NohTow Sep 27, 2024
Maintainer

fabiannagel Sep 27, 2024
Author

fabiannagel
Sep 17, 2024
Author

NohTow
Sep 18, 2024
Maintainer

fabiannagel
Sep 18, 2024
Author