Priority for MllamaForConditionalGeneration Support? #645

gros87 · 2024-09-30T09:20:50Z

gros87
Sep 30, 2024

Hi turboderp!

Just wanted to ask what your thoughts are on integrating this architecture?
Is this in your scope for this repo and does it have any kind of relevancy to implement this any time soon?

I saw some feature requests for multimodal models under the issues tab, but those were left unanswered.

Looking forward to hear from you!

Best regards,
gros

turboderp · 2024-11-07T12:05:47Z

turboderp
Nov 7, 2024
Maintainer

I am currently working on vision model support. Pixtral should be up and running soon. There's an early example in the dev branch which relies on HF Transformers for the vision tower, but the plan is to fully integrate it. Then I'll look at other formats like Llava and Qwen2.

Llama3.2 specifically looks very tricky to implement because of the cross-attention approach they went for, and I'll have to do some comparisons to other models (eventually) before I can tell if it's even worth it. Cause that's the thing with models. LLMs models are basically all transformers with a few minor variations, but VLMs are all special snowflakes with their own unique approach to mixing in image data. And the way Llama3.2 does it is especially annoying. I can't even see (at a glance) whether it supports multiple images in the same context or not.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Priority for MllamaForConditionalGeneration Support? #645

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Priority for MllamaForConditionalGeneration Support? #645

gros87 Sep 30, 2024

Replies: 1 comment

turboderp Nov 7, 2024 Maintainer

gros87
Sep 30, 2024

turboderp
Nov 7, 2024
Maintainer