Replies: 1 comment
-
I am currently working on vision model support. Pixtral should be up and running soon. There's an early example in the dev branch which relies on HF Transformers for the vision tower, but the plan is to fully integrate it. Then I'll look at other formats like Llava and Qwen2. Llama3.2 specifically looks very tricky to implement because of the cross-attention approach they went for, and I'll have to do some comparisons to other models (eventually) before I can tell if it's even worth it. Cause that's the thing with models. LLMs models are basically all transformers with a few minor variations, but VLMs are all special snowflakes with their own unique approach to mixing in image data. And the way Llama3.2 does it is especially annoying. I can't even see (at a glance) whether it supports multiple images in the same context or not. |
Beta Was this translation helpful? Give feedback.
-
Hi turboderp!
Just wanted to ask what your thoughts are on integrating this architecture?
Is this in your scope for this repo and does it have any kind of relevancy to implement this any time soon?
I saw some feature requests for multimodal models under the issues tab, but those were left unanswered.
Looking forward to hear from you!
Best regards,
gros
Beta Was this translation helpful? Give feedback.
All reactions