Conceptual help on SigLIP + pre-trained CLIP #891
Unanswered
miguelalba96
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
How logical is using a pre-trained CLIP checkpoint to train on SigLIP loss
I have millions of image-text pairs for fashion, each image could potentially contain multiple attributes like colors, fabrics, embellishments. The data is very sparse, i.e for some dresses I might only have the caption "white dress", while for others I have "white dress with back zipper and buttons and sequins"
To be able to do zero shot classification I used fashion clip and fine-tuned it using LoRA achieving an effective batch size of: 12000. Then after training I defined a hierarchical softmax approach to do the "multi-label" classification, so logits of colors get compared only with colors, fabrics with fabrics, etc. That works partially well but in groups like closures with 5 different categories. A jacket can contain front zipper and front line buttons at the same time, so softmax is hurting a lot those cases
I checked that SigLIP uses another approach based on sigmoid, so it is more suitable for the cases I am facing. So I tried fine-tuning again with LoRA on the base 224 checkpoint but I noticed the model has very low logits/probabilities when the captions are very simple "red dress". So I was thinking maybe it makes sense to use fashion-CLIP and fine tune it with the sigmoid loss of siglip? maybe the logits are bigger initially compared to pre-trained SigLIP? Does it make sense?
Also, is the sigmoid loss less susceptible to cases in which I have similar captions on the same batch "class collisions". ie. having multiple dresses on a batch with the same caption?
Beta Was this translation helpful? Give feedback.
All reactions