Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text Feature Unsupervised Clustering #16

Open
mightycatty opened this issue Nov 28, 2024 · 3 comments
Open

Text Feature Unsupervised Clustering #16

mightycatty opened this issue Nov 28, 2024 · 3 comments

Comments

@mightycatty
Copy link

mightycatty commented Nov 28, 2024

Observations on Text Clustering

Thank you for your excellent work.

I am conducting an experiment on clustering text using DBSCAN with text features. However, after comparing the results from microsoft/LLM2CLIP-Llama-3-8B-Instruct-CC-Finetuned and openai/clip-vit-base-patch16, I have noticed some peculiarities in the clustering outcomes.

Here is a simplified example:

Text Samples:

this is a small apple
a small apple on the table
a rotten apple
a green apple
red apple

a running dog
dogs fighting each other
a dog is playing with a ball
cute dog

DBSCAN Setup:

  • eps = 0.8
  • min_samples = 3

Clustering Results:

  • microsoft/LLM2CLIP-Llama-3-8B-Instruct-CC-Finetuned: (1 cluster)
  • this is a small apple
  • a rotten apple
  • a green apple
  • red apple
  • openai/clip-vit-base-patch16: (2 clusters)
  • this is a small apple
  • a small apple on the table
  • a rotten apple
  • a green apple
  • red apple
  • a running dog
  • dogs fighting each other
  • a dog is playing with a ball
  • cute dog

Note: This is a simplified example for the purpose of this issue. My actual dataset is much more complex, and the performance of microsoft/LLM2CLIP-Llama-3-8B-Instruct-CC-Finetuned appears to be significantly worse.

My Questions:

  1. Is it correct that the clustering performance of LLM2CLIP-Llama-3-8B-Instruct-CC-Finetuned is inferior?
  2. I have compared the image features and they do indeed perform better than OpenCLIP. If the first point is valid, why does the retrieval performance exceed that of OpenCLIP?
@Yif-Yang
Copy link
Collaborator

Yif-Yang commented Dec 3, 2024

Hello, I find your exploration very interesting. In our current findings, the DBSCAN results of the text encoder output in LLM2CLIP are indeed relatively average (although perhaps not as poor as your evaluation suggests, as different DBSCAN parameters are needed). However, the visual side performs quite well. Comparatively, we also tested the pre-adapter LLM part of the LLM2CLIP text encoder, and its performance is actually quite good. But after applying the adapter, the performance seems to degrade significantly. We suspect that retrieval tasks and DBSCAN may not necessarily have such a high level of consistency. What are your thoughts on this issue?

@Yif-Yang
Copy link
Collaborator

Yif-Yang commented Dec 3, 2024

We would be happy to work with you to analyze similar phenomena, and perhaps we can assist you in conducting some experiments if needed.

@mightycatty
Copy link
Author

mightycatty commented Dec 17, 2024

Hello, sorry for taking so long to reply. I have moved to other projects due to project rotation.

I carefully reviewed the principles of DBSCAN and am thinking that the clustering performance of dbscan indeed do not necessary align with the performance of image-text retrieval.
Because dbscan is based on density clustering, the text embeddings obtained do not necessarily exist in clusters on the hypercircle surface, especially for LLM, which has a far more complex internal understanding of texts than CLIP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants