Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TSNE vis: update the model & embeddings #102

Merged
merged 1 commit into from
May 11, 2023

Conversation

bzz
Copy link
Contributor

@bzz bzz commented Apr 30, 2023

Some improvements in visualisation relevant to #62.

It's using a 'all-MiniLM-L6-v2' model (87Mb instead of 418Mb, the same 512 context size) that is faster & seems to provide better visualisation.

Before: the previous model
Screenshot 2023-04-30 at 18 24 44

The previous model using both, abstracts and titles.
Screenshot 2023-04-30 at 18 25 31

The new model (batch size 1, abstracts only)
Screenshot 2023-04-30 at 18 26 51

After: the new model + titles + batched
Screenshot 2023-04-30 at 18 25 04

I tied UMAP for it and the results seems less interesting (but didn't experiment much)

UMAP Screenshot 2023-04-30 at 18 54 17

I also tried using a larger 420Mb model fine-tuned on scientific papers from SciRepEval - allenai/specter2 \w proximity adaptor that takes 1.5min vs 30sec of the above. It can't be switched though the CLI only, as it requires loading an adaptor.

specter2 TSNE Screenshot 2023-04-30 at 18 43 35

Let me know what you think and which one do you prefer!

Use smaller model that is fast and proived a better quality
'all-MiniLM-L6-v2' from https://www.sbert.net/docs/pretrained_models.html

Use title as well as abstract for paper embeddings.

Encode & avg. in batches.
@mallamanis
Copy link
Contributor

Thanks for looking into this @bzz! All these options seem quite interesting, yet, it's hard to decide without looking at what each point represents.

May I suggest, that you sample 3-4 papers that you know and try to see which of the visualizations gets "reasonable" neighbors? I'd be happy to go with whatever you find more useful, in that sense 👓

@bzz
Copy link
Contributor Author

bzz commented May 3, 2023

Indeed, that was exactly what I did, I apologize for missing this crucial information :)

Here are the interactive visualisations of embeddings + metadata for all 4 different models, sorted by subjective perception of "cluster quality" that should be easy to explore:

The best results were achieved with T-SNE hyperparams: ppl 10-20, LR 0.01, 1-2k steps
It's also better to switch Label By to Title.

Clearly identifiable clusters

  • Types
  • Completion
  • Search
  • Summarization
  • Bugs (localization, fix/repair)
  • (OpenAI) Commit messages, Decompilation/Obfuscation, ...

Let me know what you think!

@mallamanis mallamanis merged commit edb4eb5 into ml4code:source May 11, 2023
@mallamanis
Copy link
Contributor

This looks great! Thanks a lot for this 💯

@mallamanis
Copy link
Contributor

Hi @bzz it seems like that the Action Fails with this change, I suspect this is due to some restriction on GitHub Actions (memory?). Do you maybe have time to investigate?

https://github.com/ml4code/ml4code.github.io/actions/runs/4952439871

For now, I've reverted this PR in #103

@bzz
Copy link
Contributor Author

bzz commented May 15, 2023

Oh, my! From a quick glance - it may also have to do with some changes in CI runner image actions/runner-images#7188 🤔

I'll setup CI on my fork to try to reproduce and report back.

@bzz
Copy link
Contributor Author

bzz commented May 28, 2023

I suspect this is due to some restriction on GitHub Actions (memory?)

You are right, I missed that the CI has failed repeatedly and it's easily reproducible.

Here is RAM profile across different batch sizes (1 to 512, doubling every ~20sec) 🙄
all-LM6 mem profile
So the VM gets killed with OOM after exceeding 7Gb RAM limit

@bzz bzz deleted the source-new-emb branch May 28, 2023 11:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants