Robustness against hyper-parameters #2057

reuning · 2024-06-19T18:13:49Z

reuning
Jun 19, 2024

I've read a fair number of posts on here talking about evaluation and hyper-parameter tuning (see: #90, #582, and #1031 for some examples with a lot of good info). My summary of all this is that the hyper-parameters can matter but that @MaartenGr believes that simply trying to maximize some metric is likely to overfit the data and can lead to less useful results. I generally agree with this, but I'm also in a world where if there are parameters to be set people are going to want to know how they set them.

What I am considering doing, and was hoping to get some feedback on, was selecting a variety of reasonable hyper-parameters, estimating N different models, and then comparing how similar the topics are out of each model. If I can say that under these N reasonable choices the model output didn't change in any unreasonable way I can show that my (somewhat arbitrary) choices don't matter.

It seems that I can use something similar to the code here to calculate how similar the topics are.

I guess my question is, does this all some reasonable? I was also thinking about using the hierarchical topics in this similarity part as well, as I've noticed that when min_cluster_size in HBD is increased you sort of move up the hierarchical tree to a certain degree (not sure if this make sense).

MaartenGr · 2024-06-19T19:46:10Z

MaartenGr
Jun 19, 2024
Maintainer

I generally agree with this, but I'm also in a world where if there are parameters to be set people are going to want to know how they set them.

I fully agree! It would definitely be nice if we could find a way that allows users to easily explore hyperparameters.

What I am considering doing, and was hoping to get some feedback on, was selecting a variety of reasonable hyper-parameters, estimating N different models, and then comparing how similar the topics are out of each model. If I can say that under these N reasonable choices the model output didn't change in any unreasonable way I can show that my (somewhat arbitrary) choices don't matter.

Interesting idea to check out the similarity of topics amongst different runs. It also shows what I believe is the main problem to figure out, namely: "What are you evaluating?". For instance, if you find common topics through multiple runs does that then mean they are actually "good" topics? This also means that we first have to define what "good" topics actually are to specific users.

As such, I believe an interface/UI to explore the topics created with BERTopic would be ideal. The exploration itself can still be user-guided but automated with several common (potentially user-chosen) values to choose from.

An interface would also elevate problems that arise from hyperparameters that are exceedingly difficult to automate. For instance, what values are you going to pick for min_cluster_size in HDBSCAN? That not only highly depends on the size and structure of the data but also on what the goals and expertise of the end-user. For instance, some users might want hundreds of micro-clusters whereas others would be more interested in at most 20 topics.

That's why I think exploration rather than optimization would be preferred as the latter remains dependent on the users use-case. The moment a user is done with exploration and figuring out what is important to them, you can then use those parameters for optimization.

Part of this discussion potentially also involves the newly released EVōC which creates a hierarchy of assigned labels which can be used to also explore the number of topics users might be interested in.

0 replies

reuning · 2024-06-20T00:55:57Z

reuning
Jun 20, 2024
Author

Hmm, all this makes sense, and I definitely have preferences for min_cluster_size based on exploring different models.

I think one of the issues I'm having is that the data I'm using has several million documents in it so it takes sometime to estimate any given model, this slows down the exploration process.

I'm going to take a look at EVoC though, it looks potentially helpful for this.

Thanks

1 reply

MaartenGr Jun 20, 2024
Maintainer

I think one of the issues I'm having is that the data I'm using has several million documents in it so it takes sometime to estimate any given model, this slows down the exploration process.

Note that this also applies to any optimization that you propose. You would still have to estimate multiple models which can be slow on when dealing with millions of documents. Hopefully, and I think it will, EVōC could be a nice solution since it is quite fast. Similarly, cuML's offering would also make sense. Similarly, calculating the embeddings beforehand and potentially the dimensionality reduction algorithm helps greatly in that aspect.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Robustness against hyper-parameters #2057

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Robustness against hyper-parameters #2057

reuning Jun 19, 2024

Replies: 2 comments · 1 reply

MaartenGr Jun 19, 2024 Maintainer

reuning Jun 20, 2024 Author

MaartenGr Jun 20, 2024 Maintainer

reuning
Jun 19, 2024

Replies: 2 comments 1 reply

MaartenGr
Jun 19, 2024
Maintainer

reuning
Jun 20, 2024
Author

MaartenGr Jun 20, 2024
Maintainer