Best practices for multilabeling #2131

annarrrzzz · 2024-08-28T21:47:47Z

annarrrzzz
Aug 28, 2024

Hi BERTopic Experts, I have some documents, and would like to cluster them into multiple groups if relevant. Per this post, we can find out the distribution of multiple topics for each document using Topic Distribution. However, per this other post, "there is a difference between [a topic] having a high probability and actually being contained in the document". How do we consolidate these? More specifically:

1/ Is Topic Distribution the solution to unsupervised multilabeling?
2/ If so, is it advised that we pick the top n topics where the probabilities do not differ by X ppt?

Answered by MaartenGr

Aug 29, 2024

1/ Is Topic Distribution the solution to unsupervised multilabeling?

It is one of the solutions you could indeed take. The most straightforward method (and perhaps most accurate) is to split up your document into sentences instead before feeding them to BERTopic.

2/ If so, is it advised that we pick the top n topics where the probabilities do not differ by X ppt?

There isn't a single answer to this as it highly depends on the data but also your use case. However, there is the min_similarity parameter in .approximate_distribution that you can use to decide how similar given similarity scores should be.

View full answer

MaartenGr · 2024-08-29T06:08:58Z

MaartenGr
Aug 29, 2024
Maintainer

1/ Is Topic Distribution the solution to unsupervised multilabeling?

It is one of the solutions you could indeed take. The most straightforward method (and perhaps most accurate) is to split up your document into sentences instead before feeding them to BERTopic.

2/ If so, is it advised that we pick the top n topics where the probabilities do not differ by X ppt?

There isn't a single answer to this as it highly depends on the data but also your use case. However, there is the min_similarity parameter in .approximate_distribution that you can use to decide how similar given similarity scores should be.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best practices for multilabeling #2131

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Best practices for multilabeling #2131

annarrrzzz Aug 28, 2024

Replies: 1 comment

MaartenGr Aug 29, 2024 Maintainer

annarrrzzz
Aug 28, 2024

MaartenGr
Aug 29, 2024
Maintainer