Best practices for multilabeling #2131
-
Hi BERTopic Experts, I have some documents, and would like to cluster them into multiple groups if relevant. Per this post, we can find out the distribution of multiple topics for each document using Topic Distribution. However, per this other post, "there is a difference between [a topic] having a high probability and actually being contained in the document". How do we consolidate these? More specifically: 1/ Is Topic Distribution the solution to unsupervised multilabeling? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
It is one of the solutions you could indeed take. The most straightforward method (and perhaps most accurate) is to split up your document into sentences instead before feeding them to BERTopic.
There isn't a single answer to this as it highly depends on the data but also your use case. However, there is the |
Beta Was this translation helpful? Give feedback.
It is one of the solutions you could indeed take. The most straightforward method (and perhaps most accurate) is to split up your document into sentences instead before feeding them to BERTopic.
There isn't a single answer to this as it highly depends on the data but also your use case. However, there is the
min_similarity
parameter in.approximate_distribution
that you can use to decide how similar given similarity scores should be.