Skip to content

Best practices for multilabeling #2131

Answered by MaartenGr
annarrrzzz asked this question in Q&A
Discussion options

You must be logged in to vote

1/ Is Topic Distribution the solution to unsupervised multilabeling?

It is one of the solutions you could indeed take. The most straightforward method (and perhaps most accurate) is to split up your document into sentences instead before feeding them to BERTopic.

2/ If so, is it advised that we pick the top n topics where the probabilities do not differ by X ppt?

There isn't a single answer to this as it highly depends on the data but also your use case. However, there is the min_similarity parameter in .approximate_distribution that you can use to decide how similar given similarity scores should be.

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by annarrrzzz
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants