Best practice regarding outliers for dynamic topic modeling #1738
-
Hi! I'm working on dynamic topic modeling for a corpus of scientific abstracts, and I just wanted to check in regarding a best practice that I didn't see mentioned in the docs. The docs do talk about outlier reduction, which forces all points to be assigned to a cluster. I performed this on my data, as of my original 5,564 abstracts, 2,694 were outliers when I first ran the model. I then performed topic modeling with:
I had the thought that maybe the topics over time would be clearer if I didn't reduce outliers, because (to my understanding of how this works) they might be weakening the specific signal of the topic clusters since there were so many of them (I'm not sure if that's true). I did notice quite a difference between the two. Here's the plot of the results with outlier reduction: And without outlier reduction: The difference in frequencies is large enough that it would change my interpretation of the popularity of topics relative to one another over time. I was wondering if folks had thoughts on which plot is the "truer" representation of topic evolution over time? EDIT: Realized I posted the wrong screenshots! The previous ones were from two separate models with different random seeds; now that I'm looking at both version from the same model, it looks like the impact isn't that large. I'd imagine it could be, so would still love to know people's thoughts! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 5 replies
-
HDBSCAN can be quite strict in assigning outliers. Therefore, |
Beta Was this translation helpful? Give feedback.
HDBSCAN can be quite strict in assigning outliers. Therefore,
reduce_outliers
was introduced to reduce its impact. In practice, you would have to check yourself whether the new assignments make sense by inspecting a subset manually. Do note that this function can also reduce some outliers and it's actually advised not to reduce all of them.