Best practice regarding outliers for dynamic topic modeling #1738

serenalotreck · 2024-01-11T20:41:49Z

serenalotreck
Jan 11, 2024

Hi!

I'm working on dynamic topic modeling for a corpus of scientific abstracts, and I just wanted to check in regarding a best practice that I didn't see mentioned in the docs.

The docs do talk about outlier reduction, which forces all points to be assigned to a cluster. I performed this on my data, as of my original 5,564 abstracts, 2,694 were outliers when I first ran the model.

I then performed topic modeling with:

topics_over_time = topic_model.topics_over_time(docs, abstracts['year'], global_tuning=False, nr_bins=50)

I had the thought that maybe the topics over time would be clearer if I didn't reduce outliers, because (to my understanding of how this works) they might be weakening the specific signal of the topic clusters since there were so many of them (I'm not sure if that's true).

I did notice quite a difference between the two. Here's the plot of the results with outlier reduction:

And without outlier reduction:

The difference in frequencies is large enough that it would change my interpretation of the popularity of topics relative to one another over time.

I was wondering if folks had thoughts on which plot is the "truer" representation of topic evolution over time?

EDIT: Realized I posted the wrong screenshots! The previous ones were from two separate models with different random seeds; now that I'm looking at both version from the same model, it looks like the impact isn't that large. I'd imagine it could be, so would still love to know people's thoughts!

Answered by MaartenGr

Jan 12, 2024

HDBSCAN can be quite strict in assigning outliers. Therefore, reduce_outliers was introduced to reduce its impact. In practice, you would have to check yourself whether the new assignments make sense by inspecting a subset manually. Do note that this function can also reduce some outliers and it's actually advised not to reduce all of them.

View full answer

MaartenGr · 2024-01-12T14:53:40Z

MaartenGr
Jan 12, 2024
Maintainer

HDBSCAN can be quite strict in assigning outliers. Therefore, reduce_outliers was introduced to reduce its impact. In practice, you would have to check yourself whether the new assignments make sense by inspecting a subset manually. Do note that this function can also reduce some outliers and it's actually advised not to reduce all of them.

5 replies

lintonye Sep 2, 2024

@MaartenGr I ran fit_transform again on the outliers only, which seemed to produce some cohesive topics (and therefore reducing outliers), and then merged this new model with the original one. This seems to work better for my use case since reduce_outliers does not create new topics and in a sense undermine the original topics.

Does this sound like a reasonable approach?

BTW: I couldn't find a way to remove topics or docs from a model - wanted to remove the outliers from the original model. I'd appreciate any pointers.

MaartenGr Sep 2, 2024
Maintainer

@lintonye It definitely is! There are a bunch of methods/techniques you could use in order to reduce outliers but the one you describe is a solid one. Do note though that you are likely to end up with more micro-clusters than you initially had.

Removing topics from a model is currently not supported as during inference, you could simply use calculate_probabilities=True and ignore the outlier topic. Also, if it is a non-outlier topic that you want to remove, you can simply merge it with the outlier topic.

lintonye Sep 3, 2024

@MaartenGr Thanks! This seems to work well so far. Since removing topics/docs isn't supported yet, I had to store the initial documents with some duplication so that I could call get_document_info at a later time. Why does get_document_info always require the initial list of documents?

Does it make sense to mention this approach in the documentation? I could prepare a PR if it is.

lintonye Sep 3, 2024

Also, when merging two models, will some non-outlier documents be assigned as outliers? I've noticed size discrepancy after merging. For example, let's say initially the docs are:

Model 1

doc1, topic: -1
doc2, topic: -1
doc3, topic: -1
doc4, topic: 0
doc5, topic: 1

To reduce outliers, we retrain a model with doc1, doc2 and doc3 and get:

Model 2

doc1, topic: -1
doc2, topic: 0
doc3, topic: 1

Now if I merge these two models, I expect to get something like:

doc1, topic: -1
doc2, topic: -1
doc3, topic: -1
doc1, topic: -1 ==> duplicate due to merge
doc4, topic: 0
doc5, topic: 1
doc2, topic: 0
doc3, topic: 1

So, the number of non-outliers is 4, which is [total # of docs] - [# of outliers in model 2].

However, in my tests, the number of non-outliers could be 3, which means for example doc2 gets assigned as an outlier in the merged model. Is this expected?

MaartenGr Sep 11, 2024
Maintainer

It might happen if the minimum similarity is quite low or when a given topic might just be very similar to the outliers (although I wouldn't expect the latter to happen frequently).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best practice regarding outliers for dynamic topic modeling #1738

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Best practice regarding outliers for dynamic topic modeling #1738

serenalotreck Jan 11, 2024

Replies: 1 comment · 5 replies

MaartenGr Jan 12, 2024 Maintainer

lintonye Sep 2, 2024

MaartenGr Sep 2, 2024 Maintainer

lintonye Sep 3, 2024

lintonye Sep 3, 2024

MaartenGr Sep 11, 2024 Maintainer

serenalotreck
Jan 11, 2024

Replies: 1 comment 5 replies

MaartenGr
Jan 12, 2024
Maintainer

MaartenGr Sep 2, 2024
Maintainer

MaartenGr Sep 11, 2024
Maintainer