New data coming in every week. Can I merge models with overlapping documents? #2117
Unanswered
sfc-gh-mschipper
asked this question in
Q&A
Replies: 1 comment
-
Either way are valid use cases and in part depend on the size and stability of your data. More specifically, I can imagine that if you have few data coming in, you might want to approach it in your example since 100 documents might be too few to properly cluster. What you can also do is use semi-supervised topic modeling in order to make sure that the topics you created previously remain very similar and allow for the new documents to be clustered in similar topics. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I receive ~100 new documents every week, which I need to merge with an existing topic model. I'd like to be able to discover new topics if they emerge from the new data, while keeping the previous documents mapped to the existing topic model. Rerunning the topic model every week might reshuffle the documents and create new labels, which is not ideal.
Online topic modeling looks to be the ideal choose in theory, but I'd like to stick with UMAP + HDBSCAN as this approach has shown very good results, so I will not proceed with it.
Merging models seems like the right choice in this case, however I noticed that in the example, Marteen trains 3 topic models with data that doesn't overlap. Is it recommended to merge topics which have been built incrementally?
Example:
Topic Model 1: 2000 initial documents
Topic Model 2: 2000 initial documents + 100 new documents.
merged_model = BERTopic.merge_models([topic_model_1, topic_model_2])
Beta Was this translation helpful? Give feedback.
All reactions