General Questions about use in Production #18

zilch42 · 2023-03-09T01:24:45Z

zilch42
Mar 9, 2023

Hi there,

Thanks for your work on this package. It is nice to have a way to optimise some of these model parameters. I'm interested in your experience on how the two parameters being optimised behave with different size datasets and how you might see this fitting into a production environment.

Say for example I am only ever interested in scientific journal articles, and generally have a consistent level of granularity that I want to work at, but will be working on different groups of publications. One day I may be looking at 100k publications over all subject areas, another day I may be looking at 2k publications in agriculture, another day I may be looking at 10k publications on astronomy.

Is it likely that I might find a set of parameters from TopicTuner during development that works well for this type of data that I can then stick to regardless of the size of the corpus I'm working on? Or is it more likely to be the case that I need to tune the model for each corpus? And if I was building a tool for someone else would that then rely on some analyst skill to select the right parameters from the parameter search, or come up with some heuristic for choosing them automatically?

Interested in your thoughts.

drob-xx · 2023-03-09T06:41:46Z

drob-xx
Mar 9, 2023
Maintainer

Good question. This is something I haven't played with much. I don't think that using the same parameters with a new set of vectors from UMAP will result in a similar clustering without (at least) new tuned parameters. I might be wrong. HDBSCAN has approximate_predict() for this so you should test with that. Of course you can't call approximate_predict using TopicTuner , you'll need to get an HDBSCAN instance and then call it yourself. This could be an interesting use case to drive into TopicTuner though - it could even produce a BERTopic instance that had an HDBSCAN implementation that actually called approximate_predict. Let me know how it goes - I'll likely play around with this over the weekend.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

General Questions about use in Production #18

{{title}}

Replies: 1 comment

{{title}}

Select a reply

General Questions about use in Production #18

zilch42 Mar 9, 2023

Replies: 1 comment

drob-xx Mar 9, 2023 Maintainer

zilch42
Mar 9, 2023

drob-xx
Mar 9, 2023
Maintainer