Skip to content

Releases: MaartenGr/BERTopic

Major Release v0.6

09 Mar 13:23
1ffc456
Compare
Choose a tag to compare

Highlights:

  • DTM: Added a basic dynamic topic modeling technique based on the global c-TF-IDF representation
    • model.topics_over_time(docs, timestamps, global_tuning=True)
  • DTM: Option to evolve topics based on t-1 c-TF-IDF representation which results in evolving topics over time
    • Only uses topics at t-1 and skips evolution if there is a gap
    • model.topics_over_time(docs, timestamps, evolution_tuning=True)
  • DTM: Function to visualize topics over time
    • model.visualize_topics_over_time(topics_over_time)
  • DTM: Add binning of timestamps
    • model.topics_over_time(docs, timestamps, nr_bins=10)
  • Add function get general information about topics (id, frequency, name, etc.)
    • get_topic_info()
  • Improved stability of c-TF-IDF by taking the average number of words across all topics instead of the number of documents

Fixes:

  • _map_probabilities() does not take into account that there is no probability of the outlier class and the probabilities are mutated instead of copied (#63, #64)

Major Release v0.5

08 Feb 13:35
e84d7d1
Compare
Choose a tag to compare

Features

  • Add Flair to allow for more (custom) token/document embeddings
  • Option to use custom UMAP, HDBSCAN, and CountVectorizer
  • Added low_memory parameter to reduce memory during computation
  • Improved verbosity (shows progress bar)
  • Improved testing
  • Use the newest version of sentence-transformers as it speeds ups encoding significantly
  • Return the figure of visualize_topics()
  • Expose all parameters with a single function: get_params()
  • Option to disable the saving of embedding_model, should reduce BERTopic size significantly
  • Add FAQ page

Fixes

  • To simplify the API, the parameters stop_words and n_neighbors were removed. These can still be used when a custom UMAP or CountVectorizer is used.
  • Set calculate_probabilities to False as a default. Calculating probabilities with HDBSCAN significantly increases computation time and memory usage. Better to remove calculating probabilities or only allow it by manually turning this on.

Fix embedding parameter

10 Jan 07:54
8813b4d
Compare
Choose a tag to compare

Fixed the parameter embedding_model not working properly when language had been set. If you are using an older version of BERTopic, please set language to False when you want to set embedding_model.

Language fix

07 Jan 11:30
c271ec6
Compare
Choose a tag to compare

There was an issue with selecting the correct language model. This is now fixed with this small pypi update.

Major Release

21 Dec 09:34
Compare
Choose a tag to compare

Highlights:

  • Visualize Topics similar to LDAvis
  • Added option to reduce topics after training
  • Added option to update topic representation after training
  • Added option to search topics using a search term
  • Significantly improved the stability of generating clusters
  • Finetune the topic words by selecting the most coherent words with the highest c-TF-IDF values
  • More extensive tutorials in the documentation

Notable Changes:

  • Option to select language instead of sentence-transformers models to minimize the complexity of using BERTopic
  • Improved logging (remove duplicates)
  • Check if BERTopic is fitted
  • Added TF-IDF as an embedder instead of transformer models (see tutorial)
  • Numpy for Python 3.6 will be dropped and was therefore removed from the workflow.
  • Preprocess text before passing it through c-TF-IDF
  • Merged get_topics_freq() with get_topic_freq()

Fixes:

  • Fix error handling topic probabilities

BugFix Topic Reduction

16 Nov 11:52
07c3be9
Compare
Choose a tag to compare

Fixed a bug with the topic reduction method that seems to reduce the number of topics but not to the nr_topics as defined in the class. Since this was, to a certain extend, breaking the topic reduction method a new release was necessary.

Custom Embeddings

04 Nov 14:53
d96f5a1
Compare
Choose a tag to compare

Adding the option to use custom embeddings or embeddings that you generated beforehand with whatever package you'd like to use. This allows users to further customize BERTopic to their liking.

NOTE: I cannot guarantee that using your own embeddings would result in better performance. It is likely to swing both ways depending on the embeddings you are using. For example, if you use poorly-trained W2V embeddings then it is likely to result in a poor topic generation. Thus, it is up to the user to experiment with the embeddings that best serve their purposes.

Topic Probability Distribution

29 Oct 13:23
bdfebd5
Compare
Choose a tag to compare
  • transform() and fit_transform() now also return the topic probability distributions
  • Added visualize_distribution() which visualizes the topic probability distribution for a single document

Small patch release

17 Oct 06:43
a61a768
Compare
Choose a tag to compare
  • Fixed n_gram_range not being used
  • Added option for using stopwords

Small patch release

11 Oct 13:37
dd9582e
Compare
Choose a tag to compare

Improved the calculation of the class-based TF-IDF procedure by limiting the calculation to sparse matrices. This prevents out-of-memory problems when faced with large datasets.