v0.12.0
Highlights
- Perform online/incremental topic modeling with
.partial_fit
- Expose c-TF-IDF model for customization with
bertopic.vectorizers.ClassTfidfTransformer
- The parameters
bm25_weighting
andreduce_frequent_words
were added to potentially improve representations:
- The parameters
- Expose attributes for easier access to internal data
- Added many tests with the intention of making development a bit more stable
Documentation
- Major changes to the Algorithm page of the documentation, which now contains three overviews of the algorithm:
- Added an example of combining BERTopic with KeyBERT
Fixes
- Fixed iteratively merging topics (#632 and (#648)
- Fixed 0th topic not showing up in visualizations (#667)
- Fixed lowercasing not being optional (#682)
- Fixed spelling (#664 and (#673)
- Fixed 0th topic not shown in
.get_topic_info
by @oxymor0n in #660 - Fixed spelling by @domenicrosati in #674
- Add custom labels and title options to barchart @leloykun in #694
Online/incremental topic modeling
Online topic modeling (sometimes called "incremental topic modeling") is the ability to learn incrementally from a mini-batch of instances. Essentially, it is a way to update your topic model with data on which it was not trained before. In Scikit-Learn, this technique is often modeled through a .partial_fit
function, which is also used in BERTopic.
At a minimum, the cluster model needs to support a .partial_fit
function in order to use this feature. The default HDBSCAN model will not work as it does not support online updating.
from sklearn.datasets import fetch_20newsgroups
from sklearn.cluster import MiniBatchKMeans
from sklearn.decomposition import IncrementalPCA
from bertopic.vectorizers import OnlineCountVectorizer
from bertopic import BERTopic
# Prepare documents
all_docs = fetch_20newsgroups(subset="all", remove=('headers', 'footers', 'quotes'))["data"]
doc_chunks = [all_docs[i:i+1000] for i in range(0, len(all_docs), 1000)]
# Prepare sub-models that support online learning
umap_model = IncrementalPCA(n_components=5)
cluster_model = MiniBatchKMeans(n_clusters=50, random_state=0)
vectorizer_model = OnlineCountVectorizer(stop_words="english", decay=.01)
topic_model = BERTopic(umap_model=umap_model,
hdbscan_model=cluster_model,
vectorizer_model=vectorizer_model)
# Incrementally fit the topic model by training on 1000 documents at a time
for docs in doc_chunks:
topic_model.partial_fit(docs)
Only the topics for the most recent batch of documents are tracked. If you want to be using online topic modeling, not for a streaming setting but merely for low-memory use cases, then it is advised to also update the .topics_
attribute as variations such as hierarchical topic modeling will not work afterward:
# Incrementally fit the topic model by training on 1000 documents at a time and tracking the topics in each iteration
topics = []
for docs in doc_chunks:
topic_model.partial_fit(docs)
topics.extend(topic_model.topics_)
topic_model.topics_ = topics
c-TF-IDF
Explicitly define, use, and adjust the ClassTfidfTransformer
with new parameters, bm25_weighting
and reduce_frequent_words
, to potentially improve the topic representation:
from bertopic import BERTopic
from bertopic.vectorizers import ClassTfidfTransformer
ctfidf_model = ClassTfidfTransformer(bm25_weighting=True)
topic_model = BERTopic(ctfidf_model=ctfidf_model)
Attributes
After having fitted your BERTopic instance, you can use the following attributes to have quick access to certain information, such as the topic assignment for each document in topic_model.topics_
.
Attribute | Type | Description |
---|---|---|
topics_ | List[int] | The topics that are generated for each document after training or updating the topic model. The most recent topics are tracked. |
probabilities_ | List[float] | The probability of the assigned topic per document. These are only calculated if an HDBSCAN model is used for the clustering step. When calculate_probabilities=True , then it is the probabilities of all topics per document. |
topic_sizes_ | Mapping[int, int] | The size of each topic. |
topic_mapper_ | TopicMapper | A class for tracking topics and their mappings anytime they are merged, reduced, added, or removed. |
topic_representations_ | Mapping[int, Tuple[int, float]] | The top n terms per topic and their respective c-TF-IDF values. |
c_tf_idf_ | csr_matrix | The topic-term matrix as calculated through c-TF-IDF. To access its respective words, run .vectorizer_model.get_feature_names() or .vectorizer_model.get_feature_names_out() |
topic_labels_ | Mapping[int, str] | The default labels for each topic. |
custom_labels_ | List[str] | Custom labels for each topic as generated through .set_topic_labels . |
topic_embeddings_ | np.ndarray | The embeddings for each topic. It is calculated by taking the weighted average of word embeddings in a topic based on their c-TF-IDF values. |
representative_docs_ | Mapping[int, str] | The representative documents for each topic if HDBSCAN is used. |