v0.6.0 (#120)

* Major speedup, up to 2x to 5x when passing multiple documents (for MMR and MaxSum) compared to single documents * Same results whether passing a single document or multiple documents * MMR and MaxSum now work when passing a single document or multiple documents * Improved documentation * Added 🤗 Hugging Face Transformers * Highlighting support for Chinese texts * Now uses the CountVectorizer for creating the tokens * This should also improve the highlighting for most applications and higher n-grams * Fix #106 * Fix #116
MaartenGr · Jul 27, 2022 · 9dd7b59 · 9dd7b59
1 parent 412099b
commit 9dd7b59
Show file tree

Hide file tree

Showing 20 changed files with 428 additions and 266 deletions.
diff --git a/README.md b/README.md
@@ -20,7 +20,7 @@ Corresponding medium post can be found [here](https://towardsdatascience.com/key
    2. [Getting Started](#gettingstarted)  
         2.1. [Installation](#installation)  
         2.2. [Basic Usage](#usage)  
-        2.3. [Max Sum Similarity](#maxsum)  
+        2.3. [Max Sum Distance](#maxsum)  
         2.4. [Maximal Marginal Relevance](#maximal)  
         2.5. [Embedding Models](#embeddings)  
 <!--te-->  
@@ -134,7 +134,7 @@ I would advise either `"all-MiniLM-L6-v2"` for English documents or `"paraphrase
 for multi-lingual documents or any other language.
 
 <a name="maxsum"/></a>
-###  2.3. Max Sum Similarity
+###  2.3. Max Sum Distance
 
 To diversify the results, we take the 2 x top_n most similar words/phrases to the document.
 Then, we take all top_n combinations from the 2 x top_n words and extract the combination

diff --git a/docs/api/maxsum.md b/docs/api/maxsum.md
@@ -1,3 +1,3 @@
-# `Max Sum Similarity`
+# `Max Sum Distance`
 
-::: keybert._maxsum.max_sum_similarity
+::: keybert._maxsum.max_sum_distance
diff --git a/docs/changelog.md b/docs/changelog.md
@@ -1,3 +1,38 @@
+## **Version 0.6.0**
+*Release date: 25 July, 2022*
+
+**Highlights**:
+
+* Major speedup, up to 2x to 5x when passing multiple documents (for MMR and MaxSum) compared to single documents
+* Same results whether passing a single document or multiple documents
+* MMR and MaxSum now work when passing a single document or multiple documents
+* Improved documentation
+* Added 🤗 Hugging Face Transformers
+
+```python
+from keybert import KeyBERT
+from transformers.pipelines import pipeline
+
+hf_model = pipeline("feature-extraction", model="distilbert-base-cased")
+kw_model = KeyBERT(model=hf_model)
+```
+
+* Highlighting support for Chinese texts
+    * Now uses the `CountVectorizer` for creating the tokens
+    * This should also improve the highlighting for most applications and higher n-grams
+
+![image](https://user-images.githubusercontent.com/25746895/179488649-3c66403c-9620-4e12-a7a8-c2fab26b18fc.png)
+
+**NOTE**: Although highlighting for Chinese texts is improved, since I am not familiar with the Chinese language there is a good chance it is not yet as optimized as for other languages. Any feedback with respect to this is highly appreciated!
+
+**Fixes**: 
+
+* Fix typo in ReadMe by [@priyanshul-govil](https://github.com/priyanshul-govil) in [#117](https://github.com/MaartenGr/KeyBERT/pull/117)
+* Add missing optional dependencies (gensim, use, and spacy) by [@yusuke1997](https://github.com/yusuke1997)
+ in [#114](https://github.com/MaartenGr/KeyBERT/pull/114)
+
+
+
 ## **Version 0.5.1**
 *Release date:  31 March, 2022*
 
@@ -25,6 +60,8 @@
 **Miscellaneous**:
 
 * Added instructions in the FAQ to extract keywords from Chinese documents
+* Fix typo in ReadMe by [@koaning](https://github.com/koaning) in [#51](https://github.com/MaartenGr/KeyBERT/pull/51)
+
 
 ## **Version 0.4.0**
 *Release date:  23 June, 2021*

diff --git a/docs/faq.md b/docs/faq.md
@@ -15,12 +15,39 @@ typically do not contribute to the meaning of a document and should therefore be
 topic modeling to HTML-code to extract topics of code, then it becomes important.
 
 
-## **Can I use the GPU to speed up the model?**
-Yes! Since KeyBERT uses embeddings as its backend, a GPU is actually prefered when using this package.
+## **How can I speed up the model?**
+Since KeyBERT uses large language models as its backend, a GPU is typically prefered when using this package. 
 Although it is possible to use it without a dedicated GPU, the inference speed will be significantly slower.
 
+A second method for speeding up KeyBERT is by passing it multiple documents at once. By doing this, words 
+need to only be embedded a single time, which can result in a major speed up. 
+
+This is **faster**:
+
+```python
+from keybert import KeyBERT
+
+kw_model = KeyBERT()
+
+keywords = kw_model.extract_keywords(my_list_of_documents)
+```
+
+This is **slower**:
+
+```python
+from keybert import KeyBERT
+
+kw_model = KeyBERT()
+
+keywords = []
+for document in my_list_of_documents:
+    keyword = kw_model.extract_keywords(document)
+    keywords.append(keyword)
+```
+
+
 ## **How can I use KeyBERT with Chinese documents?**
-You need to make sure you use a Tokenizer in KeyBERT that supports tokenization of Chinese. I suggest installing [`jieba`](https://github.com/fxsjy/jieba) for this:
+You need to make sure you use a tokenizer in KeyBERT that supports tokenization of Chinese. I suggest installing [`jieba`](https://github.com/fxsjy/jieba) for this:
 
 ```python
 from sklearn.feature_extraction.text import CountVectorizer
@@ -41,3 +68,7 @@ from keybert import KeyBERT
 kw_model = KeyBERT()
 keywords = kw_model.extract_keywords(doc, vectorizer=vectorizer)
 ```
+
+It also supports highlighting:
+
+![image](https://user-images.githubusercontent.com/25746895/179488649-3c66403c-9620-4e12-a7a8-c2fab26b18fc.png)
diff --git a/docs/guides/countvectorizer.md b/docs/guides/countvectorizer.md
@@ -42,8 +42,10 @@ Next, we can use a basic vectorizer when extracting keywords as follows:
  ('mapping', 0.3700)]
 ```
 
-**NOTE**: Although I typically like to use `use_mmr=True` as it often improves upon the generated keywords, this tutorial will do without
-in order give you a clear view of the effects of the CountVectorizer.
+!!! note "NOTE"
+    Although I typically like to use `use_mmr=True` as it often improves upon the generated keywords, this tutorial will do without
+    in order give you a clear view of the effects of the CountVectorizer.
+
 
 ## **Parameters**
 

diff --git a/docs/guides/embeddings.md b/docs/guides/embeddings.md
@@ -20,6 +20,21 @@ sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
 kw_model = KeyBERT(model=sentence_model)
 ```
 
+### 🤗 **Hugging Face Transformers**
+To use a Hugging Face transformers model, load in a pipeline and point 
+to any model found on their model hub (https://huggingface.co/models):
+
+```python
+from transformers.pipelines import pipeline
+
+hf_model = pipeline("feature-extraction", model="distilbert-base-cased")
+kw_model = KeyBERT(model=hf_model)
+```
+
+!!! tip "Tip!"
+    These transformers also work quite well using `sentence-transformers` which has a number of 
+    optimizations tricks that make using it a bit faster. 
+
 ### **Flair**
 [Flair](https://github.com/flairNLP/flair) allows you to choose almost any embedding model that
 is publicly available. Flair can be used as follows:

diff --git a/docs/guides/quickstart.md b/docs/guides/quickstart.md
@@ -14,12 +14,6 @@ pip install keybert[spacy]
 pip install keybert[use]
 ```
 
-To install all backends:
-
-```
-pip install keybert[all]
-```
-
 ## **Usage**
 
 The most minimal example can be seen below for the extraction of keywords:
@@ -65,17 +59,18 @@ of words you would like in the resulting keyphrases:
  ('learning function', 0.5850)]
 ```
 
-We can highlight the keywords in the document by simply setting `hightlight`:
+We can highlight the keywords in the document by simply setting `highlight`:
 
 ```python
 keywords = kw_model.extract_keywords(doc, highlight=True)
 ```
 
-**NOTE**: For a full overview of all possible transformer models see [sentence-transformer](https://www.sbert.net/docs/pretrained_models.html).
-I would advise either `"all-MiniLM-L6-v2"` for English documents or `"paraphrase-multilingual-MiniLM-L12-v2"`
-for multi-lingual documents or any other language.
+!!! note "NOTE"
+    For a full overview of all possible transformer models see [sentence-transformer](https://www.sbert.net/docs/pretrained_models.html).
+    I would advise either `"all-MiniLM-L6-v2"` for English documents or `"paraphrase-multilingual-MiniLM-L12-v2"`
+    for multi-lingual documents or any other language.
 
-###  Max Sum Similarity
+###  **Max Sum Distance**
 
 To diversify the results, we take the 2 x top_n most similar words/phrases to the document.
 Then, we take all top_n combinations from the 2 x top_n words and extract the combination
@@ -91,7 +86,7 @@ that are the least similar to each other by cosine similarity.
  ('learning machine learning', 0.2891)]
 ```
 
-###  Maximal Marginal Relevance
+### **Maximal Marginal Relevance**
 
 To diversify the results, we can use Maximal Margin Relevance (MMR) to create
 keywords / keyphrases which is also based on cosine similarity. The results
@@ -119,7 +114,7 @@ The results with **low diversity**:
  ('learning algorithm generalize', 0.7514)]
 ```
 
-### Candidate Keywords/Keyphrases
+### **Candidate Keywords/Keyphrases**
 In some cases, one might want to be using candidate keywords generated by other keyword algorithms or retrieved from a select list of possible keywords/keyphrases. In KeyBERT, you can easily use those candidate keywords to perform keyword extraction:
 
 ```python
@@ -149,7 +144,7 @@ kw_model = KeyBERT()
 keywords = kw_model.extract_keywords(doc, candidates)
 ```
 
-### Guided KeyBERT
+### **Guided KeyBERT**
 
 Guided KeyBERT is similar to Guided Topic Modeling in that it tries to steer the training towards a set of seeded terms. When applying KeyBERT it automatically extracts the most related keywords to a specific document. However, there are times when stakeholders and users are looking for specific types of keywords. For example, when publishing an article on your website through contentful, you typically already know the global keywords related to the article. However, there might be a specific topic in the article that you would like to be extracted through the keywords. To achieve this, we simply give KeyBERT a set of related seeded keywords (it can also be a single one!) and search for keywords that are similar to both the document and the seeded keywords.
 

diff --git a/docs/index.md b/docs/index.md
@@ -1,11 +1,11 @@
 <img src="https://raw.githubusercontent.com/MaartenGr/KeyBERT/master/images/logo.png" width="35%" height="35%" align="right" />
 
-# KeyBERT
+# **KeyBERT**
 
 KeyBERT is a minimal and easy-to-use keyword extraction technique that leverages BERT embeddings to
 create keywords and keyphrases that are most similar to a document.
 
-## About the Project
+## **About the Project**
 
 Although there are already many methods available for keyword generation
 (e.g.,
@@ -31,7 +31,7 @@ papers and solutions out there that use BERT-embeddings
 could be used for beginners (**correct me if I'm wrong!**).
 Thus, the goal was a `pip install keybert` and at most 3 lines of code in usage.
 
-## Installation
+## **Installation**
 Installation can be done using [pypi](https://pypi.org/project/keybert/):
 
 ```
@@ -47,14 +47,7 @@ pip install keybert[spacy]
 pip install keybert[use]
 ```
 
-To install all backends:
-
-```
-pip install keybert[all]
-```
-
-
-## Usage
+## **Usage**
 
 
 The most minimal example can be seen below for the extraction of keywords:
@@ -99,3 +92,6 @@ of words you would like in the resulting keyphrases:
  ('algorithm analyzes', 0.5860),
  ('learning function', 0.5850)]
 ```
+
+!!! note "NOTE"
+    You can also pass multiple documents at once if you are looking for a major speed-up!
diff --git a/docs/style.css b/docs/style.css
diff --git a/docs/stylesheets/extra.css b/docs/stylesheets/extra.css
@@ -0,0 +1,7 @@
+:root {
+    --md-primary-fg-color:        #234E70;
+  }
+
+:root>* {
+--md-typeset-a-color: #0277BD;
+}
diff --git a/keybert/__init__.py b/keybert/__init__.py
@@ -1,3 +1,3 @@
 from keybert._model import KeyBERT
 
-__version__ = "0.5.1"
+__version__ = "0.6.0"