Skip to content

Commit

Permalink
v0.6.0 (#120)
Browse files Browse the repository at this point in the history
* Major speedup, up to 2x to 5x when passing multiple documents (for MMR and MaxSum) compared to single documents
* Same results whether passing a single document or multiple documents
* MMR and MaxSum now work when passing a single document or multiple documents
* Improved documentation
* Added 🤗 Hugging Face Transformers
* Highlighting support for Chinese texts
    * Now uses the CountVectorizer for creating the tokens
    * This should also improve the highlighting for most applications and higher n-grams
* Fix #106 
* Fix #116
  • Loading branch information
MaartenGr authored Jul 27, 2022
1 parent 412099b commit 9dd7b59
Show file tree
Hide file tree
Showing 20 changed files with 428 additions and 266 deletions.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ Corresponding medium post can be found [here](https://towardsdatascience.com/key
2. [Getting Started](#gettingstarted)
2.1. [Installation](#installation)
2.2. [Basic Usage](#usage)
2.3. [Max Sum Similarity](#maxsum)
2.3. [Max Sum Distance](#maxsum)
2.4. [Maximal Marginal Relevance](#maximal)
2.5. [Embedding Models](#embeddings)
<!--te-->
Expand Down Expand Up @@ -134,7 +134,7 @@ I would advise either `"all-MiniLM-L6-v2"` for English documents or `"paraphrase
for multi-lingual documents or any other language.

<a name="maxsum"/></a>
### 2.3. Max Sum Similarity
### 2.3. Max Sum Distance

To diversify the results, we take the 2 x top_n most similar words/phrases to the document.
Then, we take all top_n combinations from the 2 x top_n words and extract the combination
Expand Down
4 changes: 2 additions & 2 deletions docs/api/maxsum.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
# `Max Sum Similarity`
# `Max Sum Distance`

::: keybert._maxsum.max_sum_similarity
::: keybert._maxsum.max_sum_distance
37 changes: 37 additions & 0 deletions docs/changelog.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,38 @@
## **Version 0.6.0**
*Release date: 25 July, 2022*

**Highlights**:

* Major speedup, up to 2x to 5x when passing multiple documents (for MMR and MaxSum) compared to single documents
* Same results whether passing a single document or multiple documents
* MMR and MaxSum now work when passing a single document or multiple documents
* Improved documentation
* Added 🤗 Hugging Face Transformers

```python
from keybert import KeyBERT
from transformers.pipelines import pipeline

hf_model = pipeline("feature-extraction", model="distilbert-base-cased")
kw_model = KeyBERT(model=hf_model)
```

* Highlighting support for Chinese texts
* Now uses the `CountVectorizer` for creating the tokens
* This should also improve the highlighting for most applications and higher n-grams

![image](https://user-images.githubusercontent.com/25746895/179488649-3c66403c-9620-4e12-a7a8-c2fab26b18fc.png)

**NOTE**: Although highlighting for Chinese texts is improved, since I am not familiar with the Chinese language there is a good chance it is not yet as optimized as for other languages. Any feedback with respect to this is highly appreciated!

**Fixes**:

* Fix typo in ReadMe by [@priyanshul-govil](https://github.com/priyanshul-govil) in [#117](https://github.com/MaartenGr/KeyBERT/pull/117)
* Add missing optional dependencies (gensim, use, and spacy) by [@yusuke1997](https://github.com/yusuke1997)
in [#114](https://github.com/MaartenGr/KeyBERT/pull/114)



## **Version 0.5.1**
*Release date: 31 March, 2022*

Expand Down Expand Up @@ -25,6 +60,8 @@
**Miscellaneous**:

* Added instructions in the FAQ to extract keywords from Chinese documents
* Fix typo in ReadMe by [@koaning](https://github.com/koaning) in [#51](https://github.com/MaartenGr/KeyBERT/pull/51)


## **Version 0.4.0**
*Release date: 23 June, 2021*
Expand Down
37 changes: 34 additions & 3 deletions docs/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,12 +15,39 @@ typically do not contribute to the meaning of a document and should therefore be
topic modeling to HTML-code to extract topics of code, then it becomes important.


## **Can I use the GPU to speed up the model?**
Yes! Since KeyBERT uses embeddings as its backend, a GPU is actually prefered when using this package.
## **How can I speed up the model?**
Since KeyBERT uses large language models as its backend, a GPU is typically prefered when using this package.
Although it is possible to use it without a dedicated GPU, the inference speed will be significantly slower.

A second method for speeding up KeyBERT is by passing it multiple documents at once. By doing this, words
need to only be embedded a single time, which can result in a major speed up.

This is **faster**:

```python
from keybert import KeyBERT

kw_model = KeyBERT()

keywords = kw_model.extract_keywords(my_list_of_documents)
```

This is **slower**:

```python
from keybert import KeyBERT

kw_model = KeyBERT()

keywords = []
for document in my_list_of_documents:
keyword = kw_model.extract_keywords(document)
keywords.append(keyword)
```


## **How can I use KeyBERT with Chinese documents?**
You need to make sure you use a Tokenizer in KeyBERT that supports tokenization of Chinese. I suggest installing [`jieba`](https://github.com/fxsjy/jieba) for this:
You need to make sure you use a tokenizer in KeyBERT that supports tokenization of Chinese. I suggest installing [`jieba`](https://github.com/fxsjy/jieba) for this:

```python
from sklearn.feature_extraction.text import CountVectorizer
Expand All @@ -41,3 +68,7 @@ from keybert import KeyBERT
kw_model = KeyBERT()
keywords = kw_model.extract_keywords(doc, vectorizer=vectorizer)
```

It also supports highlighting:

![image](https://user-images.githubusercontent.com/25746895/179488649-3c66403c-9620-4e12-a7a8-c2fab26b18fc.png)
6 changes: 4 additions & 2 deletions docs/guides/countvectorizer.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,8 +42,10 @@ Next, we can use a basic vectorizer when extracting keywords as follows:
('mapping', 0.3700)]
```

**NOTE**: Although I typically like to use `use_mmr=True` as it often improves upon the generated keywords, this tutorial will do without
in order give you a clear view of the effects of the CountVectorizer.
!!! note "NOTE"
Although I typically like to use `use_mmr=True` as it often improves upon the generated keywords, this tutorial will do without
in order give you a clear view of the effects of the CountVectorizer.


## **Parameters**

Expand Down
15 changes: 15 additions & 0 deletions docs/guides/embeddings.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,21 @@ sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
kw_model = KeyBERT(model=sentence_model)
```

### 🤗 **Hugging Face Transformers**
To use a Hugging Face transformers model, load in a pipeline and point
to any model found on their model hub (https://huggingface.co/models):

```python
from transformers.pipelines import pipeline

hf_model = pipeline("feature-extraction", model="distilbert-base-cased")
kw_model = KeyBERT(model=hf_model)
```

!!! tip "Tip!"
These transformers also work quite well using `sentence-transformers` which has a number of
optimizations tricks that make using it a bit faster.

### **Flair**
[Flair](https://github.com/flairNLP/flair) allows you to choose almost any embedding model that
is publicly available. Flair can be used as follows:
Expand Down
23 changes: 9 additions & 14 deletions docs/guides/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,12 +14,6 @@ pip install keybert[spacy]
pip install keybert[use]
```

To install all backends:

```
pip install keybert[all]
```

## **Usage**

The most minimal example can be seen below for the extraction of keywords:
Expand Down Expand Up @@ -65,17 +59,18 @@ of words you would like in the resulting keyphrases:
('learning function', 0.5850)]
```

We can highlight the keywords in the document by simply setting `hightlight`:
We can highlight the keywords in the document by simply setting `highlight`:

```python
keywords = kw_model.extract_keywords(doc, highlight=True)
```

**NOTE**: For a full overview of all possible transformer models see [sentence-transformer](https://www.sbert.net/docs/pretrained_models.html).
I would advise either `"all-MiniLM-L6-v2"` for English documents or `"paraphrase-multilingual-MiniLM-L12-v2"`
for multi-lingual documents or any other language.
!!! note "NOTE"
For a full overview of all possible transformer models see [sentence-transformer](https://www.sbert.net/docs/pretrained_models.html).
I would advise either `"all-MiniLM-L6-v2"` for English documents or `"paraphrase-multilingual-MiniLM-L12-v2"`
for multi-lingual documents or any other language.

### Max Sum Similarity
### **Max Sum Distance**

To diversify the results, we take the 2 x top_n most similar words/phrases to the document.
Then, we take all top_n combinations from the 2 x top_n words and extract the combination
Expand All @@ -91,7 +86,7 @@ that are the least similar to each other by cosine similarity.
('learning machine learning', 0.2891)]
```

### Maximal Marginal Relevance
### **Maximal Marginal Relevance**

To diversify the results, we can use Maximal Margin Relevance (MMR) to create
keywords / keyphrases which is also based on cosine similarity. The results
Expand Down Expand Up @@ -119,7 +114,7 @@ The results with **low diversity**:
('learning algorithm generalize', 0.7514)]
```

### Candidate Keywords/Keyphrases
### **Candidate Keywords/Keyphrases**
In some cases, one might want to be using candidate keywords generated by other keyword algorithms or retrieved from a select list of possible keywords/keyphrases. In KeyBERT, you can easily use those candidate keywords to perform keyword extraction:

```python
Expand Down Expand Up @@ -149,7 +144,7 @@ kw_model = KeyBERT()
keywords = kw_model.extract_keywords(doc, candidates)
```

### Guided KeyBERT
### **Guided KeyBERT**

Guided KeyBERT is similar to Guided Topic Modeling in that it tries to steer the training towards a set of seeded terms. When applying KeyBERT it automatically extracts the most related keywords to a specific document. However, there are times when stakeholders and users are looking for specific types of keywords. For example, when publishing an article on your website through contentful, you typically already know the global keywords related to the article. However, there might be a specific topic in the article that you would like to be extracted through the keywords. To achieve this, we simply give KeyBERT a set of related seeded keywords (it can also be a single one!) and search for keywords that are similar to both the document and the seeded keywords.

Expand Down
18 changes: 7 additions & 11 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
<img src="https://raw.githubusercontent.com/MaartenGr/KeyBERT/master/images/logo.png" width="35%" height="35%" align="right" />

# KeyBERT
# **KeyBERT**

KeyBERT is a minimal and easy-to-use keyword extraction technique that leverages BERT embeddings to
create keywords and keyphrases that are most similar to a document.

## About the Project
## **About the Project**

Although there are already many methods available for keyword generation
(e.g.,
Expand All @@ -31,7 +31,7 @@ papers and solutions out there that use BERT-embeddings
could be used for beginners (**correct me if I'm wrong!**).
Thus, the goal was a `pip install keybert` and at most 3 lines of code in usage.

## Installation
## **Installation**
Installation can be done using [pypi](https://pypi.org/project/keybert/):

```
Expand All @@ -47,14 +47,7 @@ pip install keybert[spacy]
pip install keybert[use]
```

To install all backends:

```
pip install keybert[all]
```


## Usage
## **Usage**


The most minimal example can be seen below for the extraction of keywords:
Expand Down Expand Up @@ -99,3 +92,6 @@ of words you would like in the resulting keyphrases:
('algorithm analyzes', 0.5860),
('learning function', 0.5850)]
```

!!! note "NOTE"
You can also pass multiple documents at once if you are looking for a major speed-up!
Empty file removed docs/style.css
Empty file.
7 changes: 7 additions & 0 deletions docs/stylesheets/extra.css
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
:root {
--md-primary-fg-color: #234E70;
}

:root>* {
--md-typeset-a-color: #0277BD;
}
2 changes: 1 addition & 1 deletion keybert/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
from keybert._model import KeyBERT

__version__ = "0.5.1"
__version__ = "0.6.0"
Loading

0 comments on commit 9dd7b59

Please sign in to comment.