-
Notifications
You must be signed in to change notification settings - Fork 356
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Use candidate words instead of extracting those from the documents * Spacy, Gensim, USE, and Custom Backends were added * Improved imports * Fix encoding error when locally installing KeyBERT #30 * Improved documentation (ReadMe & MKDocs) * Add the main tutorial as a shield * Typos #31, #35
- Loading branch information
Showing
16 changed files
with
747 additions
and
191 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,44 @@ | ||
## **Version 0.3.0** | ||
*Release date: 10 May, 2021* | ||
|
||
The two main features are **candidate keywords** | ||
and several **backends** to use instead of Flair and SentenceTransformers! | ||
|
||
**Highlights**: | ||
|
||
* Use candidate words instead of extracting those from the documents ([#25](https://github.com/MaartenGr/KeyBERT/issues/25)) | ||
* ```KeyBERT().extract_keywords(doc, candidates)``` | ||
* Spacy, Gensim, USE, and Custom Backends were added (see documentation [here](https://maartengr.github.io/KeyBERT/guides/embeddings.html)) | ||
|
||
**Fixes**: | ||
|
||
* Improved imports | ||
* Fix encoding error when locally installing KeyBERT ([#30](https://github.com/MaartenGr/KeyBERT/issues/30)) | ||
|
||
**Miscellaneous**: | ||
|
||
* Improved documentation (ReadMe & MKDocs) | ||
* Add the main tutorial as a shield | ||
* Typos ([#31](https://github.com/MaartenGr/KeyBERT/pull/31), [#35](https://github.com/MaartenGr/KeyBERT/pull/35)) | ||
|
||
|
||
## **Version 0.2.0** | ||
*Release date: 9 Feb, 2021* | ||
|
||
**Highlights**: | ||
|
||
* Add similarity scores to the output | ||
* Add Flair as a possible back-end | ||
* Update documentation + improved testing | ||
|
||
## **Version 0.1.2* | ||
*Release date: 28 Oct, 2020* | ||
|
||
Added Max Sum Similarity as an option to diversify your results. | ||
|
||
|
||
## **Version 0.1.0** | ||
*Release date: 27 Oct, 2020* | ||
|
||
This first release includes keyword/keyphrase extraction using BERT and simple cosine similarity. | ||
There is also an option to use Maximal Marginal Relevance to select the candidate keywords/keyphrases. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,36 +1,137 @@ | ||
## **Embedding Models** | ||
The parameter `model` takes in a string pointing to a sentence-transformers model, | ||
a SentenceTransformer, or a Flair DocumentEmbedding model. | ||
# Embedding Models | ||
In this tutorial we will be going through the embedding models that can be used in KeyBERT. | ||
Having the option to choose embedding models allow you to leverage pre-trained embeddings that suit your use-case. | ||
|
||
### **Sentence-Transformers** | ||
You can select any model from `sentence-transformers` [here](https://www.sbert.net/docs/pretrained_models.html) | ||
### **Sentence Transformers** | ||
You can select any model from sentence-transformers [here](https://www.sbert.net/docs/pretrained_models.html) | ||
and pass it through KeyBERT with `model`: | ||
|
||
```python | ||
from keybert import KeyBERT | ||
model = KeyBERT(model='distilbert-base-nli-mean-tokens') | ||
kw_model = KeyBERT(model="xlm-r-bert-base-nli-stsb-mean-tokens") | ||
``` | ||
|
||
Or select a SentenceTransformer model with your own parameters: | ||
|
||
```python | ||
from keybert import KeyBERT | ||
from sentence_transformers import SentenceTransformer | ||
|
||
sentence_model = SentenceTransformer("distilbert-base-nli-mean-tokens", device="cpu") | ||
model = KeyBERT(model=sentence_model) | ||
sentence_model = SentenceTransformer("distilbert-base-nli-mean-tokens", device="cuda") | ||
kw_model = KeyBERT(model=sentence_model) | ||
``` | ||
|
||
### **Flair** | ||
### **Flair** | ||
[Flair](https://github.com/flairNLP/flair) allows you to choose almost any embedding model that | ||
is publicly available. Flair can be used as follows: | ||
|
||
```python | ||
from keybert import KeyBERT | ||
from flair.embeddings import TransformerDocumentEmbeddings | ||
|
||
roberta = TransformerDocumentEmbeddings('roberta-base') | ||
model = KeyBERT(model=roberta) | ||
kw_model = KeyBERT(model=roberta) | ||
``` | ||
|
||
You can select any 🤗 transformers model [here](https://huggingface.co/models). | ||
|
||
Moreover, you can also use Flair to use word embeddings and pool them to create document embeddings. | ||
Under the hood, Flair simply averages all word embeddings in a document. Then, we can easily | ||
pass it to KeyBERT in order to use those word embeddings as document embeddings: | ||
|
||
```python | ||
from flair.embeddings import WordEmbeddings, DocumentPoolEmbeddings | ||
|
||
glove_embedding = WordEmbeddings('crawl') | ||
document_glove_embeddings = DocumentPoolEmbeddings([glove_embedding]) | ||
|
||
kw_model = KeyBERT(model=document_glove_embeddings) | ||
``` | ||
|
||
### **Spacy** | ||
[Spacy](https://github.com/explosion/spaCy) is an amazing framework for processing text. There are | ||
many models available across many languages for modeling text. | ||
|
||
allows you to choose almost any embedding model that | ||
is publicly available. Flair can be used as follows: | ||
|
||
To use Spacy's non-transformer models in KeyBERT: | ||
|
||
```python | ||
import spacy | ||
|
||
nlp = spacy.load("en_core_web_md", exclude=['tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer']) | ||
|
||
kw_model = KeyBERT(model=document_glove_embeddings)nlp | ||
``` | ||
|
||
Using spacy-transformer models: | ||
|
||
```python | ||
import spacy | ||
|
||
spacy.prefer_gpu() | ||
nlp = spacy.load("en_core_web_trf", exclude=['tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer']) | ||
|
||
kw_model = KeyBERT(model=nlp) | ||
``` | ||
|
||
If you run into memory issues with spacy-transformer models, try: | ||
|
||
```python | ||
import spacy | ||
from thinc.api import set_gpu_allocator, require_gpu | ||
|
||
nlp = spacy.load("en_core_web_trf", exclude=['tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer']) | ||
set_gpu_allocator("pytorch") | ||
require_gpu(0) | ||
|
||
kw_model = KeyBERT(model=nlp) | ||
``` | ||
|
||
### **Universal Sentence Encoder (USE)** | ||
The Universal Sentence Encoder encodes text into high dimensional vectors that are used here | ||
for embedding the documents. The model is trained and optimized for greater-than-word length text, | ||
such as sentences, phrases or short paragraphs. | ||
|
||
Using USE in KeyBERT is rather straightforward: | ||
|
||
```python | ||
import tensorflow_hub | ||
embedding_model = tensorflow_hub.load("https://tfhub.dev/google/universal-sentence-encoder/4") | ||
kw_model = KeyBERT(model=embedding_model) | ||
``` | ||
|
||
### **Gensim** | ||
For Gensim, KeyBERT supports its `gensim.downloader` module. Here, we can download any model word embedding model | ||
to be used in KeyBERT. Note that Gensim is primarily used for Word Embedding models. This works typically | ||
best for short documents since the word embeddings are pooled. | ||
|
||
```python | ||
import gensim.downloader as api | ||
ft = api.load('fasttext-wiki-news-subwords-300') | ||
kw_model = KeyBERT(model=ft) | ||
``` | ||
|
||
### **Custom Backend** | ||
If your backend or model cannot be found in the ones currently available, you can use the `keybert.backend.BaseEmbedder` class to | ||
create your own backend. Below, you will find an example of creating a SentenceTransformer backend for KeyBERT: | ||
|
||
```python | ||
from keybert.backend import BaseEmbedder | ||
from sentence_transformers import SentenceTransformer | ||
|
||
class CustomEmbedder(BaseEmbedder): | ||
def __init__(self, embedding_model): | ||
super().__init__() | ||
self.embedding_model = embedding_model | ||
|
||
def embed(self, documents, verbose=False): | ||
embeddings = self.embedding_model.encode(documents, show_progress_bar=verbose) | ||
return embeddings | ||
|
||
# Create custom backend | ||
distilbert = SentenceTransformer("distilbert-base-nli-stsb-mean-tokens") | ||
custom_embedder = CustomEmbedder(embedding_model=distilbert) | ||
|
||
# Pass custom backend to keybert | ||
kw_model = KeyBERT(model=custom_embedder) | ||
``` |
Oops, something went wrong.