Matching with Synonyms using KeyLLM OR KeyBERT #245

ChettakattuA · 2024-07-29T09:59:53Z

I have been playing with KeyBERT and KeyLLM for a while now. And here is something I would like to achieve.

If have a text "CO2 emissions are high these days" and a list of candidate words, which might contain the word Carbondioxide and not CO2 will KeyBERT or KeyLLM find Carbondioxide as a match?

Text = "CO2 emissions are high these days"
candidate keyword list have the word ["Carbon dioxide"] and not "CO2"

Expected output = ["Carbon dioxide"]

MaartenGr · 2024-07-30T13:33:27Z

If have a text "CO2 emissions are high these days" and a list of candidate words, which might contain the word Carbondioxide and not CO2 will KeyBERT or KeyLLM find Carbondioxide as a match?

I think it should be possible if you use it as a candidate word. Have you tried it out?

ChettakattuA · 2024-08-08T14:08:11Z

In this result the acronym and synonyms are not identified by KeyBERT

acronym used = CO2 -> carbon dioxide
synonym used = emission -> release
Plural = emission -> emissions

The code used

from keybert import KeyBERT 
kw_model = KeyBERT() 
text = "CO2 emissions are high these days"
can = ["carbon dioxide", "emissions","release","emission","co2"]
Keywords = kw_model.extract_keywords(text,candidates=can)

Is there some way to resolve this?

MaartenGr · 2024-08-10T06:38:53Z

Ah right, that's because the candidates should appear in the original document in order to find them. Instead, you might want to use the seed_keywords parameter which allows you to steer the model towards certain words. Note that you might have to use the global perspective here.

ChettakattuA · 2024-08-13T13:49:44Z

But do you know why its require the word itself to appear in the text? What I understood from the documentation is it uses embeddings and cosine similarity. Aint it enough to understand similar words or synonyms from the text and candidates?

MaartenGr · 2024-08-17T14:07:19Z

@ChettakattuA That depends on what you want. Generally, keywords are derived directly from the article that was written for SEO reasons. In KeyBERT candidates are passed to the CountVectorizer as a vocabulary, which means they should appear in the original documents (as they are fitted on the original documents):

KeyBERT/keybert/_model.py

Lines 163 to 182 in f0f96a6

    
           # Extract potential words using a vectorizer / tokenizer 
        
           if vectorizer: 
        
               count = vectorizer.fit(docs) 
        
           else: 
        
               try: 
        
                   count = CountVectorizer( 
        
                       ngram_range=keyphrase_ngram_range, 
        
                       stop_words=stop_words, 
        
                       min_df=min_df, 
        
                       vocabulary=candidates, 
        
                   ).fit(docs) 
        
               except ValueError: 
        
                   return [] 
        
           # Scikit-Learn Deprecation: get_feature_names is deprecated in 1.0 
        
           # and will be removed in 1.2. Please use get_feature_names_out instead. 
        
           if version.parse(sklearn_version) >= version.parse("1.0.0"): 
        
               words = count.get_feature_names_out() 
        
           else: 
        
               words = count.get_feature_names()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Matching with Synonyms using KeyLLM OR KeyBERT #245

Matching with Synonyms using KeyLLM OR KeyBERT #245

ChettakattuA commented Jul 29, 2024

MaartenGr commented Jul 30, 2024

ChettakattuA commented Aug 8, 2024

MaartenGr commented Aug 10, 2024

ChettakattuA commented Aug 13, 2024

MaartenGr commented Aug 17, 2024

Matching with Synonyms using KeyLLM OR KeyBERT #245

Matching with Synonyms using KeyLLM OR KeyBERT #245

Comments

ChettakattuA commented Jul 29, 2024

MaartenGr commented Jul 30, 2024

ChettakattuA commented Aug 8, 2024

MaartenGr commented Aug 10, 2024

ChettakattuA commented Aug 13, 2024

MaartenGr commented Aug 17, 2024