You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
One of the tools I wish I have had is a basic statistical language model (relative frequency) of various unigrams, bigrams, and trigrams. When extracting keywords from text, one of the failures of TF-IDF is that the relative scores are not calibrated so that unigram and bigram scores can be compared with one another. There also is the trouble of needing to have document and token frequencies. Instead, I normalize the TF/TF-IDF scores against the English corpus statistics, which you have within your models. Usually I use the unwieldy Google NGrams corpus, but yours is succinct and quite helpful. Is this easily accessed?
Thanks!
The text was updated successfully, but these errors were encountered:
This library is really superb.
One of the tools I wish I have had is a basic statistical language model (relative frequency) of various unigrams, bigrams, and trigrams. When extracting keywords from text, one of the failures of TF-IDF is that the relative scores are not calibrated so that unigram and bigram scores can be compared with one another. There also is the trouble of needing to have document and token frequencies. Instead, I normalize the TF/TF-IDF scores against the English corpus statistics, which you have within your models. Usually I use the unwieldy Google NGrams corpus, but yours is succinct and quite helpful. Is this easily accessed?
Thanks!
The text was updated successfully, but these errors were encountered: