Replies: 1 comment 2 replies
-
i'm familiar with many smoothing formulas. i asked you evidence that your preferred smoothing formula would make any difference in dedupe performance. by that i mean showing it makes a consistent difference in performance and recall across datasets, for example, the datasets in the dedupe-examples repo. i would be very interested in what was possible with sklearn. it would be great do drop the dependencies on BTrees. i suspect that it will not favorable because you will still need an implementation of an inverted index. |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi @fgregg, following your suggestion in issue #1126 here are my thoughts about how IDF should be computed.
Currently the IDF is computed as
idf(t) = log(1 + N/n_t)
whereN
is the total number of documents andn_t
the number of documents in which the termt
appears.Reading sklearn documentation:
In this medium post you can see a comparison on a toy example of the behaviour of the standard TfIdf against the version implemented in sklearn.
Please keep in mind that I'm not an expert in this field.
P.S.: have you ever considered using the sklearn implementation to compute TfIdf (which is likely much faster)?
Beta Was this translation helpful? Give feedback.
All reactions