Formula for Inverse Document Frequency #1140

lmores · 2023-01-21T19:19:19Z

lmores
Jan 21, 2023

Hi @fgregg, following your suggestion in issue #1126 here are my thoughts about how IDF should be computed.

Currently the IDF is computed as idf(t) = log(1 + N/n_t) where N is the total number of documents and n_t the number of documents in which the term t appears.

Reading sklearn documentation:

If smooth_idf=True (the default), the constant “1” is added to the numerator and denominator of the idf as if an extra document was seen containing every term in the collection exactly once, which prevents zero divisions: idf(t) = log [ (1 + n) / (1 + df(t)) ] + 1.

The effect of adding [the outer] “1” to the idf in the equation above is that terms with zero idf, i.e., terms that occur in all documents in a training set, will not be entirely ignored [since it avoids a multiplication by 0 when computing the TfIdf value for terms that appears in each document].

In this medium post you can see a comparison on a toy example of the behaviour of the standard TfIdf against the version implemented in sklearn.

Please keep in mind that I'm not an expert in this field.

P.S.: have you ever considered using the sklearn implementation to compute TfIdf (which is likely much faster)?

fgregg · 2023-01-21T20:26:43Z

fgregg
Jan 21, 2023
Maintainer

i'm familiar with many smoothing formulas. i asked you evidence that your preferred smoothing formula would make any difference in dedupe performance. by that i mean showing it makes a consistent difference in performance and recall across datasets, for example, the datasets in the dedupe-examples repo.

i would be very interested in what was possible with sklearn. it would be great do drop the dependencies on BTrees. i suspect that it will not favorable because you will still need an implementation of an inverted index.

2 replies

lmores Jan 21, 2023
Author

i asked you evidence that your preferred smoothing formula would make any difference in dedupe performance.

You are right.

I did not understand that the current implementation came from a well-aware choice (forgive my poor choice of words, I'm not an English speaker).
Right now obtaining experimental data to prove/disprove my thesis would require more time than I actually have.
I'll try to come back to this in the future (dedupe is a really nice library, btw).

lmores Jan 21, 2023
Author

I'll close the associated PR for now.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Formula for Inverse Document Frequency #1140

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Formula for Inverse Document Frequency #1140

lmores Jan 21, 2023

Replies: 1 comment · 2 replies

fgregg Jan 21, 2023 Maintainer

lmores Jan 21, 2023 Author

lmores Jan 21, 2023 Author

lmores
Jan 21, 2023

Replies: 1 comment 2 replies

fgregg
Jan 21, 2023
Maintainer

lmores Jan 21, 2023
Author

lmores Jan 21, 2023
Author