Fix Term_Frequency #165

mk2510 · 2020-08-25T22:47:54Z

The term_frequency function is currently implemented incorrectly. Example:

>>> s = pd.Series(["Text Text of doc one", "Text of of doc two", "Aha hi bnd one"]).pipe(hero.tokenize)
>>> s
0    [Text, Text, of, doc, one]
1      [Text, of, of, doc, two]
2           [Aha, hi, bnd, one]
dtype: object

Old implementation of term frequency returns:

  term_frequency                                                                      
             Aha      Text       bnd       doc        hi        of       one       two
0       0.000000  0.142857  0.000000  0.071429  0.000000  0.071429  0.071429  0.000000
1       0.000000  0.071429  0.000000  0.071429  0.000000  0.142857  0.000000  0.071429
2       0.071429  0.000000  0.071429  0.000000  0.071429  0.000000  0.071429  0.000000

New version returns:

  term_frequency                                      
            Aha   Text  bnd  doc    hi   of   one  two
0           0.00  0.4  0.00  0.2  0.00  0.2  0.20  0.0
1           0.00  0.2  0.00  0.2  0.00  0.4  0.00  0.2
2           0.25  0.0  0.25  0.0  0.25  0.0  0.25  0.0

We can see that the old version simply did count for each document and then divided each count value by the total number of terms across all documents (the .sum() in the old code). That does not make sense for term_frequency. The new version will also do count, but then divide the count values of each document by the number of terms in that document to produce the correct term frequencies. Of course, that's the same as L1-Normalizing every row, so that's how we implemented it 🚥 .

Note: only so many lines changed as this builds upon the DocumentTermDF (see #156)

suport MultiIndex as function parameter returns MultiIndex, where Representation was returned * missing: correct test Co-authored-by: Henri Froese <hf2000510@gmail.com>

*missing: test adopting for new types Co-authored-by: Henri Froese <hf2000510@gmail.com>

Co-authored-by: Henri Froese <henri.froese@yahoo.com>

mk2510 · 2020-09-22T12:18:11Z

now, this branch is based on the master branch again and is ready for review/merge 🥇 🌵

jbesomi · 2020-09-22T19:10:34Z

Looks good and it has been merged. What would be nice ( probably to be done in a separate PR) is to have more friendly and funny doctest text content (instead of "Aha", "Text", ...). One idea, for instance, is to use famous sentences said by movie Superheroes. Here are a few examples:

I have the power!
Flame on!
HULK SMASH!
Holy ____ Batman!
I am the vengeance, I am the night, I am BATMAN!
I am GROOT.
I’m going ghost!
I am the law!
SPOOOON!!!

Opinions?

henrifroese · 2020-09-23T05:10:49Z

Added an issue for that, see #189

Iota87 · 2020-10-07T21:29:36Z

Great idea, I have already started to use Hero related text examples in the Docs and Tutorials I am working on.

mk2510 and others added 11 commits August 18, 2020 22:06

added MultiIndex DF support

fa342a9

suport MultiIndex as function parameter returns MultiIndex, where Representation was returned * missing: correct test Co-authored-by: Henri Froese <hf2000510@gmail.com>

beginning with tests

59a9f8c

implemented correct sparse support

19c52de

*missing: test adopting for new types Co-authored-by: Henri Froese <hf2000510@gmail.com>

Merge branch 'master_upstream' into change_representation_to_multicolumn

66e566c

added back list() and rm .tolist()

41f55a8

rm .tolist() and added list()

217611a

Adopted the test to the new dataframes

6a3b56d

wrong format

b8ff561

Address most review comments.

e3af2f9

Add more unittests for representation

77ad80e

Fix the term_frequency formula. Simplify the function body.

3fbeaa5

Co-authored-by: Henri Froese <henri.froese@yahoo.com>

vercel bot deployed to Preview August 25, 2020 22:48 View deployment

henrifroese added the bug Something isn't working label Aug 26, 2020

henrifroese mentioned this pull request Aug 28, 2020

👩‍💻 API next steps: checklist #85

Open

17 tasks

jbesomi marked this pull request as draft September 14, 2020 15:47

mk2510 added 2 commits September 22, 2020 12:35

Merge branch 'master_upstream' into fix_formula_in_term_frequency

5ed8283

fixed merge issues

efd9fde

vercel bot deployed to Preview September 22, 2020 12:17 View deployment

mk2510 marked this pull request as ready for review September 22, 2020 12:18

fix formatting

c1dd5eb

vercel bot deployed to Preview September 22, 2020 13:00 View deployment

mk2510 mentioned this pull request Sep 22, 2020

Implement filter_extremes #169

Open

jbesomi merged commit 417fbcb into jbesomi:master Sep 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Term_Frequency #165

Fix Term_Frequency #165

mk2510 commented Aug 25, 2020

mk2510 commented Sep 22, 2020

jbesomi commented Sep 22, 2020 •

edited

Loading

henrifroese commented Sep 23, 2020

Iota87 commented Oct 7, 2020

Fix Term_Frequency #165

Fix Term_Frequency #165

Conversation

mk2510 commented Aug 25, 2020

mk2510 commented Sep 22, 2020

jbesomi commented Sep 22, 2020 • edited Loading

henrifroese commented Sep 23, 2020

Iota87 commented Oct 7, 2020

jbesomi commented Sep 22, 2020 •

edited

Loading