Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Term_Frequency #165

Merged

Conversation

mk2510
Copy link
Collaborator

@mk2510 mk2510 commented Aug 25, 2020

The term_frequency function is currently implemented incorrectly. Example:

>>> s = pd.Series(["Text Text of doc one", "Text of of doc two", "Aha hi bnd one"]).pipe(hero.tokenize)
>>> s
0    [Text, Text, of, doc, one]
1      [Text, of, of, doc, two]
2           [Aha, hi, bnd, one]
dtype: object

Old implementation of term frequency returns:

  term_frequency                                                                      
             Aha      Text       bnd       doc        hi        of       one       two
0       0.000000  0.142857  0.000000  0.071429  0.000000  0.071429  0.071429  0.000000
1       0.000000  0.071429  0.000000  0.071429  0.000000  0.142857  0.000000  0.071429
2       0.071429  0.000000  0.071429  0.000000  0.071429  0.000000  0.071429  0.000000

New version returns:

  term_frequency                                      
            Aha   Text  bnd  doc    hi   of   one  two
0           0.00  0.4  0.00  0.2  0.00  0.2  0.20  0.0
1           0.00  0.2  0.00  0.2  0.00  0.4  0.00  0.2
2           0.25  0.0  0.25  0.0  0.25  0.0  0.25  0.0

We can see that the old version simply did count for each document and then divided each count value by the total number of terms across all documents (the .sum() in the old code). That does not make sense for term_frequency. The new version will also do count, but then divide the count values of each document by the number of terms in that document to produce the correct term frequencies. Of course, that's the same as L1-Normalizing every row, so that's how we implemented it 🚥 .

Note: only so many lines changed as this builds upon the DocumentTermDF (see #156)

mk2510 and others added 11 commits August 18, 2020 22:06
suport MultiIndex as function parameter

returns MultiIndex, where Representation was returned

* missing: correct test


Co-authored-by: Henri Froese <hf2000510@gmail.com>
*missing: test adopting for new types


Co-authored-by: Henri Froese <hf2000510@gmail.com>
Co-authored-by: Henri Froese <henri.froese@yahoo.com>
@henrifroese henrifroese added the bug Something isn't working label Aug 26, 2020
@jbesomi jbesomi marked this pull request as draft September 14, 2020 15:47
@mk2510
Copy link
Collaborator Author

mk2510 commented Sep 22, 2020

now, this branch is based on the master branch again and is ready for review/merge 🥇 🌵

@mk2510 mk2510 marked this pull request as ready for review September 22, 2020 12:18
@jbesomi
Copy link
Owner

jbesomi commented Sep 22, 2020

Looks good and it has been merged. What would be nice ( probably to be done in a separate PR) is to have more friendly and funny doctest text content (instead of "Aha", "Text", ...). One idea, for instance, is to use famous sentences said by movie Superheroes. Here are a few examples:

  • I have the power!
  • Flame on!
  • HULK SMASH!
  • Holy ____ Batman!
  • I am the vengeance, I am the night, I am BATMAN!
  • I am GROOT.
  • I’m going ghost!
  • I am the law!
  • SPOOOON!!!

Opinions?

@henrifroese
Copy link
Collaborator

Added an issue for that, see #189

@Iota87
Copy link

Iota87 commented Oct 7, 2020

Great idea, I have already started to use Hero related text examples in the Docs and Tutorials I am working on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants