-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Term_Frequency #165
Fix Term_Frequency #165
Conversation
suport MultiIndex as function parameter returns MultiIndex, where Representation was returned * missing: correct test Co-authored-by: Henri Froese <hf2000510@gmail.com>
*missing: test adopting for new types Co-authored-by: Henri Froese <hf2000510@gmail.com>
Co-authored-by: Henri Froese <henri.froese@yahoo.com>
now, this branch is based on the master branch again and is ready for review/merge 🥇 🌵 |
Looks good and it has been merged. What would be nice ( probably to be done in a separate PR) is to have more friendly and funny doctest text content (instead of "Aha", "Text", ...). One idea, for instance, is to use famous sentences said by movie Superheroes. Here are a few examples:
Opinions? |
Added an issue for that, see #189 |
Great idea, I have already started to use Hero related text examples in the Docs and Tutorials I am working on. |
The
term_frequency
function is currently implemented incorrectly. Example:Old implementation of term frequency returns:
New version returns:
We can see that the old version simply did
count
for each document and then divided each count value by the total number of terms across all documents (the.sum()
in the old code). That does not make sense forterm_frequency
. The new version will also do count, but then divide the count values of each document by the number of terms in that document to produce the correct term frequencies. Of course, that's the same as L1-Normalizing every row, so that's how we implemented it 🚥 .Note: only so many lines changed as this builds upon the DocumentTermDF (see #156)