Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

To count tokens, use a word tokenizer in wordview.text_analysis.core.do_txt_analysis #144

Closed
meghdadFar opened this issue Apr 8, 2024 · 1 comment
Labels
enhancement New feature or request help wanted Extra attention is needed up for grabs

Comments

@meghdadFar
Copy link
Owner

meghdadFar commented Apr 8, 2024

Description

Currently in wordview.text_analysis.core.do_txt_analysis tokens are extracted by splitting the text around space. Improve this by using a tokenizer. E.g. nltk word tokenizer.

Solution:

for text in tqdm(df["review"]):
    try:
        sentences = sent_tokenize(text.lower())
        for sentence in sentences:
            sentence_tokens = word_tokenize(sentence)
            num_tokens += len(sentence_tokens)
    except Exception as e:
        print("Processing entry --- %s --- lead to exception: %s" % (text, e.args[0]))
        continue
@meghdadFar meghdadFar added enhancement New feature or request help wanted Extra attention is needed up for grabs labels Apr 8, 2024
@meghdadFar meghdadFar changed the title To count tokens, use nltk.tokenizer in core.do_txt_analysis() To count tokens, use a word tokenizer in wordview.text_analysis.core.do_txt_analysis Apr 8, 2024
@meghdadFar
Copy link
Owner Author

Resolved by PR #145

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed up for grabs
Projects
None yet
Development

No branches or pull requests

1 participant