To count tokens, use a word tokenizer in `wordview.text_analysis.core.do_txt_analysis` #144

meghdadFar · 2024-04-08T08:54:37Z

Description

Currently in wordview.text_analysis.core.do_txt_analysis tokens are extracted by splitting the text around space. Improve this by using a tokenizer. E.g. nltk word tokenizer.

Solution:

for text in tqdm(df["review"]):
    try:
        sentences = sent_tokenize(text.lower())
        for sentence in sentences:
            sentence_tokens = word_tokenize(sentence)
            num_tokens += len(sentence_tokens)
    except Exception as e:
        print("Processing entry --- %s --- lead to exception: %s" % (text, e.args[0]))
        continue

The text was updated successfully, but these errors were encountered:

meghdadFar · 2024-04-08T09:28:53Z

Resolved by PR #145

meghdadFar added enhancement New feature or request help wanted Extra attention is needed up for grabs labels Apr 8, 2024

meghdadFar changed the title ~~To count tokens, use nltk.tokenizer in core.do_txt_analysis()~~ To count tokens, use a word tokenizer in wordview.text_analysis.core.do_txt_analysis Apr 8, 2024

meghdadFar mentioned this issue Apr 8, 2024

Use tokenizer instead of split #145

Merged

meghdadFar closed this as completed Apr 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

To count tokens, use a word tokenizer in `wordview.text_analysis.core.do_txt_analysis` #144

To count tokens, use a word tokenizer in `wordview.text_analysis.core.do_txt_analysis` #144

meghdadFar commented Apr 8, 2024 •

edited

Loading

meghdadFar commented Apr 8, 2024

To count tokens, use a word tokenizer in wordview.text_analysis.core.do_txt_analysis #144

To count tokens, use a word tokenizer in wordview.text_analysis.core.do_txt_analysis #144

Comments

meghdadFar commented Apr 8, 2024 • edited Loading

Description

Solution:

meghdadFar commented Apr 8, 2024

To count tokens, use a word tokenizer in `wordview.text_analysis.core.do_txt_analysis` #144

To count tokens, use a word tokenizer in `wordview.text_analysis.core.do_txt_analysis` #144

meghdadFar commented Apr 8, 2024 •

edited

Loading