Skip to content

Latest commit

 

History

History
46 lines (33 loc) · 1.73 KB

README.md

File metadata and controls

46 lines (33 loc) · 1.73 KB

TF-IDF

TF-IDF (Term frequency, Inverse Document Frequency) is an algorithm or way to score the importance of words (or 'terms') based on how frequently they appear

which means

  • If a word appears frequently in a document, it's important. Give the word a high score.
  • But if a word appears in many documents, it's not a unique identifier. Give the word a low score.

Prerequisites:

Before using this TF-IDF implementation, ensure you have the following packages installed:

  • textblob
  • nltk

You can install these packages using pip: 'pip install textblob nltk'

Improved TF-IDF Implementation

This implementation of TF-IDF features improvements such as:

  • Utilizing NLTK to download stopwords and tokenize the text.
  • Filtering out stopwords from the document before calculating TF-IDF scores.
  • Lowercasing the words to ensure case insensitivity.
  • Calculating TF-IDF scores based on the filtered document.

Usage

  • Ensure you have Python installed on your system.
  • Install the required packages using pip as mentioned in the Prerequisites section.
  • Clone or download this repository.
  • Navigate to the directory containing the TF-IDF script.
  • Run the script and follow the prompts to enter the location of the document file.

The script will calculate the TF-IDF scores and display the top words along with their scores.

Examples

Two example text files have been provided in the repository for testing the TF-IDF algorithm.

  • text.txt
  • text2.txt

Further Reading

License