Skip to content

Commit

Permalink
Enhancing Text Analysis with Advanced Algorithms (#75)
Browse files Browse the repository at this point in the history
* Add files via upload

* TF-IDF and Word2Vec

* Delete NLP/Algorithms/Word2Vec/word2vec.ipynb

* Delete NLP/Algorithms/TF-IDF directory

* Add files via upload

* Update README.md
  • Loading branch information
UTSAVS26 authored Jul 25, 2024
1 parent 6b3afd0 commit e8104ed
Show file tree
Hide file tree
Showing 5 changed files with 431 additions and 1 deletion.
82 changes: 82 additions & 0 deletions NLP/Algorithms/TF-IDF/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
# TF-IDF Implementation

## Introduction

The `TFIDF` class converts a collection of documents into their respective TF-IDF (Term Frequency-Inverse Document Frequency) representations. TF-IDF is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents (corpus).

## Table of Contents

1. [Attributes](#attributes)
2. [Methods](#methods)
- [fit Method](#fit-method)
- [transform Method](#transform-method)
- [fit_transform Method](#fit_transform-method)
3. [Explanation of the Code](#explanation-of-the-code)
4. [References](#references)

## Attributes

The `TFIDF` class is initialized with two main attributes:

- **`self.vocabulary`**: A dictionary that maps words to their indices in the TF-IDF matrix.
- **`self.idf_values`**: A dictionary that stores the IDF (Inverse Document Frequency) values for each word.

## Methods

### fit Method

#### Input

- **`documents`** (list of str): List of documents where each document is a string.

#### Purpose

Calculate the IDF values for all unique words in the corpus.

#### Steps

1. **Count Document Occurrences**: Determine how many documents contain each word.
2. **Compute IDF**: Calculate the importance of each word across all documents. Higher values indicate the word is more unique to fewer documents.
3. **Build Vocabulary**: Create a mapping of words to unique indexes.

### transform Method

#### Input

- **`documents`** (list of str): A list where each entry is a document in the form of a string.

#### Purpose

Convert each document into a numerical representation that shows the importance of each word.

#### Steps

1. **Compute Term Frequency (TF)**: Determine how often each word appears in a document relative to the total number of words in that document.
2. **Compute TF-IDF**: Multiply the term frequency of each word by its IDF to get a measure of its relevance in each document.
3. **Store Values**: Save these numerical values in a matrix where each row represents a document.

### fit_transform Method

#### Purpose

Perform both fitting (computing IDF values) and transforming (converting documents to TF-IDF representation) in one step.

## Explanation of the Code

The `TFIDF` class includes methods for fitting the model to the data, transforming new data into the TF-IDF representation, and combining these steps. Here's a breakdown of the primary methods:

1. **`fit` Method**: Calculates IDF values for all unique words in the corpus. It counts the number of documents containing each word and computes the IDF. The vocabulary is built with a word-to-index mapping.

2. **`transform` Method**: Converts each document into a TF-IDF representation. It computes Term Frequency (TF) for each word in the document, calculates TF-IDF by multiplying TF with IDF, and stores these values in a matrix where each row corresponds to a document.

3. **`fit_transform` Method**: Combines the fitting and transforming steps into a single method for efficient processing of documents.

## References

1. [TF-IDF - Wikipedia](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)
2. [Understanding TF-IDF](https://towardsdatascience.com/understanding-tf-idf-a-traditional-approach-to-feature-extraction-in-nlp-a5bfbe04723f)
3. [Scikit-learn: TF-IDF](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

---

This document provides a clear and structured explanation of the TF-IDF algorithm, including its attributes, methods, and overall functionality.
117 changes: 117 additions & 0 deletions NLP/Algorithms/TF-IDF/tf_idf.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
import math
from collections import Counter

class TFIDF:
def __init__(self):
self.vocabulary = {} # Vocabulary to store word indices
self.idf_values = {} # IDF values for words

def fit(self, documents):
"""
Compute IDF values based on the provided documents.
Args:
documents (list of str): List of documents where each document is a string.
"""
doc_count = len(documents)
term_doc_count = Counter() # To count the number of documents containing each word

# Count occurrences of words in documents
for doc in documents:
words = set(doc.split()) # Unique words in the current document
for word in words:
term_doc_count[word] += 1

# Compute IDF values
self.idf_values = {
word: math.log(doc_count / (count + 1)) # +1 to avoid division by zero
for word, count in term_doc_count.items()
}

# Build vocabulary
self.vocabulary = {word: idx for idx, word in enumerate(self.idf_values.keys())}

def transform(self, documents):
"""
Transform documents into TF-IDF representation.
Args:
documents (list of str): List of documents where each document is a string.
Returns:
list of list of float: TF-IDF matrix where each row corresponds to a document.
"""
rows = []
for doc in documents:
words = doc.split()
word_count = Counter(words)
doc_length = len(words)
row = [0] * len(self.vocabulary)

for word, count in word_count.items():
if word in self.vocabulary:
tf = count / doc_length
idf = self.idf_values[word]
index = self.vocabulary[word]
row[index] = tf * idf
rows.append(row)
return rows

def fit_transform(self, documents):
"""
Compute IDF values and transform documents into TF-IDF representation.
Args:
documents (list of str): List of documents where each document is a string.
Returns:
list of list of float: TF-IDF matrix where each row corresponds to a document.
"""
self.fit(documents)
return self.transform(documents)
# Example usage
if __name__ == "__main__":
documents = [
"the cat sat on the mat",
"the dog ate my homework",
"the cat ate the dog food",
"I love programming in Python",
"Machine learning is fun",
"Python is a versatile language",
"Learning new skills is always beneficial"
]

# Initialize the TF-IDF model
tfidf = TFIDF()

# Fit the model and transform the documents
tfidf_matrix = tfidf.fit_transform(documents)

# Print the vocabulary
print("Vocabulary:", tfidf.vocabulary)

# Print the TF-IDF representation
print("TF-IDF Representation:")
for i, vector in enumerate(tfidf_matrix):
print(f"Document {i + 1}: {vector}")

# More example documents with mixed content
more_documents = [
"the quick brown fox jumps over the lazy dog",
"a journey of a thousand miles begins with a single step",
"to be or not to be that is the question",
"the rain in Spain stays mainly in the plain",
"all human beings are born free and equal in dignity and rights"
]

# Fit the model and transform the new set of documents
tfidf_more = TFIDF()
tfidf_matrix_more = tfidf_more.fit_transform(more_documents)

# Print the vocabulary for the new documents
print("\nVocabulary for new documents:", tfidf_more.vocabulary)

# Print the TF-IDF representation for the new documents
print("TF-IDF Representation for new documents:")
for i, vector in enumerate(tfidf_matrix_more):
print(f"Document {i + 1}: {vector}")
105 changes: 105 additions & 0 deletions NLP/Algorithms/Word2Vec/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# Word2Vec Skip-gram Implementation

## Introduction

Word2Vec is a technique to learn word embeddings using neural networks. The primary goal is to represent words in a continuous vector space where semantically similar words are mapped to nearby points. Word2Vec can be implemented using two main architectures:

1. **Continuous Bag of Words (CBOW)**: Predicts the target word based on the context words (surrounding words).
2. **Skip-gram**: Predicts the context words based on a given target word.

In this example, we focus on the Skip-gram approach, which is more commonly used in practice. The Skip-gram model tries to maximize the probability of context words given a target word.

## Table of Contents

1. [Installation](#installation)
2. [Usage](#usage)
- [Initialization](#initialization)
- [Tokenization](#tokenization)
- [Generate Training Data](#generate-training-data)
- [Training](#training)
- [Retrieve Word Vector](#retrieve-word-vector)
3. [Explanation of the Code](#explanation-of-the-code)
4. [References](#references)

## Installation

Ensure you have Python installed. You can install the necessary dependencies using pip:

```sh
pip install numpy
```

## Usage

### Initialization

Define the parameters for the Word2Vec model:

- `window_size`: Defines the size of the context window around the target word.
- `embedding_dim`: Dimension of the word vectors (embedding space).
- `learning_rate`: Rate at which weights are updated.

### Tokenization

The `tokenize` method creates a vocabulary from the documents and builds mappings between words and their indices.

### Generate Training Data

The `generate_training_data` method creates pairs of target words and context words based on the window size.

### Training

The `train` method initializes the weight matrices and updates them using gradient descent.

For each word-context pair, it computes the hidden layer representation, predicts context probabilities, calculates the error, and updates the weights.

### Retrieve Word Vector

The `get_word_vector` method retrieves the embedding of a specific word.

## Explanation of the Code

### Initialization

- **Parameters**:
- `window_size`: Size of the context window around the target word.
- `embedding_dim`: Dimension of the word vectors (embedding space).
- `learning_rate`: Rate at which weights are updated.

### Tokenization

- The `tokenize` method creates a vocabulary from the documents.
- Builds mappings between words and their indices.

### Generate Training Data

- The `generate_training_data` method creates pairs of target words and context words based on the window size.

### Training

- The `train` method initializes the weight matrices.
- Updates the weights using gradient descent.
- For each word-context pair:
- Computes the hidden layer representation.
- Predicts context probabilities.
- Calculates the error.
- Updates the weights.

### Softmax Function

- The `softmax` function converts the output layer scores into probabilities.
- Used to compute the error and update the weights.

### Retrieve Word Vector

- The `get_word_vector` method retrieves the embedding of a specific word.

## References

1. [Word2Vec - Google](https://code.google.com/archive/p/word2vec/)
2. [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/abs/1301.3781)
3. [Distributed Representations of Words and Phrases and their Compositionality](https://arxiv.org/abs/1310.4546)

---

This README file provides a comprehensive overview of the Word2Vec Skip-gram implementation, including installation instructions, usage details, and an explanation of the code.
Loading

0 comments on commit e8104ed

Please sign in to comment.