Enhancing Text Analysis with Advanced Algorithms (#75)

* Add files via upload * TF-IDF and Word2Vec * Delete NLP/Algorithms/Word2Vec/word2vec.ipynb * Delete NLP/Algorithms/TF-IDF directory * Add files via upload * Update README.md
Avdhesh-Varshney · Jul 25, 2024 · e8104ed · e8104ed
1 parent 6b3afd0
commit e8104ed
Show file tree

Hide file tree

Showing 5 changed files with 431 additions and 1 deletion.
diff --git a/NLP/Algorithms/TF-IDF/README.md b/NLP/Algorithms/TF-IDF/README.md
@@ -0,0 +1,82 @@
+# TF-IDF Implementation
+
+## Introduction
+
+The `TFIDF` class converts a collection of documents into their respective TF-IDF (Term Frequency-Inverse Document Frequency) representations. TF-IDF is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents (corpus).
+
+## Table of Contents
+
+1. [Attributes](#attributes)
+2. [Methods](#methods)
+   - [fit Method](#fit-method)
+   - [transform Method](#transform-method)
+   - [fit_transform Method](#fit_transform-method)
+3. [Explanation of the Code](#explanation-of-the-code)
+4. [References](#references)
+
+## Attributes
+
+The `TFIDF` class is initialized with two main attributes:
+
+- **`self.vocabulary`**: A dictionary that maps words to their indices in the TF-IDF matrix.
+- **`self.idf_values`**: A dictionary that stores the IDF (Inverse Document Frequency) values for each word.
+
+## Methods
+
+### fit Method
+
+#### Input
+
+- **`documents`** (list of str): List of documents where each document is a string.
+
+#### Purpose
+
+Calculate the IDF values for all unique words in the corpus.
+
+#### Steps
+
+1. **Count Document Occurrences**: Determine how many documents contain each word.
+2. **Compute IDF**: Calculate the importance of each word across all documents. Higher values indicate the word is more unique to fewer documents.
+3. **Build Vocabulary**: Create a mapping of words to unique indexes.
+
+### transform Method
+
+#### Input
+
+- **`documents`** (list of str): A list where each entry is a document in the form of a string.
+
+#### Purpose
+
+Convert each document into a numerical representation that shows the importance of each word.
+
+#### Steps
+
+1. **Compute Term Frequency (TF)**: Determine how often each word appears in a document relative to the total number of words in that document.
+2. **Compute TF-IDF**: Multiply the term frequency of each word by its IDF to get a measure of its relevance in each document.
+3. **Store Values**: Save these numerical values in a matrix where each row represents a document.
+
+### fit_transform Method
+
+#### Purpose
+
+Perform both fitting (computing IDF values) and transforming (converting documents to TF-IDF representation) in one step.
+
+## Explanation of the Code
+
+The `TFIDF` class includes methods for fitting the model to the data, transforming new data into the TF-IDF representation, and combining these steps. Here's a breakdown of the primary methods:
+
+1. **`fit` Method**: Calculates IDF values for all unique words in the corpus. It counts the number of documents containing each word and computes the IDF. The vocabulary is built with a word-to-index mapping.
+
+2. **`transform` Method**: Converts each document into a TF-IDF representation. It computes Term Frequency (TF) for each word in the document, calculates TF-IDF by multiplying TF with IDF, and stores these values in a matrix where each row corresponds to a document.
+
+3. **`fit_transform` Method**: Combines the fitting and transforming steps into a single method for efficient processing of documents.
+
+## References
+
+1. [TF-IDF - Wikipedia](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)
+2. [Understanding TF-IDF](https://towardsdatascience.com/understanding-tf-idf-a-traditional-approach-to-feature-extraction-in-nlp-a5bfbe04723f)
+3. [Scikit-learn: TF-IDF](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)
+
+---
+
+This document provides a clear and structured explanation of the TF-IDF algorithm, including its attributes, methods, and overall functionality.
diff --git a/NLP/Algorithms/TF-IDF/tf_idf.py b/NLP/Algorithms/TF-IDF/tf_idf.py
@@ -0,0 +1,117 @@
+import math
+from collections import Counter
+
+class TFIDF:
+    def __init__(self):
+        self.vocabulary = {}  # Vocabulary to store word indices
+        self.idf_values = {}  # IDF values for words
+
+    def fit(self, documents):
+        """
+        Compute IDF values based on the provided documents.
+        
+        Args:
+            documents (list of str): List of documents where each document is a string.
+        """
+        doc_count = len(documents)
+        term_doc_count = Counter()  # To count the number of documents containing each word
+
+        # Count occurrences of words in documents
+        for doc in documents:
+            words = set(doc.split())  # Unique words in the current document
+            for word in words:
+                term_doc_count[word] += 1
+
+        # Compute IDF values
+        self.idf_values = {
+            word: math.log(doc_count / (count + 1))  # +1 to avoid division by zero
+            for word, count in term_doc_count.items()
+        }
+
+        # Build vocabulary
+        self.vocabulary = {word: idx for idx, word in enumerate(self.idf_values.keys())}
+
+    def transform(self, documents):
+        """
+        Transform documents into TF-IDF representation.
+
+        Args:
+            documents (list of str): List of documents where each document is a string.
+        
+        Returns:
+            list of list of float: TF-IDF matrix where each row corresponds to a document.
+        """
+        rows = []
+        for doc in documents:
+            words = doc.split()
+            word_count = Counter(words)
+            doc_length = len(words)
+            row = [0] * len(self.vocabulary)
+
+            for word, count in word_count.items():
+                if word in self.vocabulary:
+                    tf = count / doc_length
+                    idf = self.idf_values[word]
+                    index = self.vocabulary[word]
+                    row[index] = tf * idf
+            rows.append(row)
+        return rows
+
+    def fit_transform(self, documents):
+        """
+        Compute IDF values and transform documents into TF-IDF representation.
+
+        Args:
+            documents (list of str): List of documents where each document is a string.
+
+        Returns:
+            list of list of float: TF-IDF matrix where each row corresponds to a document.
+        """
+        self.fit(documents)
+        return self.transform(documents)
+# Example usage
+if __name__ == "__main__":
+    documents = [
+        "the cat sat on the mat",
+        "the dog ate my homework",
+        "the cat ate the dog food",
+        "I love programming in Python",
+        "Machine learning is fun",
+        "Python is a versatile language",
+        "Learning new skills is always beneficial"
+    ]
+
+    # Initialize the TF-IDF model
+    tfidf = TFIDF()
+
+    # Fit the model and transform the documents
+    tfidf_matrix = tfidf.fit_transform(documents)
+
+    # Print the vocabulary
+    print("Vocabulary:", tfidf.vocabulary)
+
+    # Print the TF-IDF representation
+    print("TF-IDF Representation:")
+    for i, vector in enumerate(tfidf_matrix):
+        print(f"Document {i + 1}: {vector}")
+
+    # More example documents with mixed content
+    more_documents = [
+        "the quick brown fox jumps over the lazy dog",
+        "a journey of a thousand miles begins with a single step",
+        "to be or not to be that is the question",
+        "the rain in Spain stays mainly in the plain",
+        "all human beings are born free and equal in dignity and rights"
+    ]
+
+    # Fit the model and transform the new set of documents
+    tfidf_more = TFIDF()
+    tfidf_matrix_more = tfidf_more.fit_transform(more_documents)
+
+    # Print the vocabulary for the new documents
+    print("\nVocabulary for new documents:", tfidf_more.vocabulary)
+
+    # Print the TF-IDF representation for the new documents
+    print("TF-IDF Representation for new documents:")
+    for i, vector in enumerate(tfidf_matrix_more):
+        print(f"Document {i + 1}: {vector}")
diff --git a/NLP/Algorithms/Word2Vec/README.md b/NLP/Algorithms/Word2Vec/README.md
@@ -0,0 +1,105 @@
+# Word2Vec Skip-gram Implementation
+
+## Introduction
+
+Word2Vec is a technique to learn word embeddings using neural networks. The primary goal is to represent words in a continuous vector space where semantically similar words are mapped to nearby points. Word2Vec can be implemented using two main architectures:
+
+1. **Continuous Bag of Words (CBOW)**: Predicts the target word based on the context words (surrounding words).
+2. **Skip-gram**: Predicts the context words based on a given target word.
+
+In this example, we focus on the Skip-gram approach, which is more commonly used in practice. The Skip-gram model tries to maximize the probability of context words given a target word.
+
+## Table of Contents
+
+1. [Installation](#installation)
+2. [Usage](#usage)
+   - [Initialization](#initialization)
+   - [Tokenization](#tokenization)
+   - [Generate Training Data](#generate-training-data)
+   - [Training](#training)
+   - [Retrieve Word Vector](#retrieve-word-vector)
+3. [Explanation of the Code](#explanation-of-the-code)
+4. [References](#references)
+
+## Installation
+
+Ensure you have Python installed. You can install the necessary dependencies using pip:
+
+```sh
+pip install numpy
+```
+
+## Usage
+
+### Initialization
+
+Define the parameters for the Word2Vec model:
+
+- `window_size`: Defines the size of the context window around the target word.
+- `embedding_dim`: Dimension of the word vectors (embedding space).
+- `learning_rate`: Rate at which weights are updated.
+
+### Tokenization
+
+The `tokenize` method creates a vocabulary from the documents and builds mappings between words and their indices.
+
+### Generate Training Data
+
+The `generate_training_data` method creates pairs of target words and context words based on the window size.
+
+### Training
+
+The `train` method initializes the weight matrices and updates them using gradient descent.
+
+For each word-context pair, it computes the hidden layer representation, predicts context probabilities, calculates the error, and updates the weights.
+
+### Retrieve Word Vector
+
+The `get_word_vector` method retrieves the embedding of a specific word.
+
+## Explanation of the Code
+
+### Initialization
+
+- **Parameters**:
+  - `window_size`: Size of the context window around the target word.
+  - `embedding_dim`: Dimension of the word vectors (embedding space).
+  - `learning_rate`: Rate at which weights are updated.
+
+### Tokenization
+
+- The `tokenize` method creates a vocabulary from the documents.
+- Builds mappings between words and their indices.
+
+### Generate Training Data
+
+- The `generate_training_data` method creates pairs of target words and context words based on the window size.
+
+### Training
+
+- The `train` method initializes the weight matrices.
+- Updates the weights using gradient descent.
+- For each word-context pair:
+  - Computes the hidden layer representation.
+  - Predicts context probabilities.
+  - Calculates the error.
+  - Updates the weights.
+
+### Softmax Function
+
+- The `softmax` function converts the output layer scores into probabilities.
+- Used to compute the error and update the weights.
+
+### Retrieve Word Vector
+
+- The `get_word_vector` method retrieves the embedding of a specific word.
+
+## References
+
+1. [Word2Vec - Google](https://code.google.com/archive/p/word2vec/)
+2. [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/abs/1301.3781)
+3. [Distributed Representations of Words and Phrases and their Compositionality](https://arxiv.org/abs/1310.4546)
+
+---
+
+This README file provides a comprehensive overview of the Word2Vec Skip-gram implementation, including installation instructions, usage details, and an explanation of the code.