-
Notifications
You must be signed in to change notification settings - Fork 31
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Enhancing Text Analysis with Advanced Algorithms (#75)
* Add files via upload * TF-IDF and Word2Vec * Delete NLP/Algorithms/Word2Vec/word2vec.ipynb * Delete NLP/Algorithms/TF-IDF directory * Add files via upload * Update README.md
- Loading branch information
Showing
5 changed files
with
431 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,82 @@ | ||
# TF-IDF Implementation | ||
|
||
## Introduction | ||
|
||
The `TFIDF` class converts a collection of documents into their respective TF-IDF (Term Frequency-Inverse Document Frequency) representations. TF-IDF is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents (corpus). | ||
|
||
## Table of Contents | ||
|
||
1. [Attributes](#attributes) | ||
2. [Methods](#methods) | ||
- [fit Method](#fit-method) | ||
- [transform Method](#transform-method) | ||
- [fit_transform Method](#fit_transform-method) | ||
3. [Explanation of the Code](#explanation-of-the-code) | ||
4. [References](#references) | ||
|
||
## Attributes | ||
|
||
The `TFIDF` class is initialized with two main attributes: | ||
|
||
- **`self.vocabulary`**: A dictionary that maps words to their indices in the TF-IDF matrix. | ||
- **`self.idf_values`**: A dictionary that stores the IDF (Inverse Document Frequency) values for each word. | ||
|
||
## Methods | ||
|
||
### fit Method | ||
|
||
#### Input | ||
|
||
- **`documents`** (list of str): List of documents where each document is a string. | ||
|
||
#### Purpose | ||
|
||
Calculate the IDF values for all unique words in the corpus. | ||
|
||
#### Steps | ||
|
||
1. **Count Document Occurrences**: Determine how many documents contain each word. | ||
2. **Compute IDF**: Calculate the importance of each word across all documents. Higher values indicate the word is more unique to fewer documents. | ||
3. **Build Vocabulary**: Create a mapping of words to unique indexes. | ||
|
||
### transform Method | ||
|
||
#### Input | ||
|
||
- **`documents`** (list of str): A list where each entry is a document in the form of a string. | ||
|
||
#### Purpose | ||
|
||
Convert each document into a numerical representation that shows the importance of each word. | ||
|
||
#### Steps | ||
|
||
1. **Compute Term Frequency (TF)**: Determine how often each word appears in a document relative to the total number of words in that document. | ||
2. **Compute TF-IDF**: Multiply the term frequency of each word by its IDF to get a measure of its relevance in each document. | ||
3. **Store Values**: Save these numerical values in a matrix where each row represents a document. | ||
|
||
### fit_transform Method | ||
|
||
#### Purpose | ||
|
||
Perform both fitting (computing IDF values) and transforming (converting documents to TF-IDF representation) in one step. | ||
|
||
## Explanation of the Code | ||
|
||
The `TFIDF` class includes methods for fitting the model to the data, transforming new data into the TF-IDF representation, and combining these steps. Here's a breakdown of the primary methods: | ||
|
||
1. **`fit` Method**: Calculates IDF values for all unique words in the corpus. It counts the number of documents containing each word and computes the IDF. The vocabulary is built with a word-to-index mapping. | ||
|
||
2. **`transform` Method**: Converts each document into a TF-IDF representation. It computes Term Frequency (TF) for each word in the document, calculates TF-IDF by multiplying TF with IDF, and stores these values in a matrix where each row corresponds to a document. | ||
|
||
3. **`fit_transform` Method**: Combines the fitting and transforming steps into a single method for efficient processing of documents. | ||
|
||
## References | ||
|
||
1. [TF-IDF - Wikipedia](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) | ||
2. [Understanding TF-IDF](https://towardsdatascience.com/understanding-tf-idf-a-traditional-approach-to-feature-extraction-in-nlp-a5bfbe04723f) | ||
3. [Scikit-learn: TF-IDF](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) | ||
|
||
--- | ||
|
||
This document provides a clear and structured explanation of the TF-IDF algorithm, including its attributes, methods, and overall functionality. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,117 @@ | ||
import math | ||
from collections import Counter | ||
|
||
class TFIDF: | ||
def __init__(self): | ||
self.vocabulary = {} # Vocabulary to store word indices | ||
self.idf_values = {} # IDF values for words | ||
|
||
def fit(self, documents): | ||
""" | ||
Compute IDF values based on the provided documents. | ||
Args: | ||
documents (list of str): List of documents where each document is a string. | ||
""" | ||
doc_count = len(documents) | ||
term_doc_count = Counter() # To count the number of documents containing each word | ||
|
||
# Count occurrences of words in documents | ||
for doc in documents: | ||
words = set(doc.split()) # Unique words in the current document | ||
for word in words: | ||
term_doc_count[word] += 1 | ||
|
||
# Compute IDF values | ||
self.idf_values = { | ||
word: math.log(doc_count / (count + 1)) # +1 to avoid division by zero | ||
for word, count in term_doc_count.items() | ||
} | ||
|
||
# Build vocabulary | ||
self.vocabulary = {word: idx for idx, word in enumerate(self.idf_values.keys())} | ||
|
||
def transform(self, documents): | ||
""" | ||
Transform documents into TF-IDF representation. | ||
Args: | ||
documents (list of str): List of documents where each document is a string. | ||
Returns: | ||
list of list of float: TF-IDF matrix where each row corresponds to a document. | ||
""" | ||
rows = [] | ||
for doc in documents: | ||
words = doc.split() | ||
word_count = Counter(words) | ||
doc_length = len(words) | ||
row = [0] * len(self.vocabulary) | ||
|
||
for word, count in word_count.items(): | ||
if word in self.vocabulary: | ||
tf = count / doc_length | ||
idf = self.idf_values[word] | ||
index = self.vocabulary[word] | ||
row[index] = tf * idf | ||
rows.append(row) | ||
return rows | ||
|
||
def fit_transform(self, documents): | ||
""" | ||
Compute IDF values and transform documents into TF-IDF representation. | ||
Args: | ||
documents (list of str): List of documents where each document is a string. | ||
Returns: | ||
list of list of float: TF-IDF matrix where each row corresponds to a document. | ||
""" | ||
self.fit(documents) | ||
return self.transform(documents) | ||
# Example usage | ||
if __name__ == "__main__": | ||
documents = [ | ||
"the cat sat on the mat", | ||
"the dog ate my homework", | ||
"the cat ate the dog food", | ||
"I love programming in Python", | ||
"Machine learning is fun", | ||
"Python is a versatile language", | ||
"Learning new skills is always beneficial" | ||
] | ||
|
||
# Initialize the TF-IDF model | ||
tfidf = TFIDF() | ||
|
||
# Fit the model and transform the documents | ||
tfidf_matrix = tfidf.fit_transform(documents) | ||
|
||
# Print the vocabulary | ||
print("Vocabulary:", tfidf.vocabulary) | ||
|
||
# Print the TF-IDF representation | ||
print("TF-IDF Representation:") | ||
for i, vector in enumerate(tfidf_matrix): | ||
print(f"Document {i + 1}: {vector}") | ||
|
||
# More example documents with mixed content | ||
more_documents = [ | ||
"the quick brown fox jumps over the lazy dog", | ||
"a journey of a thousand miles begins with a single step", | ||
"to be or not to be that is the question", | ||
"the rain in Spain stays mainly in the plain", | ||
"all human beings are born free and equal in dignity and rights" | ||
] | ||
|
||
# Fit the model and transform the new set of documents | ||
tfidf_more = TFIDF() | ||
tfidf_matrix_more = tfidf_more.fit_transform(more_documents) | ||
|
||
# Print the vocabulary for the new documents | ||
print("\nVocabulary for new documents:", tfidf_more.vocabulary) | ||
|
||
# Print the TF-IDF representation for the new documents | ||
print("TF-IDF Representation for new documents:") | ||
for i, vector in enumerate(tfidf_matrix_more): | ||
print(f"Document {i + 1}: {vector}") |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,105 @@ | ||
# Word2Vec Skip-gram Implementation | ||
|
||
## Introduction | ||
|
||
Word2Vec is a technique to learn word embeddings using neural networks. The primary goal is to represent words in a continuous vector space where semantically similar words are mapped to nearby points. Word2Vec can be implemented using two main architectures: | ||
|
||
1. **Continuous Bag of Words (CBOW)**: Predicts the target word based on the context words (surrounding words). | ||
2. **Skip-gram**: Predicts the context words based on a given target word. | ||
|
||
In this example, we focus on the Skip-gram approach, which is more commonly used in practice. The Skip-gram model tries to maximize the probability of context words given a target word. | ||
|
||
## Table of Contents | ||
|
||
1. [Installation](#installation) | ||
2. [Usage](#usage) | ||
- [Initialization](#initialization) | ||
- [Tokenization](#tokenization) | ||
- [Generate Training Data](#generate-training-data) | ||
- [Training](#training) | ||
- [Retrieve Word Vector](#retrieve-word-vector) | ||
3. [Explanation of the Code](#explanation-of-the-code) | ||
4. [References](#references) | ||
|
||
## Installation | ||
|
||
Ensure you have Python installed. You can install the necessary dependencies using pip: | ||
|
||
```sh | ||
pip install numpy | ||
``` | ||
|
||
## Usage | ||
|
||
### Initialization | ||
|
||
Define the parameters for the Word2Vec model: | ||
|
||
- `window_size`: Defines the size of the context window around the target word. | ||
- `embedding_dim`: Dimension of the word vectors (embedding space). | ||
- `learning_rate`: Rate at which weights are updated. | ||
|
||
### Tokenization | ||
|
||
The `tokenize` method creates a vocabulary from the documents and builds mappings between words and their indices. | ||
|
||
### Generate Training Data | ||
|
||
The `generate_training_data` method creates pairs of target words and context words based on the window size. | ||
|
||
### Training | ||
|
||
The `train` method initializes the weight matrices and updates them using gradient descent. | ||
|
||
For each word-context pair, it computes the hidden layer representation, predicts context probabilities, calculates the error, and updates the weights. | ||
|
||
### Retrieve Word Vector | ||
|
||
The `get_word_vector` method retrieves the embedding of a specific word. | ||
|
||
## Explanation of the Code | ||
|
||
### Initialization | ||
|
||
- **Parameters**: | ||
- `window_size`: Size of the context window around the target word. | ||
- `embedding_dim`: Dimension of the word vectors (embedding space). | ||
- `learning_rate`: Rate at which weights are updated. | ||
|
||
### Tokenization | ||
|
||
- The `tokenize` method creates a vocabulary from the documents. | ||
- Builds mappings between words and their indices. | ||
|
||
### Generate Training Data | ||
|
||
- The `generate_training_data` method creates pairs of target words and context words based on the window size. | ||
|
||
### Training | ||
|
||
- The `train` method initializes the weight matrices. | ||
- Updates the weights using gradient descent. | ||
- For each word-context pair: | ||
- Computes the hidden layer representation. | ||
- Predicts context probabilities. | ||
- Calculates the error. | ||
- Updates the weights. | ||
|
||
### Softmax Function | ||
|
||
- The `softmax` function converts the output layer scores into probabilities. | ||
- Used to compute the error and update the weights. | ||
|
||
### Retrieve Word Vector | ||
|
||
- The `get_word_vector` method retrieves the embedding of a specific word. | ||
|
||
## References | ||
|
||
1. [Word2Vec - Google](https://code.google.com/archive/p/word2vec/) | ||
2. [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/abs/1301.3781) | ||
3. [Distributed Representations of Words and Phrases and their Compositionality](https://arxiv.org/abs/1310.4546) | ||
|
||
--- | ||
|
||
This README file provides a comprehensive overview of the Word2Vec Skip-gram implementation, including installation instructions, usage details, and an explanation of the code. |
Oops, something went wrong.