Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancing Text Analysis with Advanced Algorithms #75

Merged
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions NLP/Algorithms/TF-IDF/tf-idf.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"metadata":{"kernelspec":{"language":"python","display_name":"Python 3","name":"python3"},"language_info":{"name":"python","version":"3.10.13","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"},"kaggle":{"accelerator":"none","dataSources":[],"dockerImageVersionId":30746,"isInternetEnabled":true,"language":"python","sourceType":"notebook","isGpuEnabled":false}},"nbformat_minor":4,"nbformat":4,"cells":[{"cell_type":"markdown","source":"#### The TF-IDF (Term Frequency-Inverse Document Frequency) algorithm is used to convert a collection of text documents into a matrix of TF-IDF features. It is commonly used in text mining and information retrieval to reflect the importance of a word in a document relative to a collection of documents.","metadata":{"_uuid":"8f2839f25d086af736a60e9eeb907d3b93b6e0e5","_cell_guid":"b1076dfc-b9ad-4769-8c92-a6c4dae69d19"}},{"cell_type":"code","source":"import math\nfrom collections import Counter\n\nclass TFIDF:\n def __init__(self):\n self.vocabulary = {} # Vocabulary to store word indices\n self.idf_values = {} # IDF values for words\n\n def fit(self, documents):\n \"\"\"\n Compute IDF values based on the provided documents.\n \n Args:\n documents (list of str): List of documents where each document is a string.\n \"\"\"\n doc_count = len(documents)\n term_doc_count = Counter() # To count the number of documents containing each word\n\n # Count occurrences of words in documents\n for doc in documents:\n words = set(doc.split()) # Unique words in the current document\n for word in words:\n term_doc_count[word] += 1\n\n # Compute IDF values\n self.idf_values = {\n word: math.log(doc_count / (count + 1)) # +1 to avoid division by zero\n for word, count in term_doc_count.items()\n }\n\n # Build vocabulary\n self.vocabulary = {word: idx for idx, word in enumerate(self.idf_values.keys())}\n\n def transform(self, documents):\n \"\"\"\n Transform documents into TF-IDF representation.\n\n Args:\n documents (list of str): List of documents where each document is a string.\n \n Returns:\n list of list of float: TF-IDF matrix where each row corresponds to a document.\n \"\"\"\n rows = []\n for doc in documents:\n words = doc.split()\n word_count = Counter(words)\n doc_length = len(words)\n row = [0] * len(self.vocabulary)\n\n for word, count in word_count.items():\n if word in self.vocabulary:\n tf = count / doc_length\n idf = self.idf_values[word]\n index = self.vocabulary[word]\n row[index] = tf * idf\n rows.append(row)\n return rows\n\n def fit_transform(self, documents):\n \"\"\"\n Compute IDF values and transform documents into TF-IDF representation.\n\n Args:\n documents (list of str): List of documents where each document is a string.\n\n Returns:\n list of list of float: TF-IDF matrix where each row corresponds to a document.\n \"\"\"\n self.fit(documents)\n return self.transform(documents)","metadata":{"execution":{"iopub.status.busy":"2024-07-20T10:08:08.207148Z","iopub.execute_input":"2024-07-20T10:08:08.207645Z","iopub.status.idle":"2024-07-20T10:08:08.222510Z","shell.execute_reply.started":"2024-07-20T10:08:08.207605Z","shell.execute_reply":"2024-07-20T10:08:08.221404Z"},"trusted":true},"execution_count":5,"outputs":[]},{"cell_type":"code","source":"# Example usage\nif __name__ == \"__main__\":\n documents = [\n \"the cat sat on the mat\",\n \"the dog ate my homework\",\n \"the cat ate the dog food\"\n ]\n\n tfidf = TFIDF()\n tfidf_matrix = tfidf.fit_transform(documents)\n for i, row in enumerate(tfidf_matrix):\n print(f\"Document {i}: {row}\")","metadata":{"execution":{"iopub.status.busy":"2024-07-20T10:08:10.692831Z","iopub.execute_input":"2024-07-20T10:08:10.693205Z","iopub.status.idle":"2024-07-20T10:08:10.699967Z","shell.execute_reply.started":"2024-07-20T10:08:10.693178Z","shell.execute_reply":"2024-07-20T10:08:10.698625Z"},"trusted":true},"execution_count":6,"outputs":[{"name":"stdout","text":"Document 0: [0.0, -0.09589402415059363, 0.06757751801802739, 0.06757751801802739, 0.06757751801802739, 0, 0, 0, 0, 0]\nDocument 1: [0, -0.05753641449035618, 0, 0, 0, 0.08109302162163289, 0.08109302162163289, 0.0, 0.0, 0]\nDocument 2: [0.0, -0.09589402415059363, 0, 0, 0, 0, 0, 0.0, 0.0, 0.06757751801802739]\n","output_type":"stream"}]},{"cell_type":"code","source":"# Additional example usage\nif __name__ == \"__main__\":\n # Sample documents\n documents = [\n \"I love programming in Python\",\n \"Machine learning is fun\",\n \"Python is a versatile language\",\n \"Learning new skills is always beneficial\"\n ]\n\n # Initialize the TF-IDF model\n tfidf = TFIDF()\n \n # Fit the model and transform the documents\n tfidf_matrix = tfidf.fit_transform(documents)\n \n # Print the vocabulary\n print(\"Vocabulary:\", tfidf.vocabulary)\n \n # Print the TF-IDF representation\n print(\"TF-IDF Representation:\")\n for i, vector in enumerate(tfidf_matrix):\n print(f\"Document {i + 1}: {vector}\")\n\n # More example documents with mixed content\n more_documents = [\n \"the quick brown fox jumps over the lazy dog\",\n \"a journey of a thousand miles begins with a single step\",\n \"to be or not to be that is the question\",\n \"the rain in Spain stays mainly in the plain\",\n \"all human beings are born free and equal in dignity and rights\"\n ]\n\n # Fit the model and transform the new set of documents\n tfidf_more = TFIDF()\n tfidf_matrix_more = tfidf_more.fit_transform(more_documents)\n \n # Print the vocabulary for the new documents\n print(\"\\nVocabulary for new documents:\", tfidf_more.vocabulary)\n \n # Print the TF-IDF representation for the new documents\n print(\"TF-IDF Representation for new documents:\")\n for i, vector in enumerate(tfidf_matrix_more):\n print(f\"Document {i + 1}: {vector}\")","metadata":{"execution":{"iopub.status.busy":"2024-07-20T10:09:51.105985Z","iopub.execute_input":"2024-07-20T10:09:51.107160Z","iopub.status.idle":"2024-07-20T10:09:51.118181Z","shell.execute_reply.started":"2024-07-20T10:09:51.107108Z","shell.execute_reply":"2024-07-20T10:09:51.116972Z"},"trusted":true},"execution_count":7,"outputs":[{"name":"stdout","text":"Vocabulary: {'love': 0, 'I': 1, 'Python': 2, 'programming': 3, 'in': 4, 'learning': 5, 'fun': 6, 'Machine': 7, 'is': 8, 'a': 9, 'language': 10, 'versatile': 11, 'Learning': 12, 'beneficial': 13, 'new': 14, 'always': 15, 'skills': 16}\nTF-IDF Representation:\nDocument 1: [0.13862943611198905, 0.13862943611198905, 0.05753641449035617, 0.13862943611198905, 0.13862943611198905, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]\nDocument 2: [0, 0, 0, 0, 0, 0.17328679513998632, 0.17328679513998632, 0.17328679513998632, 0.0, 0, 0, 0, 0, 0, 0, 0, 0]\nDocument 3: [0, 0, 0.05753641449035617, 0, 0, 0, 0, 0, 0.0, 0.13862943611198905, 0.13862943611198905, 0.13862943611198905, 0, 0, 0, 0, 0]\nDocument 4: [0, 0, 0, 0, 0, 0, 0, 0, 0.0, 0, 0, 0, 0.11552453009332421, 0.11552453009332421, 0.11552453009332421, 0.11552453009332421, 0.11552453009332421]\n\nVocabulary for new documents: {'brown': 0, 'fox': 1, 'quick': 2, 'over': 3, 'the': 4, 'lazy': 5, 'dog': 6, 'jumps': 7, 'thousand': 8, 'journey': 9, 'single': 10, 'a': 11, 'step': 12, 'with': 13, 'of': 14, 'miles': 15, 'begins': 16, 'to': 17, 'or': 18, 'question': 19, 'not': 20, 'be': 21, 'that': 22, 'is': 23, 'Spain': 24, 'rain': 25, 'mainly': 26, 'plain': 27, 'stays': 28, 'in': 29, 'human': 30, 'and': 31, 'all': 32, 'born': 33, 'equal': 34, 'dignity': 35, 'are': 36, 'rights': 37, 'beings': 38, 'free': 39}\nTF-IDF Representation for new documents:\nDocument 1: [0.10181008131935056, 0.10181008131935056, 0.10181008131935056, 0.10181008131935056, 0.049587455847602165, 0.10181008131935056, 0.10181008131935056, 0.10181008131935056, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]\nDocument 2: [0, 0, 0, 0, 0, 0, 0, 0, 0.08329915744310501, 0.08329915744310501, 0.08329915744310501, 0.249897472329315, 0.08329915744310501, 0.08329915744310501, 0.08329915744310501, 0.08329915744310501, 0.08329915744310501, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]\nDocument 3: [0, 0, 0, 0, 0.02231435513142098, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.18325814637483104, 0.09162907318741552, 0.09162907318741552, 0.09162907318741552, 0.18325814637483104, 0.09162907318741552, 0.09162907318741552, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]\nDocument 4: [0, 0, 0, 0, 0.049587455847602165, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.10181008131935056, 0.10181008131935056, 0.10181008131935056, 0.10181008131935056, 0.10181008131935056, 0.11351680528133126, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]\nDocument 5: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.04256880198049923, 0.07635756098951292, 0.15271512197902584, 0.07635756098951292, 0.07635756098951292, 0.07635756098951292, 0.07635756098951292, 0.07635756098951292, 0.07635756098951292, 0.07635756098951292, 0.07635756098951292]\n","output_type":"stream"}]},{"cell_type":"markdown","source":"#### Explanation:\n\n1. **Initialization**:\n - `self.vocabulary`: Dictionary to store the mapping of words to their indices in the TF-IDF matrix.\n - `self.idf_values`: Dictionary to store the IDF (Inverse Document Frequency) values for each word.\n\n2. **`fit` Method**:\n - **Input**: List of documents.\n - **Purpose**: Calculate the IDF values for all unique words in the corpus.\n - **Steps**:\n 1. Count the number of documents containing each word.\n 2. Compute the IDF for each word using the formula:\n $$\n \\text{IDF}(word) = \\log \\left(\\frac{\\text{Total number of documents}}{\\text{Number of documents containing the word} + 1}\\right)\n $$\n Adding 1 avoids division by zero.\n 3. Build the vocabulary with word-to-index mapping.\n\n3. **`transform` Method**:\n - **Input**: List of documents.\n - **Purpose**: Convert each document into a TF-IDF representation.\n - **Steps**:\n 1. Compute Term Frequency (TF) for each word in the document:\n $$\n \\text{TF} = \\frac{\\text{Count of the word}}{\\text{Total number of words in the document}}\n $$\n 2. Compute the TF-IDF value:\n $$\n \\text{TF-IDF} = \\text{TF} \\times \\text{IDF}\n $$\n 3. Store the TF-IDF values in a matrix where each row corresponds to a document.\n\n4. **`fit_transform` Method**:\n - **Purpose**: Perform both fitting (computing IDF values) and transforming (converting documents to TF-IDF representation) in one step.","metadata":{}}]}
Loading