From 822814ace8236b1fc7d22850a21ce66cba7b5043 Mon Sep 17 00:00:00 2001 From: UTSAV SINGHAL Date: Tue, 30 Jul 2024 11:40:50 +0530 Subject: [PATCH 1/3] Add files via upload --- .../Named_Entity_Recognition/README.md | 87 ++ .../data/ner_data.txt | 1015 +++++++++++++++++ .../Named_Entity_Recognition/preprocess.py | 41 + .../Named_Entity_Recognition/recognize.py | 17 + .../Named_Entity_Recognition/train.py | 42 + NLP/Algorithms/Text_Classification/README.md | 90 ++ .../Text_Classification/classify.py | 19 + .../Text_Classification/data/text_data.csv | 175 +++ .../Text_Classification/preprocess.py | 24 + NLP/Algorithms/Text_Classification/train.py | 34 + 10 files changed, 1544 insertions(+) create mode 100644 NLP/Algorithms/Named_Entity_Recognition/README.md create mode 100644 NLP/Algorithms/Named_Entity_Recognition/data/ner_data.txt create mode 100644 NLP/Algorithms/Named_Entity_Recognition/preprocess.py create mode 100644 NLP/Algorithms/Named_Entity_Recognition/recognize.py create mode 100644 NLP/Algorithms/Named_Entity_Recognition/train.py create mode 100644 NLP/Algorithms/Text_Classification/README.md create mode 100644 NLP/Algorithms/Text_Classification/classify.py create mode 100644 NLP/Algorithms/Text_Classification/data/text_data.csv create mode 100644 NLP/Algorithms/Text_Classification/preprocess.py create mode 100644 NLP/Algorithms/Text_Classification/train.py diff --git a/NLP/Algorithms/Named_Entity_Recognition/README.md b/NLP/Algorithms/Named_Entity_Recognition/README.md new file mode 100644 index 0000000..670c4d9 --- /dev/null +++ b/NLP/Algorithms/Named_Entity_Recognition/README.md @@ -0,0 +1,87 @@ +# Named Entity Recognition (NER) Project + +This project demonstrates a basic Named Entity Recognition (NER) algorithm using Python and the `spacy` library. The goal is to identify named entities in text and classify them into predefined categories. + +## Directory Structure + +``` +ner_project/ +├── data/ +│ └── ner_data.txt +├── models/ +│ └── ner_model +├── preprocess.py +├── train.py +└── recognize.py +└── README.md +``` + +- **data/ner_data.txt**: Contains the dataset used for training the NER model. +- **models/ner_model**: Stores the trained NER model. +- **preprocess.py**: Contains the code for preprocessing the text data. +- **train.py**: Script for training the NER model. +- **recognize.py**: Script for recognizing named entities in new text using the trained model. +- **README.md**: Project documentation. + +## Dataset + +The dataset (`ner_data.txt`) contains sentences and their corresponding entity labels in the IOB format. Each line contains a word and its label, separated by a space. Sentences are separated by blank lines. + +## Preprocessing + +The `preprocess.py` file contains functions to preprocess the text data. It reads the dataset and converts it into a format suitable for training with `spacy`. + +## Training the Model + +The `train.py` script is used to train the NER model. It performs the following steps: + +1. Load a blank English model. +2. Create the NER pipeline component and add it to the pipeline. +3. Add labels to the NER component. +4. Load the training data. +5. Train the model using the training data. +6. Save the trained model to `models/ner_model`. + +### Running the Training Script + +To train the model, run: +```bash +python train.py +``` + +## Recognizing Named Entities + +The `recognize.py` script is used to recognize named entities in new text using the trained model. It performs the following steps: + +1. Load the trained model. +2. Process the input text. +3. Print the recognized entities and their labels. + +### Running the Recognition Script + +To recognize named entities in new text, run: +```bash +python recognize.py +``` + +## Dependencies + +The project requires the following Python libraries: + +- spacy + +You can install the dependencies using: +```bash +pip install spacy +``` + +## Example Usage + +```python +# Example usage of the recognize.py script +if __name__ == "__main__": + text = "I love programming in Python. Machine learning is fascinating. Spacy is a useful library." + recognize_entities(text) +``` + +This project provides a basic implementation of Named Entity Recognition using the `spacy` library. You can expand it by using more advanced models or preprocessing techniques based on your requirements. \ No newline at end of file diff --git a/NLP/Algorithms/Named_Entity_Recognition/data/ner_data.txt b/NLP/Algorithms/Named_Entity_Recognition/data/ner_data.txt new file mode 100644 index 0000000..3e60694 --- /dev/null +++ b/NLP/Algorithms/Named_Entity_Recognition/data/ner_data.txt @@ -0,0 +1,1015 @@ +I O +love O +programming O +in O +Python B-LANG +. O + +Machine B-SKILL +learning I-SKILL +is O +fascinating O +. O + +Spacy B-TECH +is O +a O +useful O +library O +. O + +Python B-LANG +is O +a O +versatile O +language O +. O + +Named B-TASK +Entity I-TASK +Recognition I-TASK +is O +important O +. O + +Google B-ORG +is O +a O +leading O +tech B-INDUSTRY +company O +. O + +Elon B-PER +Musk I-PER +is O +the O +CEO O +of O +Tesla B-ORG +. O + +The O +COVID-19 B-EVENT +pandemic I-EVENT +has O +impacted O +the O +world O +significantly O +. O + +Microsoft B-ORG +was O +founded O +by O +Bill B-PER +Gates I-PER +. O + +Apple B-ORG +introduced O +the O +iPhone B-PROD +in O +2007 B-DATE +. O + +Amazon B-ORG +is O +a O +global O +e-commerce B-INDUSTRY +giant O +. O + +Facebook B-ORG +rebranded O +to O +Meta B-ORG +in O +2021 B-DATE +. O + +New B-GPE +York I-GPE +City I-GPE +is O +a O +major O +financial B-INDUSTRY +hub O +. O + +SpaceX B-ORG +launched O +the O +Falcon B-PROD +Heavy I-PROD +rocket I-PROD +. O + +The O +Olympics B-EVENT +are O +held O +every O +four O +years O +. O + +Python B-LANG +and O +Java B-LANG +are O +popular O +programming O +languages O +. O + +NASA B-ORG +is O +planning O +a O +mission O +to O +Mars B-LOC +. O + +The O +Great B-LOC +Wall I-LOC +of I-LOC +China I-LOC +is O +a O +historic O +landmark O +. O + +Tesla B-ORG +produces O +electric B-PROD +vehicles I-PROD +. O + +The O +Eiffel B-LOC +Tower I-LOC +is O +in O +Paris B-GPE +. O + +Google B-ORG +Maps I-PROD +is O +a O +widely O +used O +navigation O +app O +. O + +The O +Amazon B-LOC +Rainforest I-LOC +is O +the O +largest O +tropical O +rainforest O +. O + +Lionel B-PER +Messi I-PER +is O +a O +famous O +football B-SPORT +player O +. O + +The O +United B-ORG +Nations I-ORG +is O +an O +international O +organization O +. O + +The O +stock B-INDUSTRY +market O +can O +be O +volatile O +. O + +The O +Academy B-EVENT +Awards I-EVENT +celebrate O +cinematic B-INDUSTRY +achievements O +. O + +Mount B-LOC +Everest I-LOC +is O +the O +highest O +mountain O +in O +the O +world O +. O + +Jeff B-PER +Bezos I-PER +is O +the O +founder O +of O +Amazon B-ORG +. O + +The O +Pacific B-LOC +Ocean I-LOC +is O +the O +largest O +ocean O +. O + +The O +Sahara B-LOC +Desert I-LOC +is O +in O +Africa B-LOC +. O + +Harvard B-ORG +University I-ORG +is O +an O +Ivy B-INDUSTRY +League I-INDUSTRY +school O +. O + +The O +World B-ORG +Health I-ORG +Organization I-ORG +monitors O +global O +health O +trends O +. O + +The O +Wright B-PER +Brothers I-PER +invented O +the O +airplane B-PROD +. O + +Python B-LANG +is O +used O +in O +data B-TASK +science I-TASK +. O + +The O +Taj B-LOC +Mahal I-LOC +is O +in O +India B-GPE +. O + +Tesla B-ORG +is O +known O +for O +its O +electric B-PROD +cars I-PROD +. O + +The O +Google B-ORG +search B-TASK +engine I-TASK +is O +widely O +used O +. O + +NASA B-ORG +is O +exploring O +space B-LOC +. O + +The O +Great B-LOC +Barrier I-LOC +Reef I-LOC +is O +in O +Australia B-GPE +. O + +The O +United B-GPE +States I-GPE +of I-GPE +America I-GPE +is O +a O +large O +country O +. O + +The O +Silicon B-LOC +Valley I-LOC +is O +a O +tech B-INDUSTRY +hub O +. O + +The O +European B-ORG +Union I-ORG +is O +a O +political B-INDUSTRY +and O +economic B-INDUSTRY +union O +. O + +Python B-LANG +is O +great O +for O +machine B-TASK +learning I-TASK +. O + +The O +Sphinx B-LOC +is O +a O +historic O +monument O +in O +Egypt B-GPE +. O + +New B-GPE +York I-GPE +is O +a O +busy O +city O +. O + +The O +Grand B-LOC +Canyon I-LOC +is O +a O +natural O +wonder O +. O + +Python B-LANG +supports O +multiple O +programming O +paradigms O +. O + +Amazon B-ORG +Prime I-PROD +offers O +fast O +shipping O +. O + +The O +Mona B-PROD +Lisa I-PROD +is O +a O +famous O +painting O +. O + +The O +Berlin B-LOC +Wall I-LOC +fell O +in O +1989 B-DATE +. O + +Elon B-PER +Musk I-PER +founded O +SpaceX B-ORG +. O + +The O +Great B-LOC +Lakes I-LOC +are O +a O +group O +of O +five O +large O +lakes O +. O + +Python B-LANG +is O +used O +for O +web B-TASK +development I-TASK +. O + +The O +Rocky B-LOC +Mountains I-LOC +are O +in O +North B-LOC +America I-LOC +. O + +Microsoft B-ORG +Office I-PROD +is O +a O +suite O +of O +productivity O +tools O +. O + +Python B-LANG +is O +popular O +in O +data B-TASK +analysis I-TASK +. O + +The O +Golden B-LOC +Gate I-LOC +Bridge I-LOC +is O +in O +San B-GPE +Francisco I-GPE +. O + +The O +Big B-LOC +Ben I-LOC +is O +in O +London B-GPE +. O + +Python B-LANG +is O +known O +for O +its O +simplicity O +. O + +The O +Statue B-LOC +of I-LOC +Liberty I-LOC +is O +in O +New B-GPE +York I-GPE +. O + +The O +Great B-LOC +Pyramids I-LOC +are O +in O +Giza B-LOC +. O + +The O +Amazon B-LOC +River I-LOC +is O +the O +second O +longest O +river O +. O + +Python B-LANG +is O +widely O +used O +in O +scientific B-TASK +computing I-TASK +. O + +The O +Great B-LOC +Barrier I-LOC +Reef I-LOC +is O +in O +Australia B-GPE +. O + +The O +Sydney B-LOC +Opera I-LOC +House I-LOC +is O +a O +famous O +landmark O +. O + +Amazon B-ORG +Web I-PROD +Services I-PROD +is O +a O +cloud B-INDUSTRY +computing I-INDUSTRY +platform O +. O + +Python B-LANG +is O +used O +for O +AI B-TASK +research I-TASK +. O + +The O +Empire B-LOC +State I-LOC +Building I-LOC +is O +in O +New B-GPE +York I-GPE +. O + +The O +Great B-LOC +Wall I-LOC +of I-LOC +China I-LOC +is O +a O +historic O +landmark O +. O + +Google B-ORG +Drive I-PROD +is O +a O +cloud B-INDUSTRY +storage I-INDUSTRY +service O +. O + +The O +Amazon B-LOC +Rainforest I-LOC +is O +the O +largest O +tropical O +rainforest O +. O + +Python B-LANG +is O +known O +for O +its O +extensive O +libraries O +. O + +The O +Colosseum B-LOC +is O +in O +Rome B-GPE +. O + +Tesla B-ORG +produces O +electric B-PROD +cars I-PROD +. O + +The O +United B-ORG +Nations I-ORG +was O +founded O +in O +1945 B-DATE +. O + +Python B-LANG +is O +great O +for O +data B-TASK +visualization I-TASK +. O + +The O +Sphinx B-LOC +is O +a O +historic O +monument O +in O +Egypt B-GPE +. O + +The O +Rocky B-LOC +Mountains I-LOC +are O +in O +North B-LOC +America I-LOC +. O + +The O +World B-ORG +Health I-ORG +Organization I-ORG +monitors O +global O +health O +trends O +. O + +Python B-LANG +is O +used O +in O +natural B-TASK +language I-TASK +processing I-TASK +. O + +The O +Grand B-LOC +Canyon I-LOC +is O +a O +natural O +wonder O +. O + +The O +Pacific B-LOC +Ocean I-LOC +is O +the O +largest O +ocean O +. O + +The O +Wright B-PER +Brothers I-PER +invented O +the O +airplane B-PROD +. O + +The O +Great B-LOC +Lakes I-LOC +are O +a O +group O +of O +five O +large O +lakes O +. O + +Python B-LANG +is O +used O +for O +web B-TASK +development I-TASK +. O + +Amazon B-ORG +Prime I-PROD +offers O +fast O +shipping O +. O + +The O +Eiffel B-LOC +Tower I-LOC +is O +in O +Paris B-GPE +. O + +The O +United B-GPE +Kingdom I-GPE +is O +a O +country O +in O +Europe B-LOC +. O + +Python B-LANG +is O +known O +for O +its O +simplicity O +. O + +The O +Great B-LOC +Pyramids I-LOC +are O +in O +Giza B-LOC +. O + +Google B-ORG +Maps I-PROD +is O +a O +widely O +used O +navigation O +app O +. O + +The O +Amazon B-LOC +River I-LOC +is O +the O +second O +longest O +river O +. O + +Python B-LANG +is O +widely O +used O +in O +scientific B-TASK +computing I-TASK +. O + +The O +Berlin B-LOC +Wall I-LOC +fell O +in O +1989 B-DATE +. O + +The O +Olympics B-EVENT +are O +held O +every O +four O +years O +. O + +Python B-LANG +and O +Java B-LANG +are O +popular O +programming O +languages O +. O + +The O +Mona B-PROD +Lisa I-PROD +is O +a O +famous O +painting O +. O + +The O +Great B-LOC +Barrier I-LOC +Reef I-LOC +is O +in O +Australia B-GPE +. O + +The O +Sydney B-LOC +Opera I-LOC +House I-LOC +is O +a O +famous O +landmark O +. O + +Elon B-PER +Musk I-PER +is O +the O +CEO O +of O +Tesla B-ORG +. O + +Microsoft B-ORG +was O +founded O +by O +Bill B-PER +Gates I-PER +. O + +The O +Academy B-EVENT +Awards I-EVENT +celebrate O +cinematic B-INDUSTRY +achievements O +. O + +The O +Great B-LOC +Wall I-LOC +of I-LOC +China I-LOC +is O +a O +historic O +landmark O +. O + +Jeff B-PER +Bezos I-PER +is O +the O +founder O +of O +Amazon B-ORG +. O + +The O +Pacific B-LOC +Ocean I-LOC +is O +the O +largest O +ocean O +. O + +Lionel B-PER +Messi I-PER +is O +a O +famous O +football B-SPORT +player O +. O + +The O +United B-GPE +States I-GPE +of I-GPE +America I-GPE +is O +a O +large O +country O +. O + +The O +Silicon B-LOC +Valley I-LOC +is O +a O +tech B-INDUSTRY +hub O +. O + +The O +World B-ORG +Health I-ORG +Organization I-ORG +monitors O +global O +health O +trends O +. O + +The O +Golden B-LOC +Gate I-LOC +Bridge I-LOC +is O +in O +San B-GPE +Francisco I-GPE +. O + +The O +Empire B-LOC +State I-LOC +Building I-LOC +is O +in O +New B-GPE +York I-GPE +. O + +Python B-LANG +is O +great O +for O +machine B-TASK +learning I-TASK +. O + +NASA B-ORG +is O +planning O +a O +mission O +to O +Mars B-LOC +. O + +The O +Olympics B-EVENT +are O +held O +every O +four O +years O +. O + +The O +COVID-19 B-EVENT +pandemic I-EVENT +has O +impacted O +the O +world O +significantly O +. O \ No newline at end of file diff --git a/NLP/Algorithms/Named_Entity_Recognition/preprocess.py b/NLP/Algorithms/Named_Entity_Recognition/preprocess.py new file mode 100644 index 0000000..fec41dd --- /dev/null +++ b/NLP/Algorithms/Named_Entity_Recognition/preprocess.py @@ -0,0 +1,41 @@ +import spacy +from spacy.tokens import DocBin +import os + +def create_training_data(file_path): + nlp = spacy.blank("en") + doc_bin = DocBin() + + with open(file_path, 'r') as file: + lines = file.read().split('\n') + + words = [] + tags = [] + for line in lines: + if line: + parts = line.split() + words.append(parts[0]) + tags.append(parts[1]) + else: + if words: + doc = nlp.make_doc(' '.join(words)) + ents = [] + start = 0 + for i, word in enumerate(words): + end = start + len(word) + tag = tags[i] + if tag != 'O': + label = tag.split('-')[1] + span = doc.char_span(start, end, label=label) + if span: + ents.append(span) + start = end + 1 + doc.ents = ents + doc_bin.add(doc) + words = [] + tags = [] + + doc_bin.to_disk("data/training_data.spacy") + +if __name__ == "__main__": + create_training_data("data/ner_data.txt") \ No newline at end of file diff --git a/NLP/Algorithms/Named_Entity_Recognition/recognize.py b/NLP/Algorithms/Named_Entity_Recognition/recognize.py new file mode 100644 index 0000000..6bb9cfe --- /dev/null +++ b/NLP/Algorithms/Named_Entity_Recognition/recognize.py @@ -0,0 +1,17 @@ +import spacy + +def recognize_entities(text): + # Load the trained model + nlp = spacy.load("models/ner_model") + + # Process the text + doc = nlp(text) + + # Print the entities + for ent in doc.ents: + print(f"Text: {ent.text}, Label: {ent.label_}") + +# Example usage +if __name__ == "__main__": + text = "I love programming in Python. Machine learning is fascinating. Spacy is a useful library." + recognize_entities(text) diff --git a/NLP/Algorithms/Named_Entity_Recognition/train.py b/NLP/Algorithms/Named_Entity_Recognition/train.py new file mode 100644 index 0000000..d82007a --- /dev/null +++ b/NLP/Algorithms/Named_Entity_Recognition/train.py @@ -0,0 +1,42 @@ +import spacy +from spacy.util import minibatch, compounding +import random + +def train_ner_model(): + # Load blank English model + nlp = spacy.blank("en") + + # Create the NER pipeline component and add it to the pipeline + if "ner" not in nlp.pipe_names: + ner = nlp.create_pipe("ner") + nlp.add_pipe(ner, last=True) + + # Add labels to the NER component + labels = ["LANG", "SKILL", "TECH", "TASK"] + for label in labels: + ner.add_label(label) + + # Load training data + doc_bin = DocBin().from_disk("data/training_data.spacy") + training_data = list(doc_bin.get_docs(nlp.vocab)) + + # Train the model + optimizer = nlp.begin_training() + for itn in range(20): + random.shuffle(training_data) + losses = {} + batches = minibatch(training_data, size=compounding(4.0, 32.0, 1.5)) + for batch in batches: + texts, annotations = zip(*[(doc.text, {"entities": [(ent.start_char, ent.end_char, ent.label_) for ent in doc.ents]}) for doc in batch]) + nlp.update(texts, annotations, drop=0.5, losses=losses) + print(f"Iteration {itn}, Losses: {losses}") + + # Save the trained model + output_dir = "models/ner_model" + if not os.path.exists(output_dir): + os.makedirs(output_dir) + nlp.to_disk(output_dir) + print(f"Model saved to {output_dir}") + +if __name__ == "__main__": + train_ner_model() \ No newline at end of file diff --git a/NLP/Algorithms/Text_Classification/README.md b/NLP/Algorithms/Text_Classification/README.md new file mode 100644 index 0000000..0108124 --- /dev/null +++ b/NLP/Algorithms/Text_Classification/README.md @@ -0,0 +1,90 @@ +# Text Classification Project + +This project demonstrates a basic text classification algorithm using Python and common NLP libraries. The goal is to classify text into different categories, specifically positive and negative sentiments. + +## Directory Structure + +``` +text_classification/ +├── data/ +│ └── text_data.csv +├── models/ +│ └── text_classifier.pkl +├── preprocess.py +├── train.py +└── classify.py +``` + +- **data/text_data.csv**: Contains the dataset used for training and testing the model. +- **models/text_classifier.pkl**: Stores the trained text classification model. +- **preprocess.py**: Contains the code for preprocessing the text data. +- **train.py**: Script for training the text classification model. +- **classify.py**: Script for classifying new text using the trained model. + +## Dataset + +The dataset (`text_data.csv`) contains 200 entries with two columns: `text` and `label`. The `text` column contains the text to be classified, and the `label` column contains the corresponding category (positive or negative). + +## Preprocessing + +The `preprocess.py` file contains functions to preprocess the text data. It includes removing non-alphabetic characters, converting text to lowercase, removing stopwords, and lemmatizing the words. + +## Training the Model + +The `train.py` script is used to train the text classification model. It performs the following steps: + +1. Load the dataset. +2. Preprocess the text data. +3. Split the dataset into training and testing sets. +4. Create a pipeline with `TfidfVectorizer` and `MultinomialNB`. +5. Train the model. +6. Evaluate the model and print the accuracy and classification report. +7. Save the trained model to `models/text_classifier.pkl`. + +### Running the Training Script + +To train the model, run: +```bash +python train.py +``` + +## Classifying New Text + +The `classify.py` script is used to classify new text using the trained model. It performs the following steps: + +1. Load the trained model. +2. Preprocess the input text. +3. Predict the class of the input text. + +### Running the Classification Script + +To classify new text, run: +```bash +python classify.py +``` + +## Dependencies + +The project requires the following Python libraries: + +- pandas +- scikit-learn +- nltk +- joblib + +You can install the dependencies using: +```bash +pip install pandas scikit-learn nltk joblib +``` + +## Example Usage + +```python +# Example usage of the classify.py script +if __name__ == "__main__": + new_text = "I enjoy working on machine learning projects." + print(f'Text: "{new_text}"') + print(f'Predicted Class: {classify_text(new_text)}') +``` + +This project provides a simple text classification model using Naive Bayes with TF-IDF vectorization. You can expand it by using more advanced models or preprocessing techniques based on your requirements. \ No newline at end of file diff --git a/NLP/Algorithms/Text_Classification/classify.py b/NLP/Algorithms/Text_Classification/classify.py new file mode 100644 index 0000000..d10bae6 --- /dev/null +++ b/NLP/Algorithms/Text_Classification/classify.py @@ -0,0 +1,19 @@ +import joblib +from preprocess import preprocess_text + +# Load the model +model = joblib.load('models/text_classifier.pkl') + +def classify_text(text): + # Preprocess the text + processed_text = preprocess_text(text) + + # Predict the class + prediction = model.predict([processed_text]) + return prediction[0] + +# Example usage +if __name__ == "__main__": + new_text = "I enjoy working on machine learning projects." + print(f'Text: "{new_text}"') + print(f'Predicted Class: {classify_text(new_text)}') diff --git a/NLP/Algorithms/Text_Classification/data/text_data.csv b/NLP/Algorithms/Text_Classification/data/text_data.csv new file mode 100644 index 0000000..699ba2e --- /dev/null +++ b/NLP/Algorithms/Text_Classification/data/text_data.csv @@ -0,0 +1,175 @@ +text,label +"I love programming in Python.",positive +"I hate bugs in my code.",negative +"Machine learning is fascinating.",positive +"This is frustrating.",negative +"I enjoy learning new algorithms.",positive +"Debugging can be very annoying.",negative +"Data science is an interesting field.",positive +"I don't like syntax errors.",negative +"Natural Language Processing is a complex field.",positive +"Software crashes are irritating.",negative +"Python is a versatile language.",positive +"I dislike runtime errors.",negative +"Artificial Intelligence is the future.",positive +"Dealing with legacy code is a pain.",negative +"Deep learning opens new possibilities.",positive +"Compilation errors are troublesome.",negative +"I find joy in coding challenges.",positive +"Unoptimized code is slow.",negative +"Exploring new tech is exciting.",positive +"Unexpected bugs can ruin the day.",negative +"Cloud computing is a game changer.",positive +"I get frustrated with merge conflicts.",negative +"Building models is rewarding.",positive +"Manual testing is tedious.",negative +"AI can solve many problems.",positive +"I dislike waiting for tests to run.",negative +"Analyzing data is intriguing.",positive +"Code reviews can be stressful.",negative +"Learning new frameworks is fun.",positive +"I hate network issues.",negative +"Data visualization brings insights.",positive +"Memory leaks are a nightmare.",negative +"Training models is satisfying.",positive +"Race conditions are tricky to debug.",negative +"Big data is powerful.",positive +"Resolving dependencies is a hassle.",negative +"Predictive analytics is impactful.",positive +"Performance bottlenecks are annoying.",negative +"AI research is fascinating.",positive +"Segmentation faults are frustrating.",negative +"Reinforcement learning is innovative.",positive +"Version control is essential.",positive +"Data preprocessing is crucial.",positive +"Invalid inputs cause errors.",negative +"Optimization is challenging but fun.",positive +"Cross-browser compatibility is a pain.",negative +"Learning SQL is beneficial.",positive +"Cryptic error messages are frustrating.",negative +"Automation increases efficiency.",positive +"Legacy systems are difficult to manage.",negative +"Model deployment is exciting.",positive +"Kernel panics are troublesome.",negative +"Feature engineering is creative.",positive +"System crashes are annoying.",negative +"Understanding algorithms is rewarding.",positive +"Dependency hell is frustrating.",negative +"Blockchain is revolutionary.",positive +"Server downtimes are a nuisance.",negative +"Quantum computing is the next frontier.",positive +"Troubleshooting can be time-consuming.",negative +"Big data analytics is powerful.",positive +"Compiler warnings should be addressed.",negative +"Graph theory is fascinating.",positive +"Interrupted connections are frustrating.",negative +"Text mining reveals hidden patterns.",positive +"Deadlocks are problematic.",negative +"APIs make development easier.",positive +"Stale documentation is unhelpful.",negative +"Predictive modeling is insightful.",positive +"Slow internet is frustrating.",negative +"AI ethics is important.",positive +"Memory corruption is a serious issue.",negative +"Text classification is useful.",positive +"Server misconfigurations cause issues.",negative +"Computer vision is fascinating.",positive +"Outdated libraries can cause problems.",negative +"Speech recognition is impressive.",positive +"Hotfixes can be risky.",negative +"Recommender systems are powerful.",positive +"Intermittent bugs are hard to fix.",negative +"AI in healthcare is promising.",positive +"Security vulnerabilities are concerning.",negative +"Natural language understanding is impressive.",positive +"Hard disk failures are disruptive.",negative +"Image recognition is advanced.",positive +"Permission issues are annoying.",negative +"AI in finance is transformative.",positive +"Unintended behaviors are problematic.",negative +"Robotics combines AI and engineering.",positive +"Network congestion is troublesome.",negative +"Machine learning pipelines are essential.",positive +"File corruption is a serious issue.",negative +"AI in marketing is effective.",positive +"Out-of-memory errors are problematic.",negative +"Ethical AI is necessary.",positive +"Server overloads are concerning.",negative +"AI in manufacturing boosts efficiency.",positive +"Software glitches are frustrating.",negative +"Text summarization is useful.",positive +"Packet loss causes issues.",negative +"AI in transportation is innovative.",positive +"Application freezes are annoying.",negative +"Data augmentation improves models.",positive +"Compatibility issues are troublesome.",negative +"AI in education is transformative.",positive +"Blue screen errors are problematic.",negative +"AI in retail enhances customer experience.",positive +"Invalid configurations cause problems.",negative +"Time series forecasting is powerful.",positive +"Network latency is annoying.",negative +"AI in agriculture is revolutionary.",positive +"Disk space issues are frustrating.",negative +"AI in cybersecurity is critical.",positive +"Software conflicts cause problems.",negative +"Sentiment analysis provides insights.",positive +"Permission denied errors are annoying.",negative +"AI in entertainment is exciting.",positive +"Server timeouts are problematic.",negative +"AI in energy management is impactful.",positive +"System hangs are troublesome.",negative +"Reinforcement learning is promising.",positive +"Invalid pointers cause crashes.",negative +"Graph neural networks are innovative.",positive +"Network partitions are concerning.",negative +"Text generation is creative.",positive +"Hardware malfunctions are problematic.",negative +"AI in sports analytics is exciting.",positive +"Unexpected reboots are annoying.",negative +"AI in urban planning is beneficial.",positive +"Corrupt databases are frustrating.",negative +"Transfer learning is effective.",positive +"Incorrect settings cause issues.",negative +"Federated learning is emerging.",positive +"Browser crashes are annoying.",negative +"AI in environmental monitoring is crucial.",positive +"Unresponsive applications are troublesome.",negative +"AI in logistics optimizes operations.",positive +"Disk read errors are problematic.",negative +"AI-driven diagnostics are accurate.",positive +"Permission errors are frustrating.",negative +"Collaborative filtering is powerful.",positive +"DNS issues cause connectivity problems.",negative +"AI in customer service is efficient.",positive +"Unexpected shutdowns are annoying.",negative +"Self-supervised learning is innovative.",positive +"Invalid arguments cause errors.",negative +"AI in gaming enhances experience.",positive +"Configuration mismatches cause problems.",negative +"AI in language translation is impressive.",positive +"System instability is concerning.",negative +"Explainable AI improves transparency.",positive +"Corrupted installations cause issues.",negative +"AI in social media analysis is insightful.",positive +"Browser compatibility issues are annoying.",negative +"AI in supply chain management is impactful.",positive +"Software updates can cause disruptions.",negative +"Graph embeddings are useful.",positive +"Invalid states cause crashes.",negative +"Generative models are creative.",positive +"Network security breaches are concerning.",negative +"AI in fraud detection is crucial.",positive +"System errors are frustrating.",negative +"Zero-shot learning is impressive.",positive +"Device driver issues are problematic.",negative +"AI in personalized recommendations is effective.",positive +"Login failures are annoying.",negative +"AI in autonomous vehicles is groundbreaking.",positive +"Application bugs are concerning.",negative +"Few-shot learning is promising.",positive +"Memory allocation errors cause issues.",negative +"AI in medical imaging is revolutionary.",positive +"Database deadlocks are troublesome.",negative +"Unsupervised learning reveals patterns.",positive +"Network drops cause problems.",negative \ No newline at end of file diff --git a/NLP/Algorithms/Text_Classification/preprocess.py b/NLP/Algorithms/Text_Classification/preprocess.py new file mode 100644 index 0000000..e42c8b6 --- /dev/null +++ b/NLP/Algorithms/Text_Classification/preprocess.py @@ -0,0 +1,24 @@ +import nltk +import re +from nltk.corpus import stopwords +from nltk.stem import WordNetLemmatizer + +# Download necessary NLTK data +nltk.download('stopwords') +nltk.download('wordnet') + +# Initialize lemmatizer and stop words +lemmatizer = WordNetLemmatizer() +stop_words = set(stopwords.words('english')) + +def preprocess_text(text): + # Remove non-alphabetic characters + text = re.sub(r'[^a-zA-Z\s]', '', text, re.I|re.A) + text = text.lower() + text = text.strip() + + # Tokenize and lemmatize + tokens = text.split() + tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words] + + return ' '.join(tokens) \ No newline at end of file diff --git a/NLP/Algorithms/Text_Classification/train.py b/NLP/Algorithms/Text_Classification/train.py new file mode 100644 index 0000000..e629a9e --- /dev/null +++ b/NLP/Algorithms/Text_Classification/train.py @@ -0,0 +1,34 @@ +import pandas as pd +from sklearn.feature_extraction.text import TfidfVectorizer +from sklearn.model_selection import train_test_split +from sklearn.naive_bayes import MultinomialNB +from sklearn.pipeline import Pipeline +from sklearn.metrics import accuracy_score, classification_report +import joblib +from preprocess import preprocess_text + +# Load dataset +data = pd.read_csv('data/text_data.csv') + +# Preprocess text data +data['text'] = data['text'].apply(preprocess_text) + +# Split dataset into training and testing sets +X_train, X_test, y_train, y_test = train_test_split(data['text'], data['label'], test_size=0.2, random_state=42) + +# Create a pipeline with TfidfVectorizer and MultinomialNB +model = Pipeline([ + ('tfidf', TfidfVectorizer()), + ('nb', MultinomialNB()) +]) + +# Train the model +model.fit(X_train, y_train) + +# Evaluate the model +y_pred = model.predict(X_test) +print(f'Accuracy: {accuracy_score(y_test, y_pred)}') +print(f'Classification Report:\n{classification_report(y_test, y_pred)}') + +# Save the model +joblib.dump(model, 'models/text_classifier.pkl') From 8014573aa4f98d1fc96302d169fe0e2773b1cfd6 Mon Sep 17 00:00:00 2001 From: UTSAV SINGHAL Date: Tue, 30 Jul 2024 11:41:33 +0530 Subject: [PATCH 2/3] Add files via upload --- NLP/Documentation/Transformers.md | 95 +++++++++++++++++++++++++++++++ 1 file changed, 95 insertions(+) create mode 100644 NLP/Documentation/Transformers.md diff --git a/NLP/Documentation/Transformers.md b/NLP/Documentation/Transformers.md new file mode 100644 index 0000000..489f793 --- /dev/null +++ b/NLP/Documentation/Transformers.md @@ -0,0 +1,95 @@ +# 📚 Transformers Library Overview + +Welcome to the official documentation for the **Transformers** library! 🚀 This library, developed by Hugging Face, is designed to provide state-of-the-art natural language processing (NLP) models and tools. It's widely used for a variety of NLP tasks, including text classification, translation, summarization, and more. + +## 📑 Table of Contents + +1. [Overview](#-overview) +2. [Installation](#-installation) +3. [Quick Start](#-quick-start) +4. [Documentation](#-documentation) +5. [Community and Support](#-community-and-support) +7. [Additional Resources](#-additional-resources) +8. [FAQ](#-faq) + +## 🔍 Overview + +Transformers are a type of deep learning model that excel in handling sequential data, like text. They rely on mechanisms such as attention to process and generate text in a way that captures long-range dependencies and contextual information. + +### Key Features + +- **State-of-the-art Models**: Access pre-trained models like BERT, GPT, T5, and many more. 🏆 +- **Easy-to-use Interface**: Simplify the process of using and fine-tuning models with a user-friendly API. 🎯 +- **Tokenization Tools**: Tokenize and preprocess text efficiently for model input. 🧩 +- **Multi-Framework Support**: Compatible with PyTorch and TensorFlow, giving you flexibility in your deep learning environment. ⚙️ +- **Extensive Documentation**: Detailed guides and tutorials to help you get started and master the library. 📖 + +## 🔧 Installation + +To get started with the Transformers library, you need to install it via pip: + +```bash +pip install transformers +``` + +### System Requirements + +- **Python**: Version 3.6 or later. +- **PyTorch** or **TensorFlow**: Depending on your preferred framework. Visit the [official documentation](https://huggingface.co/transformers/installation.html) for compatibility details. + +## 🚀 Quick Start + +Here's a basic example to demonstrate how to use the library for sentiment classification: + +```python +from transformers import pipeline + +# Initialize the pipeline for sentiment analysis +classifier = pipeline('sentiment-analysis') + +# Analyze sentiment of a sample text +result = classifier("Transformers are amazing for NLP tasks! 🌟") + +print(result) +``` + +### Common Pipelines + +- **Text Classification**: Classify text into predefined categories. +- **Named Entity Recognition (NER)**: Identify entities like names, dates, and locations. +- **Text Generation**: Generate text based on a prompt. +- **Question Answering**: Answer questions based on a given context. +- **Translation**: Translate text between different languages. + +## 📚 Documentation + +For comprehensive guides, tutorials, and API references, check out the following resources: + +- **[Transformers Documentation](https://huggingface.co/transformers/)**: The official site with detailed information on using and customizing the library. +- **[Model Hub](https://huggingface.co/models)**: Explore a wide range of pre-trained models available for different NLP tasks. +- **[API Reference](https://huggingface.co/transformers/main_classes/pipelines.html)**: Detailed descriptions of classes and functions in the library. + +## 🛠️ Community and Support + +Join the vibrant community of Transformers users and contributors to get support, share your work, and stay updated: + +- **[Hugging Face Forums](https://discuss.huggingface.co/)**: Engage with other users and experts. Ask questions, share your projects, and participate in discussions. +- **[GitHub Repository](https://github.com/huggingface/transformers)**: Browse the source code, report issues, and contribute to the project. Check out the [issues](https://github.com/huggingface/transformers/issues) for ongoing conversations. + +## 🔗 Additional Resources + +- **[Research Papers](https://huggingface.co/papers)**: Read the research papers behind the models and techniques used in the library. +- **[Blog Posts](https://huggingface.co/blog/)**: Discover insights, tutorials, and updates from the Hugging Face team. +- **[Webinars and Talks](https://huggingface.co/events/)**: Watch recorded talks and webinars on the latest developments and applications of Transformers. + +## ❓ FAQ + +**Q: What are the main differences between BERT and GPT?** + +A: BERT (Bidirectional Encoder Representations from Transformers) is designed for understanding the context of words in both directions (left and right). GPT (Generative Pre-trained Transformer), on the other hand, is designed for generating text and understanding context in a left-to-right manner. + +**Q: Can I fine-tune a model on my own data?** + +A: Yes, the Transformers library provides tools for fine-tuning pre-trained models on your custom datasets. Check out the [fine-tuning guide](https://huggingface.co/transformers/training.html) for more details. + +Happy Transforming! 🌟 \ No newline at end of file From d2fbe224306604d306c8290bdd3c12c655d9e21c Mon Sep 17 00:00:00 2001 From: UTSAV SINGHAL Date: Tue, 30 Jul 2024 11:44:34 +0530 Subject: [PATCH 3/3] Update README.md - Algorithms - 6, 7 - Documentation - 5 --- NLP/README.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/NLP/README.md b/NLP/README.md index 57eeec5..d0f71dc 100644 --- a/NLP/README.md +++ b/NLP/README.md @@ -7,7 +7,8 @@ | S.No | Algorithm | S.No. | Algorithm | S.No. | Algorithm | |-------|-----------|-------|-----------|-------|-----------| | 1 | [Bag of Words](./Algorithms/BagOfWords) | 2 | [TF-IDF](./Algorithms/TF-IDF) | 3 | [Word2Vec](./Algorithms/Word2Vec) | -| 4 | [GloVe](./Algorithms/GloVe) | 5 | [FastText](./Algorithms/FastText) | 6 | | +| 4 | [GloVe](./Algorithms/GloVe) | 5 | [FastText](./Algorithms/FastText) | 6 | [Text Classification](./Algorithms/Text_Classification) | +| 7 | [Named Entity Recognition](./Algorithms/Named_Entity_Recognition) | 8 | | 9 | | ## Available Documentations @@ -16,7 +17,7 @@ | S.No | Documentation | S.No | Documentation | S.No | Documentation | |-------|---------------|-------|---------------|------|---------------| | 1 | [NLP Introduction](./Documentation/NLP_Introduction.md) | 2 | [NLTK Setup](./Documentation/NLTK-Setup.md) | 3 | [Text Preprocessing Techniques](./Documentation/Text_Preprocessing_Techniques.md) | -| 4 | [Word Embeddings](./Documentation/Word_Embeddings.md) | 5 | | 6 | | +| 4 | [Word Embeddings](./Documentation/Word_Embeddings.md) | 5 | [Transformers Overview](./Documentation/Transformers.md) | 6 | | ## Available Projects