diff --git a/NLP/Algorithms/Installation_set-up_NLTK.ipynb b/NLP/Algorithms/Installation_set-up_NLTK.ipynb deleted file mode 100644 index 367fde1..0000000 --- a/NLP/Algorithms/Installation_set-up_NLTK.ipynb +++ /dev/null @@ -1,835 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Introduction to NLTK\n", - "\n", - "Hello there! 🌟 Welcome to your first step into the fascinating world of Natural Language Processing (NLP) with the Natural Language Toolkit (NLTK). This guide is designed to be super beginner-friendly. We’ll cover everything from installation to basic operations with lots of explanations along the way. Let's get started!\n", - "\n", - "## What is NLTK?\n", - "The Natural Language Toolkit (NLTK) is a comprehensive Python library for working with human language data (text). It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning. It also includes wrappers for industrial-strength NLP libraries.\n", - "\n", - "Key Features of NLTK:\n", - "\n", - "1. Corpora and Lexical Resources: NLTK includes access to a variety of text corpora and lexical resources, such as WordNet, the Brown Corpus, the Gutenberg Corpus, and many more.\n", - "\n", - "2. Text Processing Libraries: It provides tools for a wide range of text processing tasks:\n", - "\n", - " Tokenization (splitting text into words, sentences, etc.)\n", - "\n", - " Part-of-Speech (POS) tagging\n", - "\n", - " Named Entity Recognition (NER)\n", - "\n", - " Stemming and Lemmatization\n", - "\n", - " Parsing (syntax analysis)\n", - "\n", - " Semantic reasoning\n", - "\n", - "3. Classification and Machine Learning: NLTK includes various classifiers and machine learning algorithms that can be used for text classification tasks.\n", - "\n", - "4. Visualization and Demonstrations: It offers visualization tools for trees, graphs, and other linguistic structures. It also includes a number of interactive demonstrations and sample data.\n", - "\n", - "## Installation\n", - "\n", - "First, we need to install NLTK. Make sure you have Python installed on your system. If not, you can download it from [python.org](https://www.python.org/). Once you have Python, open your command prompt (or terminal) and type the following command:\n", - "\n", - "```bash\n", - "pip install nltk\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "To verify that NLTK is installed correctly, open a Python shell and import the library:\n", - "\n", - "```bash\n", - "import nltk\n", - "```\n", - "\n", - "- If no errors occur, NLTK is successfully installed." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "NLTK requires additional data packages for various functionalities. To download all the data packages, open a python shell and run : \n", - "\n", - "```bash\n", - "import nltk\n", - "nltk.download ('all')\n", - "```\n", - "Alternatively you can download specific data packages using : \n", - "\n", - "```bash\n", - "nltk.download ('punkt') # Tokenizer for splitting sentences into words\n", - "nltk.download ('averaged_perceptron_tagger') # Part-of-speech tagger for tagging words with their parts of speech\n", - "nltk.download ('maxent_ne_chunker') # Named entity chunker for recognizing named entities in text\n", - "nltk.download ('words') # Corpus of English words required for many NLTK functions" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now that we have everything set up, let’s dive into some basic NLP operations with NLTK." - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Requirement already satisfied: nltk in /Users/praneet/anaconda3/envs/proj1maverick/lib/python3.9/site-packages (3.8.1)\n", - "Requirement already satisfied: click in /Users/praneet/anaconda3/envs/proj1maverick/lib/python3.9/site-packages (from nltk) (8.1.7)\n", - "Requirement already satisfied: joblib in /Users/praneet/anaconda3/envs/proj1maverick/lib/python3.9/site-packages (from nltk) (1.4.2)\n", - "Requirement already satisfied: regex>=2021.8.3 in /Users/praneet/anaconda3/envs/proj1maverick/lib/python3.9/site-packages (from nltk) (2024.5.15)\n", - "Requirement already satisfied: tqdm in /Users/praneet/anaconda3/envs/proj1maverick/lib/python3.9/site-packages (from nltk) (4.66.4)\n", - "Note: you may need to restart the kernel to use updated packages.\n" - ] - } - ], - "source": [ - "%pip install nltk" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "[nltk_data] Downloading collection 'all'\n", - "[nltk_data] | \n", - "[nltk_data] | Downloading package abc to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/abc.zip.\n", - "[nltk_data] | Downloading package alpino to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/alpino.zip.\n", - "[nltk_data] | Downloading package averaged_perceptron_tagger to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping taggers/averaged_perceptron_tagger.zip.\n", - "[nltk_data] | Downloading package averaged_perceptron_tagger_ru to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping\n", - "[nltk_data] | taggers/averaged_perceptron_tagger_ru.zip.\n", - "[nltk_data] | Downloading package basque_grammars to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping grammars/basque_grammars.zip.\n", - "[nltk_data] | Downloading package bcp47 to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Downloading package biocreative_ppi to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/biocreative_ppi.zip.\n", - "[nltk_data] | Downloading package bllip_wsj_no_aux to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping models/bllip_wsj_no_aux.zip.\n", - "[nltk_data] | Downloading package book_grammars to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping grammars/book_grammars.zip.\n", - "[nltk_data] | Downloading package brown to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/brown.zip.\n", - "[nltk_data] | Downloading package brown_tei to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/brown_tei.zip.\n", - "[nltk_data] | Downloading package cess_cat to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/cess_cat.zip.\n", - "[nltk_data] | Downloading package cess_esp to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/cess_esp.zip.\n", - "[nltk_data] | Downloading package chat80 to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/chat80.zip.\n", - "[nltk_data] | Downloading package city_database to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/city_database.zip.\n", - "[nltk_data] | Downloading package cmudict to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/cmudict.zip.\n", - "[nltk_data] | Downloading package comparative_sentences to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/comparative_sentences.zip.\n", - "[nltk_data] | Downloading package comtrans to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Downloading package conll2000 to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/conll2000.zip.\n", - "[nltk_data] | Downloading package conll2002 to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/conll2002.zip.\n", - "[nltk_data] | Downloading package conll2007 to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Downloading package crubadan to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/crubadan.zip.\n", - "[nltk_data] | Downloading package dependency_treebank to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/dependency_treebank.zip.\n", - "[nltk_data] | Downloading package dolch to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/dolch.zip.\n", - "[nltk_data] | Downloading package europarl_raw to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/europarl_raw.zip.\n", - "[nltk_data] | Downloading package extended_omw to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Downloading package floresta to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/floresta.zip.\n", - "[nltk_data] | Downloading package framenet_v15 to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/framenet_v15.zip.\n", - "[nltk_data] | Downloading package framenet_v17 to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/framenet_v17.zip.\n", - "[nltk_data] | Downloading package gazetteers to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/gazetteers.zip.\n", - "[nltk_data] | Downloading package genesis to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/genesis.zip.\n", - "[nltk_data] | Downloading package gutenberg to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/gutenberg.zip.\n", - "[nltk_data] | Downloading package ieer to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/ieer.zip.\n", - "[nltk_data] | Downloading package inaugural to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/inaugural.zip.\n", - "[nltk_data] | Downloading package indian to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/indian.zip.\n", - "[nltk_data] | Downloading package jeita to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Downloading package kimmo to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/kimmo.zip.\n", - "[nltk_data] | Downloading package knbc to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Downloading package large_grammars to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping grammars/large_grammars.zip.\n", - "[nltk_data] | Downloading package lin_thesaurus to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/lin_thesaurus.zip.\n", - "[nltk_data] | Downloading package mac_morpho to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/mac_morpho.zip.\n", - "[nltk_data] | Downloading package machado to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Downloading package masc_tagged to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Downloading package maxent_ne_chunker to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping chunkers/maxent_ne_chunker.zip.\n", - "[nltk_data] | Downloading package maxent_treebank_pos_tagger to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping taggers/maxent_treebank_pos_tagger.zip.\n", - "[nltk_data] | Downloading package moses_sample to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping models/moses_sample.zip.\n", - "[nltk_data] | Downloading package movie_reviews to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/movie_reviews.zip.\n", - "[nltk_data] | Downloading package mte_teip5 to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/mte_teip5.zip.\n", - "[nltk_data] | Downloading package mwa_ppdb to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping misc/mwa_ppdb.zip.\n", - "[nltk_data] | Downloading package names to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/names.zip.\n", - "[nltk_data] | Downloading package nombank.1.0 to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Downloading package nonbreaking_prefixes to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/nonbreaking_prefixes.zip.\n", - "[nltk_data] | Downloading package nps_chat to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/nps_chat.zip.\n", - "[nltk_data] | Downloading package omw to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Downloading package omw-1.4 to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Downloading package opinion_lexicon to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/opinion_lexicon.zip.\n", - "[nltk_data] | Downloading package panlex_swadesh to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Downloading package paradigms to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/paradigms.zip.\n", - "[nltk_data] | Downloading package pe08 to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/pe08.zip.\n", - "[nltk_data] | Downloading package perluniprops to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping misc/perluniprops.zip.\n", - "[nltk_data] | Downloading package pil to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/pil.zip.\n", - "[nltk_data] | Downloading package pl196x to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/pl196x.zip.\n", - "[nltk_data] | Downloading package porter_test to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping stemmers/porter_test.zip.\n", - "[nltk_data] | Downloading package ppattach to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/ppattach.zip.\n", - "[nltk_data] | Downloading package problem_reports to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/problem_reports.zip.\n", - "[nltk_data] | Downloading package product_reviews_1 to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/product_reviews_1.zip.\n", - "[nltk_data] | Downloading package product_reviews_2 to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/product_reviews_2.zip.\n", - "[nltk_data] | Downloading package propbank to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Downloading package pros_cons to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/pros_cons.zip.\n", - "[nltk_data] | Downloading package ptb to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/ptb.zip.\n", - "[nltk_data] | Downloading package punkt to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping tokenizers/punkt.zip.\n", - "[nltk_data] | Downloading package qc to /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/qc.zip.\n", - "[nltk_data] | Downloading package reuters to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Downloading package rslp to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping stemmers/rslp.zip.\n", - "[nltk_data] | Downloading package rte to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/rte.zip.\n", - "[nltk_data] | Downloading package sample_grammars to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping grammars/sample_grammars.zip.\n", - "[nltk_data] | Downloading package semcor to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Downloading package senseval to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/senseval.zip.\n", - "[nltk_data] | Downloading package sentence_polarity to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/sentence_polarity.zip.\n", - "[nltk_data] | Downloading package sentiwordnet to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/sentiwordnet.zip.\n", - "[nltk_data] | Downloading package shakespeare to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/shakespeare.zip.\n", - "[nltk_data] | Downloading package sinica_treebank to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/sinica_treebank.zip.\n", - "[nltk_data] | Downloading package smultron to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/smultron.zip.\n", - "[nltk_data] | Downloading package snowball_data to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Downloading package spanish_grammars to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping grammars/spanish_grammars.zip.\n", - "[nltk_data] | Downloading package state_union to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/state_union.zip.\n", - "[nltk_data] | Downloading package stopwords to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/stopwords.zip.\n", - "[nltk_data] | Downloading package subjectivity to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/subjectivity.zip.\n", - "[nltk_data] | Downloading package swadesh to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/swadesh.zip.\n", - "[nltk_data] | Downloading package switchboard to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/switchboard.zip.\n", - "[nltk_data] | Downloading package tagsets to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping help/tagsets.zip.\n", - "[nltk_data] | Downloading package timit to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/timit.zip.\n", - "[nltk_data] | Downloading package toolbox to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/toolbox.zip.\n", - "[nltk_data] | Downloading package treebank to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/treebank.zip.\n", - "[nltk_data] | Downloading package twitter_samples to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/twitter_samples.zip.\n", - "[nltk_data] | Downloading package udhr to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/udhr.zip.\n", - "[nltk_data] | Downloading package udhr2 to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/udhr2.zip.\n", - "[nltk_data] | Downloading package unicode_samples to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/unicode_samples.zip.\n", - "[nltk_data] | Downloading package universal_tagset to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping taggers/universal_tagset.zip.\n", - "[nltk_data] | Downloading package universal_treebanks_v20 to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Downloading package vader_lexicon to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Downloading package verbnet to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/verbnet.zip.\n", - "[nltk_data] | Downloading package verbnet3 to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/verbnet3.zip.\n", - "[nltk_data] | Downloading package webtext to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/webtext.zip.\n", - "[nltk_data] | Downloading package wmt15_eval to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping models/wmt15_eval.zip.\n", - "[nltk_data] | Downloading package word2vec_sample to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping models/word2vec_sample.zip.\n", - "[nltk_data] | Downloading package wordnet to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Downloading package wordnet2021 to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Downloading package wordnet2022 to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/wordnet2022.zip.\n", - "[nltk_data] | Downloading package wordnet31 to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Downloading package wordnet_ic to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/wordnet_ic.zip.\n", - "[nltk_data] | Downloading package words to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/words.zip.\n", - "[nltk_data] | Downloading package ycoe to\n", - "[nltk_data] | /Users/praneet/nltk_data...\n", - "[nltk_data] | Unzipping corpora/ycoe.zip.\n", - "[nltk_data] | \n", - "[nltk_data] Done downloading collection all\n" - ] - }, - { - "data": { - "text/plain": [ - "True" - ] - }, - "execution_count": 3, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "import nltk\n", - "nltk.download('all')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Tokenization\n", - "Tokenization is the process of breaking down text into smaller pieces, like words or sentences. It's like cutting a big cake into smaller slices." - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Word Tokenization: ['Natural', 'Language', 'Processing', 'with', 'NLTK', 'is', 'fun', 'and', 'educational', '.']\n", - "Sentence Tokenization: ['Natural Language Processing with NLTK is fun and educational.']\n" - ] - } - ], - "source": [ - "from nltk.tokenize import word_tokenize, sent_tokenize\n", - "\n", - "# Sample text to work with\n", - "text = \"Natural Language Processing with NLTK is fun and educational.\"\n", - "\n", - "# Tokenize into words\n", - "words = word_tokenize(text)\n", - "print(\"Word Tokenization:\", words)\n", - "\n", - "# Tokenize into sentences\n", - "sentences = sent_tokenize(text)\n", - "print(\"Sentence Tokenization:\", sentences)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Explanation:\n", - "\n", - "word_tokenize(text): This function splits the text into individual words.\n", - "\n", - "sent_tokenize(text): This function splits the text into individu" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Stopwords Removal\n", - "Stopwords are common words that don’t carry much meaning on their own. In many NLP tasks, we remove these words to focus on the important ones." - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Filtered Words: ['Natural', 'Language', 'Processing', 'NLTK', 'fun', 'educational', '.']\n" - ] - } - ], - "source": [ - "from nltk.corpus import stopwords\n", - "\n", - "# Get the list of stopwords in English\n", - "stop_words = set(stopwords.words('english'))\n", - "\n", - "# Remove stopwords from our list of words\n", - "filtered_words = [word for word in words if word.lower() not in stop_words]\n", - "\n", - "print(\"Filtered Words:\", filtered_words)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Explanation:\n", - "\n", - "stopwords.words('english'): This gives us a list of common English stopwords.\n", - "\n", - "[word for word in words if word.lower() not in stop_words]: This is a list comprehension that filters out the stopwords from our list of words." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Stemming\n", - "Stemming is the process of reducing words to their root form. It’s like finding the 'stem' of a word." - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Stemmed Words: ['natur', 'languag', 'process', 'with', 'nltk', 'is', 'fun', 'and', 'educ', '.']\n" - ] - } - ], - "source": [ - "from nltk.stem import PorterStemmer\n", - "\n", - "# Create a PorterStemmer object\n", - "ps = PorterStemmer()\n", - "\n", - "# Stem each word in our list of words\n", - "stemmed_words = [ps.stem(word) for word in words]\n", - "\n", - "print(\"Stemmed Words:\", stemmed_words)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Explanation:\n", - "\n", - "PorterStemmer(): This creates a PorterStemmer object, which is a popular stemming algorithm.\n", - "\n", - "[ps.stem(word) for word in words]: This applies the stemming algorithm to each word in our list." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Lemmatization\n", - "Lemmatization is similar to stemming but it uses a dictionary to find the base form of a word. It’s more accurate than stemming." - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Lemmatized Words: ['Natural', 'Language', 'Processing', 'with', 'NLTK', 'is', 'fun', 'and', 'educational', '.']\n" - ] - } - ], - "source": [ - "from nltk.stem import WordNetLemmatizer\n", - "\n", - "# Create a WordNetLemmatizer object\n", - "lemmatizer = WordNetLemmatizer()\n", - "\n", - "# Lemmatize each word in our list of words\n", - "lemmatized_words = [lemmatizer.lemmatize(word) for word in words]\n", - "\n", - "print(\"Lemmatized Words:\", lemmatized_words)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Explanation:\n", - "\n", - "WordNetLemmatizer(): This creates a lemmatizer object.\n", - "\n", - "[lemmatizer.lemmatize(word) for word in words]: This applies the lemmatization process to each word in our list." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Part-of-speech tagging \n", - "Part-of-speech tagging is the process of labeling each word in a sentence with its corresponding part of speech, such as noun, verb, adjective, etc.\n", - "NLTK provides functionality to perform POS tagging easily." - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Part-of-speech tags:\n", - "[('NLTK', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('powerful', 'JJ'), ('tool', 'NN'), ('for', 'IN'), ('natural', 'JJ'), ('language', 'NN'), ('processing', 'NN'), ('.', '.')]\n" - ] - } - ], - "source": [ - "# Import the word_tokenize function from nltk.tokenize module\n", - "# Import the pos_tag function from nltk module\n", - "from nltk.tokenize import word_tokenize\n", - "from nltk import pos_tag\n", - "\n", - "# Sample text to work with\n", - "text = \"NLTK is a powerful tool for natural language processing.\"\n", - "\n", - "# Tokenize the text into individual words\n", - "# The word_tokenize function splits the text into a list of words\n", - "words = word_tokenize(text)\n", - "\n", - "# Perform Part-of-Speech (POS) tagging\n", - "# The pos_tag function takes a list of words and assigns a part-of-speech tag to each word\n", - "pos_tags = pos_tag(words)\n", - "\n", - "# Print the part-of-speech tags\n", - "print(\"Part-of-speech tags:\")\n", - "print(pos_tags)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Explanation:\n", - "\n", - "pos_tags = pos_tag(words): The pos_tag function takes the list of words and assigns a part-of-speech tag to each word. For example, it might tag 'NLTK' as a proper noun (NNP), 'is' as a verb (VBZ), and so on." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Here is a list of common POS tags used in the Penn Treebank tag set, along with explanations and examples:\n", - "\n", - "### Common POS Tags:\n", - "\n", - "CC: Coordinating conjunction (e.g., and, but, or)\n", - "\n", - "CD: Cardinal number (e.g., one, two)\n", - "\n", - "DT: Determiner (e.g., the, a, an)\n", - "\n", - "EX: Existential there (e.g., there is)\n", - "\n", - "FW: Foreign word (e.g., en route)\n", - "\n", - "IN: Preposition or subordinating conjunction (e.g., in, of, like)\n", - "\n", - "JJ: Adjective (e.g., big, blue, fast)\n", - "\n", - "JJR: Adjective, comparative (e.g., bigger, faster)\n", - "\n", - "JJS: Adjective, superlative (e.g., biggest, fastest)\n", - "\n", - "LS: List item marker (e.g., 1, 2, One)\n", - "\n", - "MD: Modal (e.g., can, will, must)\n", - "\n", - "NN: Noun, singular or mass (e.g., dog, city, music)\n", - "\n", - "NNS: Noun, plural (e.g., dogs, cities)\n", - "\n", - "NNP: Proper noun, singular (e.g., John, London)\n", - "\n", - "NNPS: Proper noun, plural (e.g., Americans, Sundays)\n", - "\n", - "PDT: Predeterminer (e.g., all, both, half)\n", - "\n", - "POS: Possessive ending (e.g., 's, s')\n", - "\n", - "PRP: Personal pronoun (e.g., I, you, he)\n", - "\n", - "PRP$: Possessive pronoun (e.g., my, your, his)\n", - "\n", - "RB: Adverb (e.g., quickly, softly)\n", - "\n", - "RBR: Adverb, comparative (e.g., faster, harder)\n", - "\n", - "RBS: Adverb, superlative (e.g., fastest, hardest)\n", - "\n", - "RP: Particle (e.g., up, off)\n", - "\n", - "SYM: Symbol (e.g., $, %, &)\n", - "\n", - "TO: to (e.g., to go, to read)\n", - "\n", - "UH: Interjection (e.g., uh, well, wow)\n", - "\n", - "VB: Verb, base form (e.g., run, eat)\n", - "\n", - "VBD: Verb, past tense (e.g., ran, ate)\n", - "\n", - "VBG: Verb, gerund or present participle (e.g., running, eating)\n", - "\n", - "VBN: Verb, past participle (e.g., run, eaten)\n", - "\n", - "VBP: Verb, non-3rd person singular present (e.g., run, eat)\n", - "\n", - "VBZ: Verb, 3rd person singular present (e.g., runs, eats)\n", - "\n", - "WDT: Wh-determiner (e.g., which, that)\n", - "\n", - "WP: Wh-pronoun (e.g., who, what)\n", - "\n", - "WP$: Possessive wh-pronoun (e.g., whose)\n", - "\n", - "WRB: Wh-adverb (e.g., where, when)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "This guide provides a basic introduction to NLTK and some fundamental operations in NLP. For more information, refer to the NLTK documentation (https://www.nltk.org/).\n", - "\n", - "Keep exploring, and thank you for using this guide!" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "proj1maverick", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.9.19" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} diff --git a/NLP/Documentation/NLTK-Setup.md b/NLP/Documentation/NLTK-Setup.md new file mode 100644 index 0000000..9c88bd4 --- /dev/null +++ b/NLP/Documentation/NLTK-Setup.md @@ -0,0 +1,239 @@ +# Introduction to NLTK +Hello there! 🌟 Welcome to your first step into the fascinating world of Natural Language Processing (NLP) with the Natural Language Toolkit (NLTK). This guide is designed to be super beginner-friendly. We’ll cover everything from installation to basic operations with lots of explanations along the way. Let's get started! + +# What is NLTK? +The Natural Language Toolkit (NLTK) is a comprehensive Python library for working with human language data (text). It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning. It also includes wrappers for industrial-strength NLP libraries. + +Key Features of NLTK: + +1.Corpora and Lexical Resources: NLTK includes access to a variety of text corpora and lexical resources, such as WordNet, the Brown Corpus, the Gutenberg Corpus, and many more. + +2.Text Processing Libraries: It provides tools for a wide range of text processing tasks: + +Tokenization (splitting text into words, sentences, etc.) + + Part-of-Speech (POS) tagging + + Named Entity Recognition (NER) + + Stemming and Lemmatization + + Parsing (syntax analysis) + + Semantic reasoning + +3.Classification and Machine Learning: NLTK includes various classifiers and machine learning algorithms that can be used for text classification tasks. + +4.Visualization and Demonstrations: It offers visualization tools for trees, graphs, and other linguistic structures. It also includes a number of interactive demonstrations and sample data. + +# Installation +First, we need to install NLTK. Make sure you have Python installed on your system. If not, you can download it from python.org. Once you have Python, open your command prompt (or terminal) and type the following command: +``` +pip install nltk +``` +To verify that NLTK is installed correctly, open a Python shell and import the library: + +import nltk +If no errors occur, NLTK is successfully installed. +NLTK requires additional data packages for various functionalities. To download all the data packages, open a python shell and run : +``` +import nltk +nltk.download ('all') +``` +Alternatively you can download specific data packages using : + +nltk.download ('punkt') # Tokenizer for splitting sentences into words +nltk.download ('averaged_perceptron_tagger') # Part-of-speech tagger for tagging words with their parts of speech +nltk.download ('maxent_ne_chunker') # Named entity chunker for recognizing named entities in text +nltk.download ('words') # Corpus of English words required for many NLTK functions +Now that we have everything set up, let’s dive into some basic NLP operations with NLTK. + +# Tokenization +Tokenization is the process of breaking down text into smaller pieces, like words or sentences. It's like cutting a big cake into smaller slices. +``` +from nltk.tokenize import word_tokenize, sent_tokenize + +#Sample text to work with +text = "Natural Language Processing with NLTK is fun and educational." + +#Tokenize into words +words = word_tokenize(text) +print("Word Tokenization:", words) + +#Tokenize into sentences +sentences = sent_tokenize(text) +print("Sentence Tokenization:", sentences) +``` +Word Tokenization: ['Natural', 'Language', 'Processing', 'with', 'NLTK', 'is', 'fun', 'and', 'educational', '.'] + +Sentence Tokenization: ['Natural Language Processing with NLTK is fun and educational.'] + + + +# Stopwords Removal +Stopwords are common words that don’t carry much meaning on their own. In many NLP tasks, we remove these words to focus on the important ones. +``` +from nltk.corpus import stopwords + +# Get the list of stopwords in English +stop_words = set(stopwords.words('english')) + +# Remove stopwords from our list of words +filtered_words = [word for word in words if word.lower() not in stop_words] + +print("Filtered Words:", filtered_words) +``` +Filtered Words: ['Natural', 'Language', 'Processing', 'NLTK', 'fun', 'educational', '.'] + +### Explanation: + +stopwords.words('english'): This gives us a list of common English stopwords. + +[word for word in words if word.lower() not in stop_words]: This is a list comprehension that filters out the stopwords from our list of words. + + +#Stemming +Stemming is the process of reducing words to their root form. It’s like finding the 'stem' of a word. +``` +from nltk.stem import PorterStemmer + +# Create a PorterStemmer object +ps = PorterStemmer() + +# Stem each word in our list of words +stemmed_words = [ps.stem(word) for word in words] + +print("Stemmed Words:", stemmed_words) +``` +Stemmed Words: ['natur', 'languag', 'process', 'with', 'nltk', 'is', 'fun', 'and', 'educ', '.'] + +### Explanation: + +PorterStemmer(): This creates a PorterStemmer object, which is a popular stemming algorithm. + +[ps.stem(word) for word in words]: This applies the stemming algorithm to each word in our list. + +# Lemmatization +Lemmatization is similar to stemming but it uses a dictionary to find the base form of a word. It’s more accurate than stemming. +``` +from nltk.stem import WordNetLemmatizer + +# Create a WordNetLemmatizer object +lemmatizer = WordNetLemmatizer() + +# Lemmatize each word in our list of words +lemmatized_words = [lemmatizer.lemmatize(word) for word in words] + +print("Lemmatized Words:", lemmatized_words) +``` +Lemmatized Words: ['Natural', 'Language', 'Processing', 'with', 'NLTK', 'is', 'fun', 'and', 'educational', '.'] +### Explanation: + +WordNetLemmatizer(): This creates a lemmatizer object. + +[lemmatizer.lemmatize(word) for word in words]: This applies the lemmatization process to each word in our list. + +# Part-of-speech tagging +Part-of-speech tagging is the process of labeling each word in a sentence with its corresponding part of speech, such as noun, verb, adjective, etc. NLTK provides functionality to perform POS tagging easily. +``` +# Import the word_tokenize function from nltk.tokenize module +# Import the pos_tag function from nltk module +from nltk.tokenize import word_tokenize +from nltk import pos_tag + +# Sample text to work with +text = "NLTK is a powerful tool for natural language processing." + +# Tokenize the text into individual words +# The word_tokenize function splits the text into a list of words +words = word_tokenize(text) + +# Perform Part-of-Speech (POS) tagging +# The pos_tag function takes a list of words and assigns a part-of-speech tag to each word +pos_tags = pos_tag(words) + +# Print the part-of-speech tags +print("Part-of-speech tags:") +print(pos_tags) +``` +Part-of-speech tags: +[('NLTK', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('powerful', 'JJ'), ('tool', 'NN'), ('for', 'IN'), ('natural', 'JJ'), ('language', 'NN'), ('processing', 'NN'), ('.', '.')] + +### Explanation: + +pos_tags = pos_tag(words): The pos_tag function takes the list of words and assigns a part-of-speech tag to each word. For example, it might tag 'NLTK' as a proper noun (NNP), 'is' as a verb (VBZ), and so on. + +Here is a list of common POS tags used in the Penn Treebank tag set, along with explanations and examples: + +## Common POS Tags: +CC: Coordinating conjunction (e.g., and, but, or) + +CD: Cardinal number (e.g., one, two) + +DT: Determiner (e.g., the, a, an) + +EX: Existential there (e.g., there is) + +FW: Foreign word (e.g., en route) + +IN: Preposition or subordinating conjunction (e.g., in, of, like) + +JJ: Adjective (e.g., big, blue, fast) + +JJR: Adjective, comparative (e.g., bigger, faster) + +JJS: Adjective, superlative (e.g., biggest, fastest) + +LS: List item marker (e.g., 1, 2, One) + +MD: Modal (e.g., can, will, must) + +NN: Noun, singular or mass (e.g., dog, city, music) + +NNS: Noun, plural (e.g., dogs, cities) + +NNP: Proper noun, singular (e.g., John, London) + +NNPS: Proper noun, plural (e.g., Americans, Sundays) + +PDT: Predeterminer (e.g., all, both, half) + +POS: Possessive ending (e.g., 's, s') + +PRP: Personal pronoun (e.g., I, you, he) + +PRP$: Possessive pronoun (e.g., my, your, his) + +RB: Adverb (e.g., quickly, softly) + +RBR: Adverb, comparative (e.g., faster, harder) + +RBS: Adverb, superlative (e.g., fastest, hardest) + +RP: Particle (e.g., up, off) + +SYM: Symbol (e.g., $, %, &) + +TO: to (e.g., to go, to read) + +UH: Interjection (e.g., uh, well, wow) + +VB: Verb, base form (e.g., run, eat) + +VBD: Verb, past tense (e.g., ran, ate) + +VBG: Verb, gerund or present participle (e.g., running, eating) + +VBN: Verb, past participle (e.g., run, eaten) + +VBP: Verb, non-3rd person singular present (e.g., run, eat) + +VBZ: Verb, 3rd person singular present (e.g., runs, eats) + +WDT: Wh-determiner (e.g., which, that) + +WP: Wh-pronoun (e.g., who, what) + +WP$: Possessive wh-pronoun (e.g., whose) + +WRB: Wh-adverb (e.g., where, when) diff --git a/NLP/README.md b/NLP/README.md index b611db5..17202b2 100644 --- a/NLP/README.md +++ b/NLP/README.md @@ -15,7 +15,7 @@ | S.No | Documentation | S.No | Documentation | S.No | Documentation | |-------|---------------|-------|---------------|------|---------------| -| 1 | [NLP Introduction](./Documentation/NLP_Introduction.md) | 2 | | 3 | | +| 1 | [NLP Introduction](./Documentation/NLP_Introduction.md) | 2 | [NLTK Setup](./Documentation/NLTK-Setup.md) | 3 | | | 4 | | 5 | | 6 | | ## Available Projects