Skip to content

maryamxasghari/NLP-with-PySpark

Repository files navigation

MET CS 777 - Big Data Analytics
Fall 2021

Natural Language Processing with PySpark

Disaster Tweets classification

Author

Project description

predicting whether a given tweet is about a real disaster or not using pySpark by using ml libraries and by my own implementations

arg

Dataset

Source : https://www.kaggle.com/c/nlp-getting-started/data

Files

  • train.csv - the training set
  • test.csv - the test set (Does not include labels)

Columns

  • id - a unique identifier for each tweet
  • text - the text of the tweet
  • location - the location the tweet was sent from (may be blank)
  • keyword - a particular keyword from the tweet (may be blank)
  • target - in train.csv only, this denotes whether a tweet is about a real disaster (1) or not (0)

Python scripts

Utils

Python files for functions that I used in The Notebooks

  • Plots.py
  • prep_ml.py
  • prep_rdd.py
  • nn_func.py

Scripts to run each classifier in spark

  • LogisticRegression.py
  • NaiveBayes.py
  • SVM.py
  • Trees.py

arg

  • RDD_logisticRegression.py

arg

  • LR_Optimizers.py

arg

  • RDD_SVM.py

arg

  • SVM_Optimizer.py

arg

  • RDD_NN.py

arg

Notebooks

  • Part1

    • Data visualization
    • LogisticRegression
    • NaiveBayes
    • SVM
    • Trees
    • RDD_logisticRegression
    • LR_Optimizers
    • RDD_SVM
    • SVM_Optimizer
  • Part2

    • RDD_NN

Presentation

How to run the scripts

spark-submit LogisticRegression.py './nlp-getting-started/train.csv'
spark-submit NaiveBayes.py './nlp-getting-started/train.csv'
spark-submit SVM.py './nlp-getting-started/train.csv'
spark-submit Trees.py './nlp-getting-started/train.csv'

NOTE: Following scripts need NLTK library

spark-submit RDD_logisticregression.py './nlp-getting-started/train.csv' './output_LR'
spark-submit RDD_svm.py './nlp-getting-started/train.csv' './output_svm'
spark-submit LR_Optimizers.py './nlp-getting-started/train.csv' './out/optimizer:'
spark-submit SVM_Optimizers.py './nlp-getting-started/train.csv' './out/optimizer2'
spark-submit RDD_NN.py './nlp-getting-started/train.csv' './out/NN_rdd'

About

CS 777 - Big Data Analytics Final project

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages