- Project Overview
- Project Components
- Running
- Screenshots
- Software Requirements
- File Descriptions
- Credits and Acknowledgements
This project is part of the Data Science Nanodegree Program by Udacity in collaboration with Figure Eight. The initial dataset contains pre-labeled real messages from real-life disaster events. The aim of the project is to build a Natural Language Processing tool that categorizes messages.
The project is divided into the following sections:
- Data Processing, using ETL Pipeline to extract data from the source, clean data and store it in a database
- Machine Learning Pipeline to train a model to classify messages to categories
- Web App to show model results intuitively in real-time
There are three components of this project:
File data/process_data.py:
- Loads the
messages
andcategories
dataset - Merges and cleans the data
- Stores data in a SQLite database
File models/train_classifier.py:
- Loads data from SQLite database
- Creates training and test datasets
- Builds and Trains on a text processing and machine learning pipeline
- Uses GridSearchCV to optimize hyperparameters
- Score on test dataset
- Exports final model as a pickle file
Run python run.py
from app directory to start the web app where users can enter their messages, i.e., message sent during a disaster. App classifies messages into categories.
There are three steps to get up and runnning with the web app if you want to start from ETL process.
Run the following command to execute ETL pipeline that cleans and stores data in database:
python data/process_data.py data/disaster_messages.csv data/disaster_categories.csv data/DisasterResponse.db
The first two arguments are input data and the third argument is the output SQLite Database name to store the cleaned data.
Run the following command to execute ML pipeline that trains classifier and saves:
python models/train_classifier.py data/DisasterResponse.db models/classifier.pkl
Loads data from SQLite database to train the model and save the model to a pickle file.
Run the following command in the app's directory to run the web app.
python run.py
This will start the web app and will direct you to http://0.0.0.0:3001/ where you can enter messages and get classification results.
Information regarding training data set can be seen on main page of web app
Below is an example of a message to test ML model performance
Clicking Classify Message, will highlight the relevant text categories
- Python 3.7
- Machine Learning Libraries: NumPy, Pandas, Sciki-Learn, pickle
- Natural Language Process Libraries: NLTK, re
- SQLlite Database Libraqries: SQLalchemy
- Web App and Data Visualization: Flask, Plotly
- python related: sys,warnings
There are three main foleders:
- data
- disaster_categories.csv: dataset contains categories
- disaster_messages.csv: dataset contains messages
- process_data.py: ETL pipeline scripts to load, clean, merge and store data into a database
- DisasterResponse.db: SQLite database containing processed messages and categories data
- models
- train_classifier.py: machine learning pipeline scripts to train, and save a model
- classifier.pkl: saved model in pkl format
- app
- run.py: Python script to integrate all above files and to start the web application
- templates contains html file for the web applicatin
- run-windows.py: Python script to integrate all above files and to start the web application on windows machine
- Udacity for curating a program with intensive projects.
- Figure Eight for providing dataset used in the project.