Capstone

Predicting Coffee Ratings from Expert Reviews

This project scrapes review data from CoffeeReview.com using Beautiful Soup. Data is cleaned and compiled. Data includes a mix of evaluative numeric values, descriptive numeric data, evaluative text data, and descriptive text data. Different natural language processing (NLP) methods are applied to the text data to make it useable for machine learning. Different models are tested and evaluated to determine which model is the best fit for predicting the coffee rating. Finally, the final model is explored further to see which features are most important, as well as what other insights can be gained from the model.

Built With

pandas
requests
jupyter
lxml
matplotlib
seaborn
numpy
scipy
python=3.8
beautifulsoup4
pymysql
sqlalchemy
scikit-learn
statsmodels
tensorflow==2.7.0
gensim=3.8.3
xgboost=1.1.1

Getting Started

The project spans 7 Jupyter Notebooks:

Notebook 1: Collecting the Data - scrapes the data from CoffeeReview.com
Notebook 2: Cleaning the Data Part 1 - cleans the first scraped dataset (split because of complications with scraping)
Notebook 3: Cleaning the Data Part 2 - cleans the second scraped dataset (split because of complications with scraping)
Notebook 4: Compiling the Data and Further Cleaning - combines the two datasets and does further cleaning
Notebook 5: Baseline EDA and Modeling on Non-Text Data - initial EDA and baseline models to see how numeric data does
Notebook 6: Natural Language Processing - transforming the text and testing with baseline model
Notebook 7: Optimizing Models on All Data - looking for best model on combined data, interpretation and conclusions

Follow the steps outlined below to get your environment set up to support the project.

In addition, a streamlit file can be used to launch a site for testing the best text-only model on new reviews. Supporting images are included for rendering the streamlit page. Local URL: http://localhost:8502

Prerequisites

The following packages will need to be installed to run the notebooks.

channels:

defaults

dependencies:

pandas 1.4.4
requests 2.28.1
jupyter 1.0.0
lxml 4.8.0
matplotlib 3.5.2
seaborn 0.11.2
numpy 0.11.2
scipy 1.7.3
python 3.8.13
beautifulsoup4 4.11.1
pymysql 1.0.2
sqlalchemy 1.4.39
scikit-learn 1.1.2
statsmodels 0.13.2
tensorflow 2.7.0
gensim 3.8.3
xgboost 1.1.1
nltk 3.7

Installation

Packages can be installed using (referenced file on GitHub):

conda env create --file coffee_conda_env.yml

Use Case

This project can be used to explore web scraping, data cleaning, text vectorization, and machine learning modeling.

Contact

Kate Meredith: kmere21@gmail.com Project Link: https://github.com/KMere21/capstone

Acknowledgements

Coffee Review
Choose an Open Source License
Best READ.ME template
Recreating python environment
Additional resources are cited in the individual notebooks in which the resource was used

License

Distributed under the MIT License. See license.txt for more information.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
.ipynb_checkpoints		.ipynb_checkpoints
.DS_Store		.DS_Store
.Rhistory		.Rhistory
.gitignore		.gitignore
Capstone_Part_1-Collecting_Data-Kate_Meredith.ipynb		Capstone_Part_1-Collecting_Data-Kate_Meredith.ipynb
Capstone_Part_2_Cleaning_Data_Df-Kate_Meredith.ipynb		Capstone_Part_2_Cleaning_Data_Df-Kate_Meredith.ipynb
Capstone_Part_3-Cleaning_Older_Data_Df-Kate_Meredith.ipynb		Capstone_Part_3-Cleaning_Older_Data_Df-Kate_Meredith.ipynb
Capstone_Part_4-Compiling_Data-Kate_Meredith.ipynb		Capstone_Part_4-Compiling_Data-Kate_Meredith.ipynb
Capstone_Part_5-Baseline_Models_EDA_Pre-Tokenizing-Kate_Meredith.ipynb		Capstone_Part_5-Baseline_Models_EDA_Pre-Tokenizing-Kate_Meredith.ipynb
Capstone_Part_6_Transforming_Text-Kate_Meredith.ipynb		Capstone_Part_6_Transforming_Text-Kate_Meredith.ipynb
Capstone_Part_7-Optimizing_Models-Kate_Meredith.ipynb		Capstone_Part_7-Optimizing_Models-Kate_Meredith.ipynb
Capstone_Report-Kate_Meredith.pdf		Capstone_Report-Kate_Meredith.pdf
Captstone Presentation - Kate Meredith-1.pdf		Captstone Presentation - Kate Meredith-1.pdf
CoefficientsVisual.twb		CoefficientsVisual.twb
CoffeeReview.png		CoffeeReview.png
CoffeeReviewPredictor.png		CoffeeReviewPredictor.png
Histogram of Overall Score_PPT.png		Histogram of Overall Score_PPT.png
OverallScore_Visualization.twb		OverallScore_Visualization.twb
README.md		README.md
Review_1.png		Review_1.png
Review_2.png		Review_2.png
Review_3.png		Review_3.png
Top 5 Positive & Negative Coefficients (guava.twb		Top 5 Positive & Negative Coefficients (guava.twb
Top 5 Positive & Negative Coefficients.png		Top 5 Positive & Negative Coefficients.png
coffee_conda_env.yml		coffee_conda_env.yml
coffee_streamlit.py		coffee_streamlit.py
license.txt		license.txt
rating_pipeline.pkl		rating_pipeline.pkl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Capstone

Predicting Coffee Ratings from Expert Reviews

Built With

Getting Started

Prerequisites

Installation

Use Case

Contact

Acknowledgements

License

About

Releases

Packages

Languages

License

KMere21/capstone

Folders and files

Latest commit

History

Repository files navigation

Capstone

Predicting Coffee Ratings from Expert Reviews

Built With

Getting Started

Prerequisites

Installation

Use Case

Contact

Acknowledgements

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages