Spam detection is an old and continuing problem. I get spam texts every day, and wonder how they got through my spam filter, given that every day I flag them and in doing so train what I believe now should be a state-of-the-art, bleeding edge spam detector since I have a Google phone. Shouldn't the filter catch what is clearly a spam SMS to me?
In this project I tackled this old problem using a small corpus (download the SMS Spam Collection from this UCI Machine Learning Repository and classical ML algorithms, aiming at explainability. I achieve 99% accuracy (see more evaluation metrics and tests in this notebook) during model evaluation, yet since the training data is small I expect this model to generalize poorly, despite all the tests.
So I deploy the model in an app to see how it does in the wild - with unseen data - to fully understand the challenge.
homepageHosted in Heroku, the app consists of a simple homepage (above) with a form that accepts a text input and a results page (below) in which I offer a detailed look into all that goes behind the scenes to transform this text into a prediction of whether it is spam or not.
top of results pageThe app is meant to demistify machine learning (or "AI" as it's commonly referred to) - since it often is but a series, however complex and probabilistic, of transformations of inputs into outputs - a text becomes a 1 or a 0.
Machines are not intelligent. As one of the founders of the field, Michael I. Jordan, expertly comments in this Lex Fridman Podcast: the "I" in "AI" is a misnomer. We have yet to fully comprehend how humans think, let understand whether machines think at all - and if so, how that might differ from how humans think.
This app employes both Natural Language Processing (NLP) and Supervised Machine Learning which are widely applicable to businesses in a variety of ways. The proportion of unstructured text data in the internet only grows compared to structured data such as tabular data. Text data is often found in databases sitting around untapped, as front-facing apps continuously capture open text fields with user comments.
Insights can be extracted from text using NLP and various analytic methods, whether using machine learning or using simpler designs and iterating through solutions. This project's framework for processing text and for classification can be extended and adapted to any other classification tasks involving textual data.
This journey into the fields of NLP and ML took months of learning and development of my own understanding of various inner workings of models I never ended up deploying. I am indebted to numerous tutorials and blogs I've read and watched along the way. Below is a list in order of most-to-least influential:
- Data Science Dojo's Introduction To Text Analytics With R by David Langer
- Aurélien Géron's Classification Notebook
- Scikit-Learn's API Docs
- Chayan Kathuria's tutorial Build & Deploy a Spam Classifier app on Heroku Cloud in 10 minutes!
- Analytics Vidhya's Introduction to Topic Modeling and Latent Semantic Analysis
- Prof. Steve Brunton's YouTube lectures on Singular Value Decomposition
- Kevin Arvai's tutorial Fine Tuning a Classifier in Scikit-Learn
- Cole Brendel's article Quickly Compare Multiple Models
- Josh Starmer's StatQuest YouTube channel