Terminator is a library written in C++ for spam filtering, like the famous SpamBayes and OSBF-Lua. It can be embedded into other spam filtering software or service as a machine learning module. The advantages are
- Very high precision and recall, best results on all public spam filtering corpus.
- It is fast and can only consume several MB of memory.
- Do not need to tune hyper-parameters
Terminator can be used in any other binary text classification problems, especially those that need an adaptive model for online learning.
Terminator is not a complete E2E spam filtering solution. Instead, it focuses on the machine learning part without blocklist/allowlist or DKIM. My paper, "An Adaptive Fusion Algorithm for Spam Detection](http://csse.szu.edu.cn/staff/panwk/publications/Journal-IEEE-IS-14-AFSD.pdf)" described the implementation in detail.
(Update on Jan 2023. The work of this library dates back to around 2010. It consistently got SOTA results on most online learning email filtering corpus, TREC, CEAS, and a private dataset from NetEase. I have not followed this area for a long time, so I may miss some latest research. For batch learning context, I think the newest Transformer based LLMs have great potential.)
Terminator used a fusion model, which includes eight machine learning algorithms to boost spam filtering performance. The algorithms are listed below according to papers
- Naive Bayes
- Not So Naive Bayes
- Online Logistic Regression
- HIT
- Winnow
- Balanced Winnow
- Passive Aggressive
- Online Perceptron Algorithm with Margins
We used a novel adaptive model fusion technique. The weight of every single model is learned during the online learning process.
The only dependency is kyotocabinet](http://fallabs.com/kyotocabinet/) for persistence, which must be installed first.
clone https://github.com/freiz/terminator.git
cd terminator
make
You can change the compiler suite in Makefile; the output is a static linkable lib.
#include "terminator.h"
// The first parameter is the path of database file
// The second parameter is the main memory used as cache, the unit is Byte, so 5 << 20 is around 5MB as cache
Terminator* classifier = new Terminator("terminator.kch", 5 << 20);
// Now you can write the main logic
// There are two public api, Train and Predict
// [Predict] pass in the email content and return a score ranging from 0 (100% ham) to 1 (100% spam)
// You can change the threshold to make the decision on your own
double score = classifier->Predict(std::string email_content);
// [Train] pass in the email content and a flag
// If spam train, the flag set to true or false
classifier->Train(std::string email_content, boolean is_spam)
make run-demo
It will run a demo application to simulate spam filtering using the SpamAssassin corpus; you can also put another dataset (such as ceas08) under demo/corpus to check the experiment result.
Do not forget to link against the library kyotocabinet.
Here, I only quote samples of results on public corpus Trec05-p1
Competitor | (1-ROCA)%, the smaller the better |
---|---|
bogofilter | 0.048 |
spamprobe | 0.059 |
spamasassin | 0.059 |
terminator | 0.0055 |
The paper "An Adaptive Fusion Algorithm for Spam Detection" contains a complete set of experiment results.