Demonstrate how adding unlabeled data to a supervised problem can improve out-of-sample accuracy.
This repo provides an example of this accuracy improvement through a loss function := labeled_loss + unlabeled_loss which is optimized using expectation-maximization (EM).
Follows Nigam et al 2006 from "Semi-Supervised Learning", Chapelle et al.
python -m venv venv4ssl
Windows/Anaconda: venv4ssl/Script/activate
Powershell notes:
- you may need to Powershell as administrator
- \venv4ssl\Scripts\Activate.ps1
macos/Linux: source venv4ssl/bin/activate
Linux notes: installing matplotlib may require headers like ft2build.h
python -m pip install --upgrade pip
pip install -r requirements.txt
Handled by specifying nltk data dir in run_experiments_main
python src/run_experiments_main.py
--n_labeled 20,100,300,500,700,1000
--n_unlabeled 10000
--max_iters 5
--out_dir <your_output_dir>
--nltk_data_dir <optional: local nltk data dir>
--test_acc_plot_fname test_acc.png