This base code comes from the template provided by Jonathan Tay from his repository at which has been modified to use a different dataset.
Initial testing of the base code used his datasets as shown below then modified to use a different dataset.
This file describes the structure of this assignment submission.
This is the code for Assignment 1 for the OMSCS CS7641 Machine Learning course taught in the Spring of 2018.
The assignment code was originally written in Python 3.5.1 but for this assignment it is being run in Python 3.6.0 on Microsoft Windows using the PyCharm IDE.
Both are running on Windows 7 Professional 64-bit Editions. Hardware for processing the code.
- Lenovo Thinkpad T460P with 32Gb RAM, SSD and i7-6820HQ 8-core
- Dell Precision T1650 with 16Gb RAM, SSD and i7-3770 8-core
The Dell desktop is slightly faster at computational processing by about 11% overall. The Lenovo laptop is my primary development environment and follows me everywhere.
Processor and memory access speeds seems to be the gating factor for the processing time.
Lenovo Thinkpad
- Python 3.6.0 for Windows x86-64 retrieved from Dec 2016.
- PyCharm 2017.3.3 Professional Edition IDE retrieved January 2018.
- VirtualEnv used for isolating environment from other projects
- Github account used for hosting this code and data
Dell Precision
- Python 3.5.0 for Windows x86-64 retrieved from
- PyCharm 2017.3.3 Professional Edition IDE retrieved January 2018.
Note: I did not use the Anaconda build of Python but the version.
Library dependencies are:
- scikit-learn 0.19.1
- numpy 0.14.0
- pandas 0.22.0
- matplotlib 2.1.2
- tables 3.4.2
- scipy 1.0.0
Other libraries used are part of the Python standard library.
The main folder contains the following files:
- ./jtay-data adult., madelon_.* -> These are the original datasets, as downloaded from the UCI Machine Learning Repository
- Note: This is the original data for the base template for validation of the code.
- ./jmm-data -> These are the original datasets, as downloaded from the UCI Machine Learning Repository
- Note: This is the actual data for the assignment.
- datasets.hdf -> A pre-processed/cleaned up copy of the datasets. This file is created by the code. Note: Migrate the dataset.hdf manually to the root for processing.
- "" -> This python script pre-processes the original UCI ML repo files into a cleaner form for the experiments
- "xxx-analysis.pdf" -> The analysis for this assignment.
- -> A collection of helper functions used for this assignment
- -> Code for the Neural Network Experiments
- -> Code for the Boosted Tree experiments
- "Decision" -> Code for the Decision Tree experiments
- -> Code for the K-nearest Neighbours experiments
- -> Code for the Support Vector Machine (SVM) experiments
- -> Code to plot the learning and validation curves in the report
- README.txt -> This file
- Weka 3.8.2 for Windows x86-64 retrieved from Feb 2018.
There is also a subfolder called "output". This folder contains the experimental results.
Here, I use DT/ANN/BT/KNN/SVM_Lin/SVM_RBF to refer to decision trees, artificial neural networks, boosted trees, K-nearest neighbours, linear and RBF kernel SVMs respectively. A suffix of _OF indicates a deliberately "overfitted" version of the model where regularisation is turned off.
The datasets are adult/madelon referring to the two datasets used (the UCI Adult dataset and the UCI Madelon dataset)
There are 83 files in this folder. They come the following types:
- __reg.csv -> The validation curve tests for on
- __LC_train.scv -> Table of # of examples vs. CV training accuracy (for 5 folds) for on . For learning curves
- __LC_test.csv -> Table of # of examples vs. CV testing accuracy (for 5 folds) for on . For learning curves
- __timing.csv -> Table of fraction of training set vs. training and evaluation times. If the fulll training set is of size T and a fraction f are used for training, then the evaluation set is of size (T-fT)= (1-f)T
- ITER_base__.csv -> Table of results for learning curves based on number of iterations/epochs.
- ITERtestSET__.csv -> Table showing training and test set accuracy as number of iterations/epochs is varied. NOT USED in report.
- "test results.csv" -> Table showing the optimal hyper-parameters chosen, as well as the final accuracy on the held out test set.
- "test results Madelon No feature selection.csv" -> Table showing the optimal hyper-parameters chosen, as well as the final accuracy on the held out test set on Madelon with feature selection turned off. (Feature selection can be turned off my removing the "Cull" stages in the experiment pipelines (pipeM objects). Note that these results were done before random seeds were fixed throughout the code, so any attempt to regenerate them will be slightly different due to different random seeds.