Team
: Hello World
Team Members
: Sung Lin Chan, Xiangzhe Meng, Süha Kagan Köse
This project held a private contest in Kaggle competition and it is similar to the Higgs Boson Machine Learning Challenge - 2014. For more details about this project requirement,go to project1_description in this repository. Ultimately, we ranked 8th of 164 teams on the leaderboard.
In order to reproduce the result we submitted to Kaggle, please follow the instruction step by step blew.
- Please make sure
Numpy
is installed. This is the only required 3rd party package in this project. - Download
train.csv
andtest.csv
from Kaggle competition, and put them in the /data - Run python script -
run.py
Here below are the brief overviews of the python scripts we used in this project. To see more information and detail about our project, go to PDF report in this repository.
Contains the helper functions that are used to load data, generate the predictions and output the result to csv file as submission file in Kaggle's format
Contain the hepler functions for main regression models and finding-hyperparameter methods .
build_polynomial_features
,standardize
,replace_missing_data_by_frequent_value
,process_data
andgroup_features_by_jet
: Pre-process the raw dataset to generate desired features for training and prediction steps.compute_accuracy
,build_k_indices
: Computes the accuracy for cross validation stepcompute_gradient
: Computes the gradient for gradient descent and stochastic gradient descentbatch_iter
: Generate a minibatch iterator for a dataset
Contain 3 auxiliary functions and 2 different cost functions
calculate_mse
: Compute mean square error, an auxiliary function of compute_losscalculate_mae
: Compute mean absolute error, an auxiliary function of compute_losscompute_loss
: Compute loss for regression modelsigmoid
: An auxiliary function of compute_loss_neg_log_likelihoodcompute_loss_neg_log_likelihood
: Compute negative log likelihood for logistic regression
A python file that contains helper functions required for regularized logistic regression which is implemented in lr.ipynb
.
Contain the mandatory implementations of 6 regression models for this project
least_squares_GD
: Linear regression using gradient descentleast_squares_SDG
: Linear regression using stochastic gradient descentleast_squares
: Least squares regression using normal equationsridge_regression
: Ridge regression using normal equationslogistic_regression
: using stochastic gradient descentregularized_logistic_regression
: Regularized logistic regression
Script that generates the exact CSV file submitted on Kaggle.
A python notebook used for finding the best hyperparameters by running cross-validation
A python notebook used for doing some feature engineering tests and finding the best method to do the final prediction by implementing cross validation for all 6 methods and comparing the average test accuracy
A python notebook that includes the steps of pre-processing data, creating model and making predictions related to regularized logistic regression.