Objective: Create machine learning models capable of classifying candidate exoplanets from NASA Kepler space telescope raw dataset
Background: Kepler Space Observatory had verified 1284 new exoplanets as of May 2016. As of October 2017 there are over 3000 confirmed exoplanets total. The raw dataset exoplanet_data.csv
is a cumulative record of all observed Kepler "objects of interest."(Source)
The models below were chosen on the basis of Binary Classification Predictive modeling where class label is predicted for a given example of input data(Source). The planets would either be confirmed as a new exoplanet or not.
Comparative to all the algorithms below, the Random Forests and Logistic Regression are the models that reached greater than 85% accuracy, with Random Forests at 89%. If one were to make predictions of exoplanets from these five models, the best model would be Random Forests. However, the limitations of the accuracy scores are that there were no specific features selected or dropped when training the model so erroneous data might have been included in the Hyperparameter Tuning phase causing our test accuracy to be skewed.
To improve accuracy the following may be considered:
- Using effective and efficient value hyperparameters when using Hyperparameter Tuning with GridSearch
- Removing features in the dataset that do not provide substance to classifying the exoplanets to reduce processing time
- Controlling the prevention of overfitting and under-fitting for each model
- Predictive Test Accuracy: 0.890
- Best Grid score: 0.873
- Predictive Test Accuracy: 0.880
- Best Grid score: 0.885
- Predictive Test Accuracy: 0.799
- Best Grid score: 0.790
- Predictive Test Accuracy: 0.660
- Best Grid score: 0.672
- Predictive Test Accuracy: 0.600
- Best Grid score: 0.605
- Decision Trees at 5.1 seconds with 216 fits
- K-Nearest Neighbors(KNN) at 29.1 minutes with 28420 fits
- Capturing all Hyperparameter Tuning model parameters for each model
- Run time for fitting after GridSearchCV
- Extra visualizations can be found in the
data-visualizations
directory