Skip to content

Used Supervised Classification Predictive Machine Learning models such as Decision Trees, KNN, Logistic Regression, Random Forests, and SVM

License

Notifications You must be signed in to change notification settings

diannejardinez/machine-learning-challenge

Repository files navigation

Machine Learning - Exoplanet Exploration Analysis

Objective: Create machine learning models capable of classifying candidate exoplanets from NASA Kepler space telescope raw dataset

Background: Kepler Space Observatory had verified 1284 new exoplanets as of May 2016. As of October 2017 there are over 3000 confirmed exoplanets total. The raw dataset exoplanet_data.csv is a cumulative record of all observed Kepler "objects of interest."(Source)

The models below were chosen on the basis of Binary Classification Predictive modeling where class label is predicted for a given example of input data(Source). The planets would either be confirmed as a new exoplanet or not.

Analysis Report

Comparative to all the algorithms below, the Random Forests and Logistic Regression are the models that reached greater than 85% accuracy, with Random Forests at 89%. If one were to make predictions of exoplanets from these five models, the best model would be Random Forests. However, the limitations of the accuracy scores are that there were no specific features selected or dropped when training the model so erroneous data might have been included in the Hyperparameter Tuning phase causing our test accuracy to be skewed.

To improve accuracy the following may be considered:

  • Using effective and efficient value hyperparameters when using Hyperparameter Tuning with GridSearch
  • Removing features in the dataset that do not provide substance to classifying the exoplanets to reduce processing time
  • Controlling the prevention of overfitting and under-fitting for each model

Model Rank

1. Random Forests

  • Predictive Test Accuracy: 0.890
  • Best Grid score: 0.873

2. Logistic Regression

  • Predictive Test Accuracy: 0.880
  • Best Grid score: 0.885

3. Decision Trees

  • Predictive Test Accuracy: 0.799
  • Best Grid score: 0.790

4. K-Nearest Neighbors(KNN)

  • Predictive Test Accuracy: 0.660
  • Best Grid score: 0.672

5. Support Vector Machine(SVM)

  • Predictive Test Accuracy: 0.600
  • Best Grid score: 0.605

Additional Information

Fastest run time

  • Decision Trees at 5.1 seconds with 216 fits

Slowest run time

  • K-Nearest Neighbors(KNN) at 29.1 minutes with 28420 fits

Challenges

  • Capturing all Hyperparameter Tuning model parameters for each model
  • Run time for fitting after GridSearchCV

Model and Dataset Visualizations

  • Extra visualizations can be found in the data-visualizations directory

About

Used Supervised Classification Predictive Machine Learning models such as Decision Trees, KNN, Logistic Regression, Random Forests, and SVM

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published