Machine Learning - Exoplanet Exploration Analysis

Objective: Create machine learning models capable of classifying candidate exoplanets from NASA Kepler space telescope raw dataset

Background: Kepler Space Observatory had verified 1284 new exoplanets as of May 2016. As of October 2017 there are over 3000 confirmed exoplanets total. The raw dataset exoplanet_data.csv is a cumulative record of all observed Kepler "objects of interest."(Source)

The models below were chosen on the basis of Binary Classification Predictive modeling where class label is predicted for a given example of input data(Source). The planets would either be confirmed as a new exoplanet or not.

Analysis Report

Comparative to all the algorithms below, the Random Forests and Logistic Regression are the models that reached greater than 85% accuracy, with Random Forests at 89%. If one were to make predictions of exoplanets from these five models, the best model would be Random Forests. However, the limitations of the accuracy scores are that there were no specific features selected or dropped when training the model so erroneous data might have been included in the Hyperparameter Tuning phase causing our test accuracy to be skewed.

To improve accuracy the following may be considered:

Using effective and efficient value hyperparameters when using Hyperparameter Tuning with GridSearch
Removing features in the dataset that do not provide substance to classifying the exoplanets to reduce processing time
Controlling the prevention of overfitting and under-fitting for each model

Model Rank

1. Random Forests

Predictive Test Accuracy: 0.890
Best Grid score: 0.873

2. Logistic Regression

Predictive Test Accuracy: 0.880
Best Grid score: 0.885

3. Decision Trees

Predictive Test Accuracy: 0.799
Best Grid score: 0.790

4. K-Nearest Neighbors(KNN)

Predictive Test Accuracy: 0.660
Best Grid score: 0.672

5. Support Vector Machine(SVM)

Predictive Test Accuracy: 0.600
Best Grid score: 0.605

Additional Information

Fastest run time

Decision Trees at 5.1 seconds with 216 fits

Slowest run time

K-Nearest Neighbors(KNN) at 29.1 minutes with 28420 fits

Challenges

Capturing all Hyperparameter Tuning model parameters for each model
Run time for fitting after GridSearchCV

Model and Dataset Visualizations

Extra visualizations can be found in the data-visualizations directory

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
classification-reports		classification-reports
data-visualizations		data-visualizations
LICENSE		LICENSE
Model-DecisionTrees.ipynb		Model-DecisionTrees.ipynb
Model-KNN.ipynb		Model-KNN.ipynb
Model-Logistic-Regression.ipynb		Model-Logistic-Regression.ipynb
Model-RandomForests.ipynb		Model-RandomForests.ipynb
Model-SVM.ipynb		Model-SVM.ipynb
README.md		README.md
exoplanet_data.csv		exoplanet_data.csv
random-forests.sav		random-forests.sav

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Machine Learning - Exoplanet Exploration Analysis

Analysis Report

Model Rank

1. Random Forests

2. Logistic Regression

3. Decision Trees

4. K-Nearest Neighbors(KNN)

5. Support Vector Machine(SVM)

Additional Information

Fastest run time

Slowest run time

Challenges

Model and Dataset Visualizations

About

Releases

Packages

Languages

License

diannejardinez/machine-learning-challenge

Folders and files

Latest commit

History

Repository files navigation

Machine Learning - Exoplanet Exploration Analysis

Analysis Report

Model Rank

1. Random Forests

2. Logistic Regression

3. Decision Trees

4. K-Nearest Neighbors(KNN)

5. Support Vector Machine(SVM)

Additional Information

Fastest run time

Slowest run time

Challenges

Model and Dataset Visualizations

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages