This is a course assignment for supervised machine learning models using R. This is from the Data Science and Advanced Analytics course from the Big Data & Analytics Masters @ EAE class of 2021. This assignment has three sections.
- Regression Analysis for Child Carseat Sales
- Classification Analysis for Breast Cancer
- Classification Analysis for Iris Species
Given a dataset of 400 observations (locations) with 11 variables, we need to predict the sales volume.
Answer
I used Linear Regression with 8 different variable combinations. Model performance was evaluated using Mean Square Error
R script found here: regression_hands_on.R
Given a dataset of 699 observations with 11 variables, of what appears to be imaging from breast tissue. We need to train a model to predict whether the observation corresponds to a benignant or malignant class.
Answer
I used Support Vector Machines models with different kernel functions. For model evaluation purposes I added a cost matrix based on these assumptions
Conclusion: Use the model #7, as it represents the one with the lower prediction cost. Even though it has an accuracy of ~ 93% even though there are other models at higher accuracies ~ 95%
R script found here svm_hands_on_breast_cancer.R
Dataset of 150 observations with 4 variables and a class. The purpose isto predict the classification of the Iris species: Setosa, Versicolor, Virginica.
Answer I also used Support Vector Machines models. When doing the variable analysis, by eyeballing the distribution of the species in variable pairs, it looks like Sepal Width and Sepal Length are good input variables. From the different kernel functions tested, I went with the Polynomial Degree 3, Gamma 2.5. Another interesting takeaway from this assignment was to use the plot feature to visualize observations vs prediction.
R script found here svm_hands_on_flowers.R
Professor Assistants