PREDICTING THE DIABETES PREVALENCE RATE BASED ON THE FEATURES DERVIED FROM THE TESCO SHOPPERS PURCHASING BEHAVIOUR IN UK DATASET.
The objective of this question is to gain insight from a dataset released by Tesco store, a large supermarket chain in UK. The dataset describes the purchasing behaviour of shoppers aggregated at the Ward level. It describes the fraction of different product types in the overall shopping basket. The features in this dataset are a subset of the Tesco dataset available at Tesco Dataset Source and you can find the description of the various fields there. The last column in this dataset is a categorical feature that captures the diabetes prevalence rate in the ward and your task is to predict this categorical feature based on the features derived from the shopping behaviour.
I did a comprehensive check on the dataset before moving forward with the data cleaning process. Firstly, I imported the dataset in python, and checked for null values in all possible ways and found out that there are no missing values. Next, I noticed that there is no need to tidying up the fields since all the feature variables are numerical. Next, I dropped down the ‘area_id’ column since it was not numerical and would have been wrongly identified as a feature subset. I wanted to assign it as the row-index, but I thought, I’ll rather use the default indexes which will be intuitive if I need to go to any row location. Further I used a data preprocessing technique wherein I lowercased the column names and also replaced any ‘spaces’ with ‘_’. I also noticed that the class variable column name is quoted, so I removed the quotes. After all the tidying up, I observed the columns and I had an intuition that many of the columns might be correlated or might be duplicate or repetitive, so firstly I plotted a correlation heatmap of the dataset, and instantly found out high correlation between many columns. But I did not remove them instantly. Firstly, I went ahead and checked whether there are zero-variance features, and found none, which was a good sign. I also wanted to remove the Quasi-Constant features, but there were many which were found out, and as our dataset was domain specific, I did not drop any column encountered within it, although I learnt many new things while implementing this. I also checked for duplicate columns and found none. After the basic cleaning processes, I found out that there are 2 columns which are already normalised, h_nutrients_weight & h_nutrients_calories, so I manually dropped them. I also dropped multiple suffix_std columns which showed redundancy when I saw exact similar Information gain scores, but I also cross-checked them with the correlation heatmap and it was evidently showing 0.9+ correlation, so I manually dropped them, since I didn’t wanted redundancy in the feature subsets being selected. The same thing I saw with energy_prefix columns, so I manually deleted almost all of them by crosschecking with the correlation heatmap and the similar IG scores. I observed one confidence interval of 95% column, and it was irrelevant, but I kept it, since as it is it was in the end of the IG scores table and it wouldn’t have been selected. I also kept in mind that I have to Normalize the data before applying it on any model, and to also split it into training/testing. I learnt many new things about the basics of feature selection and data cleaning when I performed this extensive task. In the end, I had a clean and tidy dataset ready to perform various ML algorithms on.
I’m aware of the fact that there are multiple Filter methods used to find the feature subsets. For the given dataset, the input predictors are Numerical, and the class variable(prediction) is Categorical, so technically f_classif()(ANNOVA) should be used in our case. But, I also used the Mutual Information (mutual_info_classif()) test along with the ANNOVA test to find my best feature subsets for the filter selection technique. I carried this out in Python & surprisingly I found the same top features in both the tests. Before performing the filter techniques, I carefully spitted the data using train_test_split. Furthermore, I used SelectKBest() function of sklearn.feature_selection to find the top features for both the filter methods. Now, I also implemented the same in Weka using the attributeClassifier function, and I found the same top features. But how did I select only the top 15 features? So, holistically I initially ran the Decision Tree, KNN and Naïve Bayes with only top 3 features and got a less accuracy with it, but when I ran the models with more than 10 features, the training accuracy was above 70%, which is what I was expecting, in both Python and Weka. I also tried running the SVM (Linear Kernal) classifier on the top 15 subsets and got a whopping accuracy of 80% Hence, I heuristically chose these top 15 features for the filter technique and I’m content with it since I did extensive implementations for the same in both Python & Weka.
I just want to clarify and embolden that my data has already been normalized before splitting it into train/test (test_size =30%). And the accuracies which I’ll be showing are stricly testing accuracies, if any, in the previous hypothesis and further too. Now, for wrapper feature selection techniques, I went into depth and not only found features for Decision Tree, Naïve Bayes, KNN, but I also wrapped Random Forest Classifier and SVM(Linear) to have a look at its feature subsets, since they gave splendid accuracies in my previous hypothesis, and I also wanted to see how they act with wrapper technique. While performing this in Weka, I witnessed a long computation time as comapared to the filter technique. But in python, I figured out the correct parameter settings for the SequentialFeatureSelector to be selected to boost up the parallel computing power (n_jobs=-1), and the results were comparatively quicker. Unlike Weka, we can explicity give the k_features in advance in Python for the wrapper technique. The high time computation gave me an idea that the Wrapper technique runs on a heuristic search technique to find the best possible feature subset, but at a very expensive cost. Following are the wrapper technique features for various classifiers:
- Sequential Forward Selection (Floating = False): SVM - 'weight_std', 'fat', 'fat_std', 'saturate_std', 'protein', 'f_energy_sugar', 'h_nutrients_weight_norm', 'h_nutrients_calories_norm', 'f_dairy', 'f_fats_oils', 'f_meat_red', 'f_tea_coffee', 'f_eggs_weight', 'f_fish_weight', 'f_meat_red_weight'. Test set accuracy: 79.33 %. Decision Tree Classifier – 'sugar', 'protein_std', 'energy_tot', 'f_energy_saturate', 'f_energy_carb', 'energy_density', 'h_nutrients_weight_norm', 'f_beer', 'f_grains', 'f_sauces', 'f_sweets', 'f_water', 'f_eggs_weight', 'f_fish_weight', 'f_sweets_weight'. Test set accuracy: 73.33 %. Naïve Bayes - 'weight', 'weight_std', 'volume_std', 'fat', 'fat_std', 'saturate', 'saturate_std', 'sugar', 'protein', 'protein_std', 'carb', 'fibre', 'fibre_std', 'alcohol', 'f_readymade_weight'. Test set accuracy: 62.67 % KNN - 'weight', 'weight_std', 'fat', 'saturate', 'sugar', 'protein_std', 'fibre_std', 'f_energy_protein', 'energy_density', 'f_dairy', 'f_grains', 'f_sauces', 'f_spirits', 'f_fish_weight', 'f_meat_red_weight'. Test set accuracy: 74.67 % Random Forest Classifier - 'weight', 'weight_std', 'fat', 'saturate_std', 'protein', 'protein_std', 'alcohol', 'f_energy_protein', 'h_nutrients_weight_norm', 'f_beer', 'f_grains', 'f_sauces', 'f_soft_drinks', 'f_fish_weight', 'f_meat_red_weight'. Test set accuracy: 77.33 %
- Sequential Backward Selection (Floating = False): SVM - 'weight', 'weight_std', 'saturate_std', 'protein_std', 'f_energy_sugar', 'f_energy_alcohol', 'f_dairy', 'f_grains', 'f_meat_red', 'f_poultry', 'f_spirits', 'f_eggs_weight', 'f_fish_weight', 'f_meat_red_weight', 'f_readymade_weight'. Test set accuracy: 74.00 %. Decision Tree Classifier – 'weight_std', 'volume_std', 'fat', 'protein', 'carb', 'f_energy_fat', 'f_energy_sugar', 'f_energy_protein', 'f_energy_carb', 'f_energy_fibre', 'f_grains', 'f_water', 'f_wine', 'f_eggs_weight', 'f_readymade_weight'. Test set accuracy: 74.00 %. Naïve Bayes - 'weight_std', 'sugar', 'protein_std', 'carb', 'f_energy_fat', 'f_energy_carb', 'h_nutrients_weight_norm', 'f_fish', 'f_fruit_veg', 'f_grains', 'f_water', 'f_wine', 'f_dairy_weight', 'f_grains_weight', 'f_sweets_weight'. Test set accuracy: 70.67 %. KNN - 'fat_std', 'saturate_std', 'carb', 'fibre', 'f_energy_carb', 'f_fish', 'f_poultry', 'f_sauces', 'f_spirits', 'f_water', 'f_wine', 'f_dairy_weight', 'f_eggs_weight', 'f_fish_weight', 'f_readymade_weight'. Test set accuracy: 78.67 %. Random Forest Classifier - 'weight_std', 'volume_std', 'saturate_std', 'protein', 'fibre', 'fibre_std', 'f_energy_fat', 'f_energy_sugar', 'f_grains', 'f_sauces', 'f_water', 'f_dairy_weight', 'f_fish_weight', 'f_poultry_weight', 'f_readymade_weight'. Test set accuracy: 80.67 %.
After analyzing both the feature subsets, I found out that even though the wrapper subsets are computationally slow to find, but they have a meaningful importance since they are retrieved using an algorithm model, whereas the filter subsets are basically evaluated by statistical techniques rather than a holistic approach. Also, regarding the training/testing accuracies, the wrapper method take it away! Better features are found out by Wrapper techniques using both forward and backward selection strategy. Where the filter technique focuses on the correlation with the target variable, the wrapper technique predominantly works with another model to get you the feature subsets. I feel it is not about which one is the best, both have dual use-cases obviously according to the task (dataset) at hand. But for my dataset, the wrapper feature subsets have an accuracy of around 70% and more than 80% in some cases. I was also going to run Recursive Feature Elimination, Forward floating, Backward floating and ExhaustiveFeatureSelector for the wrapper technique, but due to high time complexity, I avoided them since I got viable accuracies with the SequentialBackwardSelector. After having a hands-on in implementation of the wrapper/filter techniques in Python, I can confidently say that I have a better understanding of what filter/wrapper intrinsically does! I will talk about these subsets as we move forward, since ML is hypothetically an iterative (agile) process and not a linear (waterfall) process.
Now, after talking about the predictor(independent) variables, let’s take the target (dependent variables) into consideration. The dataset which was provided to me was obviously imbalanced (imbalance support for multi-class). The multi-class labelling had an exact support of (low = 16, mid = 308, high = 76). So, does it mean we need to do upsampling or downsampling for the same? I think so no, because the evaluation metrics justify how well the model has predicted the test (unseen) data. Also, I said no because, as far as I see the labels are well dispersed, although not totally balanced. It is ok for a short dataset of around 500 rows, hence, no need for up/downsampling. Subsequently, talking about which evaluation metrics (measures) I will consider is rather too early to say without even having a look at the classification report! But, by just having a sense of the dataset and what is asked for, I initially jotted down which all evaluation measures I need to consider, rather than doing cow-work of running all the evaluation measures. I think in my dataset, both the True positives and True Negatives are equally important. Since, everyone obviously needs to know whether they have a high, med, low diabetes prevalence rate. So, firstly I will be generating a confusion matrix using an ML classification visualizer Yellowbrick in Python (I have used Yellowbrick and sklearn.metric alternatively in my implementation). Further, I will generate a thorough Classification report (consisting of the precision, recall, F1 score and support). I will also look at the accuracies, but the accuracy is somewhere down the line irrelevant in our case, since the class-labels are imbalanced, even though minor, but they are. So, it’s good that we got the test accuracies, but we also need to consider the Type I (False Positives) and Type II (False Negatives) Errors, wherein they should be the least. For F1 measure, we will use the micro averaging since the data is unbalanced, and it will be helpful in calculating the contribution of all classes to calculate the average metric. I also used Cohen’s Kappa evaluation metric as we are using multi-class classifiers. The regular metrics are usually biased towards the more no. of class labels, but Cohen’s Kappa metric measures the actual classes with the predicted classes heuristically and returns a value between 0 and 1. Higher the Cohen’s Kappa, better is the classifier. I also used the Precision-Recall curve, wherein a high precision-recall high precision-recall value again nullifies the chances of a high FPR and FNR. If the precision and recall of a particular classifier is high, then the classifier has accurate and positive results. I will be also using the Area under the curve metric to reach to a conclusion whether the TP’s are higher than FP’s. Again, all these measures should be high, since they are hypothetically set by sklearn to be in that way. Last but not the least, I will also consider Matthews Correlation Constant, wherein it considers each and every metric of the confusion matrix to represent a perfect prediction (+1).
I want to clarify that the extensive process which I have used to evaluate different feature subsets retrieved from wrapper technique on the 5 chosen classifiers with their different parameter settings and evaluation measures is a bit difficult to put in and cover everything over here due to the page limit constraint. But my hands-on knowledge on evaluating various classifiers with different parameter tuning and evaluation measures has only increased since I chose to perform this assignment in Python, and I’m glad about it. Now, jumping to the task at hand. Just to include in the report, I have chosen the transformed feature subsets from each of the wrapper technique with the particular classifier to finally evaluate the performance of those particular classifiers. Also, the parameters which I have chosen are actually the best parameters since I used GridSearchCV, to set up a pipeline of hyperparameters along with a stratified cross validation ‘k=10’and then fitted and got the best_params_ for that particular classifier. Firstly I carried out the forward_svm generated feature subsets using SVM classifier with all of its different parameters, such as C, gamma, kernel, and found that poly with C=1 is same as linear, I also tried various combinations of different parameter tuning and found out that svm.SVC(kernel='linear',C=1), is the best parameter setting for our dataset with a Cohen Kappa score of above 0.60, with good Precision-recall values (above 0.80) for all the 3 classes. I will paste the results in the end in a tabular format. Also, I think I won’t be able to paste each and every plot due to the page limit constraint, but I have thoroughly plotted each and every thing in my jupyter notebook. Further I chose the Backward Decision Tree Classifier’s feature sets using the Decision Tree Classifier with the “entropy” criterion and not the “gini” criterion, again since my dataset were showing good metrics for the same. I also noticed that initially the Cohen Kappa score was below 0.50, then I changed the “max_depth=3” which in turn gave a bit higher accuracy than the default parameter, also the splitter at “best”, other than these, the other parameters were giving the same metrics, so I went ahead with them and got an increased accuracy of 0.76. However, the Precision-recall (micro) score was around 0.70, which was less than the SVM classifier. Until now, the SVM is outperforming the decision tree classifier. Further I chose to go ahead with the Backward Wrapper Naïve Bayes selected features using a GaussianNB(), now why Gaussian, and not Categorical, Complement, and others because, the MultinomialNB() gave a very bad Cohen Kappa score of around 0.22, so I used Gaussian and the Cohen Kappa score along with the accuracy increased drastically as 0.44 and 0.726 respectively. Still since the start I felt that Naïve Bayes will perform very badly in my dataset, since it assumes all the features as conditionally independent and then do its analysis, also using the subsets of forward naïve bayes, the accuracy was even worse (below 0.65), that’s the reason why I preferred the Backward Bayesian feature subsets along with a tweak in its type of NB algorithm. We use GaussianNB because it is best suitable for classification problems. Yet again, as of now, SVM is outperforming both the Naïve Bayes and Decision tree models along with its feature subsets. Now let’s see how KNN performs! So, I again used the backward KNN feature subsets since they gave a pretty distinguished domain specific feature subsets as compared to the forward KNN features. Now, regarding KNN, we know one thing that up to a certain no. of n_neighbours the accuracy increases, after which the accuracy starts drastically decreasing. So, surprisingly this specific value was 20, and it gave a whopping accuracy of 82% which was unbelievable, since I thought SVM will overpower every other classifier as I personally think SVM is the most powerful classifier, but I was astonished to see KNN outperforming SVM. With an amazing Cohen Kappa score of 0.65 and above, KNN was ruling! The avg. precision recall(micro) was 0.86, which is pretty amazing. The MCC score was also 0.65, which was again very good. Now the last - Random Forest Classifier! I again used the Backward Random Forest feature subsets and fitted them on a pure newly parameterized RandomForestClassifier. I’m aware that this is an ensemble technique which we haven’t yet learned in the class, but I picked this since I wanted to see how it actually picks up no. of decision tree classifiers to avoid over-fitting and also increase the accuracy of the traditional Decision Tree Classifier. And to my surprise, it just touched the KNN with an accuracy of 81.33%. Also, the Cohen Kappa score (0.63) was less than that of KNN. But the F1 score was higher than KNN with 0.77. Which also means that the avg. PR curve (micro) value is also greater for Random Forest Classifier, but just by 0.1.
From the table, the picture is clear that for my dataset, KNeighborsClassifier(n_neighbors=20, n_jobs=-1) is the winner according to the MCC, Cohen Kappa Score and also with the Accuracy.
After evaluating the performance of various classifiers with the best parameter settings (according to my dataset), I saw interesting results, wherein the Wrapper Backward False Floating KNN derived feature selection subsets were outperformed by the fitted KNN Algorithm with k=20. Also, the evaluation measures which I selected with proper hindsight research and care were also used successfully with their full potential using the Yellowbrick & sklearn libraries. Now, I have already answered bits of this question in previous hypothesis as I was writing the report along with implementing it in python. This was one really big, time-consuming task right from data cleaning, data splitting, feature selection using different techniques and then evaluating it on various classifiers to getting the results, and I also took the challenge to do this in python, by the end of it, I’ll say all in all it was the best experience, and I learned some extraordinary things apart from what we see in the lecture slides and also which cannot be done in Weka. Now, talking about the results, apart from KNN, I was really expecting SVM to outperform the rest, but sadly it couldn’t, in spite of me trying all the possible hyperparameters of SVM. I picked SVM, since this was a multi-class classification dataset. Also, I took a challenging classifier – Random Forest Classifier which is not even completed in our class. But its results were content too.
Now, I had created a copy of the dataset before doing all these implementations. So, I used it to again split it into train/test and then apply the 5 classifiers with same configuration settings to see what happens, and the results were shocking!
I was really not expecting Decision Tree and Random Forest Classifier with the original dataset to outperform the one’s with best selected wrapper features. I also see that only these 2 are outperforming and others are not. For other classifiers, the feature subsets modelled accuracy was in fact better than the whole dataset’s modelled accuracy. Unintentionally, the reason why I took Random forest became viable! Since, it really gave an astonishing and whopping accuracy of 0.826. A high no. of F1 score in turn means that the classifier is working well for the given dataset. Also, the usage of Cohen Kappa Score along with MCC and Average PR Curve values, justify why Naïve Bayes works poorly in both the cases and KNN works upto the standard with the feature subsets. I think I could’ve used Logistic regression too and found out the log_loss evaluation metric to find more differences between the classifiers, but still the picture is pretty clear. All in all, I had never explored all these different configuration settings for all these wonderful classifiers before. Also, I never knew about so many different evaluation metric scores, and I think I used them to their best potential. The most prominent thing about feature selection which I think is true is that different classifiers require different features, since for a particular feature subset with one classifier may be outperformed by some other classifier. Hence, I feel that there is not a single feature subset which is best. Every feature subset which I generated have some different results at the end according to different evaluation measures, and it’s just a prolonged process of finding the best classifier with best subset. But I did try my level best to prove that KNN is the best classifier for the feature subsets and Random forest is the best classifier when we consider the whole dataset. All this heuristic process is never ending.
I have plotted ROC Curves for the same feature subsets with which I evaluated their metrics as in the previous hypothesis. Receiver Operator Characteristic is a measure which predicts and visualizes the trade-off between Specificity and Sensitivity. I will talk about ROC Curves in detail in the last question. For this task, why the “high” class ROC Curve shows high Area under the curve scores? Firstly, I thought I did something wrong, but later I saw that the support of “high” class is the lowest of all the 3 classes, so obviously it’s AUC will be higher than that of the other 2 classes.
As we can see, the ROC curve for the “high” class has the maximum area under the curve for KNN, SVM and Random Forest. Actually, for my dataset, as told earlier I had kept a test_size of 0.3, so the support for “high” class decreased even more. Also, as the support for “mid” is the highest, its AUC score is the least in all the Classifiers ROC curve plot. However, if we have a look at the micro-average AUC score for all the classifiers it is near to the top-left corner, where the FP’s and FN’s are 0. So, because of the imbalance classes, “high” has the highest AUC score as compared to the other classes, wherein the support also plays an important role. But, for our dataset, is AUC really valid (best measure)?
The answer is No! And that’s the reason why I didn’t choose it in my evaluation metrics list, since for imbalanced multiple classes there is not a fixed threshold, as we can see the curve is so irregular and especially for “high” class it is out of the plot window. The best example is of Naïve Bayes, where even though the accuracy of Naïve Bayes is less, it’s AUC is 0.88, which is very much high and creates issues when evaluating it with other classifiers. Even though I’m satisfied with the performance of the ROC plots, but they are too good to be true, is what I feel! (Only due to imbalanced multiple classes). First of all, ROC curve is best for binary classification datasets. But for Multiclass datasets, it somehow overestimates the AUC score and I think it should not be recommended for multi-class classification as an evaluation measure. However, for SVM it works wonders, as it is also evident from the ROC curve when we compare all the 3 algorithms, when there are 3 different classes. That’s one more unintentional reason why I included SVM! Instead of ROC Curves, I think Precision-Recall Curves are best suited for my dataset and also justfies my previous results.
As given earlier, and now plotted; I feel that Precision-Recall curve is much more justifiable and suitable for my dataset. Since, as we can see; for the “high” class; the Precision-recall AUC score is in all the cases less than the other two. Also, the Avg. precision (micro) can be used to predict the quality of the output and also the trade-off between accurate results and positive results. Hence, all in all I feel that ROC curves really overestimate the metrics of an algorithm in our case, where in fact, the story is actually different. Whereas, PR-curve gives us a right metric evaluation measure to distinguish between various model classifiers. If there would have been a binary classification, then the ROC curve would make deep sense. In the end, I’ll say that I enjoyed finding the best feature subsets and classifier model for my dataset, and I’m satisfied with its performance.
Question: Consider the nightmare situation in which you struggled hard to obtain a very high accuracy (>95%) on your training data for a binary classification task, but when your client ran it on their test data, the accuracy was very low (<50%). This is despite the fact that your dataset is reasonably balanced (majority class < 65%) and you are using a fairly complex learning algorithms with many parameters to fit your dataset. How do you explain this situation? What are the possible causes for this? How can you improve the testing accuracy in this situation? What precautions should you take in your evaluation procedure to avoid this situation?
Answer: This situation does look like a nightmare in reality, since I have seen instances where the training accuracy is around 90%, and the test accuracy is around 75-80%. But a scenario where the training accuracy is greater than 95% and the test accuracy is below 50%, despite the dataset being used for binary classification and has a balanced class is really worrisome, however, this is a major fault in either train/test split or data cleaning, or as silly as using 2 different kind of feature subsets for training and testing. The first thing which came into my mind is Over-fitting of the data. When the training accuracy is considerably higher than the testing data, then the model is actually overfitted, and in this case it is obviously this way. Which literally means that the model has been taught the things which perform better with training data, but when the model comes in the real world, it cannot predict the same things or more complex things and it becomes difficult for it to predict the True Positives and True Negatives, and that’s the reason why the testing accuracy suffers and differs from the training accuracy. I also noticed one thing in the situation where the parameter of the classifier used are too many! Like, is this really wise? In my previous hypothesis, even I did the same at first, where I included a lot of parameters for the classifiers which I used, for instance, in SVM, I thought linear kernel will give low testing accuracy, so I started tweaking the ‘C’, ‘gamma’ parameters with rbf, sigmoid and polynomial kernels, but when I finally ran the classifier with linear kernel, the accuracy was even better than what I expected. So, similarly, it might be the case that as there are many parameters, one should be really really careful. I’m not saying that we shouldn’t use multiple parameters, indeed we should, but having appropriate cause taking the dataset and what we need to achieve into consideration. Another possible reason might be the classifier itself, since we must not just rely on one single classifier algorithm, wherein I followed this and used multiple classifiers in my previous hypothesis. Other causes might be, there is not much data cleaning done before hand. Also, a minor thing like standardization or normalization of data could make a difference, since if 2 variables X & Y are differing a lot in their dimensions, then we must normalize them or classifiers like KNN can misinterpret them. But the most prominent reason which I feel is 90% of the times is that the model is overestimating itself by overfitting. I feel in this case, the model is trained in a complex way, which gives variance when it is tested. This high variance affects any classification algorithm. To avoid this, we can use k-fold cross validation. K-fold cross-validation is a reliable train/test split method, where the training data is again split into 2 parts, training data and cross validation data. Suppose we do a 10-fold cross validation, then we fit our model to 9/10 folds and then predict the 1/10 left out fold. We then repeat this iteratively for the next 1/10-fold and train the other 9/10 folds. In this we will have 10 Confusion matrices, however, this gets converted into one final confusion matrix intrinsically and we have a proper performance metric. Sometimes, the traditional train_test_split doesn’t work for a large dataset, which results in such overfitting. So, we should always keep in mind that we should keep the test set untouched, and save it for the real hold-out evaluation. By using cross validation, we can tune multiple classifier algorithms using only the training set and then we still will have testing set to finally test the unseen data. Also, while selecting the best parameters, we can use GridSearchCV hyperparameter pipeline with k-stratified values of cross validation, which will give the best parameters, and then you don’t have to plug in a huge list of parameters. Talking about parameters, I feel that sklearn’s default parameter are 80% of the time the best! So, all in all to avoid this situation, using K-fold cross validation is the best possible way to figure out a way! If still, your testing accuracy is low, then we can use more training data, if the dataset is that big. According to me, 70:30, 80:20 are the best splitting ways. Of course, how can I forget feature/wrapper technique to select feature subsets! We could try out that too, or even by using PCA to reduce the unwanted dimensions. Other complex things like Early stopping, Regularization, Ensembling can be your best bet if nothing works out.
Question: What is a ROC curve? What is the motivation behind using it? How do you interpret it? How do you use it to compare two different classification approaches? Why is the reference line consider to correspond to a random classifier?
Answer: The receiver operator characteristic curve (ROC) is basically a graphical plot which actually is an evaluation measure used predominantly in a binary classification dataset. It has True Positive Rate (TPR), also called as sensitivity or recall on its Y-axis, and False Positive rate (FPR) also called as probability of false alarm. The ROC curve is mainly plotted with the help of the probabilities of both the True Positiveness and the False Positiveness as a cumulative distribution curve function, and the area under this curve is the probability distribution.
We will take an example of the Logistic Regression being used to classify a group of people, who are infected by Ebola (TP) and are not infected by Ebola (TN). As Logistic Regression works on the principle of probabilistic classification, we will determine a classification threshold of let’s say 0.6, so the probability of having Ebola is > 0.6 and not having Ebola is ≤ 0.6. Now imagine a graph with the people having/not having Ebola on X-axis and the classification threshold on Y-axis. So, if we sample out the TPs, FPs, TNs, and FNs on the X-axis, we will get the idea of which samples are what according to the classification threshold. But what if we want to check the confusion matrices at each threshold, since we need to evaluate all the probabilities, in this case there will be numerous confusion matrices being created for all the samples for each threshold, and it would be a too much of a hectic task for evaluation metrics calculations. So, instead of evaluating each threshold, we use ROC curve to summarize all these tasks into a single graph of Sensitivity X (1 – Specificity). Hence, for all the samples we evaluate the TPRs and the FPRs and scatter them in the X and Y axes and get a nice ascending Probabilistic curve. Now we will see how to plot it using a random binary classification sample which I plotted using 4 algorithms. I’ll discuss why I used 4 algorithms!
I plotted an ROC curve for 4 different classification algorithms which are famously used for predict_proba. Firstly, I made a sample binary classification model and fitted it using these 4 classifiers and predicted their testing probabilities, further finding their FPRs and TPRs using these probabilities and the testing set labels, and finally plotting them against FPR and TPR. Now how can we interpret this plot? Basically, we use something called as an Area under the curve, which is nothing but the computation of the FPs and TPs. This gives us the performance of the classification model used. Higher the area under the curve, better is the performance. As we can see in the graph which I plotted, the binary classification dataset gave the highest AUC score for Random Forest and the lowest for the decision tree, so we can infer that the Random Forest is having the best performance for the tested dataset.
However, we should also consider the steepness of the curve, wherein as we can see for Naïve Bayes, it has the maximum steeps, so this means that the TPR is maximum and the FPR is reducing. So, I used 4 algorithms, just to display that we can compare 2 or more number of classification models performances’ using the ROC curve and the evaluation measure called, Area under the curve.
Now, as we can see in the plot above, the blue dashed (--) line in the middle of the plot, which is the diagonal is called the reference line which is basically the baseline (r_probs) which contains the worst-case scenarios [0 for _ in range(len(testy))]. This line is actually called as a diagnostic test. It is considered to correspond to a random classifier since as we can see from the “code” above, it randomly classifies the condition. It is also called as the Random chance probability or a no-skill classifier. It is 0.5 in this case, since the dataset is balance. It changes according to the threshold of the positive to negative distributions. However, this baseline is fixed for the ROC Curve, whereas in the Precision-Recall curve (as in my previous hypothesis), the baseline is calculated as y = P / (P+N) …..(P=Positive Class, N=Negative class, y=equation of the line). The best performance of a model is determined when the Area under the curve is occupied to the top-left corner of the plot, where the TPR is 1 and the FPR is 0. Higher the bow of the curve, best is the algorithm model. Anything below the reference line, is considered as a nightmare situation as discussed in the previous question. But I’ll again say this that Precision-Recall Curves should be preferred over ROC Curves since the ROC curves in literal sense overestimates the performance of any learning model. ROC curve is best suited in the example above where the dataset is balanced and classes are binary. The reason why ROC curve overestimates the performance is because of the abundant true negatives in the FPR calculation. This goes back to the reason; where the “high” class in the previous hypothesis showed higher AUC score for nearly all the classification models, whereas in the PR-curve the AUC was perfectly analysed!
To finally conclude,
“I feel that whatever algorithm it may be, there is always a limitation and there is always a future work awaiting to solve it. Nothing is too perfect; nothing is too bad!”
\\\\\\\\\\ THANK - YOU //////////
PROJECT CREATED BY - Prashant Wakchaure
Email ID - prashant900555@gmail.com
Contact No. - +373 892276183