Skip to content

Delved into advanced techniques to enhance ML performance during the uOttawa 2023 ML course. This repository offers Python implementations of Naïve Bayes (NB) and K-Nearest Neighbor (KNN) classifiers on the MCS dataset.

License

Notifications You must be signed in to change notification settings

RimTouny/Feature-Selection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Mobile Crowd Sensing (MCS) Data Analysis with NB and KNN Classifiers Using Differnet Feature Selection Methods

This repository contains Python implementations of Naïve Bayes (NB) and K-Nearest Neighbor (KNN) classifiers applied on the MCS dataset. Explored advanced techniques to improve machine learning performance during the 2023 uOttawa ML course.

  • Required libraries: scikit-learn, pandas, matplotlib.
  • Execute cells in a Jupyter Notebook environment.
  • The uploaded code has been executed and tested successfully within the Google Colab environment.

Binary-class classification problem

Task is to classify the MCS dataset legitimacy status: Legitimate / Fake.

Independent Variables:

  • Features include ID, Latitude, Longitude, Day, Hour, Minute, Duration, RemainingTime, Resources, Coverage, OnPeakHours, GridNumber.

Target variable:

  • 'Legitimacy' column represents the target with two classes: 'Legitimate' and 'Fake'.

Key Tasks Undertaken

  1. Dataset Splitting based on 'Day' Feature:

    • Created training (days 0, 1, 2) and test (day 3) datasets based on 'day' feature values. image
  2. Baseline Performance of NB and KNN:

    • Presented confusion matrices and F1 scores as baseline performance measures for both classifiers.

      • Bernoulli Naive Bayes merge_from_ofoct

      • K-Nearest Neighbors merge_from_ofoct (1)

      • 2D TSNE plots for Training and Testing Set merge_from_ofoct
  3. Dimensionality Reduction (DR) using PCA and Auto Encoder (AE):

    • Explored PCA and AE methods to determine optimal reduced dimensions based on F1 scores of test datasets.

    • Plotted the number of components vs. F1 score for both classifiers, showcasing the best performance. merge_from_ofoct (1)

      • Maximum of PCA-Bernoulli Naive Bayes: 93.31858407079646
      • Best number of n_components PCA-Bernoulli Naive Bayes: 10 merge_from_ofoct
      • Maximum of PCA-K-Nearest Neighbors: 94.81165600568585
      • Best number of n_components PCA-K-Nearest Neighbors: 2 merge_from_ofoct (1)
  4. Feature Selection with Filter and Wrapper Methods:

    • Explored feature selection methods such as Information Gain, Mutual Information, Variance Threshold, and Chi-Square to determine the optimal number of features and analyzed the relationship between the number of features and F1 scores, improving baseline performance. merge_from_ofoct (2)

    • Employed Wrapper Selection techniques like Forward Feature Elimination, Back Feature Elimination, and Recursive Feature Elimination to evaluate feature relevance. Investigated the correlation between the number of features and F1 scores, enhancing the baseline performance. merge_from_ofoct

    • Visualized results through 2D TSNE plots using the selected best method for both training and test datasets. merge_from_ofoct

      merge_from_ofoct (2)

  5. Clustering Analysis using Latitude and Longitude:

    • Explored clustering methods (K-means, SOFM, DBSCAN) on latitude and longitude features to identify legitimate-only clusters.
    • Plotted the total number of legitimate-only members in legitimate clusters against different cluster numbers for each algorithm. merge_from_ofoct (5)