Mobile Crowd Sensing (MCS) Data Analysis with NB and KNN Classifiers Using Differnet Feature Selection Methods

This repository contains Python implementations of Naïve Bayes (NB) and K-Nearest Neighbor (KNN) classifiers applied on the MCS dataset. Explored advanced techniques to improve machine learning performance during the 2023 uOttawa ML course.

Required libraries: scikit-learn, pandas, matplotlib.
Execute cells in a Jupyter Notebook environment.
The uploaded code has been executed and tested successfully within the Google Colab environment.

Binary-class classification problem

Task is to classify the MCS dataset legitimacy status: Legitimate / Fake.

Independent Variables:

Features include ID, Latitude, Longitude, Day, Hour, Minute, Duration, RemainingTime, Resources, Coverage, OnPeakHours, GridNumber.

Target variable:

'Legitimacy' column represents the target with two classes: 'Legitimate' and 'Fake'.

Key Tasks Undertaken

Dataset Splitting based on 'Day' Feature:
- Created training (days 0, 1, 2) and test (day 3) datasets based on 'day' feature values.
Baseline Performance of NB and KNN:
- Presented confusion matrices and F1 scores as baseline performance measures for both classifiers.
  - Bernoulli Naive Bayes
  - K-Nearest Neighbors
  - 2D TSNE plots for Training and Testing Set
Dimensionality Reduction (DR) using PCA and Auto Encoder (AE):
- Explored PCA and AE methods to determine optimal reduced dimensions based on F1 scores of test datasets.
- Plotted the number of components vs. F1 score for both classifiers, showcasing the best performance.
  - Maximum of PCA-Bernoulli Naive Bayes: 93.31858407079646
  - Best number of n_components PCA-Bernoulli Naive Bayes: 10
  - Maximum of PCA-K-Nearest Neighbors: 94.81165600568585
  - Best number of n_components PCA-K-Nearest Neighbors: 2
Feature Selection with Filter and Wrapper Methods:
- Explored feature selection methods such as Information Gain, Mutual Information, Variance Threshold, and Chi-Square to determine the optimal number of features and analyzed the relationship between the number of features and F1 scores, improving baseline performance.
- Employed Wrapper Selection techniques like Forward Feature Elimination, Back Feature Elimination, and Recursive Feature Elimination to evaluate feature relevance. Investigated the correlation between the number of features and F1 scores, enhancing the baseline performance.
- Visualized results through 2D TSNE plots using the selected best method for both training and test datasets.
Clustering Analysis using Latitude and Longitude:
- Explored clustering methods (K-means, SOFM, DBSCAN) on latitude and longitude features to identify legitimate-only clusters.
- Plotted the total number of legitimate-only members in legitimate clusters against different cluster numbers for each algorithm.