Skip to content

Classifies wines using K-Nearest Neighbors (KNN) and Random Forest with chemical attributes from the Wine dataset. Includes Exploratory Data Analysis, preprocessing, and hyperparameter tuning, achieving over 95% accuracy.

Notifications You must be signed in to change notification settings

chaitanyasai-2021/Wine-Classification-KNN-RandomForest

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Wine Dataset Classification and Analysis using K-Nearest Neighbors (KNN) & Random Forest

This project focuses on classifying wine samples based on their chemical composition using the K-Nearest Neighbors (KNN) algorithm. We explore the dataset to understand feature distributions, evaluate data correlations, and implement data transformations for optimal model performance. This repository provides code, analysis, and insights gained from applying machine learning techniques to the Wine dataset.

Table of Contents

  1. Project Overview
  2. Data Description
  3. Methodology
  4. Results
  5. Conclusion
  6. References
  7. Images and Visuals
  8. Installation and Usage

Project Overview

In this study, we utilize the Wine dataset to classify wines into categories based on their chemical attributes. Through extensive exploratory data analysis (EDA) and preprocessing techniques, we prepare the data to ensure optimal model performance. The analysis includes normalization, standardization, and parameter tuning, providing a structured workflow for machine learning classification.

Data Description

The Wine dataset, obtained from the UCI Machine Learning Repository, contains 14 features representing chemical attributes of different wine samples and a class label identifying three distinct wine types.

Key features include:

  • Alcohol
  • Malic acid
  • Ash
  • Alcalinity of ash
  • Magnesium
  • Total phenols
  • Flavanoids
  • Nonflavanoid phenols
  • Proanthocyanins
  • Color intensity
  • Hue
  • OD280/OD315 of diluted wines
  • Proline

Each sample is labeled as one of three wine types, facilitating supervised classification.

Methodology

Exploratory Data Analysis

We conducted a detailed EDA to explore feature distributions, check for outliers, and analyze correlations:

  1. Distribution Analysis: Understanding the spread and skew of data in each feature.
  2. Correlation Analysis: Identifying correlations among features (e.g., Total Phenols and Flavanoids) to gauge their influence on wine classification.
  3. Data Separation: Examining features like Proline and Color Intensity, which exhibit distinct separation across classes.

Data Transformation

Normalization
To standardize the scale across features, we normalized data to a range of [0, 1], ensuring uniform weight across variables during distance calculations in KNN.

Standardization
For further uniformity, we standardized the dataset to have a mean of 0 and a standard deviation of 1, which is crucial in distance-based models like KNN, helping to minimize the impact of outliers.

Model Training and Evaluation

Using KNeighborsClassifier from scikit-learn, we configured key parameters:

  • n_neighbors: Number of nearest neighbors for classification.
  • weights: Influence of distance on neighbor selection.
  • metric: Distance calculation method (e.g., Euclidean).
  • p: Power parameter for Minkowski distance.

Following data preparation, we applied KNN and evaluated the model’s performance on both normalized and standardized datasets.

Results

With optimal parameter tuning and preprocessing, both normalization and standardization produced accuracies above 95%, suggesting that:

  1. Data Separability: The dataset's inherent feature separability contributed to model accuracy.
  2. Preprocessing: Effective scaling enhanced KNN’s ability to generalize.
  3. Hyperparameter Selection: Carefully selected hyperparameters like n_neighbors and weights were critical to achieving high accuracy.

Conclusion

This study confirms that feature separability, appropriate preprocessing, and hyperparameter tuning significantly impact classification accuracy in KNN models. Our findings support the efficacy of KNN in distinguishing wine types based on chemical composition.


References

  • Aich S., Al-Absi A.A., et al. (2018) - A classification approach using various feature sets for predicting wine quality using machine learning.
  • Arauzo-Azofra A., et al. (2011) - Feature selection methods in classification problems.
  • [Further references related to this study]

Images and Visuals

  • Anova F-Value

    Distribution Plot

  • Distribution Analysis

    Distribution Plot

  • Correlation Heatmap

    Correlation Heatmap

    • Feature Distribution

    Model Performance Chart

  • Model Performance

    Model Performance Chart Model Performance Chart

    • Final Result

    Model Performance Chart

Installation and Usage

To set up the project locally, follow these steps:

  1. Clone this repository

    git clone https://github.com/yourusername/wine-dataset-analysis.git
  2. Install dependencies
    Use the requirements.txt file to install the necessary libraries:

    pip install -r requirements.txt
  3. Run the Jupyter Notebook
    Launch Jupyter Notebook or JupyterLab and open the main analysis notebook to execute the code.

About

Classifies wines using K-Nearest Neighbors (KNN) and Random Forest with chemical attributes from the Wine dataset. Includes Exploratory Data Analysis, preprocessing, and hyperparameter tuning, achieving over 95% accuracy.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published