This project focuses on classifying wine samples based on their chemical composition using the K-Nearest Neighbors (KNN) algorithm. We explore the dataset to understand feature distributions, evaluate data correlations, and implement data transformations for optimal model performance. This repository provides code, analysis, and insights gained from applying machine learning techniques to the Wine dataset.
- Project Overview
- Data Description
- Methodology
- Results
- Conclusion
- References
- Images and Visuals
- Installation and Usage
In this study, we utilize the Wine dataset to classify wines into categories based on their chemical attributes. Through extensive exploratory data analysis (EDA) and preprocessing techniques, we prepare the data to ensure optimal model performance. The analysis includes normalization, standardization, and parameter tuning, providing a structured workflow for machine learning classification.
The Wine dataset, obtained from the UCI Machine Learning Repository, contains 14 features representing chemical attributes of different wine samples and a class label identifying three distinct wine types.
Key features include:
- Alcohol
- Malic acid
- Ash
- Alcalinity of ash
- Magnesium
- Total phenols
- Flavanoids
- Nonflavanoid phenols
- Proanthocyanins
- Color intensity
- Hue
- OD280/OD315 of diluted wines
- Proline
Each sample is labeled as one of three wine types, facilitating supervised classification.
We conducted a detailed EDA to explore feature distributions, check for outliers, and analyze correlations:
- Distribution Analysis: Understanding the spread and skew of data in each feature.
- Correlation Analysis: Identifying correlations among features (e.g., Total Phenols and Flavanoids) to gauge their influence on wine classification.
- Data Separation: Examining features like Proline and Color Intensity, which exhibit distinct separation across classes.
Normalization
To standardize the scale across features, we normalized data to a range of [0, 1], ensuring uniform weight across variables during distance calculations in KNN.
Standardization
For further uniformity, we standardized the dataset to have a mean of 0 and a standard deviation of 1, which is crucial in distance-based models like KNN, helping to minimize the impact of outliers.
Using KNeighborsClassifier
from scikit-learn
, we configured key parameters:
- n_neighbors: Number of nearest neighbors for classification.
- weights: Influence of distance on neighbor selection.
- metric: Distance calculation method (e.g., Euclidean).
- p: Power parameter for Minkowski distance.
Following data preparation, we applied KNN and evaluated the model’s performance on both normalized and standardized datasets.
With optimal parameter tuning and preprocessing, both normalization and standardization produced accuracies above 95%, suggesting that:
- Data Separability: The dataset's inherent feature separability contributed to model accuracy.
- Preprocessing: Effective scaling enhanced KNN’s ability to generalize.
- Hyperparameter Selection: Carefully selected hyperparameters like
n_neighbors
andweights
were critical to achieving high accuracy.
This study confirms that feature separability, appropriate preprocessing, and hyperparameter tuning significantly impact classification accuracy in KNN models. Our findings support the efficacy of KNN in distinguishing wine types based on chemical composition.
- Aich S., Al-Absi A.A., et al. (2018) - A classification approach using various feature sets for predicting wine quality using machine learning.
- Arauzo-Azofra A., et al. (2011) - Feature selection methods in classification problems.
- [Further references related to this study]
To set up the project locally, follow these steps:
-
Clone this repository
git clone https://github.com/yourusername/wine-dataset-analysis.git
-
Install dependencies
Use therequirements.txt
file to install the necessary libraries:pip install -r requirements.txt
-
Run the Jupyter Notebook
Launch Jupyter Notebook or JupyterLab and open the main analysis notebook to execute the code.