Breast Cancer Classification Model: End to End Machine Learning Pipeline

This project builds a breast cancer classification model using structured numerical data consisting of diagnostic measurements such as mean_radius, mean_texture, and mean_area. These measurements, along with others, are used to predict the diagnosis—either benign or malignant.

A range of machine learning models were developed, and the Random Forest model was identified as the best performer based on metrics like accuracy, precision, recall, and F1-score. Additionally, a KNIME Analytics workflow ensures reproducibility. The dataset is tabular, with continuous numerical features and a binary target, making it ideal for supervised learning tasks.

All results and insights are available in the Jupyter notebook and can be reproduced using the KNIME workflow. A sample prediction on unseen data is also demonstrated. The dataset is available upon request.

Breast Cancer Classification Model

This project uses diagnostic measurements to classify breast cancer as either benign or malignant. Through various models such as Random Forest, Logistic Regression, SVM, Decision Tree, and k-NN, we aim to identify the most accurate classification method. The Random Forest model performed best and was saved for future predictions, tested on unseen data to ensure reliability.

Project Structure & Highlights

Technologies Used: Python, Jupyter Notebook, KNIME Analytics
Dataset: Includes continuous numerical measurements like mean_radius, mean_texture, and mean_smoothness
Data Type: Tabular data with numerical features and a binary categorical target (diagnosis)
Skills Demonstrated:
- Data cleaning, handling missing data, and detecting duplicates
- Model evaluation using metrics (accuracy, precision, recall, F1-score)
- Prediction on unseen data with trained Random Forest model
- Workflow automation and reproducibility using KNIME Analytics
- Dataset available upon request

Results Overview

Model	Accuracy	Precision	Recall	F1 Score	Confusion Matrix
Logistic Regression	0.912	0.931	0.931	0.931	[[37, 5], [5, 67]]
Decision Tree	0.886	0.954	0.861	0.905	[[39, 3], [10, 62]]
Random Forest	0.921	0.957	0.917	0.936	[[39, 3], [6, 66]]
SVM	0.895	0.885	0.958	0.920	[[33, 9], [3, 69]]
k-NN	0.895	0.905	0.931	0.918	[[35, 7], [5, 67]]

How to Explore the Project

Interactive Jupyter Notebook:
Access the complete code and analysis here on Google Colab to explore the models, results, and predictions.
KNIME Analytics Workflow:
Use the KNIME workflow for reproducibility and automation.

Project Workflow

Data Cleaning & Preparation
- Checked for missing values and duplicates
- Standardized data types to ensure smooth processing
Exploratory Data Analysis (EDA)
- Examined summary statistics and the median of the variables
- Verified data distribution and relationships between features
Model Training & Evaluation
- Models trained: Logistic Regression, Decision Tree, Random Forest, SVM, and k-NN
- Evaluated on accuracy, precision, recall, and F1-score
- Confusion matrix used to assess performance across classes
Testing the Best Model (Random Forest)
- Saved the Random Forest model as random_forest_model.pkl
- Tested the model on unseen data to validate its reliability
KNIME Workflow for Reproducibility
- Developed an automated workflow to handle data preprocessing and model training

Repository Content

File/Link	Description
`breast cancer data analysis.ipynb`	Jupyter notebook containing the code
`random_forest_model.pkl`	Saved Random Forest model
`Breast cancer classification model.py`	Python Script
Jupyter Notebook	Viewable notebook on Google Colab
`Breast_cancer_classification.knmf`	Workflow file for reproducibility

How to Run the Code Locally

Clone the repository:
Install dependencies:
Run the Jupyter notebook:

Dataset Availability

The dataset used for this project is available upon request.

Let me know if anything needs further refinement!

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
Breast cancer classification model.py		Breast cancer classification model.py
Breast_cancer_classification.knwf		Breast_cancer_classification.knwf
LICENSE		LICENSE
README.md		README.md
random_forest_model.pkl		random_forest_model.pkl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Breast Cancer Classification Model: End to End Machine Learning Pipeline

Breast Cancer Classification Model

Project Structure & Highlights

Results Overview

How to Explore the Project

Project Workflow

Repository Content

How to Run the Code Locally

Dataset Availability

About

Releases

Packages

Languages

License

preciousoben/Breast-Cancer-Classification-Model-End-to-End-Machine-Learning-Pipeline

Folders and files

Latest commit

History

Repository files navigation

Breast Cancer Classification Model: End to End Machine Learning Pipeline

Breast Cancer Classification Model

Project Structure & Highlights

Results Overview

How to Explore the Project

Project Workflow

Repository Content

How to Run the Code Locally

Dataset Availability

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages