Droplet Drug Detector (DDD) ππ¬π§ is an innovative machine learning project focused on analyzing high-resolution microscopic images of dried droplets for pharmaceutical analysis. This project aims to revolutionize substance identification and quantification in drug analysis and quality control.
- Objective π―: Utilize advanced ML techniques for substance classification, concentration estimation, and rare substance detection in pharmaceuticals.
- Dataset π¬: High-resolution microscopic images of various substances in different concentrations.
- Key Features π:
- Substance Classification π: Using CNNs and Vision Transformers for pattern recognition in dried droplet images.
- Concentration Estimation π: Developing regression models for accurate concentration levels measurement (future work).
- Rare Substance Detection π: Employing Siamese network-based methods (future work).
- Technologies π»: Python 3.10, Pytorch + Pytorch Lightning, Azure DevOps, Azure Machine Learning.
- Current Status π: Substance classification model shows high accuracy (F1-score: 0.9933) and robust performance. Concentration estimation and rare substance detection are planned future expansions.
- Contributors π₯: Tomasz Urbaniak, PhD; Adam Siemaszkiewicz (myself), MSc; Nicole Cutajar, MSc.
For detailed information on installation, development practices, and the project's structure, refer to the corresponding sections in this README.
-
- Research Objective
- Dataset
- Analysis Goals
- Substance Classification (work in progress)
- Concentration Estimation (future work)
- Rare Substance Detection (future work)
- Authors & Contributors
The Droplet Drug Detector (DDD) project aims to revolutionize pharmaceutical analysis by using advanced machine learning to analyze high-resolution microscopic images of dried droplets. This cutting-edge approach is designed to improve the identification and quantification of substances, thereby enhancing drug analysis and quality control.
The dataset comprises high-resolution microscopic images of various droplet samples, with each droplet being a few microliters in volume. Approximately 2000 images of substance droplets of different concentrations were captured under controlled conditions to ensure data consistency and reliability. The dataset includes images of the following substances:
- gelatin capsules,
- lactose,
- methyl-cellulose,
- naproxen,
- pearlitol
- polyvinyl-alcohol.
Future expansions of the dataset will include images of droplets containing mixtures of these substances.
This project is based on the study of patterns formed in dried droplets, commonly referred to as the 'coffee ring effect'. These patterns are influenced by the substance's physical and chemical properties, concentration, and interaction within the mixture, providing valuable information for substance analysis.
Images are captured under strictly controlled conditions to guarantee data consistency and reliability. However, slight imperfections and variations are intentionally included to ensure the model's robustness in less controlled environments.
Lactose, 0.25 mg/ml | Methyl Celulose, 1 mg/ml | Gelatin Capsule, 1 mg/ml |
- Single-Substance Classification: Develop a model to classify individual substances based on the patterns in dried droplet images.
- Multiple-Substance Classification: Extend the model to classify mixtures of substances, addressing the added complexity of inter-substance interactions.
- Concentration Estimation: Design and implement regression models to accurately estimate the concentration levels of the substances. We aim to introduce novel methodologies in this area.
- Rare Substance Detection: Develop a Siamese network-based approach for identifying rare substances. This network will be trained on existing data, emphasizing its utility in scenarios with limited sample availability.
(Work in progress)
A few experiments were conducted to determine a baseline model and hyperparameters for further experiments.
- Epochs: 50 (max), with early stopping implemented to prevent overfitting.
- Data Split: Stratified split (10:10:80 for training, validation & test subsets) across substances and concentration levels.
- Preprocessing: Normalization, resizing to 256x256 pixels.
- Data Augmentation: Color jitter, random gaussian noise, mirroring, and rotation.
- Model Architecture: ResNet18.
- Loss Function: Cross-entropy.
- Optimizer: Adam with a constant learning rate of 3e-4.
Learning curves (loss) | Learning curves (F1 score) |
- Metrics: Accuracy, precision, recall and F1 score.
- Results: Our initial experiments yielded a very high F1-score (0.9933) on the test set, indicating robust model performance.
Experiment | Accuracy | Precision | Recall | F1 score |
---|---|---|---|---|
Base experiment | 0.993292 | 0.993328 | 0.993292 | 0.993297 |
coming soon | - | - | - | - |
Confusion matrix (best validation epoch) | Confusion matrix (test set) |
- Misclassification Analysis: Images with high loss values are analyzed and stored for further examination.
True: gelatin-capsule, Predicted: polyvinyl-alcohol | True: gelatin-capsule, Predicted: polyvinyl-alcohol | True: methyl-cellulose, Predicted: polyvinyl-alcohol |
- Class Activation Mapping (CAM): Used to visualize significant regions in the images for making predictions.
Test sample 0 | Test sample 40 | Test sample 60 |
- Activation Feature Analysis: Analyzing how different layers of the network process the input images, to gain insights into the model's internal workings.
Layer 1 | Layer 2 | Layer 3 | Layer 4 |
(To be added) This section will discuss the challenges associated with classifying mixtures of substances and our approach to addressing them.
(To be added) This section will detail our methodology for developing regression models aimed at quantifying substance concentrations.
(To be added) This section will discuss the use of Siamese networks for detecting rare substances and the unique challenges associated with limited sample sizes.
- Tomasz Urbaniak, PhD (WrocΕaw Medical Univesity)
A pharmaceutical expert, Tomasz is the co-author responsible for guiding the project's pharmaceutical aspects, leveraging his extensive knowledge in the field.
- Adam Siemaszkiewicz, MSc (myself) (WrocΕaw University of Science & Technology)
As a co-author, I specialize in machine learning, data science, and software engineering, driving the technical and analytical facets of the project.
- Nicole Cutajar, MSc (University of Malta)
A vital contributor focusing on sample collection and image acquisition, ensuring the integrity and quality of our dataset.
Before installing the project, ensure that you have the following requirements:
- Python 3.10
- Mamba (for faster and more efficient virtual environments)
- Docker (optional, needed for containerization)
- Git (for version control)
Follow these steps to set up your local environment:
-
Clone the repository to your local machine:
git clone [repository-url] cd [repository-name]
-
Install Mamba: If you do not have Mamba installed, you can install it through Conda:
conda install mamba -n base -c conda-forge
-
Create and activate a Conda environment: Use the provided environment YAML files to create and activate your Conda environment:
mamba env create -f environments/[environment-name].yaml conda activate [environment-name]
-
Set up pre-commit hooks to enforce a variety of standards and validations during each commit:
pre-commit install
To run all pre-commit hooks on all files in the repository, execute:
pre-commit run --all-files
-
Docker setup (optional): For projects that require Docker, build and run your containers using:
docker build -t [image-name]:[tag] . docker run -it [image-name]:[tag]
The .azure-devops
directory contains configurations specific to Azure DevOps features and services to support the project's development workflow.
The .github
directory contains configurations specific to GitHub features and services to support the project's development workflow.
All experiment related artifacts such as configuration files, model checkpoints, logs, etc. are saved in artifacts
directory.
The configs
directory contains configuration YAML files for different machine learning tasks.
Store all project related data inside data
folder.
All Docker-related files necessary for building Docker images and managing Docker containers for the project are located in docker
directory.
The environments
directory stores YAML files that define the different Conda environments needed for the project.
Jupyter notebooks integral to the project as located in notebooks
directory..
The src
directory contains all source code for the project.
The tests
directory contains various types of automated tests to ensure the codebase works correctly:
tests/unit
for unit tests to validate individual pieces of code independently.tests/integration
for integration tests to ensure different code sections work together as intended.tests/e2e
for end-to-end tests that verify complete user workflows.
Execute all tests with:
pytest
A detailed explanation of the layout and purpose of the .azure-devops
directory contents.
-
.azure-devops/pipelines
: This folder holds the YAML pipeline definitions for building and deploying using Azure DevOps services.build-aml-environment.yaml
sets up Azure Machine Learning environment needed for running Azure ML tasksdroplet-drug-classificator-training.yaml
runs the Droplet Drug Classificator training pipeline as Azure Machine Learning task
-
.azure-devops/templates
: Reusable YAML templates with encapsulated functionalities to streamline pipeline creation. The templates include:configure-aml-extension.yaml
for setting up Azure ML extensions.connect-to-aml-workspace.yaml
for connecting to an Azure ML workspace within the pipeline.create-conda-env.yaml
for constructing Conda environments needed for the pipeline's operations.install-azure-cli.yaml
for installing the Azure CLI.substitute-env-vars.yaml
for injecting environment variables dynamically into the pipeline process.
A detailed explanation of the layout and purpose of the src
directory contents.
-
aml
: Azure Machine Learning utilities, components and pipelines.components
: Contains code for individual Azure Machine Learning components.classificator_training
: A component meant for classification model training & evaluation with its specification YAML, entrypoints script, options, configuration and custom functions.
pipelines
: Contains code for Azure Machine Learning pipelines.classificator_training
: A pipeline running classification component containing its specification YAML
blob_storage.py
: Azure Blob Storage service allowing to upload and download files and folders.build_aml_environment.py
: A script to set up the Azure Machine Learning environment.client.py
: Azure Machine Learning client allowing to interact with AML objects.environment.py
: Azure Machine Learning environment allowing to create and manage AML environments.
-
common
: Shared utilities and constants used across the project.consts
: Definitions of constants used throughout the codebase, like Azure-specific constants, directory paths, and extensions.settings
: Infrasturcture settings storing things such as Azure ML, Azure Blob Storage, cluster & database credentials.utils
: General utility functions and classes, such as settings management, logging, converters and validators.
-
configs
: Configuration classes for machine learning tasks. -
machine_learning
: Contains code for machine learning tasks divided into different categories and providing types, configuration and creation..augmentations
: Data augmentationcallbacks
: Pytorch Lightning training callbacksclassification
: Classification-specific modulesloss_functions
: Loss functionsmetrics
: Evaluation metricsmodels
: Model architecturesmodule.py
: Pytorch Lightning module
loggers
: Pytorch Lightning loggersoptimizer
: Optimizerpreprocessing
: Data preprocessing transformationsscheduler
: Learning rate schedulertrainer
: Pytorch Lightning trainer
The .github
directory contains configurations specific to GitHub features and services to support the project's development workflow.
workflows
: Includes automation workflows for GitHub Actions. Theci.yaml
file in this directory configures the continuous integration workflow, which is triggered on push and pull request events to run tests, perform linting, and other checks integral to maintaining code quality and operational integrity.
ISSUE_TEMPLATE
: Provides templates for opening new issues on GitHub. The templates ensure that all necessary details are included when contributors report bugs (bug_report.md
) or propose new features (feature_request.md
). Use these templates to create issues that are consistent and informative.
The docker
directory is intended to house all Docker-related files necessary for building Docker images and managing Docker containers for the project. This includes:
-
Dockerfiles: Each Dockerfile contains a set of instructions to assemble a Docker image. Dockerfiles should be named with the convention
Dockerfile
orDockerfile.<environment>
to denote different setups, such as development, testing, or production environments. -
docker-compose files: For projects that run multiple containers that need to work together,
docker-compose.yml
files define and run multi-container Docker applications. With Compose, you use a YAML file to configure your application's services and create and start all the services from your configuration with a single command. -
Configuration Scripts: Any scripts that aid in setting up, building, or deploying Docker containers, such as initialization scripts or entrypoint scripts, belong here.
-
Environment Files:
.env
files that contain environment variables necessary for Docker to run or for Dockerized applications to operate correctly can be placed in this directory. These files should not contain sensitive information and should be excluded from version control if they do.
As the project develops, ensure that you populate the docker
directory with these files and provide documentation on their purpose and how they should be used. This could include instructions on how to build images, start containers, and manage containerized environments effectively.
The environments
directory contains configuration files that define the different environments needed for the project. These files are essential for ensuring that the project runs with the correct versions of its dependencies and in a way that's consistent across different setups.
-
Conda Environment Files: YAML files that specify the packages required for a conda environment. Environment YAML files should have the same name as the project they relate to.
-
Infrastructure Configuration: The
infra.yaml
file might include configurations for setting up the infrastructure as a code, which can be particularly useful when working with cloud services or when you want to automate the setup of your project's infrastructure.
The pre-commit hooks will now automatically check each file when you attempt to commit them to your git repository. If any hooks make changes or fail, fix the issues, and try committing again.
Here are the hooks configured for this project:
flake8
: Lints Python source files for coding standard violations, complexity, and style issues.black
: Formats Python code to ensure consistent styling.isort
: Sorts Python imports alphabetically within respective sections and by type.mypy
: Checks type hints and enforces type checking on your code.pytest
: Runs automated tests to make sure new changes do not break the functionality.
This project is licensed under the MIT License - see the LICENSE.md file for details.