Skip to content

The Droplet Drug Detector (DDD) πŸ’ŠπŸ”¬πŸ§  aims to revolutionize pharmaceutical analysis by using advanced machine learning to analyze high-resolution microscopic images of dried droplets. This cutting-edge approach is designed to improve the identification and quantification of substances, thereby enhancing drug analysis and quality control.

License

Notifications You must be signed in to change notification settings

adamsiemaszkiewicz/droplet-drug-detector

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Droplet Drug Detector

License Python Version Last Commit


TL;DR (Too Long; Didn't Read)

Droplet Drug Detector (DDD) πŸ’ŠπŸ”¬πŸ§  is an innovative machine learning project focused on analyzing high-resolution microscopic images of dried droplets for pharmaceutical analysis. This project aims to revolutionize substance identification and quantification in drug analysis and quality control.

  • Objective 🎯: Utilize advanced ML techniques for substance classification, concentration estimation, and rare substance detection in pharmaceuticals.
  • Dataset πŸ”¬: High-resolution microscopic images of various substances in different concentrations.
  • Key Features πŸ“ƒ:
    • Substance Classification πŸ’Š: Using CNNs and Vision Transformers for pattern recognition in dried droplet images.
    • Concentration Estimation πŸ“ˆ: Developing regression models for accurate concentration levels measurement (future work).
    • Rare Substance Detection πŸ”: Employing Siamese network-based methods (future work).
  • Technologies πŸ’»: Python 3.10, Pytorch + Pytorch Lightning, Azure DevOps, Azure Machine Learning.
  • Current Status πŸš€: Substance classification model shows high accuracy (F1-score: 0.9933) and robust performance. Concentration estimation and rare substance detection are planned future expansions.
  • Contributors πŸ‘₯: Tomasz Urbaniak, PhD; Adam Siemaszkiewicz (myself), MSc; Nicole Cutajar, MSc.

For detailed information on installation, development practices, and the project's structure, refer to the corresponding sections in this README.


Table of Contents

  1. Table of Contents

  2. Project Overview

  3. Installation

  4. Repository Structure

  5. Development

  6. Configuration

  7. License


Project overview

Research objective

The Droplet Drug Detector (DDD) project aims to revolutionize pharmaceutical analysis by using advanced machine learning to analyze high-resolution microscopic images of dried droplets. This cutting-edge approach is designed to improve the identification and quantification of substances, thereby enhancing drug analysis and quality control.

Dataset

The dataset comprises high-resolution microscopic images of various droplet samples, with each droplet being a few microliters in volume. Approximately 2000 images of substance droplets of different concentrations were captured under controlled conditions to ensure data consistency and reliability. The dataset includes images of the following substances:

  • gelatin capsules,
  • lactose,
  • methyl-cellulose,
  • naproxen,
  • pearlitol
  • polyvinyl-alcohol.

Future expansions of the dataset will include images of droplets containing mixtures of these substances.

Theoretical basis

This project is based on the study of patterns formed in dried droplets, commonly referred to as the 'coffee ring effect'. These patterns are influenced by the substance's physical and chemical properties, concentration, and interaction within the mixture, providing valuable information for substance analysis.

Sample collection

Images are captured under strictly controlled conditions to guarantee data consistency and reliability. However, slight imperfections and variations are intentionally included to ensure the model's robustness in less controlled environments.

Lactose, 0.25 mg/ml Methyl Celulose, 1 mg/ml Gelatin Capsule, 1 mg/ml
Lactose 0.25 mg/ml Methyl Celulose 1 mg/ml Gelatin Capsule 1 mg/ml

Analysis goals

  1. Single-Substance Classification: Develop a model to classify individual substances based on the patterns in dried droplet images.
  2. Multiple-Substance Classification: Extend the model to classify mixtures of substances, addressing the added complexity of inter-substance interactions.
  3. Concentration Estimation: Design and implement regression models to accurately estimate the concentration levels of the substances. We aim to introduce novel methodologies in this area.
  4. Rare Substance Detection: Develop a Siamese network-based approach for identifying rare substances. This network will be trained on existing data, emphasizing its utility in scenarios with limited sample availability.

Single-Substance Classification

(Work in progress)

Model Training

A few experiments were conducted to determine a baseline model and hyperparameters for further experiments.

  • Epochs: 50 (max), with early stopping implemented to prevent overfitting.
  • Data Split: Stratified split (10:10:80 for training, validation & test subsets) across substances and concentration levels.
  • Preprocessing: Normalization, resizing to 256x256 pixels.
  • Data Augmentation: Color jitter, random gaussian noise, mirroring, and rotation.
  • Model Architecture: ResNet18.
  • Loss Function: Cross-entropy.
  • Optimizer: Adam with a constant learning rate of 3e-4.
Learning curves (loss) Learning curves (F1 score)
Learning curves (loss) Learning curves (F1 score)

Model Evaluation

  • Metrics: Accuracy, precision, recall and F1 score.
  • Results: Our initial experiments yielded a very high F1-score (0.9933) on the test set, indicating robust model performance.
Experiment Accuracy Precision Recall F1 score
Base experiment 0.993292 0.993328 0.993292 0.993297
coming soon - - - -
Confusion matrix (best validation epoch) Confusion matrix (test set)
Confusion matrix (best validation epoch) Confusion matrix (test set)

Explainability

  • Misclassification Analysis: Images with high loss values are analyzed and stored for further examination.
True: gelatin-capsule, Predicted: polyvinyl-alcohol True: gelatin-capsule, Predicted: polyvinyl-alcohol True: methyl-cellulose, Predicted: polyvinyl-alcohol
Misclassified image 1 Misclassified image 2 Misclassified image 3
  • Class Activation Mapping (CAM): Used to visualize significant regions in the images for making predictions.
Test sample 0 Test sample 40 Test sample 60
CAM for test sample ID 0 CAM for test sample ID 40 CAM for test sample ID 60
  • Activation Feature Analysis: Analyzing how different layers of the network process the input images, to gain insights into the model's internal workings.
Layer 1 Layer 2 Layer 3 Layer 4
Layer 1 Activation Feature Layer 2 Activation Feature Layer 3 Activation Feature Layer 4 Activation Feature

Multiple-Substance Classification

(To be added) This section will discuss the challenges associated with classifying mixtures of substances and our approach to addressing them.

Concentration Estimation

(To be added) This section will detail our methodology for developing regression models aimed at quantifying substance concentrations.

Rare Substance Detection

(To be added) This section will discuss the use of Siamese networks for detecting rare substances and the unique challenges associated with limited sample sizes.

Authors & Contributors

  • Tomasz Urbaniak, PhD (WrocΕ‚aw Medical Univesity)

A pharmaceutical expert, Tomasz is the co-author responsible for guiding the project's pharmaceutical aspects, leveraging his extensive knowledge in the field.

  • Adam Siemaszkiewicz, MSc (myself) (WrocΕ‚aw University of Science & Technology)

As a co-author, I specialize in machine learning, data science, and software engineering, driving the technical and analytical facets of the project.

  • Nicole Cutajar, MSc (University of Malta)

A vital contributor focusing on sample collection and image acquisition, ensuring the integrity and quality of our dataset.

Back to the top


Installation

Before installing the project, ensure that you have the following requirements:

  • Python 3.10
  • Mamba (for faster and more efficient virtual environments)
  • Docker (optional, needed for containerization)
  • Git (for version control)

Follow these steps to set up your local environment:

  1. Clone the repository to your local machine:

    git clone [repository-url]
    cd [repository-name]
  2. Install Mamba: If you do not have Mamba installed, you can install it through Conda:

    conda install mamba -n base -c conda-forge
  3. Create and activate a Conda environment: Use the provided environment YAML files to create and activate your Conda environment:

    mamba env create -f environments/[environment-name].yaml
    conda activate [environment-name]
  4. Set up pre-commit hooks to enforce a variety of standards and validations during each commit:

    pre-commit install

    To run all pre-commit hooks on all files in the repository, execute:

    pre-commit run --all-files
  5. Docker setup (optional): For projects that require Docker, build and run your containers using:

    docker build -t [image-name]:[tag] .
    docker run -it [image-name]:[tag]

Back to the top


Repository Structure

Azure DevOps

The .azure-devops directory contains configurations specific to Azure DevOps features and services to support the project's development workflow.

Github

The .github directory contains configurations specific to GitHub features and services to support the project's development workflow.

Artifacts

All experiment related artifacts such as configuration files, model checkpoints, logs, etc. are saved in artifacts directory.

Configs

The configs directory contains configuration YAML files for different machine learning tasks.

Data

Store all project related data inside data folder.

Docker

All Docker-related files necessary for building Docker images and managing Docker containers for the project are located in docker directory.

Environments

The environments directory stores YAML files that define the different Conda environments needed for the project.

Notebooks

Jupyter notebooks integral to the project as located in notebooks directory..

Source Code

The src directory contains all source code for the project.

Testing

The tests directory contains various types of automated tests to ensure the codebase works correctly:

  • tests/unit for unit tests to validate individual pieces of code independently.
  • tests/integration for integration tests to ensure different code sections work together as intended.
  • tests/e2e for end-to-end tests that verify complete user workflows.

Running Tests

Execute all tests with:

pytest

Back to the top


Development

Azure DevOps Code Structure

A detailed explanation of the layout and purpose of the .azure-devops directory contents.

  • .azure-devops/pipelines: This folder holds the YAML pipeline definitions for building and deploying using Azure DevOps services.

    • build-aml-environment.yaml sets up Azure Machine Learning environment needed for running Azure ML tasks
    • droplet-drug-classificator-training.yaml runs the Droplet Drug Classificator training pipeline as Azure Machine Learning task
  • .azure-devops/templates: Reusable YAML templates with encapsulated functionalities to streamline pipeline creation. The templates include:

    • configure-aml-extension.yaml for setting up Azure ML extensions.
    • connect-to-aml-workspace.yaml for connecting to an Azure ML workspace within the pipeline.
    • create-conda-env.yaml for constructing Conda environments needed for the pipeline's operations.
    • install-azure-cli.yaml for installing the Azure CLI.
    • substitute-env-vars.yaml for injecting environment variables dynamically into the pipeline process.

Source Code Structure

A detailed explanation of the layout and purpose of the src directory contents.

  • aml: Azure Machine Learning utilities, components and pipelines.

    • components: Contains code for individual Azure Machine Learning components.
      • classificator_training: A component meant for classification model training & evaluation with its specification YAML, entrypoints script, options, configuration and custom functions.
    • pipelines: Contains code for Azure Machine Learning pipelines.
      • classificator_training: A pipeline running classification component containing its specification YAML
    • blob_storage.py: Azure Blob Storage service allowing to upload and download files and folders.
    • build_aml_environment.py: A script to set up the Azure Machine Learning environment.
    • client.py: Azure Machine Learning client allowing to interact with AML objects.
    • environment.py: Azure Machine Learning environment allowing to create and manage AML environments.
  • common: Shared utilities and constants used across the project.

    • consts: Definitions of constants used throughout the codebase, like Azure-specific constants, directory paths, and extensions.
    • settings: Infrasturcture settings storing things such as Azure ML, Azure Blob Storage, cluster & database credentials.
    • utils: General utility functions and classes, such as settings management, logging, converters and validators.
  • configs: Configuration classes for machine learning tasks.

  • machine_learning: Contains code for machine learning tasks divided into different categories and providing types, configuration and creation..

    • augmentations: Data augmentation
    • callbacks: Pytorch Lightning training callbacks
    • classification: Classification-specific modules
      • loss_functions: Loss functions
      • metrics: Evaluation metrics
      • models: Model architectures
      • module.py: Pytorch Lightning module
    • loggers: Pytorch Lightning loggers
    • optimizer: Optimizer
    • preprocessing: Data preprocessing transformations
    • scheduler: Learning rate scheduler
    • trainer: Pytorch Lightning trainer

Back to the top


Configuration

GitHub Configuration

The .github directory contains configurations specific to GitHub features and services to support the project's development workflow.

GitHub Actions Workflows

  • workflows: Includes automation workflows for GitHub Actions. The ci.yaml file in this directory configures the continuous integration workflow, which is triggered on push and pull request events to run tests, perform linting, and other checks integral to maintaining code quality and operational integrity.

Issue and Pull Request Templates

  • ISSUE_TEMPLATE: Provides templates for opening new issues on GitHub. The templates ensure that all necessary details are included when contributors report bugs (bug_report.md) or propose new features (feature_request.md). Use these templates to create issues that are consistent and informative.

Docker Configuration

The docker directory is intended to house all Docker-related files necessary for building Docker images and managing Docker containers for the project. This includes:

  • Dockerfiles: Each Dockerfile contains a set of instructions to assemble a Docker image. Dockerfiles should be named with the convention Dockerfile or Dockerfile.<environment> to denote different setups, such as development, testing, or production environments.

  • docker-compose files: For projects that run multiple containers that need to work together, docker-compose.yml files define and run multi-container Docker applications. With Compose, you use a YAML file to configure your application's services and create and start all the services from your configuration with a single command.

  • Configuration Scripts: Any scripts that aid in setting up, building, or deploying Docker containers, such as initialization scripts or entrypoint scripts, belong here.

  • Environment Files: .env files that contain environment variables necessary for Docker to run or for Dockerized applications to operate correctly can be placed in this directory. These files should not contain sensitive information and should be excluded from version control if they do.

As the project develops, ensure that you populate the docker directory with these files and provide documentation on their purpose and how they should be used. This could include instructions on how to build images, start containers, and manage containerized environments effectively.

Environment Configuration

The environments directory contains configuration files that define the different environments needed for the project. These files are essential for ensuring that the project runs with the correct versions of its dependencies and in a way that's consistent across different setups.

  • Conda Environment Files: YAML files that specify the packages required for a conda environment. Environment YAML files should have the same name as the project they relate to.

  • Infrastructure Configuration: The infra.yaml file might include configurations for setting up the infrastructure as a code, which can be particularly useful when working with cloud services or when you want to automate the setup of your project's infrastructure.

Formatting, Linting & Type Checking

The pre-commit hooks will now automatically check each file when you attempt to commit them to your git repository. If any hooks make changes or fail, fix the issues, and try committing again.

Here are the hooks configured for this project:

  • flake8: Lints Python source files for coding standard violations, complexity, and style issues.
  • black: Formats Python code to ensure consistent styling.
  • isort: Sorts Python imports alphabetically within respective sections and by type.
  • mypy: Checks type hints and enforces type checking on your code.
  • pytest: Runs automated tests to make sure new changes do not break the functionality.

Back to the top


License

This project is licensed under the MIT License - see the LICENSE.md file for details.

Back to the top

About

The Droplet Drug Detector (DDD) πŸ’ŠπŸ”¬πŸ§  aims to revolutionize pharmaceutical analysis by using advanced machine learning to analyze high-resolution microscopic images of dried droplets. This cutting-edge approach is designed to improve the identification and quantification of substances, thereby enhancing drug analysis and quality control.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages