Merge branch 'main' into explainability_module

basf · Sep 27, 2024 · 4d20311 · 4d20311
2 parents 76a0972 + d67807f
commit 4d20311
Show file tree

Hide file tree

Showing 51 changed files with 2,338 additions and 553 deletions.
diff --git a/.github/molpipeline.png b/.github/molpipeline.png
diff --git a/.github/workflows/linting.yml b/.github/workflows/linting.yml
@@ -22,7 +22,7 @@ jobs:
         pip install pylint
     - name: Analysing the code with pylint
       run: |
-        pylint  -d C0301,R0913,W1202 $(git ls-files '*.py') --ignored-modules "rdkit"
+        pylint  -d C0301,R0913,W1202 $(git ls-files '*.py') --ignored-modules "rdkit"  --max-positional-arguments 10
   mypy:
     runs-on: ubuntu-latest
     steps:
@@ -150,7 +150,7 @@ jobs:
           pip install isort
       - name: Analysing the code with isort
         run: |
-          isort --profile black .
+          isort --profile black --check-only .
 
   test_basis:
     needs:
@@ -181,7 +181,7 @@ jobs:
       - name: Run unit-tests
         run: |
           # Run only the core test suite in the tests directory.
-          coverage run -m unittest discover tests
+          coverage run --source=molpipeline,tests -m unittest discover tests
           # Create a coverage report. Fail if the coverage is below 85%. Exclude extra packages from the report.
           coverage report --fail-under=85 --omit="*chemprop*","*/*chemprop*/*"
 
@@ -204,7 +204,7 @@ jobs:
       - name: Run unit-tests for chemprop
         run: |
           # Run only the chemprop test suite.
-          coverage run -m unittest discover test_extras/test_chemprop
+          coverage run --source=molpipeline,tests -m unittest discover test_extras/test_chemprop
           # Create a coverage report. Fail if the coverage is below 85%. Include only chemprop files in the report.
           coverage report --fail-under=85 --include="*chemprop*","*/*chemprop*/*"
 

diff --git a/.gitignore b/.gitignore
@@ -3,4 +3,5 @@ __pycache__
 molpipeline.egg-info/
 lib/
 build/
+lightning_logs/
 
diff --git a/README.md b/README.md
@@ -1,35 +1,56 @@
 # MolPipeline
-MolPipeline is a Python package providing RDKit functionality in a Scikit-learn like fashion.
+MolPipeline is a Python package for processing molecules with RDKit in scikit-learn.
+
+<p align="center"><img src=".github/molpipeline.png" height="250"/></p>
 
 ## Background
 
-The open-source package [scikit-learn](https://scikit-learn.org/) provides a large variety of machine
+The [scikit-learn](https://scikit-learn.org/) package provides a large variety of machine
 learning algorithms and data processing tools, among which is the `Pipeline` class, allowing users to
 prepend custom data processing steps to the machine learning model.
-`MolPipeline` extends this concept to the field of chemoinformatics by
-wrapping default functionalities of [RDKit](https://www.rdkit.org/), such as reading and writing SMILES strings
+`MolPipeline` extends this concept to the field of cheminformatics by
+wrapping standard [RDKit](https://www.rdkit.org/) functionality, such as reading and writing SMILES strings
 or calculating molecular descriptors from a molecule-object.
 
-A notable difference to the `Pipeline` class of scikit-learn is that the Pipline from `MolPipeline` allows for 
-instances to fail during processing without interrupting the whole pipeline.
-Such behaviour is useful when processing large datasets, where some SMILES strings might not encode valid molecules
-or some descriptors might not be calculable for certain molecules.
+MolPipeline aims to provide:
 
+- Automated end-to-end processing from molecule data sets to deployable machine learning models.
+- Scalable parallel processing and low memory usage through instance-based processing.
+- Standard pipeline building blocks for flexibly building custom pipelines for various
+cheminformatics tasks.
+- Consistent error handling for tracking, logging, and replacing failed instances (e.g., a
+SMILES string that could not be parsed correctly).
+- Integrated and self-contained pipeline serialization for easy deployment and tracking
+in version control.
 
 ## Publications
 
-The publication is freely available [here](https://chemrxiv.org/engage/chemrxiv/article-details/661fec7f418a5379b00ae036).
+[Sieg J, Feldmann CW, Hemmerich J, Stork C, Sandfort F, Eiden P, and Mathea M, MolPipeline: A python package for processing
+molecules with RDKit in scikit-learn, J. Chem. Inf. Model., doi:10.1021/acs.jcim.4c00863, 2024](https://doi.org/10.1021/acs.jcim.4c00863)
+\
+Further links: [arXiv](https://chemrxiv.org/engage/chemrxiv/article-details/661fec7f418a5379b00ae036)
+
+Feldmann CW, Sieg J, and Mathea M, Analysis of uncertainty of neural
+fingerprint-based models, 2024
+\
+Further links: [repository](https://github.com/basf/neural-fingerprint-uncertainty)
 
 ## Installation
 ```commandline
 pip install molpipeline
 ```
 
-## Usage
+## Documentation
+
+The [notebooks](notebooks) folder contains many basic and advanced examples of how to use Molpipeline.
+
+A nice introduction to the basic usage is in the [01_getting_started_with_molpipeline notebook](notebooks/01_getting_started_with_molpipeline.ipynb).
 
-See the [notebooks](notebooks) folder for basic and advanced examples of how to use Molpipeline.
+## Quick Start
 
-A basic example of how to use MolPipeline to create a fingerprint-based model is shown below (see also the [notebook](notebooks/01_getting_started_with_molpipeline.ipynb)): 
+### Model building
+
+Create a fingerprint-based prediction model:
 ```python
 from molpipeline import Pipeline
 from molpipeline.any2mol import AutoToMol
@@ -58,8 +79,42 @@ pipeline.predict(["CCC"])
 # output: array([0.29])
 ```
 
-Molpipeline also provides custom estimators for standard cheminformatics tasks that can be integrated into pipelines,
-like clustering for scaffold splits (see also the [notebook](notebooks/02_scaffold_split_with_custom_estimators.ipynb)):
+### Feature calculation
+
+Calculating molecular descriptors from SMILES strings is straightforward. For example, physicochemical properties can
+be calculated like this:
+```python
+from molpipeline import Pipeline
+from molpipeline.any2mol import AutoToMol
+from molpipeline.mol2any import MolToRDKitPhysChem
+
+pipeline_physchem = Pipeline(
+    [
+        ("auto2mol", AutoToMol()),
+        (
+            "physchem",
+            MolToRDKitPhysChem(
+                standardizer=None,
+                descriptor_list=["HeavyAtomMolWt", "TPSA", "NumHAcceptors"],
+            ),
+        ),
+    ],
+    n_jobs=-1,
+)
+physchem_matrix = pipeline_physchem.transform(["CCCCCC", "c1ccccc1(O)"])
+physchem_matrix
+# output: array([[72.066,  0.   ,  0.   ],
+#                [88.065, 20.23 ,  1.   ]])
+```
+
+MolPipeline provides further features and descriptors from RDKit, 
+for example Morgan (binary/count) fingerprints and MACCS keys.
+See the [04_feature_calculation notebook](notebooks/04_feature_calculation.ipynb) for more examples.
+
+### Clustering
+
+Molpipeline provides several clustering algorithms as sklearn-like estimators. For example, molecules can be
+clustered by their Murcko scaffold. See the [02_scaffold_split_with_custom_estimators notebook](notebooks/02_scaffold_split_with_custom_estimators.ipynb) for scaffolds splits and further examples.
 
 ```python
 from molpipeline.estimators import MurckoScaffoldClustering

diff --git a/molpipeline/abstract_pipeline_elements/any2mol/string2mol.py b/molpipeline/abstract_pipeline_elements/any2mol/string2mol.py
@@ -4,8 +4,11 @@
 
 import abc
 
-from molpipeline.abstract_pipeline_elements.core import AnyToMolPipelineElement
-from molpipeline.utils.molpipeline_types import OptionalMol
+from molpipeline.abstract_pipeline_elements.core import (
+    AnyToMolPipelineElement,
+    InvalidInstance,
+)
+from molpipeline.utils.molpipeline_types import OptionalMol, RDKitMol
 
 
 class StringToMolPipelineElement(AnyToMolPipelineElement, abc.ABC):
@@ -43,3 +46,60 @@ def pretransform_single(self, value: str) -> OptionalMol:
         OptionalMol
             RDKit molecule if representation was valid, else InvalidInstance.
         """
+
+
+class SimpleStringToMolElement(StringToMolPipelineElement, abc.ABC):
+    """Transforms string representation to RDKit Mol objects."""
+
+    def pretransform_single(self, value: str) -> OptionalMol:
+        """Transform string to molecule.
+
+        Parameters
+        ----------
+        value: str
+            string representation.
+
+        Returns
+        -------
+        OptionalMol
+            Rdkit molecule if valid string representation, else None.
+        """
+        if value is None:
+            return InvalidInstance(
+                self.uuid,
+                f"Invalid representation: {value}",
+                self.name,
+            )
+
+        if not isinstance(value, str):
+            return InvalidInstance(
+                self.uuid,
+                f"Not a string: {value}",
+                self.name,
+            )
+
+        mol: RDKitMol = self.string_to_mol(value)
+
+        if not mol:
+            return InvalidInstance(
+                self.uuid,
+                f"Invalid representation: {value}",
+                self.name,
+            )
+        mol.SetProp("identifier", value)
+        return mol
+
+    @abc.abstractmethod
+    def string_to_mol(self, value: str) -> RDKitMol:
+        """Transform string representation to molecule.
+
+        Parameters
+        ----------
+        value: str
+            string representation
+
+        Returns
+        -------
+        RDKitMol
+            Rdkit molecule if valid representation, else None.
+        """
diff --git a/molpipeline/abstract_pipeline_elements/core.py b/molpipeline/abstract_pipeline_elements/core.py
@@ -97,21 +97,23 @@ class ABCPipelineElement(abc.ABC):
 
     def __init__(
         self,
-        name: str = "ABCPipelineElement",
+        name: Optional[str] = None,
         n_jobs: int = 1,
         uuid: Optional[str] = None,
     ) -> None:
         """Initialize ABCPipelineElement.
 
         Parameters
         ----------
-        name: str
+        name: Optional[str], optional (default=None)
             Name of PipelineElement
         n_jobs: int
             Number of cores used for processing.
         uuid: Optional[str]
             Unique identifier of the PipelineElement.
         """
+        if name is None:
+            name = self.__class__.__name__
         self.name = name
         self.n_jobs = n_jobs
         if uuid is None:
@@ -182,12 +184,12 @@ def get_params(self, deep: bool = True) -> dict[str, Any]:
             "uuid": self.uuid,
         }
 
-    def set_params(self, **parameters: dict[str, Any]) -> Self:
+    def set_params(self, **parameters: Any) -> Self:
         """As the setter function cannot be assessed with super(), this method is implemented for inheritance.
 
         Parameters
         ----------
-        parameters: dict[str, Any]
+        parameters: Any
             Parameters to be set.
 
         Returns
@@ -338,15 +340,15 @@ class TransformingPipelineElement(ABCPipelineElement):
 
     def __init__(
         self,
-        name: str = "ABCPipelineElement",
+        name: Optional[str] = None,
         n_jobs: int = 1,
         uuid: Optional[str] = None,
     ) -> None:
         """Initialize ABCPipelineElement.
 
         Parameters
         ----------
-        name: str
+        name: Optional[str], optional (default=None)
             Name of PipelineElement
         n_jobs: int
             Number of cores used for processing.
@@ -377,12 +379,12 @@ def parameters(self) -> dict[str, Any]:
         return self.get_params()
 
     @parameters.setter
-    def parameters(self, **parameters: dict[str, Any]) -> None:
+    def parameters(self, **parameters: Any) -> None:
         """Set the parameters of the object.
 
         Parameters
         ----------
-        parameters: dict[str, Any]
+        parameters: Any
             Object parameters as a dictionary.
 
         Returns
@@ -616,25 +618,6 @@ class MolToMolPipelineElement(TransformingPipelineElement, abc.ABC):
     _input_type = "RDKitMol"
     _output_type = "RDKitMol"
 
-    def __init__(
-        self,
-        name: str = "MolToMolPipelineElement",
-        n_jobs: int = 1,
-        uuid: Optional[str] = None,
-    ) -> None:
-        """Initialize MolToMolPipelineElement.
-
-        Parameters
-        ----------
-        name: str
-            Name of the PipelineElement.
-        n_jobs: int
-            Number of cores used for processing.
-        uuid: Optional[str]
-            Unique identifier of the PipelineElement.
-        """
-        super().__init__(name=name, n_jobs=n_jobs, uuid=uuid)
-
     def transform(self, values: list[OptionalMol]) -> list[OptionalMol]:
         """Transform list of molecules to list of molecules.
 
@@ -700,25 +683,6 @@ class AnyToMolPipelineElement(TransformingPipelineElement, abc.ABC):
 
     _output_type = "RDKitMol"
 
-    def __init__(
-        self,
-        name: str = "AnyToMolPipelineElement",
-        n_jobs: int = 1,
-        uuid: Optional[str] = None,
-    ) -> None:
-        """Initialize AnyToMolPipelineElement.
-
-        Parameters
-        ----------
-        name: str
-            Name of the PipelineElement.
-        n_jobs: int
-            Number of cores used for processing.
-        uuid: Optional[str]
-            Unique identifier of the PipelineElement.
-        """
-        super().__init__(name=name, n_jobs=n_jobs, uuid=uuid)
-
     def transform(self, values: Any) -> list[OptionalMol]:
         """Transform list of instances to list of molecules.
 
@@ -756,25 +720,6 @@ class MolToAnyPipelineElement(TransformingPipelineElement, abc.ABC):
 
     _input_type = "RDKitMol"
 
-    def __init__(
-        self,
-        name: str = "MolToAnyPipelineElement",
-        n_jobs: int = 1,
-        uuid: Optional[str] = None,
-    ) -> None:
-        """Initialize MolToAnyPipelineElement.
-
-        Parameters
-        ----------
-        name: str
-            Name of the PipelineElement.
-        n_jobs: int
-            Number of cores used for processing.
-        uuid: Optional[str]
-            Unique identifier of the PipelineElement.
-        """
-        super().__init__(name=name, n_jobs=n_jobs, uuid=uuid)
-
     @abc.abstractmethod
     def pretransform_single(self, value: RDKitMol) -> Any:
         """Transform the molecule, but skip parameters learned during fitting.