diff --git a/.github/molpipeline.png b/.github/molpipeline.png
new file mode 100755
index 00000000..fc6129a2
Binary files /dev/null and b/.github/molpipeline.png differ
diff --git a/README.md b/README.md
index 7479707b..b42b655e 100644
--- a/README.md
+++ b/README.md
@@ -1,35 +1,56 @@
# MolPipeline
-MolPipeline is a Python package providing RDKit functionality in a Scikit-learn like fashion.
+MolPipeline is a Python package for processing molecules with RDKit in scikit-learn.
+
+
## Background
-The open-source package [scikit-learn](https://scikit-learn.org/) provides a large variety of machine
+The [scikit-learn](https://scikit-learn.org/) package provides a large variety of machine
learning algorithms and data processing tools, among which is the `Pipeline` class, allowing users to
prepend custom data processing steps to the machine learning model.
-`MolPipeline` extends this concept to the field of chemoinformatics by
-wrapping default functionalities of [RDKit](https://www.rdkit.org/), such as reading and writing SMILES strings
+`MolPipeline` extends this concept to the field of cheminformatics by
+wrapping standard [RDKit](https://www.rdkit.org/) functionality, such as reading and writing SMILES strings
or calculating molecular descriptors from a molecule-object.
-A notable difference to the `Pipeline` class of scikit-learn is that the Pipline from `MolPipeline` allows for
-instances to fail during processing without interrupting the whole pipeline.
-Such behaviour is useful when processing large datasets, where some SMILES strings might not encode valid molecules
-or some descriptors might not be calculable for certain molecules.
+MolPipeline aims to provide:
+- Automated end-to-end processing from molecule data sets to deployable machine learning models.
+- Scalable parallel processing and low memory usage through instance-based processing.
+- Standard pipeline building blocks for flexibly building custom pipelines for various
+cheminformatics tasks.
+- Consistent error handling for tracking, logging, and replacing failed instances (e.g., a
+SMILES string that could not be parsed correctly).
+- Integrated and self-contained pipeline serialization for easy deployment and tracking
+in version control.
## Publications
-The publication is freely available [here](https://chemrxiv.org/engage/chemrxiv/article-details/661fec7f418a5379b00ae036).
+[Sieg J, Feldmann CW, Hemmerich J, Stork C, Sandfort F, Eiden P, and Mathea M, MolPipeline: A python package for processing
+molecules with RDKit in scikit-learn, J. Chem. Inf. Model., doi:10.1021/acs.jcim.4c00863, 2024](https://doi.org/10.1021/acs.jcim.4c00863)
+\
+Further links: [arXiv](https://chemrxiv.org/engage/chemrxiv/article-details/661fec7f418a5379b00ae036)
+
+Feldmann CW, Sieg J, and Mathea M, Analysis of uncertainty of neural
+fingerprint-based models, 2024
+\
+Further links: [repository](https://github.com/basf/neural-fingerprint-uncertainty)
## Installation
```commandline
pip install molpipeline
```
-## Usage
+## Documentation
+
+The [notebooks](notebooks) folder contains many basic and advanced examples of how to use Molpipeline.
+
+A nice introduction to the basic usage is in the [01_getting_started_with_molpipeline notebook](notebooks/01_getting_started_with_molpipeline.ipynb).
-See the [notebooks](notebooks) folder for basic and advanced examples of how to use Molpipeline.
+## Quick Start
-A basic example of how to use MolPipeline to create a fingerprint-based model is shown below (see also the [notebook](notebooks/01_getting_started_with_molpipeline.ipynb)):
+### Model building
+
+Create a fingerprint-based prediction model:
```python
from molpipeline import Pipeline
from molpipeline.any2mol import AutoToMol
@@ -58,8 +79,42 @@ pipeline.predict(["CCC"])
# output: array([0.29])
```
-Molpipeline also provides custom estimators for standard cheminformatics tasks that can be integrated into pipelines,
-like clustering for scaffold splits (see also the [notebook](notebooks/02_scaffold_split_with_custom_estimators.ipynb)):
+### Feature calculation
+
+Calculating molecular descriptors from SMILES strings is straightforward. For example, physicochemical properties can
+be calculated like this:
+```python
+from molpipeline import Pipeline
+from molpipeline.any2mol import AutoToMol
+from molpipeline.mol2any import MolToRDKitPhysChem
+
+pipeline_physchem = Pipeline(
+ [
+ ("auto2mol", AutoToMol()),
+ (
+ "physchem",
+ MolToRDKitPhysChem(
+ standardizer=None,
+ descriptor_list=["HeavyAtomMolWt", "TPSA", "NumHAcceptors"],
+ ),
+ ),
+ ],
+ n_jobs=-1,
+)
+physchem_matrix = pipeline_physchem.transform(["CCCCCC", "c1ccccc1(O)"])
+physchem_matrix
+# output: array([[72.066, 0. , 0. ],
+# [88.065, 20.23 , 1. ]])
+```
+
+MolPipeline provides further features and descriptors from RDKit,
+for example Morgan (binary/count) fingerprints and MACCS keys.
+See the [04_feature_calculation notebook](notebooks/04_feature_calculation.ipynb) for more examples.
+
+### Clustering
+
+Molpipeline provides several clustering algorithms as sklearn-like estimators. For example, molecules can be
+clustered by their Murcko scaffold. See the [02_scaffold_split_with_custom_estimators notebook](notebooks/02_scaffold_split_with_custom_estimators.ipynb) for scaffolds splits and further examples.
```python
from molpipeline.estimators import MurckoScaffoldClustering
diff --git a/notebooks/04_feature_calculation.ipynb b/notebooks/04_feature_calculation.ipynb
new file mode 100644
index 00000000..1bcf9ce8
--- /dev/null
+++ b/notebooks/04_feature_calculation.ipynb
@@ -0,0 +1,915 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "a5e18566-ab97-4ead-b6e3-0ad930754a21",
+ "metadata": {},
+ "source": [
+ "# Feature calculation\n",
+ "\n",
+ "\n",
+ "\n",
+ "Molpipeline provides multiple molecular featurization methods and descriptors from RDKit. This notebook shows how features like\n",
+ "\n",
+ "- Morgan binary fingerprints\n",
+ "- Morgan count fingerprints\n",
+ "- MACCS keys fingerprints\n",
+ "- Physicochemical features\n",
+ "\n",
+ "can be easily calculated in parallel and in different variations with MolPipeline. If you are interested in further molecular featurization and descriptors check out the `molpipeline.mol2any` module."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "id": "6872cc5e-5851-42ec-a63e-071d8139829e",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import pandas as pd\n",
+ "\n",
+ "from molpipeline import Pipeline\n",
+ "from molpipeline.any2mol import AutoToMol\n",
+ "from molpipeline.mol2any import MolToMorganFP, MolToMACCSFP, MolToRDKitPhysChem"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8a6ba6bf-c0cd-4949-82f3-e71e538cdee0",
+ "metadata": {},
+ "source": [
+ "In this example we fetch the ESOL (delaney) data set. However, you can use any other data set."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "id": "761f0ee7-3e66-4e86-bdac-e9dcec9ecb17",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "df_full = pd.read_csv(\n",
+ " \"https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/delaney-processed.csv\",\n",
+ " usecols=lambda col: col != \"num\",\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6853d13e-c371-49cc-8009-544022c67d34",
+ "metadata": {},
+ "source": [
+ "We use a smaller portion of the data set for illustration"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "id": "d47ea54e-ac15-4358-ae2b-7e8428642a26",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " Compound ID | \n",
+ " ESOL predicted log solubility in mols per litre | \n",
+ " Minimum Degree | \n",
+ " Molecular Weight | \n",
+ " Number of H-Bond Donors | \n",
+ " Number of Rings | \n",
+ " Number of Rotatable Bonds | \n",
+ " Polar Surface Area | \n",
+ " measured log solubility in mols per litre | \n",
+ " smiles | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " Amigdalin | \n",
+ " -0.974 | \n",
+ " 1 | \n",
+ " 457.432 | \n",
+ " 7 | \n",
+ " 3 | \n",
+ " 7 | \n",
+ " 202.32 | \n",
+ " -0.77 | \n",
+ " OCC3OC(OCC2OC(OC(C#N)c1ccccc1)C(O)C(O)C2O)C(O)... | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " Fenfuram | \n",
+ " -2.885 | \n",
+ " 1 | \n",
+ " 201.225 | \n",
+ " 1 | \n",
+ " 2 | \n",
+ " 2 | \n",
+ " 42.24 | \n",
+ " -3.30 | \n",
+ " Cc1occc1C(=O)Nc2ccccc2 | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " citral | \n",
+ " -2.579 | \n",
+ " 1 | \n",
+ " 152.237 | \n",
+ " 0 | \n",
+ " 0 | \n",
+ " 4 | \n",
+ " 17.07 | \n",
+ " -2.06 | \n",
+ " CC(C)=CCCC(C)=CC(=O) | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " Picene | \n",
+ " -6.618 | \n",
+ " 2 | \n",
+ " 278.354 | \n",
+ " 0 | \n",
+ " 5 | \n",
+ " 0 | \n",
+ " 0.00 | \n",
+ " -7.87 | \n",
+ " c1ccc2c(c1)ccc3c2ccc4c5ccccc5ccc43 | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " Thiophene | \n",
+ " -2.232 | \n",
+ " 2 | \n",
+ " 84.143 | \n",
+ " 0 | \n",
+ " 1 | \n",
+ " 0 | \n",
+ " 0.00 | \n",
+ " -1.33 | \n",
+ " c1ccsc1 | \n",
+ "
\n",
+ " \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ "
\n",
+ " \n",
+ " 95 | \n",
+ " diethylstilbestrol | \n",
+ " -5.074 | \n",
+ " 1 | \n",
+ " 268.356 | \n",
+ " 2 | \n",
+ " 2 | \n",
+ " 4 | \n",
+ " 40.46 | \n",
+ " -4.07 | \n",
+ " CCC(=C(CC)c1ccc(O)cc1)c2ccc(O)cc2 | \n",
+ "
\n",
+ " \n",
+ " 96 | \n",
+ " Chlorothalonil | \n",
+ " -3.995 | \n",
+ " 1 | \n",
+ " 265.914 | \n",
+ " 0 | \n",
+ " 1 | \n",
+ " 0 | \n",
+ " 47.58 | \n",
+ " -5.64 | \n",
+ " c1(C#N)c(Cl)c(C#N)c(Cl)c(Cl)c(Cl)1 | \n",
+ "
\n",
+ " \n",
+ " 97 | \n",
+ " 2,3',4',5-PCB | \n",
+ " -6.312 | \n",
+ " 1 | \n",
+ " 291.992 | \n",
+ " 0 | \n",
+ " 2 | \n",
+ " 1 | \n",
+ " 0.00 | \n",
+ " -7.25 | \n",
+ " Clc1ccc(Cl)c(c1)c2ccc(Cl)c(Cl)c2 | \n",
+ "
\n",
+ " \n",
+ " 98 | \n",
+ " styrene oxide | \n",
+ " -1.826 | \n",
+ " 2 | \n",
+ " 120.151 | \n",
+ " 0 | \n",
+ " 2 | \n",
+ " 1 | \n",
+ " 12.53 | \n",
+ " -1.60 | \n",
+ " C1OC1c2ccccc2 | \n",
+ "
\n",
+ " \n",
+ " 99 | \n",
+ " Isopropylbenzene | \n",
+ " -3.265 | \n",
+ " 1 | \n",
+ " 120.195 | \n",
+ " 0 | \n",
+ " 1 | \n",
+ " 1 | \n",
+ " 0.00 | \n",
+ " -3.27 | \n",
+ " CC(C)c1ccccc1 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
100 rows × 10 columns
\n",
+ "
"
+ ],
+ "text/plain": [
+ " Compound ID ESOL predicted log solubility in mols per litre \\\n",
+ "0 Amigdalin -0.974 \n",
+ "1 Fenfuram -2.885 \n",
+ "2 citral -2.579 \n",
+ "3 Picene -6.618 \n",
+ "4 Thiophene -2.232 \n",
+ ".. ... ... \n",
+ "95 diethylstilbestrol -5.074 \n",
+ "96 Chlorothalonil -3.995 \n",
+ "97 2,3',4',5-PCB -6.312 \n",
+ "98 styrene oxide -1.826 \n",
+ "99 Isopropylbenzene -3.265 \n",
+ "\n",
+ " Minimum Degree Molecular Weight Number of H-Bond Donors \\\n",
+ "0 1 457.432 7 \n",
+ "1 1 201.225 1 \n",
+ "2 1 152.237 0 \n",
+ "3 2 278.354 0 \n",
+ "4 2 84.143 0 \n",
+ ".. ... ... ... \n",
+ "95 1 268.356 2 \n",
+ "96 1 265.914 0 \n",
+ "97 1 291.992 0 \n",
+ "98 2 120.151 0 \n",
+ "99 1 120.195 0 \n",
+ "\n",
+ " Number of Rings Number of Rotatable Bonds Polar Surface Area \\\n",
+ "0 3 7 202.32 \n",
+ "1 2 2 42.24 \n",
+ "2 0 4 17.07 \n",
+ "3 5 0 0.00 \n",
+ "4 1 0 0.00 \n",
+ ".. ... ... ... \n",
+ "95 2 4 40.46 \n",
+ "96 1 0 47.58 \n",
+ "97 2 1 0.00 \n",
+ "98 2 1 12.53 \n",
+ "99 1 1 0.00 \n",
+ "\n",
+ " measured log solubility in mols per litre \\\n",
+ "0 -0.77 \n",
+ "1 -3.30 \n",
+ "2 -2.06 \n",
+ "3 -7.87 \n",
+ "4 -1.33 \n",
+ ".. ... \n",
+ "95 -4.07 \n",
+ "96 -5.64 \n",
+ "97 -7.25 \n",
+ "98 -1.60 \n",
+ "99 -3.27 \n",
+ "\n",
+ " smiles \n",
+ "0 OCC3OC(OCC2OC(OC(C#N)c1ccccc1)C(O)C(O)C2O)C(O)... \n",
+ "1 Cc1occc1C(=O)Nc2ccccc2 \n",
+ "2 CC(C)=CCCC(C)=CC(=O) \n",
+ "3 c1ccc2c(c1)ccc3c2ccc4c5ccccc5ccc43 \n",
+ "4 c1ccsc1 \n",
+ ".. ... \n",
+ "95 CCC(=C(CC)c1ccc(O)cc1)c2ccc(O)cc2 \n",
+ "96 c1(C#N)c(Cl)c(C#N)c(Cl)c(Cl)c(Cl)1 \n",
+ "97 Clc1ccc(Cl)c(c1)c2ccc(Cl)c(Cl)c2 \n",
+ "98 C1OC1c2ccccc2 \n",
+ "99 CC(C)c1ccccc1 \n",
+ "\n",
+ "[100 rows x 10 columns]"
+ ]
+ },
+ "execution_count": 3,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df = df_full.head(n=100)\n",
+ "df"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "80d9843a-a702-4da5-8a4f-6c5ed7a5034b",
+ "metadata": {},
+ "source": [
+ "## Calculating fingerprints"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "15dcb6cb-2a8e-4d62-a218-826581155816",
+ "metadata": {},
+ "source": [
+ "### Morgan binary fingerprints\n",
+ "\n",
+ "Morgan fingerprints are the most popular molecular fingerprints. They are also known as [Extended-Connectivity Fingerprints (ECFP)](https://doi.org/10.1021/ci100050t). They encode circular substructures in the molecule. The binary version contains only 0s and 1s indicating the presence or absence of the substructures in the molecule."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1a838dd7-ec21-4875-a5b8-c5e0c27d9389",
+ "metadata": {},
+ "source": [
+ "Let's define the Pipeline to first read the molecule and then calculate the binary Morgan fingerprint. Then, we execute it by calling the `transform` function."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "id": "b6be019a-cc4d-45b2-b41a-9dca98d9644c",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "CPU times: user 181 ms, sys: 247 ms, total: 428 ms\n",
+ "Wall time: 12.6 s\n"
+ ]
+ },
+ {
+ "data": {
+ "text/plain": [
+ ""
+ ]
+ },
+ "execution_count": 4,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "%%time\n",
+ "# define the pipeline\n",
+ "pipeline_morgan = Pipeline(\n",
+ " [(\"auto2mol\", AutoToMol()), (\"morgan2_2048\", MolToMorganFP(n_bits=2048, radius=2))],\n",
+ " n_jobs=-1,\n",
+ ")\n",
+ "# execute the pipeline\n",
+ "morgan_matrix = pipeline_morgan.transform(df[\"smiles\"])\n",
+ "morgan_matrix"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "a13cc430-1c5e-4399-ab50-4b56ce8a7c09",
+ "metadata": {},
+ "source": [
+ "By default, the `MolToMorganFP` element returns a sparse matrix. More specifically, a [csr_matrix](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html) is returned which is more memory efficient than a dense matrix since most elements in the matrix are zero."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d872a591-cbfe-4158-8960-da813249fd1b",
+ "metadata": {},
+ "source": [
+ "To get a dense matrix you can convert the `csr_matrix` to a dense numpy matrix like this:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "id": "5d9d772b-98b9-42e5-ba12-11f007a3d17f",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "matrix([[0, 1, 0, ..., 0, 0, 0],\n",
+ " [0, 0, 0, ..., 0, 0, 0],\n",
+ " [0, 0, 0, ..., 0, 0, 0],\n",
+ " ...,\n",
+ " [0, 0, 0, ..., 0, 0, 0],\n",
+ " [0, 0, 0, ..., 0, 0, 0],\n",
+ " [0, 1, 0, ..., 0, 0, 0]])"
+ ]
+ },
+ "execution_count": 5,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "morgan_matrix.todense()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "923f168d-e6e4-418d-adb3-5451555b1303",
+ "metadata": {},
+ "source": [
+ "Alternatively, you can specify in the `MolToMorganFP` element the return type of the feature matrix by using the `return_as` option. You can choose between\n",
+ "\n",
+ "- `return_as=\"sparse\"` which returns a `csr_matrix`\n",
+ "- `return_as=\"dense` which returns a dense numpy matrix\n",
+ "- `return_as=\"explicit_bit_vect\"` which returns RDKit's dense [ExplicitBitVect](https://www.rdkit.org/new_docs/cppapi/classExplicitBitVect.html)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "id": "e728cf48-10bb-4168-9229-fe48b462ac03",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "CPU times: user 45.4 ms, sys: 11.7 ms, total: 57 ms\n",
+ "Wall time: 62.4 ms\n"
+ ]
+ },
+ {
+ "data": {
+ "text/plain": [
+ "array([[0, 1, 0, ..., 0, 0, 0],\n",
+ " [0, 0, 0, ..., 0, 0, 0],\n",
+ " [0, 0, 0, ..., 0, 0, 0],\n",
+ " ...,\n",
+ " [0, 0, 0, ..., 0, 0, 0],\n",
+ " [0, 0, 0, ..., 0, 0, 0],\n",
+ " [0, 1, 0, ..., 0, 0, 0]], dtype=uint8)"
+ ]
+ },
+ "execution_count": 6,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "%%time\n",
+ "pipeline_morgan_dense = Pipeline(\n",
+ " [\n",
+ " (\"auto2mol\", AutoToMol()),\n",
+ " (\"morgan2_2048\", MolToMorganFP(n_bits=2048, radius=2, return_as=\"dense\")),\n",
+ " ],\n",
+ " n_jobs=-1,\n",
+ ")\n",
+ "dense_morgan_matrix = pipeline_morgan_dense.transform(df[\"smiles\"])\n",
+ "dense_morgan_matrix"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6aecd789-2198-4325-b892-6aeecf857e25",
+ "metadata": {},
+ "source": [
+ "The feature matrix can be used to train a machine learning model but also for various analyses."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "85043b30-7476-4204-8268-a9375b2ee4f8",
+ "metadata": {},
+ "source": [
+ "### Morgan count fingerprints"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "9897e96f-4ffd-434b-b629-837a31a99f04",
+ "metadata": {},
+ "source": [
+ "Just set `counted=True` to compute Morgan count fingerprints instead of binary fingerprints."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "id": "477ebba4-0fbe-46c2-8c4a-13f9051ae85b",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "array([[0, 1, 0, ..., 0, 0, 0],\n",
+ " [0, 0, 0, ..., 0, 0, 0],\n",
+ " [0, 0, 0, ..., 0, 0, 0],\n",
+ " ...,\n",
+ " [0, 0, 0, ..., 0, 0, 0],\n",
+ " [0, 0, 0, ..., 0, 0, 0],\n",
+ " [0, 1, 0, ..., 0, 0, 0]], dtype=uint32)"
+ ]
+ },
+ "execution_count": 7,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "pipeline_morgan_counted = Pipeline(\n",
+ " [\n",
+ " (\"auto2mol\", AutoToMol()),\n",
+ " (\n",
+ " \"morgan2_2048\",\n",
+ " MolToMorganFP(n_bits=2048, radius=2, counted=True, return_as=\"dense\"),\n",
+ " ),\n",
+ " ],\n",
+ " n_jobs=-1,\n",
+ ")\n",
+ "count_morgan_matrix = pipeline_morgan_counted.transform(df[\"smiles\"])\n",
+ "count_morgan_matrix"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "0e24ea56-f0f8-4426-b3e3-da960b93d431",
+ "metadata": {},
+ "source": [
+ "When we sort the matrix values we see that some substructures are present up to 14 times in a single molecule."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "id": "189ea2d6-9274-4097-b654-5ca88c318abf",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[14, 13, 12, 12, 11, 10, 10, 10, 10, 10, 10, 10, 9, 9, 8, 8, 8, 8, 8, 8]"
+ ]
+ },
+ "execution_count": 8,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "sorted(count_morgan_matrix.ravel(), reverse=True)[:20]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "80fb055a-1b4c-4c69-989c-5f3e774e80e1",
+ "metadata": {},
+ "source": [
+ "### MACCS key fingerprints\n",
+ "\n",
+ "MACCS keys are a manually defined set of 166 substructures whose presence is checked in the molecule. MACCS keys contain for example common functional groups."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "id": "d9a11c62-c8ad-470f-b40f-f5d4ddc16b61",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "CPU times: user 43.8 ms, sys: 1.15 ms, total: 44.9 ms\n",
+ "Wall time: 70.9 ms\n"
+ ]
+ },
+ {
+ "data": {
+ "text/plain": [
+ "array([[0, 0, 0, ..., 1, 1, 0],\n",
+ " [0, 0, 0, ..., 1, 1, 0],\n",
+ " [0, 0, 0, ..., 1, 0, 0],\n",
+ " ...,\n",
+ " [0, 0, 0, ..., 0, 1, 0],\n",
+ " [0, 0, 0, ..., 1, 1, 0],\n",
+ " [0, 0, 0, ..., 0, 1, 0]])"
+ ]
+ },
+ "execution_count": 9,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "%%time\n",
+ "pipeline_maccs_dense = Pipeline(\n",
+ " [(\"auto2mol\", AutoToMol()), (\"maccs\", MolToMACCSFP(return_as=\"dense\"))],\n",
+ " n_jobs=-1,\n",
+ ")\n",
+ "dense_maccs_matrix = pipeline_maccs_dense.transform(df[\"smiles\"])\n",
+ "dense_maccs_matrix"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7d3546ca-6d58-4a69-a252-d7deb3147a40",
+ "metadata": {},
+ "source": [
+ "## Physicochemical features\n",
+ "\n",
+ "RDKit also provides more than 200 physicochemical descriptors that can readily be computed from most molecules. In MolPipeline we can compute these features with the `MolToRDKitPhysChem` element."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "id": "858afb55-7e24-415d-bb5a-e0d7c811d6df",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "CPU times: user 68.1 ms, sys: 2.43 ms, total: 70.5 ms\n",
+ "Wall time: 171 ms\n"
+ ]
+ },
+ {
+ "data": {
+ "text/plain": [
+ "array([[10.25332888, 10.25332888, 0.48660209, ..., 0. ,\n",
+ " 0. , 0. ],\n",
+ " [11.72491119, 11.72491119, 0.14587963, ..., 0. ,\n",
+ " 0. , 0. ],\n",
+ " [10.02049761, 10.02049761, 0.84508976, ..., 0. ,\n",
+ " 0. , 0. ],\n",
+ " ...,\n",
+ " [ 6.08815823, 6.08815823, 0.49556374, ..., 0. ,\n",
+ " 0. , 0. ],\n",
+ " [ 5.09453704, 5.09453704, 0.40851852, ..., 0. ,\n",
+ " 0. , 0. ],\n",
+ " [ 2.2037037 , 2.2037037 , 0.65851852, ..., 0. ,\n",
+ " 0. , 0. ]])"
+ ]
+ },
+ "execution_count": 10,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "%%time\n",
+ "pipeline_physchem = Pipeline(\n",
+ " [(\"auto2mol\", AutoToMol()), (\"physchem\", MolToRDKitPhysChem(standardizer=None))],\n",
+ " n_jobs=-1,\n",
+ ")\n",
+ "physchem_matrix = pipeline_physchem.transform(df[\"smiles\"])\n",
+ "physchem_matrix"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8746f6cb-dc30-4435-a97b-0235f2c8c47a",
+ "metadata": {},
+ "source": [
+ "We can get the name of the descriptors like this:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "id": "f0b5fe47-54f0-4cca-9a1a-aa689a0b2d0c",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "['MaxAbsEStateIndex',\n",
+ " 'MaxEStateIndex',\n",
+ " 'MinAbsEStateIndex',\n",
+ " 'MinEStateIndex',\n",
+ " 'qed',\n",
+ " 'SPS',\n",
+ " 'HeavyAtomMolWt',\n",
+ " 'ExactMolWt',\n",
+ " 'NumValenceElectrons',\n",
+ " 'NumRadicalElectrons',\n",
+ " 'MaxPartialCharge',\n",
+ " 'MinPartialCharge',\n",
+ " 'MaxAbsPartialCharge',\n",
+ " 'MinAbsPartialCharge',\n",
+ " 'FpDensityMorgan1',\n",
+ " 'FpDensityMorgan2',\n",
+ " 'FpDensityMorgan3',\n",
+ " 'BCUT2D_MWHI',\n",
+ " 'BCUT2D_MWLOW',\n",
+ " 'BCUT2D_CHGHI']"
+ ]
+ },
+ "execution_count": 11,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "pipeline_physchem[\"physchem\"].descriptor_list[:20]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b0823f4d-8a2e-4ae2-91f7-3db6ecaf0c0e",
+ "metadata": {},
+ "source": [
+ "When we only want to calculate a subset of all available descriptors we can specify this during pipeline construction"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "id": "a3e005f3-f421-4634-9135-860e91a19de1",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "CPU times: user 41.2 ms, sys: 3.38 ms, total: 44.6 ms\n",
+ "Wall time: 47.5 ms\n"
+ ]
+ },
+ {
+ "data": {
+ "text/plain": [
+ "array([[430.216, 202.32 , 12. ],\n",
+ " [190.137, 42.24 , 2. ],\n",
+ " [136.109, 17.07 , 1. ],\n",
+ " [264.242, 0. , 0. ],\n",
+ " [ 80.111, 0. , 1. ],\n",
+ " [130.151, 12.89 , 2. ],\n",
+ " [321.397, 0. , 0. ],\n",
+ " [248.196, 40.46 , 2. ],\n",
+ " [372.849, 12.53 , 1. ],\n",
+ " [372.247, 63.22 , 6. ],\n",
+ " [ 78.05 , 29.1 , 1. ],\n",
+ " [155.563, 0. , 0. ],\n",
+ " [ 60.055, 0. , 0. ],\n",
+ " [204.144, 58.2 , 2. ],\n",
+ " [168.154, 0. , 0. ],\n",
+ " [ 71.486, 0. , 0. ],\n",
+ " [ 76.054, 20.23 , 1. ],\n",
+ " [ 98.084, 23.79 , 1. ],\n",
+ " [283.184, 53.47 , 6. ],\n",
+ " [148.12 , 20.23 , 1. ],\n",
+ " [321.397, 0. , 0. ],\n",
+ " [216.155, 54.86 , 3. ],\n",
+ " [243.25 , 18.46 , 5. ],\n",
+ " [166.115, 38.33 , 2. ],\n",
+ " [309.139, 115.54 , 6. ],\n",
+ " [100.076, 20.23 , 1. ],\n",
+ " [172.103, 72.68 , 5. ],\n",
+ " [196.121, 75.27 , 3. ],\n",
+ " [309.966, 0. , 0. ],\n",
+ " [140.097, 26.3 , 2. ],\n",
+ " [120.11 , 0. , 0. ],\n",
+ " [267.272, 18.46 , 5. ],\n",
+ " [284.186, 76.66 , 4. ],\n",
+ " [ 94.928, 0. , 0. ],\n",
+ " [168.154, 0. , 0. ],\n",
+ " [ 76.054, 17.07 , 1. ],\n",
+ " [158.139, 12.03 , 1. ],\n",
+ " [234.215, 29.54 , 3. ],\n",
+ " [325.266, 38.77 , 5. ],\n",
+ " [210.981, 0. , 0. ],\n",
+ " [179.585, 0. , 0. ],\n",
+ " [ 76.054, 20.23 , 1. ],\n",
+ " [160.088, 75.27 , 3. ],\n",
+ " [136.109, 20.23 , 1. ],\n",
+ " [ 80.042, 26.3 , 2. ],\n",
+ " [100.076, 20.23 , 1. ],\n",
+ " [205.998, 29.1 , 1. ],\n",
+ " [258.034, 60.91 , 4. ],\n",
+ " [328.195, 107.77 , 7. ],\n",
+ " [146.128, 12.89 , 1. ],\n",
+ " [ 96.088, 0. , 0. ],\n",
+ " [220.143, 75.27 , 3. ],\n",
+ " [216.198, 0. , 0. ],\n",
+ " [248.015, 54.86 , 3. ],\n",
+ " [356.85 , 0. , 0. ],\n",
+ " [100.076, 20.23 , 1. ],\n",
+ " [108.099, 0. , 0. ],\n",
+ " [144.132, 0. , 0. ],\n",
+ " [228.209, 0. , 0. ],\n",
+ " [ 76.054, 17.07 , 1. ],\n",
+ " [427.756, 0. , 0. ],\n",
+ " [104.064, 26.3 , 2. ],\n",
+ " [367.223, 115.06 , 6. ],\n",
+ " [102.072, 46.25 , 2. ],\n",
+ " [248.157, 90.06 , 5. ],\n",
+ " [347.692, 54.37 , 3. ],\n",
+ " [213.587, 53.94 , 5. ],\n",
+ " [118.075, 68.87 , 3. ],\n",
+ " [223.993, 72.19 , 2. ],\n",
+ " [215.038, 0. , 0. ],\n",
+ " [232.111, 118.05 , 6. ],\n",
+ " [277.042, 52.37 , 3. ],\n",
+ " [136.109, 17.07 , 1. ],\n",
+ " [232.154, 75.27 , 3. ],\n",
+ " [116.075, 26.3 , 2. ],\n",
+ " [116.075, 26.3 , 2. ],\n",
+ " [356.252, 75.71 , 4. ],\n",
+ " [250.491, 0. , 0. ],\n",
+ " [115.937, 0. , 0. ],\n",
+ " [325.09 , 49.17 , 5. ],\n",
+ " [245.177, 55.84 , 6. ],\n",
+ " [140.105, 51.56 , 4. ],\n",
+ " [ 72.092, 52.04 , 1. ],\n",
+ " [ 96.088, 0. , 0. ],\n",
+ " [120.11 , 0. , 0. ],\n",
+ " [236.74 , 0. , 0. ],\n",
+ " [428.285, 68.55 , 5. ],\n",
+ " [ 82.038, 43.14 , 2. ],\n",
+ " [136.109, 17.07 , 1. ],\n",
+ " [261.627, 45.23 , 3. ],\n",
+ " [188.977, 43.14 , 2. ],\n",
+ " [236.211, 58.2 , 3. ],\n",
+ " [192.176, 0. , 0. ],\n",
+ " [ 88.065, 9.23 , 1. ],\n",
+ " [144.132, 0. , 0. ],\n",
+ " [248.196, 40.46 , 2. ],\n",
+ " [265.914, 47.58 , 2. ],\n",
+ " [285.944, 0. , 0. ],\n",
+ " [112.087, 12.53 , 1. ],\n",
+ " [108.099, 0. , 0. ]])"
+ ]
+ },
+ "execution_count": 12,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "%%time\n",
+ "pipeline_physchem_small = Pipeline(\n",
+ " [\n",
+ " (\"auto2mol\", AutoToMol()),\n",
+ " (\n",
+ " \"physchem\",\n",
+ " MolToRDKitPhysChem(\n",
+ " standardizer=None,\n",
+ " descriptor_list=[\"HeavyAtomMolWt\", \"TPSA\", \"NumHAcceptors\"],\n",
+ " ),\n",
+ " ),\n",
+ " ],\n",
+ " n_jobs=-1,\n",
+ ")\n",
+ "physchem_matrix_small = pipeline_physchem_small.transform(df[\"smiles\"])\n",
+ "physchem_matrix_small"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.11.9"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}