diff --git a/.github/molpipeline.png b/.github/molpipeline.png new file mode 100755 index 00000000..fc6129a2 Binary files /dev/null and b/.github/molpipeline.png differ diff --git a/README.md b/README.md index 7479707b..b42b655e 100644 --- a/README.md +++ b/README.md @@ -1,35 +1,56 @@ # MolPipeline -MolPipeline is a Python package providing RDKit functionality in a Scikit-learn like fashion. +MolPipeline is a Python package for processing molecules with RDKit in scikit-learn. + +

## Background -The open-source package [scikit-learn](https://scikit-learn.org/) provides a large variety of machine +The [scikit-learn](https://scikit-learn.org/) package provides a large variety of machine learning algorithms and data processing tools, among which is the `Pipeline` class, allowing users to prepend custom data processing steps to the machine learning model. -`MolPipeline` extends this concept to the field of chemoinformatics by -wrapping default functionalities of [RDKit](https://www.rdkit.org/), such as reading and writing SMILES strings +`MolPipeline` extends this concept to the field of cheminformatics by +wrapping standard [RDKit](https://www.rdkit.org/) functionality, such as reading and writing SMILES strings or calculating molecular descriptors from a molecule-object. -A notable difference to the `Pipeline` class of scikit-learn is that the Pipline from `MolPipeline` allows for -instances to fail during processing without interrupting the whole pipeline. -Such behaviour is useful when processing large datasets, where some SMILES strings might not encode valid molecules -or some descriptors might not be calculable for certain molecules. +MolPipeline aims to provide: +- Automated end-to-end processing from molecule data sets to deployable machine learning models. +- Scalable parallel processing and low memory usage through instance-based processing. +- Standard pipeline building blocks for flexibly building custom pipelines for various +cheminformatics tasks. +- Consistent error handling for tracking, logging, and replacing failed instances (e.g., a +SMILES string that could not be parsed correctly). +- Integrated and self-contained pipeline serialization for easy deployment and tracking +in version control. ## Publications -The publication is freely available [here](https://chemrxiv.org/engage/chemrxiv/article-details/661fec7f418a5379b00ae036). +[Sieg J, Feldmann CW, Hemmerich J, Stork C, Sandfort F, Eiden P, and Mathea M, MolPipeline: A python package for processing +molecules with RDKit in scikit-learn, J. Chem. Inf. Model., doi:10.1021/acs.jcim.4c00863, 2024](https://doi.org/10.1021/acs.jcim.4c00863) +\ +Further links: [arXiv](https://chemrxiv.org/engage/chemrxiv/article-details/661fec7f418a5379b00ae036) + +Feldmann CW, Sieg J, and Mathea M, Analysis of uncertainty of neural +fingerprint-based models, 2024 +\ +Further links: [repository](https://github.com/basf/neural-fingerprint-uncertainty) ## Installation ```commandline pip install molpipeline ``` -## Usage +## Documentation + +The [notebooks](notebooks) folder contains many basic and advanced examples of how to use Molpipeline. + +A nice introduction to the basic usage is in the [01_getting_started_with_molpipeline notebook](notebooks/01_getting_started_with_molpipeline.ipynb). -See the [notebooks](notebooks) folder for basic and advanced examples of how to use Molpipeline. +## Quick Start -A basic example of how to use MolPipeline to create a fingerprint-based model is shown below (see also the [notebook](notebooks/01_getting_started_with_molpipeline.ipynb)): +### Model building + +Create a fingerprint-based prediction model: ```python from molpipeline import Pipeline from molpipeline.any2mol import AutoToMol @@ -58,8 +79,42 @@ pipeline.predict(["CCC"]) # output: array([0.29]) ``` -Molpipeline also provides custom estimators for standard cheminformatics tasks that can be integrated into pipelines, -like clustering for scaffold splits (see also the [notebook](notebooks/02_scaffold_split_with_custom_estimators.ipynb)): +### Feature calculation + +Calculating molecular descriptors from SMILES strings is straightforward. For example, physicochemical properties can +be calculated like this: +```python +from molpipeline import Pipeline +from molpipeline.any2mol import AutoToMol +from molpipeline.mol2any import MolToRDKitPhysChem + +pipeline_physchem = Pipeline( + [ + ("auto2mol", AutoToMol()), + ( + "physchem", + MolToRDKitPhysChem( + standardizer=None, + descriptor_list=["HeavyAtomMolWt", "TPSA", "NumHAcceptors"], + ), + ), + ], + n_jobs=-1, +) +physchem_matrix = pipeline_physchem.transform(["CCCCCC", "c1ccccc1(O)"]) +physchem_matrix +# output: array([[72.066, 0. , 0. ], +# [88.065, 20.23 , 1. ]]) +``` + +MolPipeline provides further features and descriptors from RDKit, +for example Morgan (binary/count) fingerprints and MACCS keys. +See the [04_feature_calculation notebook](notebooks/04_feature_calculation.ipynb) for more examples. + +### Clustering + +Molpipeline provides several clustering algorithms as sklearn-like estimators. For example, molecules can be +clustered by their Murcko scaffold. See the [02_scaffold_split_with_custom_estimators notebook](notebooks/02_scaffold_split_with_custom_estimators.ipynb) for scaffolds splits and further examples. ```python from molpipeline.estimators import MurckoScaffoldClustering diff --git a/notebooks/04_feature_calculation.ipynb b/notebooks/04_feature_calculation.ipynb new file mode 100644 index 00000000..1bcf9ce8 --- /dev/null +++ b/notebooks/04_feature_calculation.ipynb @@ -0,0 +1,915 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "a5e18566-ab97-4ead-b6e3-0ad930754a21", + "metadata": {}, + "source": [ + "# Feature calculation\n", + "\n", + "\n", + "\n", + "Molpipeline provides multiple molecular featurization methods and descriptors from RDKit. This notebook shows how features like\n", + "\n", + "- Morgan binary fingerprints\n", + "- Morgan count fingerprints\n", + "- MACCS keys fingerprints\n", + "- Physicochemical features\n", + "\n", + "can be easily calculated in parallel and in different variations with MolPipeline. If you are interested in further molecular featurization and descriptors check out the `molpipeline.mol2any` module." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "6872cc5e-5851-42ec-a63e-071d8139829e", + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "\n", + "from molpipeline import Pipeline\n", + "from molpipeline.any2mol import AutoToMol\n", + "from molpipeline.mol2any import MolToMorganFP, MolToMACCSFP, MolToRDKitPhysChem" + ] + }, + { + "cell_type": "markdown", + "id": "8a6ba6bf-c0cd-4949-82f3-e71e538cdee0", + "metadata": {}, + "source": [ + "In this example we fetch the ESOL (delaney) data set. However, you can use any other data set." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "761f0ee7-3e66-4e86-bdac-e9dcec9ecb17", + "metadata": {}, + "outputs": [], + "source": [ + "df_full = pd.read_csv(\n", + " \"https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/delaney-processed.csv\",\n", + " usecols=lambda col: col != \"num\",\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "6853d13e-c371-49cc-8009-544022c67d34", + "metadata": {}, + "source": [ + "We use a smaller portion of the data set for illustration" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "d47ea54e-ac15-4358-ae2b-7e8428642a26", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Compound IDESOL predicted log solubility in mols per litreMinimum DegreeMolecular WeightNumber of H-Bond DonorsNumber of RingsNumber of Rotatable BondsPolar Surface Areameasured log solubility in mols per litresmiles
0Amigdalin-0.9741457.432737202.32-0.77OCC3OC(OCC2OC(OC(C#N)c1ccccc1)C(O)C(O)C2O)C(O)...
1Fenfuram-2.8851201.22512242.24-3.30Cc1occc1C(=O)Nc2ccccc2
2citral-2.5791152.23700417.07-2.06CC(C)=CCCC(C)=CC(=O)
3Picene-6.6182278.3540500.00-7.87c1ccc2c(c1)ccc3c2ccc4c5ccccc5ccc43
4Thiophene-2.232284.1430100.00-1.33c1ccsc1
.................................
95diethylstilbestrol-5.0741268.35622440.46-4.07CCC(=C(CC)c1ccc(O)cc1)c2ccc(O)cc2
96Chlorothalonil-3.9951265.91401047.58-5.64c1(C#N)c(Cl)c(C#N)c(Cl)c(Cl)c(Cl)1
972,3',4',5-PCB-6.3121291.9920210.00-7.25Clc1ccc(Cl)c(c1)c2ccc(Cl)c(Cl)c2
98styrene oxide-1.8262120.15102112.53-1.60C1OC1c2ccccc2
99Isopropylbenzene-3.2651120.1950110.00-3.27CC(C)c1ccccc1
\n", + "

100 rows × 10 columns

\n", + "
" + ], + "text/plain": [ + " Compound ID ESOL predicted log solubility in mols per litre \\\n", + "0 Amigdalin -0.974 \n", + "1 Fenfuram -2.885 \n", + "2 citral -2.579 \n", + "3 Picene -6.618 \n", + "4 Thiophene -2.232 \n", + ".. ... ... \n", + "95 diethylstilbestrol -5.074 \n", + "96 Chlorothalonil -3.995 \n", + "97 2,3',4',5-PCB -6.312 \n", + "98 styrene oxide -1.826 \n", + "99 Isopropylbenzene -3.265 \n", + "\n", + " Minimum Degree Molecular Weight Number of H-Bond Donors \\\n", + "0 1 457.432 7 \n", + "1 1 201.225 1 \n", + "2 1 152.237 0 \n", + "3 2 278.354 0 \n", + "4 2 84.143 0 \n", + ".. ... ... ... \n", + "95 1 268.356 2 \n", + "96 1 265.914 0 \n", + "97 1 291.992 0 \n", + "98 2 120.151 0 \n", + "99 1 120.195 0 \n", + "\n", + " Number of Rings Number of Rotatable Bonds Polar Surface Area \\\n", + "0 3 7 202.32 \n", + "1 2 2 42.24 \n", + "2 0 4 17.07 \n", + "3 5 0 0.00 \n", + "4 1 0 0.00 \n", + ".. ... ... ... \n", + "95 2 4 40.46 \n", + "96 1 0 47.58 \n", + "97 2 1 0.00 \n", + "98 2 1 12.53 \n", + "99 1 1 0.00 \n", + "\n", + " measured log solubility in mols per litre \\\n", + "0 -0.77 \n", + "1 -3.30 \n", + "2 -2.06 \n", + "3 -7.87 \n", + "4 -1.33 \n", + ".. ... \n", + "95 -4.07 \n", + "96 -5.64 \n", + "97 -7.25 \n", + "98 -1.60 \n", + "99 -3.27 \n", + "\n", + " smiles \n", + "0 OCC3OC(OCC2OC(OC(C#N)c1ccccc1)C(O)C(O)C2O)C(O)... \n", + "1 Cc1occc1C(=O)Nc2ccccc2 \n", + "2 CC(C)=CCCC(C)=CC(=O) \n", + "3 c1ccc2c(c1)ccc3c2ccc4c5ccccc5ccc43 \n", + "4 c1ccsc1 \n", + ".. ... \n", + "95 CCC(=C(CC)c1ccc(O)cc1)c2ccc(O)cc2 \n", + "96 c1(C#N)c(Cl)c(C#N)c(Cl)c(Cl)c(Cl)1 \n", + "97 Clc1ccc(Cl)c(c1)c2ccc(Cl)c(Cl)c2 \n", + "98 C1OC1c2ccccc2 \n", + "99 CC(C)c1ccccc1 \n", + "\n", + "[100 rows x 10 columns]" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df = df_full.head(n=100)\n", + "df" + ] + }, + { + "cell_type": "markdown", + "id": "80d9843a-a702-4da5-8a4f-6c5ed7a5034b", + "metadata": {}, + "source": [ + "## Calculating fingerprints" + ] + }, + { + "cell_type": "markdown", + "id": "15dcb6cb-2a8e-4d62-a218-826581155816", + "metadata": {}, + "source": [ + "### Morgan binary fingerprints\n", + "\n", + "Morgan fingerprints are the most popular molecular fingerprints. They are also known as [Extended-Connectivity Fingerprints (ECFP)](https://doi.org/10.1021/ci100050t). They encode circular substructures in the molecule. The binary version contains only 0s and 1s indicating the presence or absence of the substructures in the molecule." + ] + }, + { + "cell_type": "markdown", + "id": "1a838dd7-ec21-4875-a5b8-c5e0c27d9389", + "metadata": {}, + "source": [ + "Let's define the Pipeline to first read the molecule and then calculate the binary Morgan fingerprint. Then, we execute it by calling the `transform` function." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "b6be019a-cc4d-45b2-b41a-9dca98d9644c", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "CPU times: user 181 ms, sys: 247 ms, total: 428 ms\n", + "Wall time: 12.6 s\n" + ] + }, + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "%%time\n", + "# define the pipeline\n", + "pipeline_morgan = Pipeline(\n", + " [(\"auto2mol\", AutoToMol()), (\"morgan2_2048\", MolToMorganFP(n_bits=2048, radius=2))],\n", + " n_jobs=-1,\n", + ")\n", + "# execute the pipeline\n", + "morgan_matrix = pipeline_morgan.transform(df[\"smiles\"])\n", + "morgan_matrix" + ] + }, + { + "cell_type": "markdown", + "id": "a13cc430-1c5e-4399-ab50-4b56ce8a7c09", + "metadata": {}, + "source": [ + "By default, the `MolToMorganFP` element returns a sparse matrix. More specifically, a [csr_matrix](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html) is returned which is more memory efficient than a dense matrix since most elements in the matrix are zero." + ] + }, + { + "cell_type": "markdown", + "id": "d872a591-cbfe-4158-8960-da813249fd1b", + "metadata": {}, + "source": [ + "To get a dense matrix you can convert the `csr_matrix` to a dense numpy matrix like this:" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "5d9d772b-98b9-42e5-ba12-11f007a3d17f", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "matrix([[0, 1, 0, ..., 0, 0, 0],\n", + " [0, 0, 0, ..., 0, 0, 0],\n", + " [0, 0, 0, ..., 0, 0, 0],\n", + " ...,\n", + " [0, 0, 0, ..., 0, 0, 0],\n", + " [0, 0, 0, ..., 0, 0, 0],\n", + " [0, 1, 0, ..., 0, 0, 0]])" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "morgan_matrix.todense()" + ] + }, + { + "cell_type": "markdown", + "id": "923f168d-e6e4-418d-adb3-5451555b1303", + "metadata": {}, + "source": [ + "Alternatively, you can specify in the `MolToMorganFP` element the return type of the feature matrix by using the `return_as` option. You can choose between\n", + "\n", + "- `return_as=\"sparse\"` which returns a `csr_matrix`\n", + "- `return_as=\"dense` which returns a dense numpy matrix\n", + "- `return_as=\"explicit_bit_vect\"` which returns RDKit's dense [ExplicitBitVect](https://www.rdkit.org/new_docs/cppapi/classExplicitBitVect.html)" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "e728cf48-10bb-4168-9229-fe48b462ac03", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "CPU times: user 45.4 ms, sys: 11.7 ms, total: 57 ms\n", + "Wall time: 62.4 ms\n" + ] + }, + { + "data": { + "text/plain": [ + "array([[0, 1, 0, ..., 0, 0, 0],\n", + " [0, 0, 0, ..., 0, 0, 0],\n", + " [0, 0, 0, ..., 0, 0, 0],\n", + " ...,\n", + " [0, 0, 0, ..., 0, 0, 0],\n", + " [0, 0, 0, ..., 0, 0, 0],\n", + " [0, 1, 0, ..., 0, 0, 0]], dtype=uint8)" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "%%time\n", + "pipeline_morgan_dense = Pipeline(\n", + " [\n", + " (\"auto2mol\", AutoToMol()),\n", + " (\"morgan2_2048\", MolToMorganFP(n_bits=2048, radius=2, return_as=\"dense\")),\n", + " ],\n", + " n_jobs=-1,\n", + ")\n", + "dense_morgan_matrix = pipeline_morgan_dense.transform(df[\"smiles\"])\n", + "dense_morgan_matrix" + ] + }, + { + "cell_type": "markdown", + "id": "6aecd789-2198-4325-b892-6aeecf857e25", + "metadata": {}, + "source": [ + "The feature matrix can be used to train a machine learning model but also for various analyses." + ] + }, + { + "cell_type": "markdown", + "id": "85043b30-7476-4204-8268-a9375b2ee4f8", + "metadata": {}, + "source": [ + "### Morgan count fingerprints" + ] + }, + { + "cell_type": "markdown", + "id": "9897e96f-4ffd-434b-b629-837a31a99f04", + "metadata": {}, + "source": [ + "Just set `counted=True` to compute Morgan count fingerprints instead of binary fingerprints." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "477ebba4-0fbe-46c2-8c4a-13f9051ae85b", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[0, 1, 0, ..., 0, 0, 0],\n", + " [0, 0, 0, ..., 0, 0, 0],\n", + " [0, 0, 0, ..., 0, 0, 0],\n", + " ...,\n", + " [0, 0, 0, ..., 0, 0, 0],\n", + " [0, 0, 0, ..., 0, 0, 0],\n", + " [0, 1, 0, ..., 0, 0, 0]], dtype=uint32)" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pipeline_morgan_counted = Pipeline(\n", + " [\n", + " (\"auto2mol\", AutoToMol()),\n", + " (\n", + " \"morgan2_2048\",\n", + " MolToMorganFP(n_bits=2048, radius=2, counted=True, return_as=\"dense\"),\n", + " ),\n", + " ],\n", + " n_jobs=-1,\n", + ")\n", + "count_morgan_matrix = pipeline_morgan_counted.transform(df[\"smiles\"])\n", + "count_morgan_matrix" + ] + }, + { + "cell_type": "markdown", + "id": "0e24ea56-f0f8-4426-b3e3-da960b93d431", + "metadata": {}, + "source": [ + "When we sort the matrix values we see that some substructures are present up to 14 times in a single molecule." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "189ea2d6-9274-4097-b654-5ca88c318abf", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[14, 13, 12, 12, 11, 10, 10, 10, 10, 10, 10, 10, 9, 9, 8, 8, 8, 8, 8, 8]" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "sorted(count_morgan_matrix.ravel(), reverse=True)[:20]" + ] + }, + { + "cell_type": "markdown", + "id": "80fb055a-1b4c-4c69-989c-5f3e774e80e1", + "metadata": {}, + "source": [ + "### MACCS key fingerprints\n", + "\n", + "MACCS keys are a manually defined set of 166 substructures whose presence is checked in the molecule. MACCS keys contain for example common functional groups." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "d9a11c62-c8ad-470f-b40f-f5d4ddc16b61", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "CPU times: user 43.8 ms, sys: 1.15 ms, total: 44.9 ms\n", + "Wall time: 70.9 ms\n" + ] + }, + { + "data": { + "text/plain": [ + "array([[0, 0, 0, ..., 1, 1, 0],\n", + " [0, 0, 0, ..., 1, 1, 0],\n", + " [0, 0, 0, ..., 1, 0, 0],\n", + " ...,\n", + " [0, 0, 0, ..., 0, 1, 0],\n", + " [0, 0, 0, ..., 1, 1, 0],\n", + " [0, 0, 0, ..., 0, 1, 0]])" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "%%time\n", + "pipeline_maccs_dense = Pipeline(\n", + " [(\"auto2mol\", AutoToMol()), (\"maccs\", MolToMACCSFP(return_as=\"dense\"))],\n", + " n_jobs=-1,\n", + ")\n", + "dense_maccs_matrix = pipeline_maccs_dense.transform(df[\"smiles\"])\n", + "dense_maccs_matrix" + ] + }, + { + "cell_type": "markdown", + "id": "7d3546ca-6d58-4a69-a252-d7deb3147a40", + "metadata": {}, + "source": [ + "## Physicochemical features\n", + "\n", + "RDKit also provides more than 200 physicochemical descriptors that can readily be computed from most molecules. In MolPipeline we can compute these features with the `MolToRDKitPhysChem` element." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "858afb55-7e24-415d-bb5a-e0d7c811d6df", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "CPU times: user 68.1 ms, sys: 2.43 ms, total: 70.5 ms\n", + "Wall time: 171 ms\n" + ] + }, + { + "data": { + "text/plain": [ + "array([[10.25332888, 10.25332888, 0.48660209, ..., 0. ,\n", + " 0. , 0. ],\n", + " [11.72491119, 11.72491119, 0.14587963, ..., 0. ,\n", + " 0. , 0. ],\n", + " [10.02049761, 10.02049761, 0.84508976, ..., 0. ,\n", + " 0. , 0. ],\n", + " ...,\n", + " [ 6.08815823, 6.08815823, 0.49556374, ..., 0. ,\n", + " 0. , 0. ],\n", + " [ 5.09453704, 5.09453704, 0.40851852, ..., 0. ,\n", + " 0. , 0. ],\n", + " [ 2.2037037 , 2.2037037 , 0.65851852, ..., 0. ,\n", + " 0. , 0. ]])" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "%%time\n", + "pipeline_physchem = Pipeline(\n", + " [(\"auto2mol\", AutoToMol()), (\"physchem\", MolToRDKitPhysChem(standardizer=None))],\n", + " n_jobs=-1,\n", + ")\n", + "physchem_matrix = pipeline_physchem.transform(df[\"smiles\"])\n", + "physchem_matrix" + ] + }, + { + "cell_type": "markdown", + "id": "8746f6cb-dc30-4435-a97b-0235f2c8c47a", + "metadata": {}, + "source": [ + "We can get the name of the descriptors like this:" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "f0b5fe47-54f0-4cca-9a1a-aa689a0b2d0c", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['MaxAbsEStateIndex',\n", + " 'MaxEStateIndex',\n", + " 'MinAbsEStateIndex',\n", + " 'MinEStateIndex',\n", + " 'qed',\n", + " 'SPS',\n", + " 'HeavyAtomMolWt',\n", + " 'ExactMolWt',\n", + " 'NumValenceElectrons',\n", + " 'NumRadicalElectrons',\n", + " 'MaxPartialCharge',\n", + " 'MinPartialCharge',\n", + " 'MaxAbsPartialCharge',\n", + " 'MinAbsPartialCharge',\n", + " 'FpDensityMorgan1',\n", + " 'FpDensityMorgan2',\n", + " 'FpDensityMorgan3',\n", + " 'BCUT2D_MWHI',\n", + " 'BCUT2D_MWLOW',\n", + " 'BCUT2D_CHGHI']" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pipeline_physchem[\"physchem\"].descriptor_list[:20]" + ] + }, + { + "cell_type": "markdown", + "id": "b0823f4d-8a2e-4ae2-91f7-3db6ecaf0c0e", + "metadata": {}, + "source": [ + "When we only want to calculate a subset of all available descriptors we can specify this during pipeline construction" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "a3e005f3-f421-4634-9135-860e91a19de1", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "CPU times: user 41.2 ms, sys: 3.38 ms, total: 44.6 ms\n", + "Wall time: 47.5 ms\n" + ] + }, + { + "data": { + "text/plain": [ + "array([[430.216, 202.32 , 12. ],\n", + " [190.137, 42.24 , 2. ],\n", + " [136.109, 17.07 , 1. ],\n", + " [264.242, 0. , 0. ],\n", + " [ 80.111, 0. , 1. ],\n", + " [130.151, 12.89 , 2. ],\n", + " [321.397, 0. , 0. ],\n", + " [248.196, 40.46 , 2. ],\n", + " [372.849, 12.53 , 1. ],\n", + " [372.247, 63.22 , 6. ],\n", + " [ 78.05 , 29.1 , 1. ],\n", + " [155.563, 0. , 0. ],\n", + " [ 60.055, 0. , 0. ],\n", + " [204.144, 58.2 , 2. ],\n", + " [168.154, 0. , 0. ],\n", + " [ 71.486, 0. , 0. ],\n", + " [ 76.054, 20.23 , 1. ],\n", + " [ 98.084, 23.79 , 1. ],\n", + " [283.184, 53.47 , 6. ],\n", + " [148.12 , 20.23 , 1. ],\n", + " [321.397, 0. , 0. ],\n", + " [216.155, 54.86 , 3. ],\n", + " [243.25 , 18.46 , 5. ],\n", + " [166.115, 38.33 , 2. ],\n", + " [309.139, 115.54 , 6. ],\n", + " [100.076, 20.23 , 1. ],\n", + " [172.103, 72.68 , 5. ],\n", + " [196.121, 75.27 , 3. ],\n", + " [309.966, 0. , 0. ],\n", + " [140.097, 26.3 , 2. ],\n", + " [120.11 , 0. , 0. ],\n", + " [267.272, 18.46 , 5. ],\n", + " [284.186, 76.66 , 4. ],\n", + " [ 94.928, 0. , 0. ],\n", + " [168.154, 0. , 0. ],\n", + " [ 76.054, 17.07 , 1. ],\n", + " [158.139, 12.03 , 1. ],\n", + " [234.215, 29.54 , 3. ],\n", + " [325.266, 38.77 , 5. ],\n", + " [210.981, 0. , 0. ],\n", + " [179.585, 0. , 0. ],\n", + " [ 76.054, 20.23 , 1. ],\n", + " [160.088, 75.27 , 3. ],\n", + " [136.109, 20.23 , 1. ],\n", + " [ 80.042, 26.3 , 2. ],\n", + " [100.076, 20.23 , 1. ],\n", + " [205.998, 29.1 , 1. ],\n", + " [258.034, 60.91 , 4. ],\n", + " [328.195, 107.77 , 7. ],\n", + " [146.128, 12.89 , 1. ],\n", + " [ 96.088, 0. , 0. ],\n", + " [220.143, 75.27 , 3. ],\n", + " [216.198, 0. , 0. ],\n", + " [248.015, 54.86 , 3. ],\n", + " [356.85 , 0. , 0. ],\n", + " [100.076, 20.23 , 1. ],\n", + " [108.099, 0. , 0. ],\n", + " [144.132, 0. , 0. ],\n", + " [228.209, 0. , 0. ],\n", + " [ 76.054, 17.07 , 1. ],\n", + " [427.756, 0. , 0. ],\n", + " [104.064, 26.3 , 2. ],\n", + " [367.223, 115.06 , 6. ],\n", + " [102.072, 46.25 , 2. ],\n", + " [248.157, 90.06 , 5. ],\n", + " [347.692, 54.37 , 3. ],\n", + " [213.587, 53.94 , 5. ],\n", + " [118.075, 68.87 , 3. ],\n", + " [223.993, 72.19 , 2. ],\n", + " [215.038, 0. , 0. ],\n", + " [232.111, 118.05 , 6. ],\n", + " [277.042, 52.37 , 3. ],\n", + " [136.109, 17.07 , 1. ],\n", + " [232.154, 75.27 , 3. ],\n", + " [116.075, 26.3 , 2. ],\n", + " [116.075, 26.3 , 2. ],\n", + " [356.252, 75.71 , 4. ],\n", + " [250.491, 0. , 0. ],\n", + " [115.937, 0. , 0. ],\n", + " [325.09 , 49.17 , 5. ],\n", + " [245.177, 55.84 , 6. ],\n", + " [140.105, 51.56 , 4. ],\n", + " [ 72.092, 52.04 , 1. ],\n", + " [ 96.088, 0. , 0. ],\n", + " [120.11 , 0. , 0. ],\n", + " [236.74 , 0. , 0. ],\n", + " [428.285, 68.55 , 5. ],\n", + " [ 82.038, 43.14 , 2. ],\n", + " [136.109, 17.07 , 1. ],\n", + " [261.627, 45.23 , 3. ],\n", + " [188.977, 43.14 , 2. ],\n", + " [236.211, 58.2 , 3. ],\n", + " [192.176, 0. , 0. ],\n", + " [ 88.065, 9.23 , 1. ],\n", + " [144.132, 0. , 0. ],\n", + " [248.196, 40.46 , 2. ],\n", + " [265.914, 47.58 , 2. ],\n", + " [285.944, 0. , 0. ],\n", + " [112.087, 12.53 , 1. ],\n", + " [108.099, 0. , 0. ]])" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "%%time\n", + "pipeline_physchem_small = Pipeline(\n", + " [\n", + " (\"auto2mol\", AutoToMol()),\n", + " (\n", + " \"physchem\",\n", + " MolToRDKitPhysChem(\n", + " standardizer=None,\n", + " descriptor_list=[\"HeavyAtomMolWt\", \"TPSA\", \"NumHAcceptors\"],\n", + " ),\n", + " ),\n", + " ],\n", + " n_jobs=-1,\n", + ")\n", + "physchem_matrix_small = pipeline_physchem_small.transform(df[\"smiles\"])\n", + "physchem_matrix_small" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.9" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}