diff --git a/.github/molpipeline.png b/.github/molpipeline.png
new file mode 100755
index 00000000..fc6129a2
Binary files /dev/null and b/.github/molpipeline.png differ
diff --git a/README.md b/README.md
index 7479707b..b42b655e 100644
--- a/README.md
+++ b/README.md
@@ -1,35 +1,56 @@
 # MolPipeline
-MolPipeline is a Python package providing RDKit functionality in a Scikit-learn like fashion.
+MolPipeline is a Python package for processing molecules with RDKit in scikit-learn.
+
+<p align="center"><img src=".github/molpipeline.png" height="250"/></p>
 
 ## Background
 
-The open-source package [scikit-learn](https://scikit-learn.org/) provides a large variety of machine
+The [scikit-learn](https://scikit-learn.org/) package provides a large variety of machine
 learning algorithms and data processing tools, among which is the `Pipeline` class, allowing users to
 prepend custom data processing steps to the machine learning model.
-`MolPipeline` extends this concept to the field of chemoinformatics by
-wrapping default functionalities of [RDKit](https://www.rdkit.org/), such as reading and writing SMILES strings
+`MolPipeline` extends this concept to the field of cheminformatics by
+wrapping standard [RDKit](https://www.rdkit.org/) functionality, such as reading and writing SMILES strings
 or calculating molecular descriptors from a molecule-object.
 
-A notable difference to the `Pipeline` class of scikit-learn is that the Pipline from `MolPipeline` allows for 
-instances to fail during processing without interrupting the whole pipeline.
-Such behaviour is useful when processing large datasets, where some SMILES strings might not encode valid molecules
-or some descriptors might not be calculable for certain molecules.
+MolPipeline aims to provide:
 
+- Automated end-to-end processing from molecule data sets to deployable machine learning models.
+- Scalable parallel processing and low memory usage through instance-based processing.
+- Standard pipeline building blocks for flexibly building custom pipelines for various
+cheminformatics tasks.
+- Consistent error handling for tracking, logging, and replacing failed instances (e.g., a
+SMILES string that could not be parsed correctly).
+- Integrated and self-contained pipeline serialization for easy deployment and tracking
+in version control.
 
 ## Publications
 
-The publication is freely available [here](https://chemrxiv.org/engage/chemrxiv/article-details/661fec7f418a5379b00ae036).
+[Sieg J, Feldmann CW, Hemmerich J, Stork C, Sandfort F, Eiden P, and Mathea M, MolPipeline: A python package for processing
+molecules with RDKit in scikit-learn, J. Chem. Inf. Model., doi:10.1021/acs.jcim.4c00863, 2024](https://doi.org/10.1021/acs.jcim.4c00863)
+\
+Further links: [arXiv](https://chemrxiv.org/engage/chemrxiv/article-details/661fec7f418a5379b00ae036)
+
+Feldmann CW, Sieg J, and Mathea M, Analysis of uncertainty of neural
+fingerprint-based models, 2024
+\
+Further links: [repository](https://github.com/basf/neural-fingerprint-uncertainty)
 
 ## Installation
 ```commandline
 pip install molpipeline
 ```
 
-## Usage
+## Documentation
+
+The [notebooks](notebooks) folder contains many basic and advanced examples of how to use Molpipeline.
+
+A nice introduction to the basic usage is in the [01_getting_started_with_molpipeline notebook](notebooks/01_getting_started_with_molpipeline.ipynb).
 
-See the [notebooks](notebooks) folder for basic and advanced examples of how to use Molpipeline.
+## Quick Start
 
-A basic example of how to use MolPipeline to create a fingerprint-based model is shown below (see also the [notebook](notebooks/01_getting_started_with_molpipeline.ipynb)): 
+### Model building
+
+Create a fingerprint-based prediction model:
 ```python
 from molpipeline import Pipeline
 from molpipeline.any2mol import AutoToMol
@@ -58,8 +79,42 @@ pipeline.predict(["CCC"])
 # output: array([0.29])
 ```
 
-Molpipeline also provides custom estimators for standard cheminformatics tasks that can be integrated into pipelines,
-like clustering for scaffold splits (see also the [notebook](notebooks/02_scaffold_split_with_custom_estimators.ipynb)):
+### Feature calculation
+
+Calculating molecular descriptors from SMILES strings is straightforward. For example, physicochemical properties can
+be calculated like this:
+```python
+from molpipeline import Pipeline
+from molpipeline.any2mol import AutoToMol
+from molpipeline.mol2any import MolToRDKitPhysChem
+
+pipeline_physchem = Pipeline(
+    [
+        ("auto2mol", AutoToMol()),
+        (
+            "physchem",
+            MolToRDKitPhysChem(
+                standardizer=None,
+                descriptor_list=["HeavyAtomMolWt", "TPSA", "NumHAcceptors"],
+            ),
+        ),
+    ],
+    n_jobs=-1,
+)
+physchem_matrix = pipeline_physchem.transform(["CCCCCC", "c1ccccc1(O)"])
+physchem_matrix
+# output: array([[72.066,  0.   ,  0.   ],
+#                [88.065, 20.23 ,  1.   ]])
+```
+
+MolPipeline provides further features and descriptors from RDKit, 
+for example Morgan (binary/count) fingerprints and MACCS keys.
+See the [04_feature_calculation notebook](notebooks/04_feature_calculation.ipynb) for more examples.
+
+### Clustering
+
+Molpipeline provides several clustering algorithms as sklearn-like estimators. For example, molecules can be
+clustered by their Murcko scaffold. See the [02_scaffold_split_with_custom_estimators notebook](notebooks/02_scaffold_split_with_custom_estimators.ipynb) for scaffolds splits and further examples.
 
 ```python
 from molpipeline.estimators import MurckoScaffoldClustering
diff --git a/notebooks/04_feature_calculation.ipynb b/notebooks/04_feature_calculation.ipynb
new file mode 100644
index 00000000..1bcf9ce8
--- /dev/null
+++ b/notebooks/04_feature_calculation.ipynb
@@ -0,0 +1,915 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "a5e18566-ab97-4ead-b6e3-0ad930754a21",
+   "metadata": {},
+   "source": [
+    "# Feature calculation\n",
+    "\n",
+    "\n",
+    "\n",
+    "Molpipeline provides multiple molecular featurization methods and descriptors from RDKit. This notebook shows how features like\n",
+    "\n",
+    "- Morgan binary fingerprints\n",
+    "- Morgan count fingerprints\n",
+    "- MACCS keys fingerprints\n",
+    "- Physicochemical features\n",
+    "\n",
+    "can be easily calculated in parallel and in different variations with MolPipeline. If you are interested in further molecular featurization and descriptors check out the `molpipeline.mol2any` module."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "6872cc5e-5851-42ec-a63e-071d8139829e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "\n",
+    "from molpipeline import Pipeline\n",
+    "from molpipeline.any2mol import AutoToMol\n",
+    "from molpipeline.mol2any import MolToMorganFP, MolToMACCSFP, MolToRDKitPhysChem"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8a6ba6bf-c0cd-4949-82f3-e71e538cdee0",
+   "metadata": {},
+   "source": [
+    "In this example we fetch the ESOL (delaney) data set. However, you can use any other data set."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "761f0ee7-3e66-4e86-bdac-e9dcec9ecb17",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df_full = pd.read_csv(\n",
+    "    \"https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/delaney-processed.csv\",\n",
+    "    usecols=lambda col: col != \"num\",\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6853d13e-c371-49cc-8009-544022c67d34",
+   "metadata": {},
+   "source": [
+    "We use a smaller portion of the data set for illustration"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "d47ea54e-ac15-4358-ae2b-7e8428642a26",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>Compound ID</th>\n",
+       "      <th>ESOL predicted log solubility in mols per litre</th>\n",
+       "      <th>Minimum Degree</th>\n",
+       "      <th>Molecular Weight</th>\n",
+       "      <th>Number of H-Bond Donors</th>\n",
+       "      <th>Number of Rings</th>\n",
+       "      <th>Number of Rotatable Bonds</th>\n",
+       "      <th>Polar Surface Area</th>\n",
+       "      <th>measured log solubility in mols per litre</th>\n",
+       "      <th>smiles</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>Amigdalin</td>\n",
+       "      <td>-0.974</td>\n",
+       "      <td>1</td>\n",
+       "      <td>457.432</td>\n",
+       "      <td>7</td>\n",
+       "      <td>3</td>\n",
+       "      <td>7</td>\n",
+       "      <td>202.32</td>\n",
+       "      <td>-0.77</td>\n",
+       "      <td>OCC3OC(OCC2OC(OC(C#N)c1ccccc1)C(O)C(O)C2O)C(O)...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>Fenfuram</td>\n",
+       "      <td>-2.885</td>\n",
+       "      <td>1</td>\n",
+       "      <td>201.225</td>\n",
+       "      <td>1</td>\n",
+       "      <td>2</td>\n",
+       "      <td>2</td>\n",
+       "      <td>42.24</td>\n",
+       "      <td>-3.30</td>\n",
+       "      <td>Cc1occc1C(=O)Nc2ccccc2</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>citral</td>\n",
+       "      <td>-2.579</td>\n",
+       "      <td>1</td>\n",
+       "      <td>152.237</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>4</td>\n",
+       "      <td>17.07</td>\n",
+       "      <td>-2.06</td>\n",
+       "      <td>CC(C)=CCCC(C)=CC(=O)</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>Picene</td>\n",
+       "      <td>-6.618</td>\n",
+       "      <td>2</td>\n",
+       "      <td>278.354</td>\n",
+       "      <td>0</td>\n",
+       "      <td>5</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0.00</td>\n",
+       "      <td>-7.87</td>\n",
+       "      <td>c1ccc2c(c1)ccc3c2ccc4c5ccccc5ccc43</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>Thiophene</td>\n",
+       "      <td>-2.232</td>\n",
+       "      <td>2</td>\n",
+       "      <td>84.143</td>\n",
+       "      <td>0</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0.00</td>\n",
+       "      <td>-1.33</td>\n",
+       "      <td>c1ccsc1</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>...</th>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>95</th>\n",
+       "      <td>diethylstilbestrol</td>\n",
+       "      <td>-5.074</td>\n",
+       "      <td>1</td>\n",
+       "      <td>268.356</td>\n",
+       "      <td>2</td>\n",
+       "      <td>2</td>\n",
+       "      <td>4</td>\n",
+       "      <td>40.46</td>\n",
+       "      <td>-4.07</td>\n",
+       "      <td>CCC(=C(CC)c1ccc(O)cc1)c2ccc(O)cc2</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>96</th>\n",
+       "      <td>Chlorothalonil</td>\n",
+       "      <td>-3.995</td>\n",
+       "      <td>1</td>\n",
+       "      <td>265.914</td>\n",
+       "      <td>0</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0</td>\n",
+       "      <td>47.58</td>\n",
+       "      <td>-5.64</td>\n",
+       "      <td>c1(C#N)c(Cl)c(C#N)c(Cl)c(Cl)c(Cl)1</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>97</th>\n",
+       "      <td>2,3',4',5-PCB</td>\n",
+       "      <td>-6.312</td>\n",
+       "      <td>1</td>\n",
+       "      <td>291.992</td>\n",
+       "      <td>0</td>\n",
+       "      <td>2</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0.00</td>\n",
+       "      <td>-7.25</td>\n",
+       "      <td>Clc1ccc(Cl)c(c1)c2ccc(Cl)c(Cl)c2</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>98</th>\n",
+       "      <td>styrene oxide</td>\n",
+       "      <td>-1.826</td>\n",
+       "      <td>2</td>\n",
+       "      <td>120.151</td>\n",
+       "      <td>0</td>\n",
+       "      <td>2</td>\n",
+       "      <td>1</td>\n",
+       "      <td>12.53</td>\n",
+       "      <td>-1.60</td>\n",
+       "      <td>C1OC1c2ccccc2</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>99</th>\n",
+       "      <td>Isopropylbenzene</td>\n",
+       "      <td>-3.265</td>\n",
+       "      <td>1</td>\n",
+       "      <td>120.195</td>\n",
+       "      <td>0</td>\n",
+       "      <td>1</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0.00</td>\n",
+       "      <td>-3.27</td>\n",
+       "      <td>CC(C)c1ccccc1</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "<p>100 rows × 10 columns</p>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "           Compound ID  ESOL predicted log solubility in mols per litre  \\\n",
+       "0            Amigdalin                                           -0.974   \n",
+       "1             Fenfuram                                           -2.885   \n",
+       "2               citral                                           -2.579   \n",
+       "3               Picene                                           -6.618   \n",
+       "4            Thiophene                                           -2.232   \n",
+       "..                 ...                                              ...   \n",
+       "95  diethylstilbestrol                                           -5.074   \n",
+       "96      Chlorothalonil                                           -3.995   \n",
+       "97       2,3',4',5-PCB                                           -6.312   \n",
+       "98       styrene oxide                                           -1.826   \n",
+       "99   Isopropylbenzene                                            -3.265   \n",
+       "\n",
+       "    Minimum Degree  Molecular Weight  Number of H-Bond Donors  \\\n",
+       "0                1           457.432                        7   \n",
+       "1                1           201.225                        1   \n",
+       "2                1           152.237                        0   \n",
+       "3                2           278.354                        0   \n",
+       "4                2            84.143                        0   \n",
+       "..             ...               ...                      ...   \n",
+       "95               1           268.356                        2   \n",
+       "96               1           265.914                        0   \n",
+       "97               1           291.992                        0   \n",
+       "98               2           120.151                        0   \n",
+       "99               1           120.195                        0   \n",
+       "\n",
+       "    Number of Rings  Number of Rotatable Bonds  Polar Surface Area  \\\n",
+       "0                 3                          7              202.32   \n",
+       "1                 2                          2               42.24   \n",
+       "2                 0                          4               17.07   \n",
+       "3                 5                          0                0.00   \n",
+       "4                 1                          0                0.00   \n",
+       "..              ...                        ...                 ...   \n",
+       "95                2                          4               40.46   \n",
+       "96                1                          0               47.58   \n",
+       "97                2                          1                0.00   \n",
+       "98                2                          1               12.53   \n",
+       "99                1                          1                0.00   \n",
+       "\n",
+       "    measured log solubility in mols per litre  \\\n",
+       "0                                       -0.77   \n",
+       "1                                       -3.30   \n",
+       "2                                       -2.06   \n",
+       "3                                       -7.87   \n",
+       "4                                       -1.33   \n",
+       "..                                        ...   \n",
+       "95                                      -4.07   \n",
+       "96                                      -5.64   \n",
+       "97                                      -7.25   \n",
+       "98                                      -1.60   \n",
+       "99                                      -3.27   \n",
+       "\n",
+       "                                               smiles  \n",
+       "0   OCC3OC(OCC2OC(OC(C#N)c1ccccc1)C(O)C(O)C2O)C(O)...  \n",
+       "1                              Cc1occc1C(=O)Nc2ccccc2  \n",
+       "2                                CC(C)=CCCC(C)=CC(=O)  \n",
+       "3                  c1ccc2c(c1)ccc3c2ccc4c5ccccc5ccc43  \n",
+       "4                                             c1ccsc1  \n",
+       "..                                                ...  \n",
+       "95                 CCC(=C(CC)c1ccc(O)cc1)c2ccc(O)cc2   \n",
+       "96                 c1(C#N)c(Cl)c(C#N)c(Cl)c(Cl)c(Cl)1  \n",
+       "97                   Clc1ccc(Cl)c(c1)c2ccc(Cl)c(Cl)c2  \n",
+       "98                                     C1OC1c2ccccc2   \n",
+       "99                                      CC(C)c1ccccc1  \n",
+       "\n",
+       "[100 rows x 10 columns]"
+      ]
+     },
+     "execution_count": 3,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "df = df_full.head(n=100)\n",
+    "df"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "80d9843a-a702-4da5-8a4f-6c5ed7a5034b",
+   "metadata": {},
+   "source": [
+    "## Calculating fingerprints"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "15dcb6cb-2a8e-4d62-a218-826581155816",
+   "metadata": {},
+   "source": [
+    "### Morgan binary fingerprints\n",
+    "\n",
+    "Morgan fingerprints are the most popular molecular fingerprints. They are also known as [Extended-Connectivity Fingerprints (ECFP)](https://doi.org/10.1021/ci100050t). They encode circular substructures in the molecule. The binary version contains only 0s and 1s indicating the presence or absence of the substructures in the molecule."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1a838dd7-ec21-4875-a5b8-c5e0c27d9389",
+   "metadata": {},
+   "source": [
+    "Let's define the Pipeline to first read the molecule and then calculate the binary Morgan fingerprint. Then, we execute it by calling the `transform` function."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "b6be019a-cc4d-45b2-b41a-9dca98d9644c",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "CPU times: user 181 ms, sys: 247 ms, total: 428 ms\n",
+      "Wall time: 12.6 s\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "<Compressed Sparse Row sparse matrix of dtype 'int64'\n",
+       "\twith 2191 stored elements and shape (100, 2048)>"
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "%%time\n",
+    "# define the pipeline\n",
+    "pipeline_morgan = Pipeline(\n",
+    "    [(\"auto2mol\", AutoToMol()), (\"morgan2_2048\", MolToMorganFP(n_bits=2048, radius=2))],\n",
+    "    n_jobs=-1,\n",
+    ")\n",
+    "# execute the pipeline\n",
+    "morgan_matrix = pipeline_morgan.transform(df[\"smiles\"])\n",
+    "morgan_matrix"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a13cc430-1c5e-4399-ab50-4b56ce8a7c09",
+   "metadata": {},
+   "source": [
+    "By default, the `MolToMorganFP` element returns a sparse matrix. More specifically, a [csr_matrix](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html) is returned which is more memory efficient than a dense matrix since most elements in the matrix are zero."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d872a591-cbfe-4158-8960-da813249fd1b",
+   "metadata": {},
+   "source": [
+    "To get a dense matrix you can convert the `csr_matrix` to a dense numpy matrix like this:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "5d9d772b-98b9-42e5-ba12-11f007a3d17f",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "matrix([[0, 1, 0, ..., 0, 0, 0],\n",
+       "        [0, 0, 0, ..., 0, 0, 0],\n",
+       "        [0, 0, 0, ..., 0, 0, 0],\n",
+       "        ...,\n",
+       "        [0, 0, 0, ..., 0, 0, 0],\n",
+       "        [0, 0, 0, ..., 0, 0, 0],\n",
+       "        [0, 1, 0, ..., 0, 0, 0]])"
+      ]
+     },
+     "execution_count": 5,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "morgan_matrix.todense()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "923f168d-e6e4-418d-adb3-5451555b1303",
+   "metadata": {},
+   "source": [
+    "Alternatively, you can specify in the `MolToMorganFP` element the return type of the feature matrix by using the `return_as` option. You can choose between\n",
+    "\n",
+    "- `return_as=\"sparse\"` which returns a `csr_matrix`\n",
+    "- `return_as=\"dense` which returns a dense numpy matrix\n",
+    "- `return_as=\"explicit_bit_vect\"` which returns RDKit's dense [ExplicitBitVect](https://www.rdkit.org/new_docs/cppapi/classExplicitBitVect.html)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "e728cf48-10bb-4168-9229-fe48b462ac03",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "CPU times: user 45.4 ms, sys: 11.7 ms, total: 57 ms\n",
+      "Wall time: 62.4 ms\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "array([[0, 1, 0, ..., 0, 0, 0],\n",
+       "       [0, 0, 0, ..., 0, 0, 0],\n",
+       "       [0, 0, 0, ..., 0, 0, 0],\n",
+       "       ...,\n",
+       "       [0, 0, 0, ..., 0, 0, 0],\n",
+       "       [0, 0, 0, ..., 0, 0, 0],\n",
+       "       [0, 1, 0, ..., 0, 0, 0]], dtype=uint8)"
+      ]
+     },
+     "execution_count": 6,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "%%time\n",
+    "pipeline_morgan_dense = Pipeline(\n",
+    "    [\n",
+    "        (\"auto2mol\", AutoToMol()),\n",
+    "        (\"morgan2_2048\", MolToMorganFP(n_bits=2048, radius=2, return_as=\"dense\")),\n",
+    "    ],\n",
+    "    n_jobs=-1,\n",
+    ")\n",
+    "dense_morgan_matrix = pipeline_morgan_dense.transform(df[\"smiles\"])\n",
+    "dense_morgan_matrix"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6aecd789-2198-4325-b892-6aeecf857e25",
+   "metadata": {},
+   "source": [
+    "The feature matrix can be used to train a machine learning model but also for various analyses."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "85043b30-7476-4204-8268-a9375b2ee4f8",
+   "metadata": {},
+   "source": [
+    "### Morgan count fingerprints"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9897e96f-4ffd-434b-b629-837a31a99f04",
+   "metadata": {},
+   "source": [
+    "Just set `counted=True` to compute Morgan count fingerprints instead of binary fingerprints."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "477ebba4-0fbe-46c2-8c4a-13f9051ae85b",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "array([[0, 1, 0, ..., 0, 0, 0],\n",
+       "       [0, 0, 0, ..., 0, 0, 0],\n",
+       "       [0, 0, 0, ..., 0, 0, 0],\n",
+       "       ...,\n",
+       "       [0, 0, 0, ..., 0, 0, 0],\n",
+       "       [0, 0, 0, ..., 0, 0, 0],\n",
+       "       [0, 1, 0, ..., 0, 0, 0]], dtype=uint32)"
+      ]
+     },
+     "execution_count": 7,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "pipeline_morgan_counted = Pipeline(\n",
+    "    [\n",
+    "        (\"auto2mol\", AutoToMol()),\n",
+    "        (\n",
+    "            \"morgan2_2048\",\n",
+    "            MolToMorganFP(n_bits=2048, radius=2, counted=True, return_as=\"dense\"),\n",
+    "        ),\n",
+    "    ],\n",
+    "    n_jobs=-1,\n",
+    ")\n",
+    "count_morgan_matrix = pipeline_morgan_counted.transform(df[\"smiles\"])\n",
+    "count_morgan_matrix"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0e24ea56-f0f8-4426-b3e3-da960b93d431",
+   "metadata": {},
+   "source": [
+    "When we sort the matrix values we see that some substructures are present up to 14 times in a single molecule."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "189ea2d6-9274-4097-b654-5ca88c318abf",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[14, 13, 12, 12, 11, 10, 10, 10, 10, 10, 10, 10, 9, 9, 8, 8, 8, 8, 8, 8]"
+      ]
+     },
+     "execution_count": 8,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "sorted(count_morgan_matrix.ravel(), reverse=True)[:20]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "80fb055a-1b4c-4c69-989c-5f3e774e80e1",
+   "metadata": {},
+   "source": [
+    "### MACCS key fingerprints\n",
+    "\n",
+    "MACCS keys are a manually defined set of 166 substructures whose presence is checked in the molecule. MACCS keys contain for example common functional groups."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "d9a11c62-c8ad-470f-b40f-f5d4ddc16b61",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "CPU times: user 43.8 ms, sys: 1.15 ms, total: 44.9 ms\n",
+      "Wall time: 70.9 ms\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "array([[0, 0, 0, ..., 1, 1, 0],\n",
+       "       [0, 0, 0, ..., 1, 1, 0],\n",
+       "       [0, 0, 0, ..., 1, 0, 0],\n",
+       "       ...,\n",
+       "       [0, 0, 0, ..., 0, 1, 0],\n",
+       "       [0, 0, 0, ..., 1, 1, 0],\n",
+       "       [0, 0, 0, ..., 0, 1, 0]])"
+      ]
+     },
+     "execution_count": 9,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "%%time\n",
+    "pipeline_maccs_dense = Pipeline(\n",
+    "    [(\"auto2mol\", AutoToMol()), (\"maccs\", MolToMACCSFP(return_as=\"dense\"))],\n",
+    "    n_jobs=-1,\n",
+    ")\n",
+    "dense_maccs_matrix = pipeline_maccs_dense.transform(df[\"smiles\"])\n",
+    "dense_maccs_matrix"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7d3546ca-6d58-4a69-a252-d7deb3147a40",
+   "metadata": {},
+   "source": [
+    "## Physicochemical features\n",
+    "\n",
+    "RDKit also provides more than 200 physicochemical descriptors that can readily be computed from most molecules. In MolPipeline we can compute these features with the `MolToRDKitPhysChem` element."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "id": "858afb55-7e24-415d-bb5a-e0d7c811d6df",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "CPU times: user 68.1 ms, sys: 2.43 ms, total: 70.5 ms\n",
+      "Wall time: 171 ms\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "array([[10.25332888, 10.25332888,  0.48660209, ...,  0.        ,\n",
+       "         0.        ,  0.        ],\n",
+       "       [11.72491119, 11.72491119,  0.14587963, ...,  0.        ,\n",
+       "         0.        ,  0.        ],\n",
+       "       [10.02049761, 10.02049761,  0.84508976, ...,  0.        ,\n",
+       "         0.        ,  0.        ],\n",
+       "       ...,\n",
+       "       [ 6.08815823,  6.08815823,  0.49556374, ...,  0.        ,\n",
+       "         0.        ,  0.        ],\n",
+       "       [ 5.09453704,  5.09453704,  0.40851852, ...,  0.        ,\n",
+       "         0.        ,  0.        ],\n",
+       "       [ 2.2037037 ,  2.2037037 ,  0.65851852, ...,  0.        ,\n",
+       "         0.        ,  0.        ]])"
+      ]
+     },
+     "execution_count": 10,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "%%time\n",
+    "pipeline_physchem = Pipeline(\n",
+    "    [(\"auto2mol\", AutoToMol()), (\"physchem\", MolToRDKitPhysChem(standardizer=None))],\n",
+    "    n_jobs=-1,\n",
+    ")\n",
+    "physchem_matrix = pipeline_physchem.transform(df[\"smiles\"])\n",
+    "physchem_matrix"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8746f6cb-dc30-4435-a97b-0235f2c8c47a",
+   "metadata": {},
+   "source": [
+    "We can get the name of the descriptors like this:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "id": "f0b5fe47-54f0-4cca-9a1a-aa689a0b2d0c",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "['MaxAbsEStateIndex',\n",
+       " 'MaxEStateIndex',\n",
+       " 'MinAbsEStateIndex',\n",
+       " 'MinEStateIndex',\n",
+       " 'qed',\n",
+       " 'SPS',\n",
+       " 'HeavyAtomMolWt',\n",
+       " 'ExactMolWt',\n",
+       " 'NumValenceElectrons',\n",
+       " 'NumRadicalElectrons',\n",
+       " 'MaxPartialCharge',\n",
+       " 'MinPartialCharge',\n",
+       " 'MaxAbsPartialCharge',\n",
+       " 'MinAbsPartialCharge',\n",
+       " 'FpDensityMorgan1',\n",
+       " 'FpDensityMorgan2',\n",
+       " 'FpDensityMorgan3',\n",
+       " 'BCUT2D_MWHI',\n",
+       " 'BCUT2D_MWLOW',\n",
+       " 'BCUT2D_CHGHI']"
+      ]
+     },
+     "execution_count": 11,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "pipeline_physchem[\"physchem\"].descriptor_list[:20]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b0823f4d-8a2e-4ae2-91f7-3db6ecaf0c0e",
+   "metadata": {},
+   "source": [
+    "When we only want to calculate a subset of all available descriptors we can specify this during pipeline construction"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "id": "a3e005f3-f421-4634-9135-860e91a19de1",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "CPU times: user 41.2 ms, sys: 3.38 ms, total: 44.6 ms\n",
+      "Wall time: 47.5 ms\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "array([[430.216, 202.32 ,  12.   ],\n",
+       "       [190.137,  42.24 ,   2.   ],\n",
+       "       [136.109,  17.07 ,   1.   ],\n",
+       "       [264.242,   0.   ,   0.   ],\n",
+       "       [ 80.111,   0.   ,   1.   ],\n",
+       "       [130.151,  12.89 ,   2.   ],\n",
+       "       [321.397,   0.   ,   0.   ],\n",
+       "       [248.196,  40.46 ,   2.   ],\n",
+       "       [372.849,  12.53 ,   1.   ],\n",
+       "       [372.247,  63.22 ,   6.   ],\n",
+       "       [ 78.05 ,  29.1  ,   1.   ],\n",
+       "       [155.563,   0.   ,   0.   ],\n",
+       "       [ 60.055,   0.   ,   0.   ],\n",
+       "       [204.144,  58.2  ,   2.   ],\n",
+       "       [168.154,   0.   ,   0.   ],\n",
+       "       [ 71.486,   0.   ,   0.   ],\n",
+       "       [ 76.054,  20.23 ,   1.   ],\n",
+       "       [ 98.084,  23.79 ,   1.   ],\n",
+       "       [283.184,  53.47 ,   6.   ],\n",
+       "       [148.12 ,  20.23 ,   1.   ],\n",
+       "       [321.397,   0.   ,   0.   ],\n",
+       "       [216.155,  54.86 ,   3.   ],\n",
+       "       [243.25 ,  18.46 ,   5.   ],\n",
+       "       [166.115,  38.33 ,   2.   ],\n",
+       "       [309.139, 115.54 ,   6.   ],\n",
+       "       [100.076,  20.23 ,   1.   ],\n",
+       "       [172.103,  72.68 ,   5.   ],\n",
+       "       [196.121,  75.27 ,   3.   ],\n",
+       "       [309.966,   0.   ,   0.   ],\n",
+       "       [140.097,  26.3  ,   2.   ],\n",
+       "       [120.11 ,   0.   ,   0.   ],\n",
+       "       [267.272,  18.46 ,   5.   ],\n",
+       "       [284.186,  76.66 ,   4.   ],\n",
+       "       [ 94.928,   0.   ,   0.   ],\n",
+       "       [168.154,   0.   ,   0.   ],\n",
+       "       [ 76.054,  17.07 ,   1.   ],\n",
+       "       [158.139,  12.03 ,   1.   ],\n",
+       "       [234.215,  29.54 ,   3.   ],\n",
+       "       [325.266,  38.77 ,   5.   ],\n",
+       "       [210.981,   0.   ,   0.   ],\n",
+       "       [179.585,   0.   ,   0.   ],\n",
+       "       [ 76.054,  20.23 ,   1.   ],\n",
+       "       [160.088,  75.27 ,   3.   ],\n",
+       "       [136.109,  20.23 ,   1.   ],\n",
+       "       [ 80.042,  26.3  ,   2.   ],\n",
+       "       [100.076,  20.23 ,   1.   ],\n",
+       "       [205.998,  29.1  ,   1.   ],\n",
+       "       [258.034,  60.91 ,   4.   ],\n",
+       "       [328.195, 107.77 ,   7.   ],\n",
+       "       [146.128,  12.89 ,   1.   ],\n",
+       "       [ 96.088,   0.   ,   0.   ],\n",
+       "       [220.143,  75.27 ,   3.   ],\n",
+       "       [216.198,   0.   ,   0.   ],\n",
+       "       [248.015,  54.86 ,   3.   ],\n",
+       "       [356.85 ,   0.   ,   0.   ],\n",
+       "       [100.076,  20.23 ,   1.   ],\n",
+       "       [108.099,   0.   ,   0.   ],\n",
+       "       [144.132,   0.   ,   0.   ],\n",
+       "       [228.209,   0.   ,   0.   ],\n",
+       "       [ 76.054,  17.07 ,   1.   ],\n",
+       "       [427.756,   0.   ,   0.   ],\n",
+       "       [104.064,  26.3  ,   2.   ],\n",
+       "       [367.223, 115.06 ,   6.   ],\n",
+       "       [102.072,  46.25 ,   2.   ],\n",
+       "       [248.157,  90.06 ,   5.   ],\n",
+       "       [347.692,  54.37 ,   3.   ],\n",
+       "       [213.587,  53.94 ,   5.   ],\n",
+       "       [118.075,  68.87 ,   3.   ],\n",
+       "       [223.993,  72.19 ,   2.   ],\n",
+       "       [215.038,   0.   ,   0.   ],\n",
+       "       [232.111, 118.05 ,   6.   ],\n",
+       "       [277.042,  52.37 ,   3.   ],\n",
+       "       [136.109,  17.07 ,   1.   ],\n",
+       "       [232.154,  75.27 ,   3.   ],\n",
+       "       [116.075,  26.3  ,   2.   ],\n",
+       "       [116.075,  26.3  ,   2.   ],\n",
+       "       [356.252,  75.71 ,   4.   ],\n",
+       "       [250.491,   0.   ,   0.   ],\n",
+       "       [115.937,   0.   ,   0.   ],\n",
+       "       [325.09 ,  49.17 ,   5.   ],\n",
+       "       [245.177,  55.84 ,   6.   ],\n",
+       "       [140.105,  51.56 ,   4.   ],\n",
+       "       [ 72.092,  52.04 ,   1.   ],\n",
+       "       [ 96.088,   0.   ,   0.   ],\n",
+       "       [120.11 ,   0.   ,   0.   ],\n",
+       "       [236.74 ,   0.   ,   0.   ],\n",
+       "       [428.285,  68.55 ,   5.   ],\n",
+       "       [ 82.038,  43.14 ,   2.   ],\n",
+       "       [136.109,  17.07 ,   1.   ],\n",
+       "       [261.627,  45.23 ,   3.   ],\n",
+       "       [188.977,  43.14 ,   2.   ],\n",
+       "       [236.211,  58.2  ,   3.   ],\n",
+       "       [192.176,   0.   ,   0.   ],\n",
+       "       [ 88.065,   9.23 ,   1.   ],\n",
+       "       [144.132,   0.   ,   0.   ],\n",
+       "       [248.196,  40.46 ,   2.   ],\n",
+       "       [265.914,  47.58 ,   2.   ],\n",
+       "       [285.944,   0.   ,   0.   ],\n",
+       "       [112.087,  12.53 ,   1.   ],\n",
+       "       [108.099,   0.   ,   0.   ]])"
+      ]
+     },
+     "execution_count": 12,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "%%time\n",
+    "pipeline_physchem_small = Pipeline(\n",
+    "    [\n",
+    "        (\"auto2mol\", AutoToMol()),\n",
+    "        (\n",
+    "            \"physchem\",\n",
+    "            MolToRDKitPhysChem(\n",
+    "                standardizer=None,\n",
+    "                descriptor_list=[\"HeavyAtomMolWt\", \"TPSA\", \"NumHAcceptors\"],\n",
+    "            ),\n",
+    "        ),\n",
+    "    ],\n",
+    "    n_jobs=-1,\n",
+    ")\n",
+    "physchem_matrix_small = pipeline_physchem_small.transform(df[\"smiles\"])\n",
+    "physchem_matrix_small"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.9"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

	Compound ID	ESOL predicted log solubility in mols per litre	Minimum Degree	Molecular Weight	Number of H-Bond Donors	Number of Rings	Number of Rotatable Bonds	Polar Surface Area	measured log solubility in mols per litre	smiles
0	Amigdalin	-0.974	1	457.432	7	3	7	202.32	-0.77	OCC3OC(OCC2OC(OC(C#N)c1ccccc1)C(O)C(O)C2O)C(O)...
1	Fenfuram	-2.885	1	201.225	1	2	2	42.24	-3.30	Cc1occc1C(=O)Nc2ccccc2
2	citral	-2.579	1	152.237	0	0	4	17.07	-2.06	CC(C)=CCCC(C)=CC(=O)
3	Picene	-6.618	2	278.354	0	5	0	0.00	-7.87	c1ccc2c(c1)ccc3c2ccc4c5ccccc5ccc43
4	Thiophene	-2.232	2	84.143	0	1	0	0.00	-1.33	c1ccsc1
...	...	...	...	...	...	...	...	...	...	...
95	diethylstilbestrol	-5.074	1	268.356	2	2	4	40.46	-4.07	CCC(=C(CC)c1ccc(O)cc1)c2ccc(O)cc2
96	Chlorothalonil	-3.995	1	265.914	0	1	0	47.58	-5.64	c1(C#N)c(Cl)c(C#N)c(Cl)c(Cl)c(Cl)1
97	2,3',4',5-PCB	-6.312	1	291.992	0	2	1	0.00	-7.25	Clc1ccc(Cl)c(c1)c2ccc(Cl)c(Cl)c2
98	styrene oxide	-1.826	2	120.151	0	2	1	12.53	-1.60	C1OC1c2ccccc2
99	Isopropylbenzene	-3.265	1	120.195	0	1	1	0.00	-3.27	CC(C)c1ccccc1