Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scaffold clustering #43

Merged
merged 13 commits into from
May 13, 2024
175 changes: 175 additions & 0 deletions src/konnektor/network_tools/clustering/scaffold_clustering.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,175 @@
"""Clustering compounds based on scaffolds

This clusterer attempts to cluster compounds based on their scaffolds.
It is built on rdkit's rdScaffoldNetwork module.
"""
from collections import defaultdict
import itertools
import gufe
import openfe
from rdkit import Chem
from rdkit.Chem import rdMolHash
from rdkit.Chem.Scaffolds import rdScaffoldNetwork

Check warning on line 12 in src/konnektor/network_tools/clustering/scaffold_clustering.py

View check run for this annotation

Codecov / codecov/patch

src/konnektor/network_tools/clustering/scaffold_clustering.py#L6-L12

Added lines #L6 - L12 were not covered by tests

from ._abstract_clusterer import _AbstractClusterer

Check warning on line 14 in src/konnektor/network_tools/clustering/scaffold_clustering.py

View check run for this annotation

Codecov / codecov/patch

src/konnektor/network_tools/clustering/scaffold_clustering.py#L14

Added line #L14 was not covered by tests


class TwoDimensionalScaffoldClusterer(_AbstractClusterer):
RiesBen marked this conversation as resolved.
Show resolved Hide resolved
scaffold_looseness: int

Check warning on line 18 in src/konnektor/network_tools/clustering/scaffold_clustering.py

View check run for this annotation

Codecov / codecov/patch

src/konnektor/network_tools/clustering/scaffold_clustering.py#L17-L18

Added lines #L17 - L18 were not covered by tests

def __init__(self, scaffold_looseness: int = 9):

Check warning on line 20 in src/konnektor/network_tools/clustering/scaffold_clustering.py

View check run for this annotation

Codecov / codecov/patch

src/konnektor/network_tools/clustering/scaffold_clustering.py#L20

Added line #L20 was not covered by tests
"""
Parameters
----------
scaffold_looseness : int
a heuristic to define to what extent alternate/smaller scaffolds can
be used to match a certain molecule.
this value decides how many heavy atoms from the *largest* scaffold
other scaffolds may be permitted.
a too high value may result in inappropriately generic scaffolds being
used, while too low will result in too many scaffolds being identified
"""
self.scaffold_looseness = scaffold_looseness

Check warning on line 32 in src/konnektor/network_tools/clustering/scaffold_clustering.py

View check run for this annotation

Codecov / codecov/patch

src/konnektor/network_tools/clustering/scaffold_clustering.py#L32

Added line #L32 was not covered by tests

@staticmethod
def normalise_molecules(mols: list[gufe.SmallMoleculeComponent]) -> dict[gufe.SmallMoleculeComponent, Chem.Mol]:

Check warning on line 35 in src/konnektor/network_tools/clustering/scaffold_clustering.py

View check run for this annotation

Codecov / codecov/patch

src/konnektor/network_tools/clustering/scaffold_clustering.py#L34-L35

Added lines #L34 - L35 were not covered by tests
# Convert SMC to a normalised (reduced & cleaned up) version in rdkit
# This is anonymous, so no bond orders, charges or elements
# This makes comparing scaffolds more suited to RBFEs
def normalised_rep(mol):
smi = rdMolHash.MolHash(Chem.RemoveHs(mol.to_rdkit()),

Check warning on line 40 in src/konnektor/network_tools/clustering/scaffold_clustering.py

View check run for this annotation

Codecov / codecov/patch

src/konnektor/network_tools/clustering/scaffold_clustering.py#L39-L40

Added lines #L39 - L40 were not covered by tests
rdMolHash.HashFunction.AnonymousGraph)
return Chem.MolFromSmiles(smi)

Check warning on line 42 in src/konnektor/network_tools/clustering/scaffold_clustering.py

View check run for this annotation

Codecov / codecov/patch

src/konnektor/network_tools/clustering/scaffold_clustering.py#L42

Added line #L42 was not covered by tests

# returns mapping of molecules to their normalised rep
mols2anonymous = {mol: normalised_rep(mol) for mol in mols}

Check warning on line 45 in src/konnektor/network_tools/clustering/scaffold_clustering.py

View check run for this annotation

Codecov / codecov/patch

src/konnektor/network_tools/clustering/scaffold_clustering.py#L45

Added line #L45 was not covered by tests

return mols2anonymous

Check warning on line 47 in src/konnektor/network_tools/clustering/scaffold_clustering.py

View check run for this annotation

Codecov / codecov/patch

src/konnektor/network_tools/clustering/scaffold_clustering.py#L47

Added line #L47 was not covered by tests

@staticmethod
def generate_scaffold_network(mols: list[Chem.Mol]) -> rdScaffoldNetwork.ScaffoldNetwork:

Check warning on line 50 in src/konnektor/network_tools/clustering/scaffold_clustering.py

View check run for this annotation

Codecov / codecov/patch

src/konnektor/network_tools/clustering/scaffold_clustering.py#L49-L50

Added lines #L49 - L50 were not covered by tests
# generates the scaffold network from the rdkit mol objects
params = rdScaffoldNetwork.ScaffoldNetworkParams()
params.includeScaffoldsWithAttachments = False
params.flattenChirality = True
params.pruneBeforeFragmenting = True
net = rdScaffoldNetwork.CreateScaffoldNetwork(mols, params)

Check warning on line 56 in src/konnektor/network_tools/clustering/scaffold_clustering.py

View check run for this annotation

Codecov / codecov/patch

src/konnektor/network_tools/clustering/scaffold_clustering.py#L52-L56

Added lines #L52 - L56 were not covered by tests

return net

Check warning on line 58 in src/konnektor/network_tools/clustering/scaffold_clustering.py

View check run for this annotation

Codecov / codecov/patch

src/konnektor/network_tools/clustering/scaffold_clustering.py#L58

Added line #L58 was not covered by tests

@staticmethod
def match_scaffolds_to_source(network: rdScaffoldNetwork.ScaffoldNetwork,

Check warning on line 61 in src/konnektor/network_tools/clustering/scaffold_clustering.py

View check run for this annotation

Codecov / codecov/patch

src/konnektor/network_tools/clustering/scaffold_clustering.py#L60-L61

Added lines #L60 - L61 were not covered by tests
mols: list[Chem.Mol],
hac_heuristic: int) -> dict[Chem.Mol, list[str]]:
# match scaffolds in network back to normalised input molecules
# i.e. for each molecule, which scaffolds can apply

# will store for each molecule, potential scaffolds and their size
mols2scaffolds = defaultdict(list)
for scaff in network.nodes:

Check warning on line 69 in src/konnektor/network_tools/clustering/scaffold_clustering.py

View check run for this annotation

Codecov / codecov/patch

src/konnektor/network_tools/clustering/scaffold_clustering.py#L68-L69

Added lines #L68 - L69 were not covered by tests
# determine size of scaffold
q = Chem.MolFromSmarts(scaff)
natoms = q.GetNumAtoms()

Check warning on line 72 in src/konnektor/network_tools/clustering/scaffold_clustering.py

View check run for this annotation

Codecov / codecov/patch

src/konnektor/network_tools/clustering/scaffold_clustering.py#L71-L72

Added lines #L71 - L72 were not covered by tests

for m in mols:
if m.HasSubstructMatch(q):
mols2scaffolds[m].append((scaff, natoms))

Check warning on line 76 in src/konnektor/network_tools/clustering/scaffold_clustering.py

View check run for this annotation

Codecov / codecov/patch

src/konnektor/network_tools/clustering/scaffold_clustering.py#L74-L76

Added lines #L74 - L76 were not covered by tests

# then filter these scaffolds to only allow those which are large enough
mols2candidates = defaultdict(list)
for m, scaffs in mols2scaffolds.items():

Check warning on line 80 in src/konnektor/network_tools/clustering/scaffold_clustering.py

View check run for this annotation

Codecov / codecov/patch

src/konnektor/network_tools/clustering/scaffold_clustering.py#L79-L80

Added lines #L79 - L80 were not covered by tests
# determine what the largest scaffold for this molecule was
largest_scaff = max(scaffs, key=lambda x: x[1])

Check warning on line 82 in src/konnektor/network_tools/clustering/scaffold_clustering.py

View check run for this annotation

Codecov / codecov/patch

src/konnektor/network_tools/clustering/scaffold_clustering.py#L82

Added line #L82 was not covered by tests
# for each molecule, a size ordered list of scaffolds that they could be assigned to
mols2candidates[m] = sorted(

Check warning on line 84 in src/konnektor/network_tools/clustering/scaffold_clustering.py

View check run for this annotation

Codecov / codecov/patch

src/konnektor/network_tools/clustering/scaffold_clustering.py#L84

Added line #L84 was not covered by tests
[s for s in scaffs if (largest_scaff[1] - s[1]) <= hac_heuristic],
key=lambda x: x[1], reverse=True
)

return mols2candidates

Check warning on line 89 in src/konnektor/network_tools/clustering/scaffold_clustering.py

View check run for this annotation

Codecov / codecov/patch

src/konnektor/network_tools/clustering/scaffold_clustering.py#L89

Added line #L89 was not covered by tests

@staticmethod
def find_solution(mol_to_candidates: dict[Chem.Mol, list[str]]) -> list[tuple[str, int]]:

Check warning on line 92 in src/konnektor/network_tools/clustering/scaffold_clustering.py

View check run for this annotation

Codecov / codecov/patch

src/konnektor/network_tools/clustering/scaffold_clustering.py#L91-L92

Added lines #L91 - L92 were not covered by tests
# returns the best scaffolds that cover all mols
# returns a list of (scaffold smiles, n heavy atoms)

# reverse mapping of scaffolds onto the mols they cater for
scaffold2mols = defaultdict(list)
anon_mols = set()
for mol, scaffs in mol_to_candidates.items():
anon_mols.add(mol)
for scaff in scaffs:
scaffold2mols[scaff].append(mol)

Check warning on line 102 in src/konnektor/network_tools/clustering/scaffold_clustering.py

View check run for this annotation

Codecov / codecov/patch

src/konnektor/network_tools/clustering/scaffold_clustering.py#L97-L102

Added lines #L97 - L102 were not covered by tests

def scaffold_coverage(scaffolds, scaff2mol, all_mols) -> bool:

Check warning on line 104 in src/konnektor/network_tools/clustering/scaffold_clustering.py

View check run for this annotation

Codecov / codecov/patch

src/konnektor/network_tools/clustering/scaffold_clustering.py#L104

Added line #L104 was not covered by tests
"""Does this combination of scaffolds cover all ligands"""
covered_mols = set()

Check warning on line 106 in src/konnektor/network_tools/clustering/scaffold_clustering.py

View check run for this annotation

Codecov / codecov/patch

src/konnektor/network_tools/clustering/scaffold_clustering.py#L106

Added line #L106 was not covered by tests

for scaff in scaffolds:
covered_mols |= set(scaff2mol[scaff])

Check warning on line 109 in src/konnektor/network_tools/clustering/scaffold_clustering.py

View check run for this annotation

Codecov / codecov/patch

src/konnektor/network_tools/clustering/scaffold_clustering.py#L108-L109

Added lines #L108 - L109 were not covered by tests

return covered_mols == set(all_mols)

Check warning on line 111 in src/konnektor/network_tools/clustering/scaffold_clustering.py

View check run for this annotation

Codecov / codecov/patch

src/konnektor/network_tools/clustering/scaffold_clustering.py#L111

Added line #L111 was not covered by tests

candidate_scaffolds = set(itertools.chain.from_iterable(mol_to_candidates.values()))

Check warning on line 113 in src/konnektor/network_tools/clustering/scaffold_clustering.py

View check run for this annotation

Codecov / codecov/patch

src/konnektor/network_tools/clustering/scaffold_clustering.py#L113

Added line #L113 was not covered by tests
# try one scaffold to see if it catches all molecules
# then try all combinations of two scaffolds to see if we cover
# etc until we find a solution
# then pick the solution with the largest scaffolds
for i in range(1, len(candidate_scaffolds)):
solutions = []

Check warning on line 119 in src/konnektor/network_tools/clustering/scaffold_clustering.py

View check run for this annotation

Codecov / codecov/patch

src/konnektor/network_tools/clustering/scaffold_clustering.py#L118-L119

Added lines #L118 - L119 were not covered by tests

for scaffolds in itertools.combinations(candidate_scaffolds, i):
if not scaffold_coverage(scaffolds, scaffold2mols, anon_mols):
continue

Check warning on line 123 in src/konnektor/network_tools/clustering/scaffold_clustering.py

View check run for this annotation

Codecov / codecov/patch

src/konnektor/network_tools/clustering/scaffold_clustering.py#L121-L123

Added lines #L121 - L123 were not covered by tests

solutions.append(scaffolds)

Check warning on line 125 in src/konnektor/network_tools/clustering/scaffold_clustering.py

View check run for this annotation

Codecov / codecov/patch

src/konnektor/network_tools/clustering/scaffold_clustering.py#L125

Added line #L125 was not covered by tests

if solutions:

Check warning on line 127 in src/konnektor/network_tools/clustering/scaffold_clustering.py

View check run for this annotation

Codecov / codecov/patch

src/konnektor/network_tools/clustering/scaffold_clustering.py#L127

Added line #L127 was not covered by tests
# pick the best, based on HAC
solution = max(solutions, key=lambda x: sum(s[1] for s in x))
return solution

Check warning on line 130 in src/konnektor/network_tools/clustering/scaffold_clustering.py

View check run for this annotation

Codecov / codecov/patch

src/konnektor/network_tools/clustering/scaffold_clustering.py#L129-L130

Added lines #L129 - L130 were not covered by tests

@staticmethod
def formulate_answer(solution: list[tuple[str, int]],

Check warning on line 133 in src/konnektor/network_tools/clustering/scaffold_clustering.py

View check run for this annotation

Codecov / codecov/patch

src/konnektor/network_tools/clustering/scaffold_clustering.py#L132-L133

Added lines #L132 - L133 were not covered by tests
mols_to_norm: dict[gufe.SmallMoleculeComponent, Chem.Mol]
) -> dict[str, list[gufe.SmallMoleculeComponent]]:
# relate the solution scaffolds back to the input SMC

# for each molecule, pick the largest scaffold that matches
relationship = []
for input_mol, anon_mol in mols_to_norm.items():
best = -1
best_scaff = None
for scaff, natoms in solution:
if natoms < best:
continue
if anon_mol.HasSubstructMatch(Chem.MolFromSmarts(scaff)):
best_scaff = scaff
relationship.append((input_mol, best_scaff))

Check warning on line 148 in src/konnektor/network_tools/clustering/scaffold_clustering.py

View check run for this annotation

Codecov / codecov/patch

src/konnektor/network_tools/clustering/scaffold_clustering.py#L139-L148

Added lines #L139 - L148 were not covered by tests

final_answer = defaultdict(list)
for input_mol, scaff in relationship:
final_answer[scaff].append(input_mol)

Check warning on line 152 in src/konnektor/network_tools/clustering/scaffold_clustering.py

View check run for this annotation

Codecov / codecov/patch

src/konnektor/network_tools/clustering/scaffold_clustering.py#L150-L152

Added lines #L150 - L152 were not covered by tests

return dict(final_answer)

Check warning on line 154 in src/konnektor/network_tools/clustering/scaffold_clustering.py

View check run for this annotation

Codecov / codecov/patch

src/konnektor/network_tools/clustering/scaffold_clustering.py#L154

Added line #L154 was not covered by tests

def cluster_compounds(self, components: list[gufe.SmallMoleculeComponent]):

Check warning on line 156 in src/konnektor/network_tools/clustering/scaffold_clustering.py

View check run for this annotation

Codecov / codecov/patch

src/konnektor/network_tools/clustering/scaffold_clustering.py#L156

Added line #L156 was not covered by tests
# first normalise the molecules to a generic representation
mols_to_norm = self.normalise_molecules(components)

Check warning on line 158 in src/konnektor/network_tools/clustering/scaffold_clustering.py

View check run for this annotation

Codecov / codecov/patch

src/konnektor/network_tools/clustering/scaffold_clustering.py#L158

Added line #L158 was not covered by tests

# then create a scaffold network from the normalised molecules
network = self.generate_scaffold_network(list(mols_to_norm.values()))

Check warning on line 161 in src/konnektor/network_tools/clustering/scaffold_clustering.py

View check run for this annotation

Codecov / codecov/patch

src/konnektor/network_tools/clustering/scaffold_clustering.py#L161

Added line #L161 was not covered by tests

# reassign the scaffolds in the network back to the normalised reps
mol_to_candidates = self.match_scaffolds_to_source(

Check warning on line 164 in src/konnektor/network_tools/clustering/scaffold_clustering.py

View check run for this annotation

Codecov / codecov/patch

src/konnektor/network_tools/clustering/scaffold_clustering.py#L164

Added line #L164 was not covered by tests
network,
list(mols_to_norm.values()),
self.scaffold_looseness,
)

# then try increasingly larger number of scaffolds
# until we hit full coverage of molecules
solution = self.find_solution(mol_to_candidates)

Check warning on line 172 in src/konnektor/network_tools/clustering/scaffold_clustering.py

View check run for this annotation

Codecov / codecov/patch

src/konnektor/network_tools/clustering/scaffold_clustering.py#L172

Added line #L172 was not covered by tests

# finally, relate this solution back to the input set
return self.formulate_answer(solution, mols_to_norm)

Check warning on line 175 in src/konnektor/network_tools/clustering/scaffold_clustering.py

View check run for this annotation

Codecov / codecov/patch

src/konnektor/network_tools/clustering/scaffold_clustering.py#L175

Added line #L175 was not covered by tests
Loading