Open Chemistry
, RDKit & Neo4j
GSoC 2019 project
Chemical and pharmaceutical R&D produce large amounts of data of completely different nature, such as chemical structures, recipe and process data, formulation data, and data from various application tests. Altogether these data rarely follow a schema. Consequently, relational data models and databases have frequetly disadvantages mapping these data appropriately. Here, chemical data frequently leads to rather abstract data models, which are difficult to develop, align, and maintain with the domain experts. Upon retrieval computationally expesive joins in not predetermined depths may cause issues.
Graph data models promise here advantages:
- they can easily be understood by non IT experts from the research domains
- due to their plasticity, they can easily be extended and refactored
- graph databases such as neo4j are made for coping with arbitrary path lengths
Chemical data models usually require a database to be able to deal with chemical structures to be utilized for structure based queries to either identify records or as filtering criteria.
The project will be focused on development of extension for neo4j graph database for querying knowledge graphs storing molecular and chemical information. Task is to enable identification of entry points into the graph via exact/substructure/similarity searches (UC1). UC2 is closely related to UC1, but here the intention is to use chemical structures as limiting conditions in graph traversals originating from different entry points. Both use cases rely on the same integration of RDkit and Neo4j and will only differ in their CYPHER statements.
Mentors:
- Greg Landrum
- Christian Pilger
- Stefan Armbruster
- Install
lib/org.RDKit.jar
andlib/org.RDKitDoc.jar
into your local maven repository
mvn org.apache.maven.plugins:maven-install-plugin:2.3.1:install-file \
-Dfile=lib/org.RDKit.jar -DgroupId=org.rdkit \
-DartifactId=rdkit -Dversion=1.0.0 \
-Dpackaging=jar
mvn org.apache.maven.plugins:maven-install-plugin:2.3.1:install-file \
-Dfile=lib/org.RDKitDoc.jar -DgroupId=org.rdkit \
-DartifactId=rdkit-doc -Dversion=1.0.0 \
-Dpackaging=jar
- Generate .jar file with all dependencies with
mvn package
- Put generated .jar file into
plugins/
folder of your neo4j instance and start the server - add
server.rdkit.index.sanitize=false
toneo4j.conf
if you want to switch of sanitizing for indexing. If not providedtrue
is assumed as default. - By executing
CALL dbms.procedures()
, you are expected to seeorg.rdkit.*
procedures
The native libraries of rdkit do have a dependency on libFreetype and libPng. On desktop Linux systems those are typically installed by default. The Neo4j docker image is based on openjdk:11-jdk-slim
which itself is based on a minimal Debian linux image. This does not contain these to libraries. To solve that you need to make sure these packages get installed.
In docker_example there's a script run_docker.sh mounting a volume with these debian packages and using an extension script to install these images upon startup of the docker container. Before using that make sure to populate the plugins folder with the plugin's jar file.
- Plugin not present
- Feed Neo4j DB
- then
CALL org.rdkit.update(['Chemical', 'Structure'])
&CALL org.rdkit.search.createIndex(['Structure', 'Chemical'])
That triggers computation of additional properties (fp, etc.) and fp index creation
Automated computation of properties enabled only afterupdate
procedure
- Plugin present
- Feed Neo4j DB
- then
CALL org.rdkit.search.createIndex(['Structure', 'Chemical'])
Automated computation of additional properties (fp, etc.) and triggered index
Fp index automatically updated when new :Structure:Chemical records arrive
- Plugin present
CALL org.rdkit.search.createIndex(['Structure', 'Chemical'])
- Then feed Knime
Automated computation of additional properties (fp, etc.) and index
Empty Neo4j instance is prepared in advance
Whenever a new :Structure:Chemical entries comes, property calculation and fp index update are automatically conducted
It is possible to check index existence with CALL db.indexes
- It would strongly affect performance of exact search if
createIndex
procedure was called earlier (it creates a property index). CALL org.rdkit.search.exact.smiles(['Chemical', 'Structure'], 'CC(=O)Nc1nnc(S(N)(=O)=O)s1')
CALL org.rdkit.search.exact.mol(['Chemical', 'Structure'], '<mdlmol>')
(refer to tests for examples)
- Make sure the fulltext index exists with
CALL db.indexes
,fp_index
must exist. (It should be created withcreateIndex
procedure) CALL org.rdkit.search.substructure.smiles(['Chemical', 'Structure'], 'CC(=O)Nc1nnc(S(N)(=O)=O)s1', <santize> (true/false))
CALL org.rdkit.search.substructure.mol(['Chemical', 'Structure'], '<mol value>', <santize> (true/false))
CALL org.rdkit.fingerprint.create(['Chemical, 'Structure'], 'torsion_fp', 'torsion', <santize> (true/false))
- new propertytorsion_fp
is createdCALL org.rdkit.fingerprint.search.smiles(['Chemical', 'Structure'], 'CC(=O)Nc1nnc(S(N)(=O)=O)s1', 'torsion', 'torsion_fp', 0.4, <santize> (true/false))
CALL org.rdkit.fingerprint.search.smiles(['Chemical', 'Structure'], 'CC(=O)Nc1nnc(S(N)(=O)=O)s1', 'pattern', 'fp', 0.7, <santize> (true/false))
CALL org.rdkit.search.exact.smiles(['Chemical', 'Structure'], 'CC(C)(C)OC(=O)N1CCC(COc2ccc(OCc3ccccc3)cc2)CC1') YIELD luri
MATCH (finalProduct:Entity{luri:luri})
CALL apoc.path.expand(finalProduct, "<HAS_PRODUCT,>HAS_INGREDIENT", ">Reaction", 0, 4) yield path
WITH nodes(path)[-1] as reaction, path, (length(path)+1)/2 as depths
MATCH (reaction)-[:HAS_INGREDIENT]->(c:Compound) where org.rdkit.search.substructure.is(c, 'CC(C)C(O)=O')
RETURN path
CALL org.rdkit.search.exact.smiles(['Chemical', 'Structure'], 'CC(C)(C)OC(=O)N1CCC(COc2ccc(OCc3ccccc3)cc2)CC1') YIELD luri
MATCH (finalProduct:Entity{luri:luri})
CALL apoc.path.expand(finalProduct, "<HAS_PRODUCT,>HAS_INGREDIENT", ">Reaction", 0, 4) yield path
WITH nodes(path)[-1] AS reaction, path, (length(path)+1)/2 AS depths
MATCH (reaction)-[:HAS_INGREDIENT]->(c:Compound)
WITH path, COLLECT(c) as compounds
WHERE ANY( x IN compounds where org.rdkit.search.substructure.is.mol(x, '
Ketcher 9 71921 82D 1 1.00000 0.00000 0
6 5 0 0 0 999 V2000
8.9170 -12.3000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
9.7830 -11.8000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
10.6490 -12.3000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
9.7830 -10.8000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
10.6490 -10.3000 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
8.9170 -10.3000 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0
2 3 1 0 0 0
2 4 1 0 0 0
4 5 1 0 0 0
4 6 2 0 0 0
M END'))
RETURN path
CALL org.rdkit.search.exact.smiles(['Chemical', 'Structure'], 'CCCC(C(=O)Nc1ccc(S(N)(=O)=O)cc1)C(C)(C)C')
YIELD canonical_smiles
RETURN org.rdkit.utils.svg(canonical_smiles) as svg
- Whenever a new node added with labels, an
rdkit
event handler is applied and new node properties are constructed frommdlmol
property. Those are also reserved property names
canonical_smiles
inchi
formula
molecular_weight
fp
- bit-vector fingerprint in form of indexes of positive bits ("1 4 19 23"
)fp_ones
- count of positive bitsmdlmol
Additional reserved property names:
smiles
-
If the graph was fulfilled with nodes before the extension was loaded, it is possible to apply a procedure:
CALL org.rdkit.update(['Chemical', 'Structure'])
- which iterates through nodes with specified labels and creates properties described before. -
In order to speed up an exact search, create an index on top of
canonical_smiles
property
CALL org.rdkit.search.exact.smiles(['Chemical', 'Structure'], 'CC(=O)Nc1nnc(S(N)(=O)=O)s1')
CALL org.rdkit.search.exact.mol(['Chemical', 'Structure'], '<mdlmol block>')
- RDKit provides functionality to use
exact search
on top ofsmiles
andmdlmol blocks
, returns a node which satisfiescanonical smiles
- RDKit provides functionality to use
CALL org.rdkit.update(['Chemical', 'Structure'])
- Update procedure (manual properties initialization from
mdlmol
property) - Current implementation uses single thread and on a huge database may take a lot of time (>3 minutes)
- Update procedure (manual properties initialization from
CALL org.rdkit.search.createIndex(['Chemical', 'Structure'])
- Create fulltext index (called
rdkitIndex
) on propertyfp
, which is required for substructure search - Create index for
:Chemical(canonical_smiles)
property
- Create fulltext index (called
CALL org.rdkit.search.deleteIndex()
* Delete fulltext index (calledrdkitIndex
) on propertyfp
, which is required for substructure search
* Delete index for:Chemical(canonical_smiles)
propertyCALL org.rdkit.search.substructure.smiles(['Chemical', 'Structure'], 'CC(=O)Nc1nnc(S(N)(=O)=O)s1')
- SSS based on smiles substructure
CALL org.rdkit.search.substructure.mol(['Chemical', 'Structure'], '<mol value>')
- SSS based on mdlmol block substructure
CALL org.rdkit.fingerprint.create(['Chemical, 'Structure'], 'morgan_fp', 'morgan')
- Create a new property called
morgan_fp
with fingerprint typemorgan
on all nodes - Supporting properties are:
morgan_fp_type
,morgan_fp_ones
are also added - Creates fulltext index on this property
- Node is skipped if it's not possible to convert its smiles with this fingerprint type
- It is not allowed to use property name equal to predefined
- Create a new property called
CALL org.rdkit.fingerprint.search.smiles(['Chemical', 'Structure'], 'CC(=O)Nc1nnc(S(N)(=O)=O)s1', 'pattern', 'fp', 0.7)
- Call similarity search with next parameters:
- Node labels:
['Chemical', 'Structure']
- Smiles:
'CC(=O)Nc1nnc(S(N)(=O)=O)s1'
- Fingerprint type:
'pattern'
- Property name:
'fp'
- Threshold:
0.7
- Node labels:
- Smiles value is converted into specfied fingerprint type (if possible) and compared with nodes which have property (
'fp'
in this case) - Threshold is a lower bound for the score value
- Current implementation uses single thread and on a huge database may take a lot of time (>3 minutes)
- Call similarity search with next parameters:
- User-defined functions
org.rdkit.search.substructure.is.smiles(<node object>, '<smiles_string>')
org.rdkit.search.substructure.is.mol(<node object>, '<mol_string>')
- Return boolean answer: does specified
node
object have substructure match provided bysmiles_string
ormol_string
.
- User-defined function
org.rdkit.utils.svg('<smiles_string>')
- Return svg image in text format from smiles
- Implementation of exact search (100%)
- Implementation of substructure search (90%, several minor bugs)
- Implementation of condition based graph traversal - usage of function calls in complex queries (100%)
- Implementation of similarity search (70%, major performance issues)
- Coverage with unit tests (80%, not all invalid arguments for procedures are tested)
- Speed up batch tasks by utilizing several threads (currently waiting for resolving issue on native level)
- Speed up the
similarity search
procedures - Solve minor bugs (todos) like unclosed
query
object during SSS
- Compatability of native libraries for win64 (beginning of the development)
- Lazy streams evaluation and not resolved issue with
query
object during SSS - Parallelization of stream evaluations
Plugin supports openjdk and oraclejdk java versions (< 12).
Further versions upgraded security sensitive fields policy, those are currently not supported.