-
Notifications
You must be signed in to change notification settings - Fork 0
/
params.json
1 lines (1 loc) · 40.3 KB
/
params.json
1
{"name":"ParaSim Documentation","tagline":"All you need to know on one page","body":"![ParaSim Logo](https://raw.github.com/cherhaus/cherhaus.github.io/master/images/parasim-logo-v02.png) ParaSim - Parallelized high-throughput structural similarity calculations\r\n=========================================================================\r\n\r\nDiversity assessments and structural comparisons of large compound databases require calculating similarities of millions of compounds in an affordable time. The *ParaSim* programme addresses this challenge by adapting similarity calculations to high-performance computer environments.\r\n\r\n*ParaSim* parallelizes the calculations according to the number of available computing cores on a single machine. The programme is optimized for the throughput of very large numbers of query structures against very large numbers of reference structures. For that reason the reference structure dataset in its entirety is loaded into memory prior to calculations. The size of the reference dataset is therefore solely limited by the available memory. As a special feature, repeatedly queried reference datasets can be kept in memory as persistent memory objects to be immediately available.\r\n\r\n*ParaSim* calculates similarities based on binary structural fingerprints. A fingerprint is a set of \"on\" or \"off\" (0) bits for each present structural feature which can be stored as a binary object.\r\n*ParaSim* does not compute fingerprints by itself but relies on third-party software to do so. Basically, all types of structural fingerprints which can be stored in a bitset can be used by *ParaSim*. Examples for fingerprints usable by *ParaSim* are included in the OpenSource chemoinformatics toolkits *RDKit* (http://www.rdkit.org), *CDK* (http://cdk.sourceforge.net/) or *OpenBabel* (http://www.openbabel.org) as well as the commercial chemoinformatics software packages *Pipeline Pilot*™ from Accelrys® (http://www.accelrys.com/products/pipeline-pilot/) or the Digital Chemistry® toolkit (http://www.digitalchemistry.co.uk). As *ParaSim* calculates similarities based on binary representations, fingerprint lengths should best be a multiple of 32 (as integer size is 32-bit on most systems, also on 64-bit machines) and must be a multiple of 8 (as character size is 8-bit).\r\n\r\n\r\n------\r\n\r\n## What ParaSim does\r\n\r\n*ParaSim* calculates the well-known Tanimoto (or Jaccard) and Dice similarity indexes from fingerprint query and reference input files. Dissimilarity is represented by a similarity index of 0.0, identity by 1.0. It can be defined by the user how many reference molecules/nearest neighbors shall be identified per query molecule. By default, one hit molecule, the nearest neighbour, is captured. Moreover, thresholds can be defined indicating minimum and maximum similarities for hit molecules. A maximum similarity threshold can further be used to decide whether identity hits with a similarity of 1.0 should be included or excluded. In case they should be excluded, the maximum similarity threshold can be set to a value < 1, e.g. 0.999999.\r\n\r\n*ParaSim* accepts query and reference input files containing fingerprint information in a format described below in more detail. *ParaSim* output is written in a tab-delimited format to the system's standard output stream (stdout, usually the console) from where it can easily be redirected to files or pipes using the operation system's redirection mechanisms. Output consists of one row per result containing the ID of the query structure, the ID of the found reference structure and the computed similarity index. If only one nearest neighbour per query is requested and no similarity thresholds are applied (so the full reference set will be queried), then also the average similarity of the query molecule against all reference molecules is printed out for statistical purposes. For multiple nearest neighbour reference molecules per query, multiple output lines are written, all containing the same ID of the query structure. In this case the statistical information is omitted in order to keep the output clearly arranged. Using the \"verbose\" option `-v`, additional information describing the progress of file reading and calculating steps is written to the standard error stream (stderr, usually also the console).\r\n\r\n**Notes**\r\n* Depending on the fingerprint type generated with third-party software, a similarity index of 1.0 does not necessarily mean full structural identity! E.g. fingerprints based on functional classifications may lead to a similarity of 1.0 for highly similar but not identical structures.\r\n* For speed reasons, *ParaSim* does not sort the output in any way but immediately returns results as they are computed by the executing threads. Therefore, the order of output lines may vary from run to run. If the output is required in a sorted way, this can easily be achieved by piping it into a subsequent sort command.\r\n\r\n\r\n-----\r\n\r\n## Contents of this Package\r\n\r\n*ParaSim* is principally able to be executed by itself if appropriate fingerprint input files are provided. However, in order to extend the usage of *ParaSim* to persistent memory objects and to facilitate similarity searches directly from Smiles strings or structure files (SDF or Smiles), several additional scripts are provided:\r\n\r\n1.`fp2mem.pl` : Creates and manages persistentently stored memory objects with reference fingerprint data.\r\n\r\n2.`rdkit2parasim.py` : Generates *ParaSim* input files from Smiles strings or SDF/Smiles files applying RDKit's Morgan or feature-based Morgan fingerprints. Requires installation of Python and RDKit.\r\n\r\n3.`Molecule2Parasim.xml` : A Pipeline Pilot™ protocol for the generation of *ParaSim* input files from Smiles strings or SDF/Smiles files applying ECFP or FCFP fingerprints. Requires installation of Pipeline Pilot™.\r\n\r\n4.`parasim-conversion-knime-demo.zip` : An example workflow for the Open-Source workflow engine Knime to demonstrate how *ParaSim* input files can be generated from within the OpenSource workflow engine Knime.\r\n\r\n5.`simsearch.pl` : Allows similarity searches directly from a Smiles string or structure files (SDF or Smiles) against a reference dataset stored in *ParaSim* format.\r\n\r\nFor testing, sample files with 10 records drawn from the freely available PubChem and ZINC databases are provided as well in the data/ subdirectory.\r\n\r\nMoreover, several installation-related files are packaged together with *ParaSim*:\r\n\r\n1.`parasim-config.txt` : This central configuration file may be edited by the user and stores several default values used by the different scripts.\r\n\r\n2.`prepare_and_call_pipeline_pilot.csh` : A sample configuration shell script to prepare the system environment for inclusion of Pipeline Pilot™ fingerprint calculations by `simsearch.pl`.\r\n\r\n3.`prepare_and_call_python_rdkit.csh` : A sample configuration shell script to prepare the system environment for inclusion of RDKit fingerprint calculations by `simsearch.pl`.\r\n\r\n4.`Parasim.pm` : A Perl module containing all shared *ParaSim* functions.\r\n\r\n-----\r\n\r\n## Requirements\r\n\r\n**Operating System**\r\n\r\nIn the current implementation, *ParaSim* itself is a single Perl script with a parallelized computational core written in C. The C core potentially applies extensions of the GCC compiler or hardware routines of Intel® processors (Intel® Streaming SIMD Extensions/SSE4). For multithreading, the C core makes use of POSIX threads by the *pthread* library which has to be accessible to the compiler. For the use of persistent memory objects ParaSim uses SysV inter process communication (IPC) concepts. Due to these requirements, *ParaSim* currently can only be executed in a Unix/Linux OS environment with the GCC compiler installed.\r\n\r\n\r\n**Software**\r\n\r\n*ParaSim* was tested successfully with Perl version 5.10.0 and 5.12.1 under OpenSuse Linux 11.3 on 32-bit dual-core and Suse Linux Enterprise Server 11 SP3 on 64-bit multiprocessor machines up to 192 cores. Some Perl modules which are not part of the standard distribution are required:\r\n* The C code is directly integrated into the Perl code and compiled by the Perl module [Inline::C](http://search.cpan.org/~sisyphus/Inline-0.50/C/C.pod) which is not part of the standard Perl distribution and therefore must be manually installed.\r\n* SysV IPC support is supplied by the Perl module [IPC::Sharelite](http://search.cpan.org/~andya/IPC-ShareLite-0.17/lib/IPC/ShareLite.pm) which also requires separate installation.\r\n\r\nIf you want to make use of the tools packaged together with *ParaSim*, installation of third-party software like Python, RDKit and Pipeline Pilot™ or further software packages for fingerprint calculations may be necessary.\r\n\r\n\r\n**Memory**\r\n\r\nBecause *ParaSim* loads the reference set into memory, the size of the reference set is limited only by the available memory. Typically, memory consumption per 1 million of reference fingerprints of length 1024 is ~150 MB as persistent memory object and ~300 MB during runtime.\r\n\r\n-----\r\n\r\n## Installation\r\n\r\n*ParaSim* itself currently consists of just a single Perl script including the C code as well. Compilation of the C source code is performed automatically by the Inline::C module when calling the Perl script. Therefore, basically no installation is required:\r\n* Make shure that the OS and software requirements described above are met\r\n* Extract the archive\r\n* In case you do not want to prepend the Perl call itself each time, make the script executable (`chmod 755 parasim.pl`). *ParaSim* expects the perl executable to be located in /usr/bin/perl. If that is not true in your case, change the default Perl path in the first line of the script's source code.\r\n\r\nIn order to test if *ParaSim* runs correctly, try\r\n\r\n perl parasim.pl -q data/pubchem-test-fcfp6.txt -r data/zinc-test-fcfp6.txt\r\n\r\nThe output should be\r\n\r\n QUERY REFERENCE TANIMOTO AVG_TANIMOTO\r\n 68664 ZINC01914437 0.198019801980198 0.104496307506587\r\n 71360 ZINC03775002 0.133333333333333 0.103979492391050\r\n 68938 ZINC03774999 0.160377358490566 0.122158970101436\r\n 71696 ZINC03774999 0.163636363636364 0.118017086925888\r\n 71917 ZINC03774999 0.147368421052632 0.102165139370256\r\n 71107 ZINC03774999 0.173076923076923 0.128406853662191\r\n 71542 ZINC01914437 0.185185185185185 0.107759423159295\r\n 71227 ZINC03774999 0.181818181818182 0.129684949182247\r\n 71767 ZINC03775009 0.174418604651163 0.122120643622887\r\n 71923 ZINC03774991 0.154761904761905 0.117569042869504\r\n<br>\r\n\r\nIf you want to start similarity searches directly from SDF or Smiles files using `simsearch.pl`, fingerprints and input files for *ParaSim* need to be generated during runtime using third-party software. Therefore, third-party software packages like Python and RDKit or Pipeline Pilot™ need to be installed separately:\r\n* For `rdkit2parasim.py` make shure that beside RDKit modules also the modules \"sys\", \"argparse\", \"gzip\" and \"base64\" are accessible to the Python installation.\r\n* Paths to executables and scripts need to be defined in the respective section of `parasim-config.txt`. Therefore replace placeholders like \"my_path\" or \"my_server\" in `parasim-config.txt` by the path and server information fitting your environment.\r\n* It is sometimes necessary to prepare environments for scripting languages like Perl or Python or for Pipeline Pilot™. This can be achieved either by calling executables from within a preparational shell script or by calling several commands combined by '&&'. You may use the example scripts `prepare_and_call_pipeline_pilot.csh` and `prepare_and_call_python_rdkit.csh` for this purpose and adapt them to your needs.\r\n\r\n**Technical note**: In the current version of *ParaSim*, Inline::C compiles the C sources only if binaries do not yet exist or if the C sources were modified. Therefore, if you use *ParaSim* on different machines in a network, it may happen that you cannot run *ParaSim* on one architecture because it was compiled on a different architecture before. In this case, make sure that you either re-run the script from a different run directory or that you apply a slight change in the C section of the source code (a single space character is already sufficient) to trigger a recompilation for the new architecture. This issue will be addressed in a future version of *ParaSim*.\r\n\r\n-----\r\n\r\n## How to use ParaSim\r\n\r\n### Synopsis\r\n\r\n USAGE: parasim.pl [options] -q query.txt[.gz] [-r reference.txt[.gz]]\r\n\r\n OPTIONS: -min #min_similarity The minimum similarity (0.0 = dissimilarity, 1.0 = identity).\r\n This has impact on the performance.\r\n Default: 0.00\r\n -max #max_similarity The maximum similarity (0.0 = dissimilarity, 1.0 = identity).\r\n This has impact on the performance.\r\n Default: 1.00\r\n -n/k #num_similars The number of hits to keep (k nearest neighbors).\r\n Default: 1\r\n -c similarity_coeff The similarity coefficient to use. Allowed values:\r\n 'tan' : Tanimoto/Jaccard similarity coefficient\r\n 'dice' : Dice similarity coefficient\r\n Default: 'tan'\r\n -v Verbose. Print detailed status and progress information.\r\n -q query.txt[.gz] The file containing the query fingerprints.\r\n Wildcards are expanded but have to be quoted.\r\n -r reference.txt[.gz] The file containing the reference fingerprints.\r\n Wildcards are expanded but have to be quoted.\r\n Use 'mem:#key' to identify a persistent memory object\r\n which was created with fp2mem-persist.pl before.\r\n Default: 'mem:0'\r\n -h/help Show this help.\r\n\r\n ADVANCED OPTIONS:\r\n -t #threads The number of threads to be used in parallel.\r\n Default: Number of available cores on host\r\n -b binary_class The class used to represent the fingerprint.\r\n This has impact on the performance. Allowed values:\r\n 'int' : Integer representation of fingerprint bitset\r\n 'char' : Character representation of fingerprint bitset\r\n Default: 'int' for fingerprints being a multiple of 32,\r\n 'char' for fingerprints being a multiple of 8.\r\n -u on/off Switch on/off loop-unrolling. This has impact on the performance.\r\n Default: 'on' for 32 x sizeof(int) bit fingerprints,\r\n 'on' for 64 x sizeof(char) bit fingerprints,\r\n 'off' for all other fingerprint lengths.\r\n\r\n\r\n<br>\r\n\r\n### Advanced Options\r\n\r\nBeside the set of standard options whose purpose is to control the basic features of *ParaSim*, *ParaSim* also provides a set of advanced options for experienced users which control the technical behaviour of the software.\r\n\r\nBy default, *ParaSim* uses all available CPU cores for parallel calculations and automatically reduces the number to the number of query fingerprints if necessary. However, given the case that only a lower,\r\nlimited number of cores shall be used by *ParaSim*, this can be manually defined using option `-t`.\r\n\r\n*ParaSim* implements several different options for the most time-consuming calculation, the count of on-bits in a fingerprint, the so-called *bitcount* or *popcount*. By default, it determines the best\r\napplicable method based on the length of the fingerprint. However, for test or research purposes, the calculation method can completely be controlled by the user:\r\n\r\n1.The way how the fingerprint is internally interpreted (option `-b`): `char` (character) or `int` (integer) with a speed advantage for 'int'.\r\n\r\n2.Loop-unrolling (option `-u`): `on` or `off`. For particular fingerprints lengths (currently `32 x sizeof(int)` and `64 x sizeof(char)` with `sizeof(int) = 32` on most systems and usually `sizeof(char) = 8`) a special internal algorithm is available which is supposed to result in additional gain of performance. If not set manually, it will be used automatically if applicable.\r\n\r\n<br>\r\n\r\n### User Defaults\r\n\r\n*ParaSim* comes with a central configuration file `parasim-config.txt` which consolidates the different default values and makes it easy to modify them. Especially, paths to preinstalled third-party software\r\npackages for the calculation of fingerprints from chemical structure files are defined here. Just use a text editor of your choice to edit the file and change default values if required. Comments within the file explain the default values' meanings.\r\n\r\nThe maximum number of allowed parallel threads (set to 256) is the only default value which can only be modified in the C source code section of the *ParaSim* Perl script. This parameter limits the memory used for thread function parameters and is more of technical value. The practically used number of threads is defined by option `-t` and must be equal or lower than this value (checked during runtime). If this is not sufficient, replace the value by the one you require in the source code command `#define MAX_THREADS 256`.\r\n\r\n-----\r\n\r\n## Factors influencing Calculation Performance\r\n\r\nSeveral factors have direct influence on the calculation performance. In parts this can be significant.\r\n\r\n1.The number of cores: Obviously, parallelisation has the strongest impact on performance (option `-t`, see Advanced Options).\r\n\r\n2.The fingerprint binary class: Where applicable, fingerprints should be interpreted as integers which is faster (option `-b`, see Advanced Options).\r\n\r\n3.The fingerprint length: Depending on the length of the fingerprint, faster or slower calculation routines can be called. Advisable is a fingerprint length of a multiple of 32 as fingerprints can then be interpreted as integers. Moreover, if the fingerprint length fulfils the requirements for loop-unrolling (option `-u`, see Advanced Options), this adds additional speed. The current version of *ParaSim* contains algorithms optimized for a fingerprint length of 512 or, even better, 1024.\r\n\r\n4.Thresholds: Application of similarity thresholds has strong influence on the computation speed because thresholds allow purging of reference compounds prior to similarity calculations. The narrower the thresholds are set, the faster the calculations are performed. Usually, for finding nearest neighbors, a minimum similarity of about 0.3-0.5 may be sufficient which already allows to save about a third to half of the computation time.\r\n\r\n-----\r\n\r\n## Input File Format\r\n\r\nIn the current version, *ParaSim* expects that a software package which is able to compute structural fingerprints and to convert them into any binary format is also able to compute the number of \"on\" bits for that fingerprint, the so-called *bitcount* or *popcount*. Bitcounts of query and reference bitsets as well as the intersection of both are the basis of the similarity calculation. Therefore the *ParaSim* file format for the query and reference fingerprints is a tab-delimited plain text format (Windows or Linux style) with one row for each structure containing three columns:\r\n\r\n* A unique alphanumeric row/structure identifier\r\n* The bitcount of the fingerprint for that structure in integer format\r\n* The fingerprint bitset encoded in the common Base64 string format\r\n\r\nA headline containing column identifiers describing the fingerprint type is mandatory. This fingerprint description is used by *ParaSim* to check if the same fingerprint type is used in both the query and reference data sets. A more descriptive appended '\\_BASE64' (or prepended 'BASE64\\_') is tolerated but not mandatory and will be ignored during comparison of the fingerprint types. The name of the structure identifier is detected by the file parser but so far this information is not used. The name of the bitcount column must be 'BITCOUNT' or 'POPCOUNT'.\r\n\r\nThe size of the fingerprint bitset (and Base64 string) is not fixed. This implies that the fingerprint bitset size has to be the same for query as well as for reference fingerprint files which is checked by *ParaSim* when the reference file is loaded.\r\n\r\nFiles can be used either in plain text or in compressed gzip format in order to save disk space for large databases. Filename wildcards are extrapolated to multiple files but need to be quoted. Example query and reference files are packaged together with the *ParaSim* script itself in the data/ subdirectory. A typical input file looks like the following:\r\n\r\n CID BITCOUNT FCFP_6_BASE64\r\n 68664 52 AwIDARAAAAAAAIAAAAAAAAAEACAABgAAEAAAAAAAAAAAAAAAAAEAAAAA [... truncated]\r\n 68938 56 CxIBCZAAAAIBAAAEAAABAAAggAAABgAAQIBAAAAAAYAAAEAAAAAAAAAA [... truncated]\r\n 71360 70 A0IDAREAAQMBEIAAAACAwAAEAAAAAmAAAIAEAAAIQIAAAIAgAAIAQAAA [... truncated]\r\n [...]\r\n\r\n<br>\r\n-----\r\n\r\n## Persistent Memory Objects\r\n\r\nAs a special feature, *ParaSim* makes use of pre-stored persistent memory objects. This is because, for large data sets, reading of input files from disk becomes the performance-limiting step in comparison to\r\npure calculation times. This is particularly true for repeated queries against the same set(s) of data.\r\n\r\nFor that purpose, a supportive tool for *ParaSim* is available, `fp2mem.pl`, which reads a reference fingerprint file and stores it persistently in RAM. Memory consumption is about 100 MB per 1 million of fingerprints of length 1024. Parallel storage of several memory objects is possible which are identified and addressed by an integer key.`fp2mem.pl` can also be used to retrieve information about all stored memory objects on a machine as well as to destroy a particular memory object identified by its key.\r\n\r\nTo access a memory object which was generated with fp2mem.pl as a reference dataset with *ParaSim,* use the *ParaSim* option `-r` (to define the reference set) together with the keyword `mem:` combined with the integer key of the object you want to use, i.e.`parasim.pl -r mem:7`. This will trigger *ParaSim* to read all reference fingerprint information directly from that particular memory object with key 7 and will significantly increase the return time for calculation results.\r\n\r\nFor creation of a memory object, `fp2mem.pl` reads a valid *ParaSim* fingerprint file. Creation is triggered using option `-create` together with a numeric key which can be selected from a limited range of allowed integer values (default: 0-10) in order to avoid exhaustive consumption of memory.\r\n\r\nInformation about stored datasets can be reviewed together with all information about the originator, the source file and the fingerprint type applying option `-info` for information about all datasets or again in combination with an integer key for one particular dataset. Similarly, options `-destroy` and `-dump`, in combination with an integer key, remove a dataset from memory or dump it’s content to stdout (for debugging/testing only).\r\n\r\nIt may be useful to trigger regular updating of a frequently used reference data set in memory by a cron job. For that purpose, option `-force` was added to prevent fp2mem.pl from requesting for confirmation\r\nfor overwriting an existing memory object. For the same purpose, option `-silent` suppresses all output of progess information.\r\n\r\n**fp2mem.pl options summary:**\r\n\r\n -info [#key] Output information about all existing memory objects.\r\n Optionally, output information for one object identified by #key.\r\n -destroy #key Destroy the memory object identified by #key.\r\n -dump #key For testing only: Dumps the mem object's content to stdout.\r\n -create #key Create the memory object identified by #key. Requires option -file.\r\n -file fingerprints.txt[.gz] Used together with -create: The file containing the fingerprint data.\r\n Wildcards are expanded but have to be quoted.\r\n -silent Used together with -create: Suppress progress information output.\r\n -force Force deletion or recreation of existing memory object without confirmation.\r\n CAUTION: This will overwrite all existing content of this object!\r\n -help/h Show this help.\r\n\r\n<br>\r\n**Technical note:**The integer keys provided by the user are not used as they are but are converted internally to a numerical key which is unique for each machine. The reason is that all *ParaSim*-related tools need to identify the same memory objects from the same keys, but the key structure should not be too simple so that they may get mixed up with keys potentially used by other applications.\r\n\r\n-----\r\n\r\n## How to use the Tools shipped together with ParaSim\r\n\r\nTogether with *ParaSim* several additional scripts are packaged to facilitate the application of *ParaSim* and to demonstrate possible use cases. The scripts wrap pre-installed third party software for\r\ncalculation of fingerprints. So, query or reference files for *ParaSim* can be generated directly from available structure file (SDF or Smiles).\r\n\r\n### rdkit2parasim.py\r\n\r\nThis script expects a running installation of Python and RDKit. It converts an SDF or Smiles file (also gz-compressed) into a *ParaSim* fingerprint input file. If the script's default parameters are used, it requires source and destination filenames as arguments as well as the name of a property containing the unique integer ID of the structure. For regular Smiles files containing only two columns without column names, this parameter must be \"\\_Name\". So far, the RDKit implementations of Morgan fingerprints and feature-based Morgan fingerprints with different radii can be generated.\r\n\r\nIn order to check if the script runs correctly, try\r\n\r\n python rdkit2parasim.py pubchem-test.sdf dest.txt CID\r\nor\r\n\r\n python rdkit2parasim.py pubchem-test.smi dest.txt _Name\r\n\r\nThe content of file dest.txt should be identical to the provided file pubchem-test-featmorgan3.txt.\r\n\r\n**Options:**\r\n\r\n positional arguments:\r\n source A valid Smiles string or the path to the source file. Can be a\r\n .sdf[.gz] or .smi file\r\n destination Path of the destination file. Will be a tabbed .txt[.gz] file\r\n id Name of the property containing the unique structure\r\n identifier. If the source is a valid Smiles string, the name of\r\n the property can be freely chosen and it will be created during\r\n runtime. If the source is a file of type .smi without title\r\n line, it must be \"_Name\"\r\n\r\n optional arguments:\r\n -h, --help show this help message and exit\r\n -f FP RDKit fingerprint to be used. Allowed values: MORGAN or FEATMORGAN.\r\n DEFAULT: FEATMORGAN\r\n -r RADIUS radius of the fingerprint. DEFAULT: 3\r\n -l LENGTH length of the fingerprint in bits. Must be a multiple of 8.\r\n DEFAULT: 1024\r\n -v verbose: Print additional status information\r\n\r\n<br>\r\n### Molecule2Parasim.xml\r\n\r\nThis is a protocol for Pipeline Pilot™. It can be run either by importing it directly into a Pipeline Pilot™ client window or by calling it through another supportive script, `simsearch.pl`. Therefore, it requires a running Pipeline Pilot™ server (tested with version 8.5) which needs to be accessible via http to be called by `parasim.pl`. Make sure that you properly set the execution path for anonymous user access to Pipeline Pilot™ protocols in parasim-config.txt. The protocol reads molecules from SDF or Smiles files (also gz-compressed) and converts them either to FCFP or ECFP fingerprints of radius 2,4,6,8,10 or 12.\r\n\r\nIn order to check if Pipeline Pilot™ settings are set correctly for access by `simsearch.pl` , try:\r\n\r\n perl simsearch.pl -fp FCFP_6 -q data/pubchem-test.sdf -r data/zinc-test-fcfp6.txt -id CID\r\n\r\nFor the Smiles input version, the internal name of the ID property is \"Data\":\r\n\r\n perl simsearch.pl -fp FCFP_6 -q data/pubchem-test.smi -r data/zinc-test-fcfp6.txt -id Data\r\n\r\nIn both cases, output should be:\r\n\r\n QUERY REFERENCE TANIMOTO AVG_TANIMOTO\r\n 68664 ZINC01914437 0.198019801980198 0.104496307506587\r\n 68938 ZINC03774999 0.160377358490566 0.122158970101436\r\n 71360 ZINC03775002 0.133333333333333 0.103979492391050\r\n 71696 ZINC03774999 0.163636363636364 0.118017086925888\r\n 71917 ZINC03774999 0.147368421052632 0.102165139370256\r\n 71107 ZINC03774999 0.173076923076923 0.128406853662191\r\n 71542 ZINC01914437 0.185185185185185 0.107759423159295\r\n 71227 ZINC03774999 0.181818181818182 0.129684949182247\r\n 71767 ZINC03775009 0.174418604651163 0.122120643622887\r\n 71923 ZINC03774991 0.154761904761905 0.117569042869504\r\n\r\n<br>\r\n### parasim-conversion-knime-demo.zip\r\n\r\nThis example workflow demonstrates how in principal *ParaSim* input files can be generated with the OpenSource workflow engine KNIME (http://www.knime.org/) applying either RDKit or CDK fingerprints. Before using it, make sure you have the required Knime packages installed.\r\n\r\n**Caution:**As the internal calculations applied within KNIME may differ from the implementations in the Perl or Python scripts, fingerprint files generated with KNIME may be different to those generated with the scripts. Therefore, only use fingerprint input files from the same source.\r\n\r\n### simsearch.pl\r\n\r\nThis is the most powerful supportive script for *ParaSim* as it integrates the generation of fingerprint files either with RDKit or with Pipeline Pilot™ and the similarity search done with *ParaSim* itself. Therefore it allows similarity search against pre-computed reference fingerprint files directly from SDF or Smiles query files.\r\n\r\nAs a wrapper script, simsearch.pl combines the functionalities and parameter sets of the three wrapped scripts. In addition to the already described *ParaSim* parameters, additional parameters are required for the fingerprint type to generate (option `-fp`) and the input file data field which contains the unique integer ID identifying each structure (option `-id`). For the full list of the combined set of options, use `perl simsearch.pl -h`.\r\n\r\nSimsearch.pl accepts SDF and Smiles files, also gz-compressed. For common Smiles files which only contain two columns without column names, one for the Smiles code and one for the ID, the ID data field name needs to be \"Data\" for use with Pipeline Pilot™ and \"\\_Name\" for use with RDKit.\r\n\r\n**Initialisation:**If you want to start similarity searches directly from SDF or Smiles files using simsearch.pl, fingerprints and ParaSim input files need to be generated during runtime using either RDKit (through rdkit2parasim.py) or PipelinePilot™ (through Molecule2ParaSim.xml). Therefore, paths to the executables and scripts need to be defined in the paths section of `parasim-config.txt`.\r\n\r\nThe functionality check for Pipeline Pilot™ fingerprints was described above. In order to check if it runs correctly for RDKit fingerprints, try:\r\n\r\n perl simsearch.pl -fp featmorgan_3 -q data/pubchem-test.sdf -id CID -r data/zinc-test-featmorgan3.txt\r\n\r\nOutput should be:\r\n\r\n QUERY REFERENCE TANIMOTO AVG_TANIMOTO\r\n 68664 ZINC03775002 0.181818181818182 0.116428576403331\r\n 68938 ZINC03774999 0.146788990825688 0.121462650379503\r\n 71696 ZINC03774999 0.168141592920354 0.108370896568424\r\n 71360 ZINC03774991 0.125000000000000 0.101284467917619\r\n 71542 ZINC01914437 0.228571428571429 0.141443815964918\r\n 71917 ZINC01914437 0.135416666666667 0.101810400382934\r\n 71227 ZINC03775002 0.191304347826087 0.133582252553683\r\n 71107 ZINC03774999 0.216981132075472 0.124678368523578\r\n 71767 ZINC03774991 0.116504854368932 0.096029465692885\r\n 71923 ZINC03774999 0.144230769230769 0.121297586262549\r\n<br>\r\n\r\n\r\n----\r\n\r\n## Application Examples\r\n\r\n**1. Load file pubchem-test-featmorgan3.txt into persistant memory object with key 0:**\r\n\r\n perl fp2mem.pl -create 0 -file data/pubchem-test-featmorgan3.txt\r\n\r\n Reading Reference fingerprints from base64...\r\n \r\n File data/pubchem-test-featmorgan3.txt:\r\n ID: CID, FP type: FEATMORGAN_3, has bitcounts\r\n FP length: 1024\r\n Fingerprints read from file: 10\r\n\r\n Reference: 10 fingerprints read in total\r\n\r\n Created key 0 from file data/pubchem-test-featmorgan3.txt<br>\r\n\r\n KEY : 0\r\n RECORDS : 10\r\n FILE : <your_parasim_path/data/pubchem-test-featmorgan3.txt\r\n ID FIELD : CID\r\n FP TYPE : FEATMORGAN_3\r\n FP LENGTH : 1024\r\n DATE : <creation_date>\r\n CREATOR : <your_name>\r\n BYTES USED : 1'360\r\n PERMISSIONS : 660\r\n SEGMENT COUNT : 1\r\n SEGMENT SIZE : 65'536\r\n BYTES NET : 1'450\r\n BYTES GROSS : 65'536\r\n<br>\r\n\r\n**2. For file pubchem-test-featmorgan3.txt, find the two nearest neighbours in itself (versus the memory object 0), applying the Dice similarity coefficient:**\r\n\r\n perl parasim.pl -n 2 -c dice -q data/pubchem-test-featmorgan3.txt -r mem:0\r\n\r\n QUERY REFERENCE DICE\r\n 71923 68664 0.285714285714286\r\n 71923 71923 1.000000000000000\r\n 68664 68664 1.000000000000000\r\n 68664 71542 0.348623853211009\r\n 68938 71917 0.387096774193548\r\n 68938 68938 1.000000000000000\r\n 71360 71360 1.000000000000000\r\n 71360 68938 0.347107438016529\r\n 71696 71696 1.000000000000000\r\n 71696 71917 0.380000000000000\r\n 71917 71917 1.000000000000000\r\n 71917 68938 0.387096774193548\r\n 71542 68664 0.348623853211009\r\n 71542 71542 1.000000000000000\r\n 71107 71107 1.000000000000000\r\n 71107 68938 0.321428571428571\r\n 71227 71227 1.000000000000000\r\n 71227 71360 0.291970802919708\r\n 71767 71767 1.000000000000000\r\n 71767 71360 0.333333333333333\r\n<br>\r\n\r\n**3. Same query, but from the SDF file directly and including only dice similarities between 0.35 and 0.999:**\r\n\r\n perl simsearch.pl -n 2 -c dice -min 0.35 -max 0.99 -q data/pubchem-test.sdf -id CID -fp featmorgan_3 -r mem:0\r\n\r\n QUERY REFERENCE DICE\r\n 71696 68938 0.365217391304348\r\n 71696 71917 0.380000000000000\r\n 71917 68938 0.387096774193548\r\n 71917 71696 0.380000000000000\r\n 68938 71696 0.365217391304348\r\n 68938 71917 0.387096774193548\r\n<br>\r\n\r\n**4. Search a Smiles string directly against pubchem-test-featmorgan3.txt which was stored in memory:**\r\n\r\n perl simsearch.pl -id Name -fp featmorgan_3 -r mem:0 -q 'o1c2c\\(cccc2\\)cc1C\\(=O\\)N3CCNCC3'\r\n\r\n QUERY REFERENCE TANIMOTO AVG_TANIMOTO\r\n 1 68664 0.542372881355932 0.174922486279414\r\n\r\nIn this case, the ID property Name was generated during runtime.\r\n<br>\r\n**5. Destroy memory object with key 0:**\r\n\r\n perl fp2mem.pl -destroy 0\r\n\r\n WARNING: Key 0 is already present! The next action will destroy all existing data! Continue (y/n): y\r\n\r\n Killed memory object with key 0 and all attached data.\r\n<br>\r\n\r\n**6. Generate histogramme data for the occurence of distances of nearest neighbors between pubchem-test-fcfp6.txt and zinc-test-fcfp6.txt, rounded to two decimal places:**\r\n\r\n perl parasim.pl -q data/pubchem-test-fcfp6.txt -r data/zinc-test-fcfp6.txt | awk '{printf(\"%.2f\\n\",$3)}' | sort -n | uniq -c\r\n 1 0.00\r\n 1 0.13\r\n 2 0.15\r\n 2 0.16\r\n 2 0.17\r\n 1 0.18\r\n 1 0.19\r\n 1 0.20\r\n<br>\r\n\r\n-----\r\n\r\n## Troubleshooting\r\n\r\n* In general, if something does not work as desired, try first to rerun with option `-v`. Scripts are quite verbose and in most cases ouptput allows a solid guess what went wrong.\r\n* Most frequently issues may occur during generation of persistent memory objects, i.e. due to lack of memory. In that case, fragmented semaphore arrays or memory segments may prevent generation of further memory objects. If this happens, remove all existing memory segments with `fp2mem.pl -destroy`, use `ipcs -a` to get a list of semaphores and then remove all remaining disturbing arrays and segments from memory with `ipcrm -m` or `ipcrm -s` together with the semaphore ids.\r\n\r\n-----\r\n\r\n## Version Info\r\n\r\n**V 0.04:**\r\n* This is an important bugfix release. The persistent memory segment size is no longer static but adapted to the used memory to avoid depletion of segment addresses\r\n* rdkit2parasim.py, Molecule2ParaSim.xml and simsearch.pl now allow not only filenames as query input parameters put also valid Smiles strings.\r\n\r\n**V 0.03:**\r\n* Allow non-integer structure IDs\r\n* Achitecture: Externalize shared procedures (i.e. parsers) into module\r\n* Determine and control fingerprint length from fingerprint itself (no option `-l`)\r\n* Check query vs. reference fingerprint types to avoid mismatches\r\n\r\n**V 0.02:**\r\n* Proof of concept \r\n\r\n-----\r\n\r\n## Development Roadmap\r\n* Calculate input bitcounts if not present\r\n* Optionally report progress if output is redirected to file\r\n* Fix reading twice during Perl to C data transfer \r\n* Read fingerprints as blocks\r\n* Avoid manual recompilation for different processor architectures\r\n* Additional similarity indexes\r\n* Try a Windows version using Win32::MMF for shared memory and OpenMP for multithreading\r\n* Different input (FPS) and output formats\r\n\r\n-----\r\n\r\n## *ParaSim* vs. *ChemFP*\r\n\r\nAndrew Dalke from Dalke Scientific develops and provides *ChemFP*, an OpenSource fingerprint toolbox optimized for fast similarity searches, which is currently about two to five time faster than *ParaSim* (see http://code.google.com/p/chem-fingerprints/). However, *ParaSim* was continued to be developed as a separate project with the specific goal to make use of persistent memory objects for frequently repeated\r\nlarge-scale similarity searches. In later stages of the development of *ParaSim* it will presumably be tried to implement *ChemFP* function calls into *ParaSim*. If one day *ChemFP* should make use of persistent memory objects by itself, further development of *ParaSim* may get obsolete.\r\n\r\n\r\n-----\r\n\r\n## Acknowledgements\r\n\r\nAlgorithms in the current version of *ParaSim* are inspired by and with kind permission contain concepts for speed-optimized bitcount calculations presented by Andrew Dalke from Dalke Scientific (http://www.dalkescientific.com, see [detailed documentation](http://www.dalkescientific.com/writings/diary/archive/2008/06/27/computing_tanimoto_scores.html)).\r\n\r\nThanks to Thomas Fahle (http://www.thomas-fahle.de) for introduction to the concept of IPC::Sharelite.\r\n\r\n-----\r\n\r\n## Licence\r\n\r\nIn order to allow usage of *ParaSim* in different collaboration scenarios with academic or industrial partners, source code of the programme itself and all eventually evolving present and future supporting scripts and material is released under the [GNU General Public Licence v3](http://www.gnu.org/licenses/gpl.html).\r\n","google":"","note":"Don't delete this file! It's used internally to help with page regeneration."}