-
Notifications
You must be signed in to change notification settings - Fork 0
2. Installation
ParaSim is principally able to be executed by itself if appropriate fingerprint input files are provided. However, in order to extend the usage of ParaSim to persistent memory objects and to facilitate similarity searches directly from Smiles strings or structure files (SDF or Smiles), several additional tools are provided:
1.fp2mem.pl
: Creates and manages persistentently stored memory objects with reference fingerprint data.
2.rdkit2parasim.py
: Generates ParaSim input files from Smiles strings or SDF/Smiles files applying RDKit's Morgan or feature-based Morgan fingerprints. Requires installation of Python and RDKit.
3.Molecule2Parasim.xml
: A Pipeline Pilot™ protocol for the generation of ParaSim input files from Smiles strings or SDF/Smiles files applying ECFP or FCFP fingerprints. Requires installation of Pipeline Pilot™.
4.parasim-conversion-knime-demo.zip
: An example workflow for the Open-Source workflow engine Knime to demonstrate how ParaSim input files can be generated from within the OpenSource workflow engine Knime.
5.simsearch.pl
: Allows similarity searches directly from a Smiles string or structure files (SDF or Smiles) against a reference dataset stored in ParaSim format.
For testing, sample files with 10 records drawn from the freely available PubChem and ZINC databases are provided as well in the data/ subdirectory.
Moreover, several installation-related files are packaged together with ParaSim:
1.parasim-config.txt
: This central configuration file may be edited by the user and stores several default values used by the different scripts.
2.prepare_and_call_pipeline_pilot.csh
: A sample configuration shell script to prepare the system environment for inclusion of Pipeline Pilot™ fingerprint calculations by simsearch.pl
.
3.prepare_and_call_python_rdkit.csh
: A sample configuration shell script to prepare the system environment for inclusion of RDKit fingerprint calculations by simsearch.pl
.
4.Parasim.pm
: A Perl module containing all shared ParaSim functions.
Operating System
In the current implementation, ParaSim itself is a single Perl script with a parallelized computational core written in C. The C core potentially applies extensions of the GCC compiler or hardware routines of Intel® processors (Intel® Streaming SIMD Extensions/SSE4). For multithreading, the C core makes use of POSIX threads by the pthread library which has to be accessible to the compiler. For the use of persistent memory objects ParaSim uses SysV inter process communication (IPC) concepts. Due to these requirements, ParaSim currently can only be executed in a Unix/Linux OS environment with the GCC compiler installed.
Software
ParaSim was tested successfully with Perl version 5.10.0 and 5.12.1 under OpenSuse Linux 11.3 on 32-bit dual-core and Suse Linux Enterprise Server 11 SP3 on 64-bit multiprocessor machines up to 192 cores. Some Perl modules which are not part of the standard distribution are required:
- The C code is directly integrated into the Perl code and compiled by the Perl module Inline::C which is not part of the standard Perl distribution and therefore must be manually installed.
- SysV IPC support is supplied by the Perl module IPC::Sharelite which also requires separate installation.
If you want to make use of the tools packaged together with ParaSim, installation of third-party software like Python, RDKit and Pipeline Pilot™ or further software packages for fingerprint calculations may be necessary.
Memory
Because ParaSim loads the reference set into memory, the size of the reference set is limited only by the available memory. Typically, memory consumption per 1 million of reference fingerprints of length 1024 is ~150 MB as persistent memory object and ~300 MB during runtime.
Since version 0.05, ParaSim allows storing additional data like e.g. Smiles strings. Depending on the amount of additional data this of course has direct influence on memory consumption.
ParaSim itself currently consists of just a single Perl script including the C code as well. Compilation of the C source code is performed automatically by the Inline::C module when calling the Perl script. Therefore, basically no installation is required:
- Make shure that the OS and software requirements described above are met
- Extract the archive
- In case you do not want to prepend the Perl call itself each time, make the script executable (
chmod 755 parasim.pl
). ParaSim expects the perl executable to be located in /usr/bin/perl. If that is not true in your case, change the default Perl path in the first line of the script's source code.
In order to test if ParaSim runs correctly, try
perl parasim.pl -q data/pubchem-test-fcfp6.txt -r data/zinc-test-fcfp6.txt
The output should be
QUERY REFERENCE TANIMOTO AVG_TANIMOTO
68664 ZINC01914437 0.198019801980198 0.104496307506587
71360 ZINC03775002 0.133333333333333 0.103979492391050
68938 ZINC03774999 0.160377358490566 0.122158970101436
71696 ZINC03774999 0.163636363636364 0.118017086925888
71917 ZINC03774999 0.147368421052632 0.102165139370256
71107 ZINC03774999 0.173076923076923 0.128406853662191
71542 ZINC01914437 0.185185185185185 0.107759423159295
71227 ZINC03774999 0.181818181818182 0.129684949182247
71767 ZINC03775009 0.174418604651163 0.122120643622887
71923 ZINC03774991 0.154761904761905 0.117569042869504
If you want to start similarity searches directly from SDF or Smiles files using simsearch.pl
, fingerprints and input files for ParaSim need to be generated during runtime using third-party software. Therefore, third-party software packages like Python and RDKit or Pipeline Pilot™ need to be installed separately:
- For
rdkit2parasim.py
make shure that beside RDKit modules also the modules "sys", "argparse", "gzip" and "base64" are accessible to the Python installation. - Paths to executables and scripts need to be defined in the respective section of
parasim-config.txt
. Therefore replace placeholders like "my_path" or "my_server" inparasim-config.txt
by the path and server information fitting your environment. - It is sometimes necessary to prepare environments for scripting languages like Perl or Python or for Pipeline Pilot™. This can be achieved either by calling executables from within a preparational shell script or by calling several commands combined by '&&'. You may use the example scripts
prepare_and_call_pipeline_pilot.csh
andprepare_and_call_python_rdkit.csh
for this purpose and adapt them to your needs.
Technical note: In the current version of ParaSim, Inline::C compiles the C sources only if binaries do not yet exist or if the C sources were modified. Therefore, if you use ParaSim on different machines in a network, it may happen that you cannot run ParaSim on one architecture because it was compiled on a different architecture before. In this case, make sure that you either re-run the script from a different run directory or that you apply a slight change in the C section of the source code (a single space character is already sufficient) to trigger a recompilation for the new architecture. This issue will be addressed in a future version of ParaSim.