Skip to content

2. Installation

cherhaus edited this page Aug 30, 2013 · 1 revision

Contents of this Package

ParaSim is principally able to be executed by itself if appropriate fingerprint input files are provided. However, in order to extend the usage of ParaSim to persistent memory objects and to facilitate similarity searches directly from Smiles strings or structure files (SDF or Smiles), several additional tools are provided:

1.fp2mem.pl : Creates and manages persistentently stored memory objects with reference fingerprint data.

2.rdkit2parasim.py : Generates ParaSim input files from Smiles strings or SDF/Smiles files applying RDKit's Morgan or feature-based Morgan fingerprints. Requires installation of Python and RDKit.

3.Molecule2Parasim.xml : A Pipeline Pilot™ protocol for the generation of ParaSim input files from Smiles strings or SDF/Smiles files applying ECFP or FCFP fingerprints. Requires installation of Pipeline Pilot™.

4.parasim-conversion-knime-demo.zip : An example workflow for the Open-Source workflow engine Knime to demonstrate how ParaSim input files can be generated from within the OpenSource workflow engine Knime.

5.simsearch.pl : Allows similarity searches directly from a Smiles string or structure files (SDF or Smiles) against a reference dataset stored in ParaSim format.

For testing, sample files with 10 records drawn from the freely available PubChem and ZINC databases are provided as well in the data/ subdirectory.

Moreover, several installation-related files are packaged together with ParaSim:

1.parasim-config.txt : This central configuration file may be edited by the user and stores several default values used by the different scripts.

2.prepare_and_call_pipeline_pilot.csh : A sample configuration shell script to prepare the system environment for inclusion of Pipeline Pilot™ fingerprint calculations by simsearch.pl.

3.prepare_and_call_python_rdkit.csh : A sample configuration shell script to prepare the system environment for inclusion of RDKit fingerprint calculations by simsearch.pl.

4.Parasim.pm : A Perl module containing all shared ParaSim functions.


Requirements

Operating System

In the current implementation, ParaSim itself is a single Perl script with a parallelized computational core written in C. The C core potentially applies extensions of the GCC compiler or hardware routines of Intel® processors (Intel® Streaming SIMD Extensions/SSE4). For multithreading, the C core makes use of POSIX threads by the pthread library which has to be accessible to the compiler. For the use of persistent memory objects ParaSim uses SysV inter process communication (IPC) concepts. Due to these requirements, ParaSim currently can only be executed in a Unix/Linux OS environment with the GCC compiler installed.

Software

ParaSim was tested successfully with Perl version 5.10.0 and 5.12.1 under OpenSuse Linux 11.3 on 32-bit dual-core and Suse Linux Enterprise Server 11 SP3 on 64-bit multiprocessor machines up to 192 cores. Some Perl modules which are not part of the standard distribution are required:

  • The C code is directly integrated into the Perl code and compiled by the Perl module Inline::C which is not part of the standard Perl distribution and therefore must be manually installed.
  • SysV IPC support is supplied by the Perl module IPC::Sharelite which also requires separate installation.

If you want to make use of the tools packaged together with ParaSim, installation of third-party software like Python, RDKit and Pipeline Pilot™ or further software packages for fingerprint calculations may be necessary.

Memory

Because ParaSim loads the reference set into memory, the size of the reference set is limited only by the available memory. Typically, memory consumption per 1 million of reference fingerprints of length 1024 is ~150 MB as persistent memory object and ~300 MB during runtime.

Since version 0.05, ParaSim allows storing additional data like e.g. Smiles strings. Depending on the amount of additional data this of course has direct influence on memory consumption.


Installation

ParaSim itself currently consists of just a single Perl script including the C code as well. Compilation of the C source code is performed automatically by the Inline::C module when calling the Perl script. Therefore, basically no installation is required:

  • Make shure that the OS and software requirements described above are met
  • Extract the archive
  • In case you do not want to prepend the Perl call itself each time, make the script executable (chmod 755 parasim.pl). ParaSim expects the perl executable to be located in /usr/bin/perl. If that is not true in your case, change the default Perl path in the first line of the script's source code.

In order to test if ParaSim runs correctly, try

perl parasim.pl -q data/pubchem-test-fcfp6.txt -r data/zinc-test-fcfp6.txt

The output should be

QUERY   REFERENCE       TANIMOTO        AVG_TANIMOTO
68664   ZINC01914437    0.198019801980198       0.104496307506587
71360   ZINC03775002    0.133333333333333       0.103979492391050
68938   ZINC03774999    0.160377358490566       0.122158970101436
71696   ZINC03774999    0.163636363636364       0.118017086925888
71917   ZINC03774999    0.147368421052632       0.102165139370256
71107   ZINC03774999    0.173076923076923       0.128406853662191
71542   ZINC01914437    0.185185185185185       0.107759423159295
71227   ZINC03774999    0.181818181818182       0.129684949182247
71767   ZINC03775009    0.174418604651163       0.122120643622887
71923   ZINC03774991    0.154761904761905       0.117569042869504

If you want to start similarity searches directly from SDF or Smiles files using simsearch.pl, fingerprints and input files for ParaSim need to be generated during runtime using third-party software. Therefore, third-party software packages like Python and RDKit or Pipeline Pilot™ need to be installed separately:

  • For rdkit2parasim.py make shure that beside RDKit modules also the modules "sys", "argparse", "gzip" and "base64" are accessible to the Python installation.
  • Paths to executables and scripts need to be defined in the respective section of parasim-config.txt. Therefore replace placeholders like "my_path" or "my_server" in parasim-config.txt by the path and server information fitting your environment.
  • It is sometimes necessary to prepare environments for scripting languages like Perl or Python or for Pipeline Pilot™. This can be achieved either by calling executables from within a preparational shell script or by calling several commands combined by '&&'. You may use the example scripts prepare_and_call_pipeline_pilot.csh and prepare_and_call_python_rdkit.csh for this purpose and adapt them to your needs.

Technical note: In the current version of ParaSim, Inline::C compiles the C sources only if binaries do not yet exist or if the C sources were modified. Therefore, if you use ParaSim on different machines in a network, it may happen that you cannot run ParaSim on one architecture because it was compiled on a different architecture before. In this case, make sure that you either re-run the script from a different run directory or that you apply a slight change in the C section of the source code (a single space character is already sufficient) to trigger a recompilation for the new architecture. This issue will be addressed in a future version of ParaSim.