elss is a Monte Carlo code for the modeling of protein sequence data (and more generally multivariate discrete data). The basic usage of the code consists in 1) learning a probabilistic model for the joint distribution of variables from the correlations in real samples and 2) generating artificial discrete data sampled from the model via Markov chain Monte Carlo (MCMC) sampling.
elss was originally developed to show that pairwise models for protein sequences with correlated amino acids can be learned and resampled using MCMC methods (paper).
All elss source code is hosted on Github.
A copy of the code can be downloaded from
this page
The git
repo can be cloned by:
$ git clone https://github.com/simomarsili/elss.git
In order to compile elss, you will need a Fortran compiler installed on your machine. If you are using Debian or a Debian derivative such as Ubuntu, you can install the gfortran compiler using the following command:
$ sudo apt-get install gfortran
The inference algorithm works by simulating a swarm of persistent Markov chains. To compile elss with support for parallel runs on a distributed-memory architecture, you will need to have a valid MPI implementation installed on your machine. The code has been tested and is known to work with the latest versions of both OpenMPI and MPICH.
OpenMPI (recommended) can be installed on Debian derivatives with:
$ sudo apt-get install openmpi-bin libopenmpi-dev
For details on running MPI jobs with OpenMPI see this link
Alternatively, MPICH can be installed with:
$ sudo apt-get install mpich libmpich-dev
The compiling and linking of source files is handled by Gnu Make. If you are using Debian or a Debian derivative such as Ubuntu, you should find Gnu Make 4.1 already installed.
(optional) git version control software for obtaining the source code:
$ sudo apt-get install git
To compile elss, type make
in the src
directory:
$ cd src; make
This will build the elss
executables (elss-learn
, elss-sample
and elss-eval
).
To install the executables, type make install
:
$ cd src; make install
for the default installation dir (/usr/local/bin
)
or use DESTDIR
to override it.
For example, to install in ~/.local
instead of /usr/local
:
$ cd src; make install DESTDIR=~/.local
Run the run-test.bash
script in the test directory:
$ cd test; bash run-test.bash
The input data should be encoded as space/tab separated integer labels, with variables as columns and samples as rows:
$ head encoded.txt
20 13 8 21 4 17 18 8 19 24
12 0 13 0 6 4 12 4 13 19
19 4 2 7 13 14 11 14 6 24
6 14 21 4 17 13 12 4 13 19
3 4 15 0 17 19 12 4 13 19
2 0 19 4 6 14 17 8 4 18
2 14 13 3 8 19 8 14 13 18
The code assumes that all variables share a common set of classes,
encoded as integer indices starting from zero.
Alternatively, biological sequence data can be directly read from a
multiple sequence alignment file in FASTA format.
The standard workflow has two steps:
- the fitting of the pairwise model to data (using the
elss-learn
executable) and - the sampling of artificially generated data from the fitted model (using
elss-sample
).
The fitting consists in a first-order iterative minimization of a cost function including two terms, a term proportional to the parameters' likelihood and a regularization term. Example:
$ mpiexec -n 4 elss-learn --fasta 1.fa --niter 2000 -n 10000
mpiexec -n 4
: compute the gradient of the cost function simulating 4 independent Markov chains--fasta 1.fa
: read data from file1.fa
in FASTA format--niter 2000
: set the number of iterations for iterative minimization to 2000-n 10000
: set the length of each MC chain (per gradient evaluation) to 1000 MC sweeps
The run will produce a binary checkpoint file chk
,
that contains all the fitted parameters.
For a full list of options and details, type elss-learn -h
.
elss-sample
reads the parameters contained in the checkpoint file produced by elss-learn
and simulate a MC trajectory sampling from a model of pairwise-interacting variables:
$ elss-sample --chk chk -n 100000 -u 100
--chk chk
: read the fitted model from checkpoint filechk
-n 100000
: run a MC trajectory for 100000 MC sweeps-u 100
: dump a configuration every 100 MC sweeps to atrj
file.
The output of the calculation is atrj
file containing 1000 (100000/100) configurations sampled according to the fitted pairwise model. For a full list of options and details, typeelss-sample -h
.
A checkpoint file is an unformatted binary file containing a set of fitted parameters and all the informations needed to restart a previously interrupted optimization. For example, the command:
$ mpiexec -n 4 elss-learn --chk old.chk --fasta 1.fa --niter 2000 -n 10000
will restart the optimization process from the values found in old.chk
.
A checkpoint file can be converted to a plain text file using the elss-pchk
tool together with the -u
option.
The command elss-pchk -u <file>
will generate a file named <file>.txt
, that is:
$ elss-pchk -u chk
$ head chk.txt
10 # n. of variables
21 # n. of classes
protein # n. output data format
4 # n. samples in checkpoint
# samples start here
12 4 4 1 17 4 4 3 10 9
12 20 3 18 12 4 4 4 18 9
7 14 16 18 17 13 9 3 5 3
6 11 3 17 17 4 3 10 18 4
# parms start here
1 0.36794536927277111 -0.38797251076999351 0.45155637053903591 ....
Viceversa, a custom text file containing user-defined parameters can be used
to generate a checkpoint file using the -f
option. Given a valid input file of parameters,
the command elss-pchk -f <file>
will generate an unformatted checkpoint file named <file>.chk
.
The lines of a valid input file to elss-pchk -f
will contain, in this order:
- the number of features/variables in the system, NF
- the number of classes or possible values that can be taken by a variable, NC
- a keyword selecting the format of output data (int, protein)
- the number of samples contained in the subsequent lines, NS
- NS lines, each containing a sample encoded as a space/tab separated array of integer labels
- an arbitrary number of lines each containing the biases for each class of a given variable, with this format:
p x(1) x(2) ... x(NC)
where the {x} are the elements of the NC-long array of biases for variable p.
The program will set to zero the biases for those variables that are not explicitly defined. - an arbitrary number of lines each containing the matrix of couplings for a pair of variables, with this format:
p q x(1,1) x(1,2) ... x(1,NC) ...
where the {x} are the elements of the NC x NC array of couplings for the variables p and q, iterated sequentally in row-major order.
NB: q > p
The program will set to zero the couplings for those pairs that are not explicitly defined. - all characters following the
#
symbol are comments and are ignored - empty lines are ignored
elss is an OPEN Source Project so please help out by reporting bugs or forking and opening pull requests when possible.
Copyright (c) 2016, Simone Marsili
All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
-
Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
-
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
-
Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
If this software has been useful for your work, please cite using the following bibtex entries:
@article{sutto2015residue,
title={From residue coevolution to protein conformational ensembles and functional dynamics},
author={Sutto, Ludovico and Marsili, Simone and Valencia, Alfonso and Gervasio, Francesco Luigi},
journal={Proceedings of the National Academy of Sciences},
volume={112},
number={44},
pages={13567--13572},
year={2015},
publisher={National Acad Sciences}
}
@misc{fmpl,
author = {Simone Marsili},
title = {elss 0.5},
month = aug,
year = 2019,
doi = {10.5281/zenodo.3359018},
url = {http://dx.doi.org/10.5281/zenodo.3359018}
}