A SAT-based program to compute a haplotype configuration on pedigrees with recombinations, genotyping errors, and missing genotypes over biallelic and multi-allelic loci.
by Yuri Pirola
Started: September 27, 2010
Current release: 2.1.5 (March 21, 2016)
This program is based on a reduction of the Haplotype Configuration with Recombinations and Errors problem to Boolean Satisfiability, which is then solved by a SAT solver. A haplotype configuration is finally recovered from the satisfying assignment.
The algorithm is described in the following papers:
Yuri Pirola, Gianluca Della Vedova, Stefano Biffani, Alessandra Stella,
and Paola Bonizzoni.
A fast and practical approach to genotype phasing and imputation on a
pedigree with erroneous and incomplete information.
IEEE/ACM Transactions on Computational Biology and Bioinformatics
(2012).
Link
Yuri Pirola, Gianluca Della Vedova, Stefano Biffani, Alessandra Stella,
and Paola Bonizzoni.
A fast and practical approach to genotype phasing and imputation on a
pedigree with erroneous and incomplete information.
In: Proc. of IEEE 2nd International Conference on Computational Advances
in Bio and Medical Sciences, ICCABS 2012.
Link
reHC-* is currently distributed only on source form. It has been developed on Ubuntu Linux machines (10.04 and later) and has been tested on both 32 and 64 bit. The program should work on (or should be easily ported to) on MacOS X but has not been tested and it is not supported on this operating system.
- Python (>= 2.7)
- CMake (>= 2.8)
- GNU make
- Boost FileSystem, System, DateTime, ProgramOptions, IOStreams, and other include-only libraries (tested with 1.42)
- Apache Log4cxx (tested with 0.10.0)
The easiest way for having reHC-* correctly installed on your machine is through PyPI with the command:
$ pip install -v reHCstar
(Please be patient, because it can take some time to build the package.)
If pip
is not available on your system (and you cannot install it
following these instructions,
you can manually download the reHC-* package from
https://github.com/yp/reHCstar/tarball/master, unpack it in a directory
of your choice, and then build it with the command:
$ python setup.py install
Alternatively, you can proceed with the manual download, compilation, and installation as detailed below.
reHC-* is developed on the yp/reHCstar
Git repository hosted by
GitHub.
The repository can be explored using the GitHub web interface at
https://github.com/yp/reHCstar.
The latest stable version of reHC-* can be downloaded in either .tar.gz or in .zip format. Previous stable releases can be downloaded from https://github.com/yp/reHCstar/archives/master.
It is also possible to clone the entire repository using the following command:
$ git clone git://github.com/yp/reHCstar.git
Or, if you have a GitHub account, you can fork the project from the repository web page.
The program can be compiled by issuing the command at the command prompt:
$ make STATUS=Release
The program can be compiled in three different variants:
- with an integrated SAT solver and without the ability to invoke an external SAT solver. This is the default variant, and should be the most efficient on both time and memory.
- with an integrated SAT solver and with the ability to invoke an external SAT solver. This is the most flexible variant, but it is also the variant that uses more time and memory than the others.
- without an integrated SAT solver but with the ability to invoke an external SAT solver. This variant should be preferred when one wants to use an external SAT solver, since it is more memory- and time-efficient than the previous one.
Currently, the SAT solvers that can be directly integrated into reHC-*
are CryptoMiniSat v2.9.1 by Mate
Soos and MiniSat v2.2.0 by Niklas
Een and Niklas Sorensson.
reHC-* can interact (in the 2nd and 3rd variants) with any SAT solver
that follow the standard
requirements
of the last SAT competitions.
The variant can be specified by modifying the file CMakeOptions.txt
in
the root directory.
In particular, two options have to be used to specify the variant:
INTEGRATE_SAT_SOLVER
, which specifies if the SAT solver should be integrated into reHC-*;DISABLE_EXTERNAL_SAT_SOLVERS
, which specifies if the invocation of external SAT solvers is allowed.
The following combinations are allowed:
INTEGRATE_SAT_SOLVER=ON
andDISABLE_EXTERNAL_SAT_SOLVERS=ON
, (default), which corresponds to the first variant;INTEGRATE_SAT_SOLVER=ON
andDISABLE_EXTERNAL_SAT_SOLVERS=OFF
, which corresponds to the second variant;INTEGRATE_SAT_SOLVER=OFF
andDISABLE_EXTERNAL_SAT_SOLVERS=OFF
, which corresponds to the third variant.
The SAT solver that will be integrated (if INTEGRATE_SAT_SOLVER
is
ON
) can be specified by setting USE_CRYPTOMINISAT
or USE_MINISAT
to ON
in the file CMakeOptions.txt
.
By default, reHC-* generates SAT instances in a augmented CNF format
where the "extended clauses" can also be XORs of literals.
The SAT instance is generally smaller if XORs are allowed,
represents better the "internal structure" of the Boolean formula, and
should be used if it is supported by the SAT solver.
The SAT solver embedded by default, CryptoMiniSat, supports this
functionality, while MiniSat does not.
If MiniSat is chosen instead of CryptoMiniSat by editing the file
CMakeOptions.txt
, then XOR-clauses are automatically disabled.
To disable the augmented CNF format, reHC-* must be rebuilt specifying
the preprocessor symbol AVOID_XOR_CLAUSES
in the CXXFLAGS
.
For example, if the program is compiled in a bash shell in Linux, it
suffices the following command to enable the "pure" CNF format:
$ CXXFLAGS="-DAVOID_XOR_CLAUSES" make STATUS=Release
The program takes as input a genotyped pedigree (with missing genotypes) and returns (if possible) a complete haplotype configuration with at most r recombinations and e errors. (The file formats are described below.) Depending on the variant that has been compiled, the program works in four different modes that have to be specified on the command line as program parameter:
--create
(short form-1
), that, given a genotyped pedigree, creates the associated SAT instance. (Available only on variants 2 and 3.)--read
(short form-2
), that, given a genotyped pedigree, reads a satisfying model of the associated SAT instance (if such a model exists) and computes the associated haplotype configuration. (Available only on variants 2 and 3.)--create-read
(short form-3
), that, given a genotyped pedigree, creates the associated SAT instance, invokes the external SAT solver, reads a satisfying model of the SAT instance (if such a model exists), and computes the associated haplotype configuration. This mode essentially combines the previous two modes by automatically invoking the external SAT solver. (Available only on variants 2 and 3.)--solve-internal
(short form-4
), that, given a genotyped pedigree, creates the associated SAT instance, uses the integrated SAT solver for solving the instance, and, if the SAT instance is satisfiable, computes the associated haplotype configuration. (Available only on variant 1.)
The following options are used to specify the input/output files:
-
--pedigree
(short form-p
), that specifies the file containing the genotyped pedigree (input file); -
--sat
(short form-s
), that specifies the file containing the SAT instance associated with the genotyped pedigree (output file); -
--result
(short form-r
), that specifies the file containing the results computed by the external SAT solver for the SAT instance associated with the genotyped pedigree (input file); -
--haplotypes
(short form-h
), that specifies the file that will contain the haplotype configuration of the genotyped pedigree computed by reHC-* (output file); -
--assumptions
(short form-a
), that specifies an optional file that contains additional assumptions that must be satisfied by the resulting haplotype configuration. Assumptions are specified one for each row with the following syntax:<variable kind> <individual id> <locus> <value>
Where
<variable kind>
is one ofsp
(paternal source),sm
(maternal source),p
(paternal allele),m
(maternal allele),rp
(paternal recombination),rm
(maternal recombination), ande
(genotyping error),<individual id>
is the numerical identifier of the individual (1-based),<locus>
is the genotype locus, and<value>
is the boolean value (0/1) that the variable must have. Please note that biallelic and multi-allelic loci are treated differently, thus they have different set of variables.
For the --create-read
mode, the command-line that has to be used to
invoke the external SAT must be specified by using the --sat-cmdline
(short form -c
) program option.
The strings %%INPUT%%
and %%OUTPUT%%
are placeholders for,
respectively, the input and the output files of the SAT solver.
If the SAT solver can read the SAT instance from its standard input,
then it is possible to write the SAT instance to the solver's standard
input by specifying the option --pipe
.
In this case, the placeholder %%INPUT%%
will not be used.
By default, reHC-* search for a haplotype configuration with zero
recombinations and zero errors.
To enable recombinations in the haplotyping process, the program options
--global-recomb
and either --global-recomb-rate=XX
or
--global-recomb-number=YY
must be specified.
Here XX
is a number between 0.0
and 1.0
that represents the
maximum number of recombinations r as a fraction of the total number
of possible recombination loci, while YY
is (directly) the maximum
number of recombinations r.
Moreover, if option --global-recomb
is enabled and
--global-recomb-number
is used, it is also possible to search for a
haplotype configuration with a given minimum number of recombinations by
specifying the option --global-recomb-min-number=ZZ
, where ZZ
is the
sought lower bound.
This option should only be used to specify a lower bound that has been
already proved since the resulting haplotype configuration could induce
unnecessary recombination in order to satisfy the given lower bound.
Similarly, to enable genotyping errors in the computed haplotype
configuration, the program options --global-error
and
either --global-error-rate=XX
or --global-error-number=YY
must be
specified.
As before, XX
is a number between 0.0
and 1.0
that represents the
maximum number of errors e as a fraction of the number of non-missing
genotypes, while YY
is (directly) the maximum number of errors e.
Other program options allow a finer control over the distribution of
recombinations and errors.
Please refer to the help of the program (that can be obtained by
specifying the --help
program option) for their presentation and
explanation.
reHC-* can also read and write files compressed by GZip. The GZip compression allows to save some space and, especially for large instances and when an external SAT solver is used, it could reduce the running time, since it greatly reduces to time spent for I/O operations. It is disabled by default since not all the SAT solvers support it. Three options regulates the GZip compression:
--compress-input
, which enables the GZip compression of some files that are read by reHC-* (currently only the--pedigree
file);--compress-output
, which enables the GZip compression of some files that are written by reHC-* (currently the--sat
and--haplotypes
files);--compress
(short form-z
), which is equivalent to specify both--compress-input
and--compress-output
;--compress-sat
, which enables the GZip compression only for the file that contains the computed SAT instance.
Temporary files of the --create-read
mode are automatically removed by
default.
To keep them (for example, for manual inspection), the program option
--keep
(short form -k
) has to be specified.
A summary of the available program options can be printed by invoking
reHC-* with the --help
(short form -?
) option.
For example, if the genotyped pedigree is described in file
genotyped-pedigree.txt
, the following commands perform the complete
haplotype inference process (saving the resulting haplotype
configuration in file haplotype-configuration.txt
).
Using the integrated SAT solver (variant 1 or 2):
$ ./bin/reHCstar -4 \
-p genotyped-pedigree.txt \
-h haplotype-configuration.txt
Using an external SAT solver (variant 2 or 3) with manual invocation of the SAT solver:
$ ./bin/reHCstar -1 \
-p genotyped-pedigree.txt \
-s instance.cnf
# ...execution of the external SAT solver, assuming that
# it writes the results in file sat-result.txt
$ ./bin/reHCstar -2 \
-p genotyped-pedigree.txt \
-r sat-result.txt \
-h haplotype-configuration.txt
Using an external SAT solver (variant 2 or 3) with automatic invocation of the SAT solver:
$ ./bin/reHCstar -3 \
-p genotyped-pedigree.txt \
-h haplotype-configuration.txt \
-c "./external-sat-solver %%INPUT%% %%OUTPUT%%"
Or, if the SAT solver reads the SAT instance from its standard input:
$ ./bin/reHCstar -3 \
-p genotyped-pedigree.txt \
-h haplotype-configuration.txt \
--pipe \
-c "./external-sat-solver %%OUTPUT%%"
reHC-* also includes a program that uses the basic reHCstar
executable
in order to achieve two different aims:
- finding (by a bisect-like search) the haplotype configuration that induces the minimum number of recombinations;
- splitting long input genotypes into smaller overlapping blocks on which a partial haplotype configuration is computed independently and then used to reconstruct the complete haplotype configuration.
Please notice that the optimality of the solution (in term of number of recombinations) is guaranteed if the genotypes are not split into smaller blocks.
These functionalities are provided by the program reHCstar-mgr
written in Python version 3 and later.
reHCstar-mgr
requires two parameters, -p
and -r
, that specify,
respectively, the file containing the input genotyped pedigree and the
file on which the computed haplotype configuration will be saved.
By default, reHCstar-mgr
invokes the reHCstar
executable in the
current directory using the internal SAT solver mode (option
--solve-internal
described above).
To change the default, the complete command line must be provided as
argument of the program option --cmd
and must contain the following
three placeholders {pedigree}
, {haplotypes}
, and {assumptions}
that will be replaced, respectively, with the input pedigree file, the
output haplotype configuration file, and the input additional assumption
file.
For example, the default value of the --cmd
option (i.e. the default
command line) is:
./reHCstar -4 -p "{pedigree}" -h "{haplotypes}" -a "{assumptions}"
The command line used to invoke the reHCstar
executable is
composed by concatenating the argument of the previous option with the
arguments of two other options: --cmd-rec
and --cmd-time
.
The first one, --cmd-rec
, specifies the options (of reHCstar
) that
regulates the maximum (and, possibly, minimum) number of recombinations.
In particular, the argument must include the placeholder {number}
which will be replaced before invocation with the actual maximum number
of recombinations.
Moreover, the argument may include the placeholder {min_number}
which
will be replaced before invocation with the largest lower bound on the
number of recombinations computed so far.
For example, the default value of the --cmd-rec
option is:
--global-recomb --global-recomb-number "{number}" --global-recomb-min-number "{min_number}"
The last option that regulates the final command line of reHCstar
is
--cmd-time
and, if specified, must include the placeholder {time}
which will be replaced before invocation with the maximum CPU time of
the reHCstar
execution (in seconds).
An empty argument disables the running time limit control (albeit it
could be enforced anyway via OS services).
For example, the default value of the --cmd-time
option is:
--time-limit {time}
The following sections present the other main features of
reHCstar-mgr
while the full list of its options is available in the
integrated help (option -h
).
The subdivision of the input genotypes in (smaller) overlapping blocks
is regulated by the following two options: --block-length
(short form
-l
, default 50
) and --lookahead-length
(short form -a
,
default 0
).
The first option specifies the non-overlapping (maximum) length of each
block which the genotypes are divided into, while the second option
specifies the number of loci (in addition to a single fixed locus) which
two consecutive blocks overlap on.
In other words, a single block can be considered as composed by three
parts: the first part spans block-length
loci, the second is composed
by a single locus, and the third (optional) part spans
lookahead-length
loci.
(Hence, the total length of a block is block-length
+ 1
+
lookahead-length
.)
The second part of a block always overlaps with the first locus of the
first part of the next genotype block.
Moreover the haplotype configuration computed on this locus during the
solution of the "current" block is used as assumptions during the
solution of the next block (thus coincide).
The third part of a block, the "look-ahead" part, if it is present
overlaps with the next block starting from its second locus.
This part is used to compute a haplotype configuration of the "current"
block, but the solution is then discarded when the next block is
considered (thus it may not coincide).
Its purpose is to provide a hint of the structure of the next block and
it should be particularly useful when the proportion of missing
genotypes is relevant, since when the overlapping locus has many missing
genotypes, the solution of the current block could impute the genotypes
in a way that is locally optimal, but globally sub-optimal.
Please notice that reHCstar-mgr
finds a solution that requires the
minimum number of recombinations only if the genotypes are not divided
into blocks.
Initial lower and upper bounds on the number of recombinations may be
specified with the options --initial-recomb-lb=XX
and
--initial-recomb-ub=YY
, respectively.
The options' arguments, XX
and YY
, are non-negative numbers such
that a haplotype configuration with XX
recombinations does not exist
and a haplotype configuration with YY
recombinations certainly exists.
The default value of both of them is -1
which means that no bound is
known/provided.
Moreover it is possible to specify a file containing an initial
haplotype configuration that reHCstar-mgr
tries to improve (in
terms of number of recombinations).
In this case, the initial haplotype configuration is read and the number
of recombinations that it induces is used as initial upper bound.
If not better solution is found (for example, due to time limits), then
reHCstar-mgr
outputs the initial haplotype configuration.
The file containing the initial haplotype configuration is specified as
argument of the --initial-haplotype-configuration
program option.
Please notice that options --initial-haplotype-configuration
and
--initial-recomb-ub
cannot be used together.
These options could help to speed-up the process of searching the
solution with the minimum number of recombinations since they provide
the initial interval which the bisect-like search is performed on.
If an initial upper bound is known but an initial lower bound is not, it
is possible to enable a bootstrap phase that attempts to quickly
identify an initial lower bound and then the execution continues by
bisecting the interval so determined.
The bootstrap phase can be activated by specifying the --bootstrap
switch, while the maximum CPU time spent in the bootstrap
phase can be specified with the --bootstrap-time-limit=XX
parameter,
where XX
is the time limit expressed in seconds.
reHCstar-mgr
provides basic tools for limiting its total running
time (CPU time).
In particular, option --time-limit=SS
specifies the maximum running
time of the program (SS
seconds).
For the proper functioning of this feature, the option --cmd-time
must
be valid.
If the program execution exceeds the given time limit, then
reHCstar-mgr
tries to save the solution computed so far in a file
whose name is the name specified by the option --results
concatenated
with the (fixed) extension .part
.
The saved solution could be partial (if the original instance has been
partitioned in blocks) and/or suboptimal (if the minimum number of
recombinations has not been computed within the time limit).
The status of the solution is saved as a comment line in the same file
of the solution.
We suggest to enable the verbose mode (with -v
or -vv
) for getting
additional information.
Genotyped pedigrees are described by a single file with the standard PED
format used in
plink
.
In particular, each line of the pedigree file fully describes a single
individual and it is composed by at least six whitespace-separated
fields.
The first (mandatory) six fields are:
Family ID
(numeric only)Individual ID
(numeric only, greater than0
)Paternal ID
(the ID of the father,0
if unknown/not present)Maternal ID
(the ID of the mother,0
if unknown/not present)Sex
(1
=male,2
=female)Phenotype
(ignored, could be any string not containing a whitespace)
Remark:
reHC-* currently works only on single-family pedigrees, thus the
Family ID
s must be the same for all the individuals.
The remaining fields (field 7 onwards) represent the genotype of the
individual, where each field represents a single allele of a single SNP
biallelic locus.
Both the alleles of each locus must be specified (they can be
missing alleles), thus the total number of fields of each row must be
even.
Major and minor alleles are encoded by the characters 1
and 2
.
Missing genotypes are encoded by the pair 0 0
(i.e. by two fields
containing the missing allele 0
).
The pairs composed by a valid allele (1
or 2
) and a missing allele
(0
) are not valid.
Since reHC-* 2.0.0, there could also be multi-allelic loci. Alleles
are encoded by a number greater than 0
(which is always considered the
missing allele code).
Rows starting with the character #
are considered as comments and
ignored.
Remark:
The order of the two alleles on each locus is meaningless (i.e., the
pair 2 1
is considered the same as the pair 1 2
).
A simple single-family pedigree composed by 5 individuals genotyped over 5 biallelic loci is as follows.
0 1 0 0 1 phenotype 1 1 2 2 2 2 2 2 1 1
0 2 0 0 2 phenotype 2 2 1 1 1 1 1 1 1 1
0 3 1 2 2 phenotype 1 2 0 0 1 2 1 2 1 1
0 4 0 0 1 phenotype 1 2 1 2 1 1 1 1 0 0
0 5 4 3 1 phenotype 1 2 1 2 0 0 1 1 1 2
The haplotype configuration computed by reHC-* is represented in a
PED-like format.
In particular, the first six fields are equal to the PED format.
The remaining fields represent the computed haplotype pair of the
individual, where each field represents the two alleles on a
single locus (separated by the character |
).
In this case, the order of the two alleles is important and represents
the phase of each locus.
The first allele in each pair is the paternal allele, while the second
one is the maternal allele.
For the example, a zero-recombinant haplotype configuration for the previous genotyped pedigree is as follows.
0 1 0 0 1 phenotype 1|1 2|2 2|2 2|2 1|1
0 2 0 0 2 phenotype 2|2 1|1 1|1 1|1 1|1
0 3 1 2 2 phenotype 1|2 2|1 2|1 2|1 1|1
0 4 0 0 1 phenotype 2|1 1|2 1|1 1|1 2|2
0 5 4 3 1 phenotype 1|2 2|1 1|1 1|1 2|1
where the two (multi-locus) haplotypes of individual 5
are 12112
(paternal haplotype) and 21111
(maternal haplotype).
reHC-* is released under the terms of the GNU General Public License (GPL) as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
reHC-* is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
Please refer to file COPYING
or to the
GNU website for a copy of the GNU
General Public License.
The template of reHC-* is based on the cpp-project-template by Michael Aaron Safyan.
reHC-* incorporates the following SAT solvers:
- CryptoMiniSat version 2.9.1 (commit e819ab3236e, date 26/May/2011) by Mate Soos, which is distributed under the GNU General Public License version 3;
- MiniSat version 2.2.0 by Niklas Een and Niklas Sorensson, which is distributed under the MIT license.
For extracting source version information from git repository tags, reHC-* uses autorevision by dak180 and others, which is distributed under the MIT license.
We would like to thank Gianluca Della Vedova for useful discussions.
Please contact Yuri Pirola for additional information.
E-mail: yuri.pirola@gmail.com
Web page: http://algolab.eu/pirola