Skip to content

Commit

Permalink
set temporal constraints and variance by default
Browse files Browse the repository at this point in the history
  • Loading branch information
Thu-Hien To committed May 13, 2020
1 parent c1c35aa commit fb9317a
Show file tree
Hide file tree
Showing 6 changed files with 51 additions and 41 deletions.
61 changes: 35 additions & 26 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,22 @@
# LSD2: LEAST-SQUARES METHODS TO ESTIMATE RATES AND DATES FROM PHYLOGENIES

__For people who prefer R, an R-wrapper of lsd2 is developping here: https://github.com/tothuhien/Rlsd2 .__
##News

- lsd2 is now integrated in IQ-tree

- For people who prefer R, an R-wrapper of lsd2 is developping here: https://github.com/tothuhien/Rlsd2 .

- Temporal constraint is now imposed by default without option -c

- Variance is also used by default without applying -v 1

## Compile/install LSD2:

Type __make__ from the folder __src__, you will have the executable file __lsd2__ in the same place.
Note that C++ compiler and library support for the ISO C++ 2011 is required to compile the program from sources.


Mac/Linux users can install lsd2 via Homebrew as follows (the Homebrew version is not yet updated with the current one on github):

`brew install brewsci/bio/lsd2`
Mac/Linux users can install lsd2 via Homebrew as follows (the Homebrew version is not yet updated with the current one on github): `brew install brewsci/bio/lsd2`

## Run LSD2:

Expand All @@ -27,9 +33,8 @@ In order to avoid undetermined problem, sufficient dates should be given.
A tree where all tips having the same date with no further date information on internal nodes will not be able to infer absolute dates.
In this case, you can still estimate relative dates using options -a and -z to specify the root date and tip date.

Option __-c is recommended__ to take into account the temporal constraints (date of a node >= date of its ancestors).
It should be noticed that LSD2 always assumes an increasing-time order from root to tips, i.e the date of a node is smaller than that of its children. If your data has the reverse order, the simplest way is to take the negation of the
input date, and take the negation again of the output date to obtain your expected results.
By default, lsd2 imppose the __temporal constraints__ (date of a node >= date of its ancestors) on every node.
It should be noticed that LSD2 always assumes an increasing-time order from root to tips, i.e the date of a node is smaller than that of its children. If your data has the reverse order, the simplest way is to take the negation of the input date, and take the negation again of the output date to obtain your expected results.

The program first __collapses__ all internal branches that are considered uninformative (<= 0.5/seqlength by default, and low support value if specified) and impose a constraint of __minimum branches lengths__ for the time scaled tree.
These values could be manually specified via option -l (for uninformative branch length threshold), -S (support value threshold), and options -u, -U (for minimum internal/external branches lengths of time scaled tree).
Expand Down Expand Up @@ -72,7 +77,7 @@ then an input date file can be as follows:
n2 u(2003-02-12)
root b(1998-10-11,1999-11-12)

If the date format is detected as year-month-day and there're some imprecise date (missing month, or missing day) then lsd2 automatically turns it into the corresponding interval.
If the date format is detected as year-month-day and there're some imprecise dates (missing month, or missing day) then lsd2 automatically turns it into the corresponding interval.

### Given rate file

Expand All @@ -90,7 +95,7 @@ to each tree in the Input_tree_file, for example:
outgroup1
outgroup2

If there are more than 1 outgroups, than they must be monophyletic in the input trees.
If there are more than 1 outgroups, than they must be monophyletic in the input trees. By default, the program uses outgroup to root the tree and removes them before dating process. To keep to root in the final tree, use option -k in addition.

### Partition file

Expand All @@ -106,72 +111,76 @@ then an example for partition file can be as follows:
group1 1 {n1} {n5 n4}
group2 1 {n3}

Each line defines a list of subtrees whose branches are supposed to have the same substitution rate. It starts by the name of the group (`group1`), then the prior proportion (`1`) of the group rate compared to the main rate. This is just a starting value, and the proportion will be estimated; giving an appropriate value helps to converge faster.
Each line defines a group rate, which contains a list of subtrees whose branches are supposed to have the same substitution rate. It starts by the name of the group (`group1`), then the prior proportion (`1`) of the group rate compared to the main rate. This is just a starting value, and the proportion will be estimated at the end; giving an appropriate value helps to converge faster.
Each subtree is then defined between {}: the first node is the root of the subtree and the following nodes (if there any) define its tips. If the first node is a tip label then it takes the mrca of all tips as the root of the subtree.
If there's only root and not any tip defined, then the subtree is extended down to the tips of the full tree. Hence, {n1} defines the subtree rooted at the node n1; and {n5 n4} defines the subtree rooted at n5 that has one tip as n4 and other tips as the ones of the full trees (here are B,C).
As a consequence, in this example, the branches will be partitioned into 3 groups such that each group has a different rate:

- group1: (n1,A), (n1,D), (n5,n4), (n5,n2), (n2,B), (n2,C);
- group2: (n3,F), (n3,G);
- group0: the remaining branches of the tree.
- group0: the remaining branches of the tree (main rate).

Note that if the internal nodes don't have labels, then they can be defined by mrca of at least two tips, for example n1 is mrca(A,D)

## Using variances

Variance is used to penalize long branch lengths. The variance formula of each branch v_i is proprtion to (b_i + b), where b (specified by option __-b__) is the pseudo constant added to adjust the dependency of variances to branch lengths. This parameter is a positive number, and by defaul is maximum of median branch length and 10/seqlength. It could be adjusted based on how much your input tree is relaxed. The smaller it is, the more variances are linear to branch lengths, which is more appropriate for strict clock tree. The bigger it is the less dependent of branch lengths on variances, which may be better for relaxed tree. Set __-v 1__ to use variances, and __-v 2__ to run program twice where the second time calculates variances based of the estimated branch length of the first time. Simulation shows that -v 2 give slightly better result than -v 1 in average.
Variance is used to penalize long branch lengths. The variance formula of each branch v_i is proprtion to (b_i + b), where b (specified by option __-b__) is the pseudo constant added to adjust the dependency of variances to branch lengths. This parameter is a positive number, and by defaul is maximum of median branch length and 10/seqlength. It could be adjusted based on how much your input tree is relaxed. The smaller it is, the more variances are linear to branch lengths, which is more appropriate for strict clock tree. The bigger it is the less dependent of branch lengths on variances, which may be better for relaxed tree. Option -v is used to set variance option. __-v 1__ is set by default to use variance. Set __-v 0__ if you don't want to use variance, and __-v 2__ to run program twice where the second time calculates variances based of the estimated branch length of the first time. Simulation shows that -v 2 give slightly better result than -v 1 in average.

## Some examples of command lines:

* If the input tree is rooted:

- You want to estimate rate & dates under temporal constrained mode, using variances:
- You want to estimate rate & dates (by default under temporal constraints, and using variances):

`./lsd2 -i rootedtree_file -d date_file -c -v 1`
`./lsd2 -i rootedtree_file -d date_file`

- You want to remove outlier nodes with Zscore threshold 3:

`./lsd2 -i rootedtree_file -d date_file -c -v 1 -e 3`
`./lsd2 -i rootedtree_file -d date_file -e 3`

- You just want to collapse null branches in the input tree (by default all branches <= 0.5/seqlength are collapsed), and impose a minimum of 0.1 (estimated by default) for the branches of the time scaled tree:
- You want to collapse only null branches in the input tree (by default all branches <= 0.5/seqlength are collapsed), and impose a minimum of 0.1 (estimated by default) for the branches of the time scaled tree:

`./lsd2 -i rootedtree_file -d date_file -c -v 1 -e 3 -u 0.1 -l 0`
`./lsd2 -i rootedtree_file -d date_file -e 3 -u 0.1 -l 0`

- Similar as above, but you don't want to collapse any branch even null, then set a negative value for option -l:

`./lsd2 -i rootedtree_file -d date_file -e 3 -u 0.1 -l -1`

- Similar as above, but you allow nullability for external branches of output tree:

`./lsd2 -i rootedtree_file -d date_file -c -v 1 -e 3 -u 0.1 -U 0 -l 0`
`./lsd2 -i rootedtree_file -d date_file -e 3 -u 0.1 -U 0 -l 0`

- You know the tree partition where each part should have a different rate:

`./lsd2 -i rootedtree_file -d date_file -c -v 1 -p parition_file`
`./lsd2 -i rootedtree_file -d date_file -p parition_file`

- You want to re-estimate the root position locally around the given root

`./lsd2 -i rootedtree_file -d date_file -c -r l`
`./lsd2 -i rootedtree_file -d date_file -r l`

- You want to calculate confidence intervals from 100 simulated trees. The sequence length used to build your tree was 1000, and you'd like to apply a lognormal relaxed clock of standard deviation 0.4 to the simulated branch lengths.

`./lsd2 -i rootedtree_file -d date_file -c -r l -f 100 -s 1000 -q 0.4`
`./lsd2 -i rootedtree_file -d date_file -r l -f 100 -s 1000 -q 0.4`

(To calculate confidence intervals, the sequence length is required via option -s. The program generates simulated branch lengths using Poisson distributions whose mean equal to the estimated ones multiplied with sequence length. In addition, a lognormal relaxed clock is also applied to the branch lengths. This ditribution has mean 1 and standard deviation settable by users with option -q, by default is 0.2; 0 means strict clock. The bigger q is, the more your tree is relaxed and the bigger confidence intervals you should get).

- If all tips are supposed to have the same date, you can still estimate the rate but only relative dates.
- If all tips are supposed to have the same date, you can still estimate the rate but only relative dates when specifying the root date (for example 0) and the tip date (for example 1)

`./lsd2 -i tree_file -c`
`./lsd2 -i tree_file -a 0 -z 1`

* If the input tree is unrooted, you should either specify outgroups or use option -r to estimate the root position.

- If you don't have any outgroup and you want to estimate the root position:

`./lsd2 -i unrootedtree_file -d date_file -c -r a -v 1`
`./lsd2 -i unrootedtree_file -d date_file -r a`

- If you have a list of outgroups and want to use them for rooting (also remove them):

`./lsd2 -i unrootedtree_file -d date_file -g outgroup_file -c -v 1`
`./lsd2 -i unrootedtree_file -d date_file -g outgroup_file`

- If you want to keep the outgroups in the tree, just estimate the root position on the branch that defined by the outgroups:

`./lsd2 -i unrootedtree_file -d date_file -g outgroup_file -k -c -v 1`
`./lsd2 -i unrootedtree_file -d date_file -g outgroup_file -k`

## Output files:

Expand Down
Binary file modified bin/lsd2_mac
Binary file not shown.
Binary file modified bin/lsd2_unix
Binary file not shown.
4 changes: 2 additions & 2 deletions src/lsd.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -120,8 +120,7 @@ int lsd::buildTimeTree( int argc, char** argv, InputOutputStream *inputOutput)
constraintConsistent = initConstraint(opt, nodes);
bool medianRateOK = true;
if (opt->e>0) medianRateOK = calculateOutliers(opt,nodes,median_rate);
else if (opt->minblen<0) medianRateOK = calculateMedianRate(opt,nodes,median_rate);
imposeMinBlen(*(io->outResult),opt,nodes,median_rate,medianRateOK);
else if (opt->minblen<0 && opt->constraint) medianRateOK = calculateMedianRate(opt,nodes,median_rate);
if (!opt->constraint){//LD without constraints
if (!constraintConsistent){
ostringstream oss;
Expand Down Expand Up @@ -174,6 +173,7 @@ int lsd::buildTimeTree( int argc, char** argv, InputOutputStream *inputOutput)
}
}
else {//QPD with temporal constrains
imposeMinBlen(*(io->outResult),opt,nodes,median_rate,medianRateOK);
if (constraintConsistent || (opt->estimate_root!="" && opt->estimate_root!="k")){
if (constraintConsistent){
if (opt->estimate_root==""){//keep the given root
Expand Down
23 changes: 12 additions & 11 deletions src/options.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ Pr* getCommandLine( int argc, char** argv)
uflag=false,
vflag=false,
validDate = true;
while ( (c = getopt(argc, argv, ":i:d:D:o:s:n:g:r:v:ct:w:b:ha:z:f:kje:m:p:q:u:l:U:R:S:V")) != -1 )
while ( (c = getopt(argc, argv, ":i:d:D:o:s:n:g:r:v:Ft:w:b:ha:z:f:kje:m:p:q:u:l:U:R:S:V")) != -1 )
{
switch (c)
{
Expand Down Expand Up @@ -112,15 +112,15 @@ Pr* getCommandLine( int argc, char** argv)
break;
case 'v':
if( !isInteger(optarg) )
myExit("Argument of option -v must be either 1 or 2.\n");
myExit("Argument of option -v must be an integer either 0, 1, or 2.\n");
opt->variance = atoi(optarg);
if (opt->variance!=1 && opt->variance!=2){
myExit("Argument of option -v must be either 1 (for using orginal branches to compute variance) or 2 (LSD will be run twice, the second time uses the variances based on the estimated branch lengths of the first time).\n");
if (opt->variance!=0 && opt->variance!=1 && opt->variance!=2){
myExit("Argument of option -v must be either 0 (do not use variance) 1 (default value, to using orginal branches to compute variance) or 2 (LSD will be run twice, the second time uses the variances based on the estimated branch lengths of the first time).\n");
}
vflag = true;
break;
case 'c':
opt->constraint = true;
case 'F':
opt->constraint = false;
break;
case 'b':
if( !isReal(optarg) )
Expand Down Expand Up @@ -247,7 +247,8 @@ Pr* getCommandLine( int argc, char** argv)
cout<<"lsd2 "<<VERSION<<endl;
exit( EXIT_SUCCESS );
case '?':
myExit("Unrecognized option: -%c\n", optopt);
if (optopt == 'c') myExit("Unrecognized option -c, temporal constraint now becomes by default without this option.\n");
else myExit("Unrecognized option: -%c\n", optopt);
case ':':
if (optopt=='v') myExit("Argument of option -v must be either 1 (for using orginal branches to compute variance) or 2 (for using branches estimated by LSD to compute variance, i.e LSD will be run 2 times: the first time is just used to compute the variance).\n");
else myExit("Option -%c requires an operand\n", optopt );
Expand Down Expand Up @@ -534,10 +535,6 @@ void printHelp( void )
<<FLAT<<"\t and 10/seqlength; but it should be adjusted based on how/whether the input tree is relaxed or strict. The smaller it is the more variances\n"
<<FLAT<<"\t would be linear to branch lengths, which is relevant for strict clock. The bigger it is the less effect of branch lengths on variances, \n"
<<FLAT<<"\t which might be better for relaxed clock.\n"
<<FLAT<<"\t" <<BOLD<<"-c \n"
<<FLAT<<"\t By using this option, we impose the constraints that the date of every node is equal or smaller then\n"
<<FLAT<<"\t the dates of its descendants. Without constraints, the runtime is linear (LD). With constraints, the\n"
<<FLAT<<"\t problem is a quadratic programming and is solved efficiently (quasi-linear) by the active-set method.\n"
<<FLAT<<"\t" <<BOLD<<"-d " <<LINE<<"inputDateFile\n"
<<FLAT<<"\t This options is used to read the name of the input date file which contains temporal constraints of internal nodes\n"
<<FLAT<<"\t or tips. An internal node can be defined either by its label (given in the input tree) or by a subset of tips that have it as \n"
Expand Down Expand Up @@ -579,6 +576,10 @@ void printHelp( void )
<<FLAT<<"\t OUTGROUP2\n"
<<FLAT<<"\t ...\n"
<<FLAT<<"\t OUTGROUPn\n"
<<FLAT<<"\t" <<BOLD<<"-F \n"
<<FLAT<<"\t By default without this option, we impose the constraints that the date of every node is equal or smaller then the\n"
<<FLAT<<"\t dates of its descendants, so the running time is quasi-linear. Using this option we ignore this temporal constraints, and\n"
<<FLAT<<"\t the the running time becomes linear, much faster.\n"
<<FLAT<<"\t" <<BOLD<<"-h " <<LINE<<"help\n"
<<FLAT<<"\t Print this message.\n"
<<FLAT<<"\t" <<BOLD<<"-i " <<LINE<<"inputTreesFile\n"
Expand Down
4 changes: 2 additions & 2 deletions src/pr.h
Original file line number Diff line number Diff line change
Expand Up @@ -131,8 +131,8 @@ typedef struct Pr
mrca=0;
leaves=1;
estimate_root = "";
constraint = false;
variance = 0;
constraint = true;
variance = 1;
minblen = -1;
minblenL = -1;
nullblen = NAN;
Expand Down

0 comments on commit fb9317a

Please sign in to comment.