Skip to content

Releases: ksahlin/IsoCon

0.3.2

29 Jul 02:48
Compare
Choose a tag to compare
  • Several speed improvements. IsoCon 0.3.2 is ~2-10x faster than version 0.3.1 and lower.
  • Various bugfixes
  • Added a script to estimate sequencing depth required to have >T% recall in the statistical test.

0.3.1

25 Mar 13:57
Compare
Choose a tag to compare
  • New pairwise SW alignment approach --> changed from ssw to parasail that supports NW.
  • IsoCon can now take a fastq file as input with CCS reads and their quality values instead of the flnc and the ccs.bam.
  • Some minor bugfixes

Major runtime improvements and code base changes

03 Feb 02:49
Compare
Choose a tag to compare

Major updates to speed and code readability, minor bugfixes. Previous versions are deprecated.

Fixed

  • Bugfix when —nearest_neighbor_depth set. Previous version would not explore less than specified number of sequences when assigning reads to candidates (statistical test step).

Added

  • Added parameter --min_test_ratio X (default 5). This parameter omits testing candidates c to t if c has X times more support. This will speed up the algorithm because it omits tests that will (most likely) be significant as c is dominant to t.
  • Added CHANGELOG and GPL LICENSE.

Changed

  • IsoCon does not longer build the multi-alignment matrix (MAM) in the statistical testing step. We now get support, base probabilities etc. by (1) aligning only c to t to get positions where they differ, (2) Obtain base qualities over these positions by using (the already created) read alignments to c and t (a read is assigned to either c or t). This implementation therefor skips the realignment of all reads to the reference t as well as the creation of the MAM. This gives a speed-up of 5-20x for the statistical test (most speedup for longer noisier sequences).
  • Improved speed (~2-3x) in multi-alignment function (Now used only in the correction step).
  • significant re-write of code in some regions, such as the statistical test. Improved readability and factorization

v0.2.5.1

02 Feb 23:47
Compare
Choose a tag to compare

Stable version before major re-write of code resulting in improved readability and significant speedup in statistical test. This release fixed several minor bugs present in the IsoCon version (commit 85eb122, tag 0.2.4) that was used to generate results in the bioRxiv preprint made available 2018-01-10, as well as the version sent to a journal for review.

Fixed

  • Bugfix in if-statement described in Supplementary Section A: "Estimating the probability of a sequencing error" biorxiv-suppl.
  • Bugfix in how to treat tiebreakers, described in Supplementary Section A: "implementation details" in
    biorxiv-suppl.

Added

  • Test data and instructions for running IsoCon on testdata.
  • Automatic builds and testing with Travis.
  • Installation through pip now possible.
  • Parameter --verbose and also removed lot of prints to stdout.
  • Added parameter --min_exon_diff to break alignment with this many consecutive '-' (an indel). Previously this was hardcoded.
  • Made the mapping quality values upper bound T (described in Supplementary Section A: "Estimating the probability of a sequencing error"biorxiv-suppl) a parameter --max_phred_q_trusted instead of hardcoded value.

Changed

  • change parameter --single_core (a flag that was false by default and IsoCon would use all cores available) to a more flexible format --nr_cores where the used can specify how many cores.
  • Changed terminology. All occurences of "minimizer" is changed to "nearest_neighbor" (or "neighbor" in parameters) to adapt for new notation of nearest neighbor graph instead of minimizer graph.

Removed

  • Removed option --barcodes as it serves no purpose anymore --- if we have barcodes they have been detected and reads have been split into batches in a upstream step.

bioRxiv version

03 Feb 00:02
Compare
Choose a tag to compare

This version was used to generate the results in the bioRxiv preprint made available 2018-01-10 (link), as well as the version sent to a journal for review.

Various improvements

09 Jul 19:49
Compare
Choose a tag to compare
  • Fixed places in code that were stochastic and gave inconsistent results between runs:
    1. Fixed stochasticity when correcting sequences in partition. This had to do with the fact that majority base pair was chosen arbitrary (in python3) if there was a tie. I'n this version, we do not correct sequences on positions where majority is ambiguous.
    2. Fixed ambiguity in partition: We choose the node with largest number of reachable nodes. Within this partition, we choose the minimizer as the string with the most direct support (nr identical strings + direct neighbors in graph)
  • Fixed a logical bug in the function creating a multiple alignment matrix from pairwise alignments
  • No filtering of candidates based on requirement of having to be consensus over each base pair is performed (neither in the output in the minimizer step nor in the output of the final candidates step). This was a heuristic used in earlier implementations that was used only to limit the number of candidates and covered other flaws/bugs with the old implementation -- such as the one in the multiple alignment creation mentioned above.
  • Fixed bug in bipartite partitioning algorithm. Now we use the "bipartite" data structure from networkx which also improves readability of code.
  • In statistical test:
    1. Better multiple correction factor where there is difference in homopolymers
    2. Statistical test is performed for each edge in the minimizer graph formed by candidates -- as described in methods section.
    3. added support for calculating p-value under normal approximation which can be better than poisson if sample size is vary large, and lager enough probabilities. We however only output log of difference,. Poisson is still used to derive results.

Exact minimizer graph, partitioning, reducing stochasticity.

02 May 16:10
Compare
Choose a tag to compare
  • Exact minimizer graph computation with library edlib.
  • Partitioning graph based on selecting the minimizers m with the highest number of reachable nodes as centers, instead of heuristic approximation of the largest number of neighbors, as in v0.1.0.
  • Update in error correction:
    • Total number of errors according to types, insertions, deletions and substitutions are calculated for each partition. In v0.1.0, the weight of each error was simply the count. In this version, the weight is the count at the position divided by the total error count of this type.
    • Instead of computing edit distance E to minimizer, we compute it to the consensus of the partition (denote this edit distance M), so the number of errors corrected changes from E/2 to M/2 in this version compared to old one.
    • Version 0.1.0 finds the weight w (i.e., the count) of the E/2 position with the lowest weight after sorting positions according to weight. If P is all positions with weight <= w, then v0.1.0 selects a random subset of |E/2| positions to correct, where |E/2| <= |P|. In contrast, v0.2.0 finds weight w' (normalized count) of the M/2 position with the lowest weight after sorting positions according to weight. If P' is all positions with weight <= w', then v0.2.0 corrects all these |P'| positions instead of a random subset.
  • Error model in statistical test still assuming uniform errors across CCS reads

First stable version v0.1.0

15 Mar 02:59
Compare
Choose a tag to compare
Pre-release
  • First stable version on both simulated and biological datasets
  • Error model assuming uniform errors across CCS reads
  • Fast approximation of minimizer graph
  • Partitioning graph based on selecting the minimizers m with the highest number of neighbors as centers, and assigning each read that can reach m in minimizer graph.
  • Correcting E/2 errors each pass, for each read in partition, where E is the edit distance between minimizer and read. Positions to correct: Find the weight w (i.e., the count) of the E/2 position with the lowest weight after sorting positions according to weight. If P is all positions with weight <= w. Select a random subset of |E/2| positions to correct, where |E/2| <= |P|. The E/2 positions with lowest counts in PFM of consensus of partition get corrected. So in summary, if more positions with equal counts than E/2, choosing random subset of E/2 positions to correct.