Releases · jtnystrom/Discount

13 Feb 06:40

jtnystrom

v3.0.1

f93da32

Version 3.0.1 Latest

Latest

Bugfix for self-union or self-intersection of indexes (in v 3.0.0, this caused a cartesian product).

Additional convenience methods for Index, e.g. unionLeft, unionRight, etc.

Assets 3

16 Jan 06:36

jtnystrom

v3.0.0

7a2ac8b

Version 3.0.0

This version adds indexes (k-mer databases with counted k-mers) and ways to combine these, including intersect, union and subtract. Various rules for intersection and union are available, including max, min, left, right. Most operations that could formerly be done only on raw sequence files can now also be done on indexes with a similar syntax.
Indexes are stored using bucketed parquet files, which gives good efficiency when using the same input data multiple times, as the k-mers do not have to be shuffled again during subsequent use.

Indexes can be manipulated using the command-line interface as well as the API from notebooks or from the Spark shell.

Summary of changes in this version:

The minimum supported Spark version is now 3.1.0.
Support for indexes (k-mer databases) written as parquet files.
Index operations such as union, intersect, subtract, with various combination rules like min, max, sum, left, right.
Restructured the API to use indexes as much as possible.
Several operations were moved to Spark SQL (from handcrafted Scala) for performance and simplicity.
Run scripts were renamed and can now detect their location, which makes it easy to symlink them to somewhere in $PATH.
The new -p flag is now the preferred way to specify the number of partitions.
Most commands that take input can now read input from an index (using -i) as well as from sequence files.
K-mer counts are now consistently represented as Int instead of Long in the user API as they were limited to 32-bit signed integers internally.
Added com.globalmentor's hadoop-bare-naked-local-fs to avoid dependency on winutils.exe on Windows when running tests.
Various simplifications and speedups.

Assets 3

06 Jun 06:00

jtnystrom

v2.3.0

fd351f8

Version 2.3.0

Version 2.3.0 greatly increases the maximum data size that can be analysed (we have tested up to 6 TB of input data). As some very minor changes are incompatible, API users may need to manually migrate some code.

Pre-grouped mode for handling repetitive or very large data, which can be enabled with --method pregrouped.
Some minimizer sets are now bundled in the Discount jar, which means that many users will not need to supply minimizers manually.
Improved support for large m (up to 13), which helps subdivide complex data with many distinct k-mers.
Automatic coalescing of partitions in frequency sampling when appropriate (improves performance).
Support for @inputs.txt syntax to supply a list of input files on the command line.
More efficient frequency sampling by doing the sampling entirely in Spark SQL, instead of partially on the driver.

Assets 3

11 Mar 03:27

jtnystrom

v2.2.1

8b04c8e

Version 2.2.1

This release fixes a data loss bug in the parsing of some fastq files.

Assets 3

08 Feb 07:16

jtnystrom

v2.2.0

8f7846d

Version 2.2.0

Improved support for very long fasta sequences (e.g. full chromosomes), even for multiple sequences per file. This is done by relying on an external .fai index, which is now necessary for sequences with unbounded length.
File input formats can now be mixed (e.g. fastq, fasta, long fasta can be read by the same job).
k-mer statistics can now optionally be written to an output file using a new argument (not just to standard output as before).
For convenience, additional PASHA minimizer sets for k >= 19, m=10,11 were added to the distribution.

Assets 3

22 Oct 03:00

jtnystrom

v2.1.0

e8c265b

Version 2.1.0

Classes were restructured under the com.jnpersson.discount package (instead of simply "discount") to comply with normal Java/Scala conventions. This is a breaking change for API users, but should be a simple migration.
Faster algorithms for read splitting and bitwise encoding.
Sampling and input parsing has changed into a unified API that is consistent across short reads and long sequences, and that samples long sequences more fairly.
Foundational work towards preserving the sequence locations of input sequence fragments.
Additional test cases for different kinds of input data.

Assets 3

30 Aug 06:40

jtnystrom

v2.0.1

9f4af4a

Version 2.0.1

This release fixes a bug where long, multiline input sequences were not handled correctly and k-mer counts would occasionally be wrong, along with some other minor improvements.

Assets 3

20 Aug 09:02

jtnystrom

v2.0.0

f947178

Version 2.0.0

This version includes the following improvements.

Nearly 50% faster counting due to better algorithms, including a version of radix sort from the Fastutil library
Automatic selection of the most appropriate minimizer set from a directory, by matching with the desired (k, m) values
Support for interactive notebooks (a Zeppelin example is included) and a restructured API to support this
Hashed superkmers can now be queried by sequences to find matching k-mers
Support for lowercase nucleotide letters in input
Support for user-defined minimizer orderings (-o given)
Various simplifications and enhancements

Assets 3

24 Apr 07:12

jtnystrom

v1.4.0

8493bc0

Version 1.4.0

Version 1.4.0 has the following improvements:

Scala 2.12/Spark 3.1 are now the default versions when compiling.
Bugfix for incorrect counting when k mod 16 = 0.
sbt-assembly is now the preferred way to package Discount, including its dependencies (Scallop and Fastdoop) in a "fat" jar.
Additional property-based unit tests using ScalaCheck.
A minimal demo application (ReadSplitDemo) shows how to use the Discount API without Spark.
Various simplifications, code cleanups and speedups.

Assets 4

04 Mar 01:47

jtnystrom

v1.3.0

ad92cb2

Version 1.3.0

Version 1.3.0 has the following improvements:

Improved performance for large m
Reduced memory usage in the hashing stage
Fixed a bug that caused Discount to crash on empty inputs
Improved command line argument validation
Renamed the output path for count --stats
Renamed the command line arguments --motif-set and --stats to --minimizers and --buckets, respectively, for improved clarity

Assets 4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases: jtnystrom/Discount

Version 3.0.1

Version 3.0.0

Version 2.3.0

Version 2.2.1

Version 2.2.0

Version 2.1.0

Version 2.0.1

Version 2.0.0

Version 1.4.0

Version 1.3.0