Releases · ksahlin/ultra

17 May 12:50

ksahlin

6b81c05

v0.1 Latest

Latest

Major update. Previous versions of uLTRA had several bottlenecks which made is infeasible for mapping larger datasets (number of reads). Most notable updates:

A faster and more memory-efficient seed finder namfinder. (>10x faster than previously used MEM finders)
Removed loading reads/SAM files into memory on several places and instead stream over the files (Previously a sam file of alignments was loaded into RAM)
Compressing intermediate output.

This version has been tested on the datasets I used in the publication of uLTRA from 2021. The largest dataset in the evaluation is the IsoSeq Alzheimer dataset (4.5M reads). On the Alzheimer dataset using 19 cores, peak memory usage is now less than 30Gb (previously ~100Gb), the runtime is 3h 40m (previously 5h 40m), and disk usage has gone down due to compressed files (I have not measured the reductions in size).

The accuracy of v0.1 is only a very small fraction lower than previous version (v0.0.4.2) on the tested simulated datasets. The non-identical output to previous versions is due to the new seed finder. The boost in aligning to, e.g., small exons is still there compared to other aligners.

Assets 2

26 Oct 15:50

ksahlin

v0.0.4.2

b3e6c27

v0.0.4.2

Fixes issues #17 and #2.

Assets 2

21 Sep 08:13

ksahlin

v0.0.4

a6b6fb0

v0.0.4

Fixed issue #4
Added an option --use_NAM_seeds which changes the seeding from MEMs to NAMs (with strobemers). NAM seeding makes uLTRA faster and produces smaller intermediate files. The memory usage with --use_NAM_seeds is "fixed" regardless of the number of cores/threads (about ~80-90Gb for human genome) compared to default option which grows with number of cores. Therefore, using --use_NAM_seeds results in lower peak memory usage over the default option if using more than 18 cores, and higher memory usage otherwise. The alignment accuracy is largely the same -- NAM seeds decrease the accuracy of about 0.01%-0.05% compared to MEMs (i.e., 1 alignment in every 2,000-10,000). Due to faster runtime and smaller disk usage, at a cost of high memory usage, I recommend --use_NAM_seeds for large datasets (>5M reads) if running on nodes with >90Gb memory and more than 20 cores.

Assets 2