shortseq

ShortSeqs are compact and efficient Python objects that hold short sequences while using up to 73% less memory compared to built-in types. They are prehashed and comparable, they support slicing, indexing, and a variety of vectorized operations, and they easily convert back to their original string form. Their memory advantage is shown in the table below.

Sequence Length	PyUnicode Size	PyBytes Size	ShortSeq Size	% Reduced
0-32 nt	56-88 bytes	40-72 bytes	32 bytes (fixed)	43-64%
33-96 nt	88-152 bytes	72-136 bytes	48 bytes (fixed)	55-68%
97-1024 nt	152-1080 bytes	136-1064 bytes	64-288 bytes	57-73%

^{* Object sizes were measured on Python 3.10 using asizeof() from the pympler package. % Reduced is PyUnicode vs. ShortSeq}

Installation

mamba install -c bioconda -c conda-forge shortseq

However, I strongly recommend compiling the project from source. The instructions that make ShortSeqs fast are very specific to your CPU's microarchitecture, so the prebuilt binaries on Bioconda are not optimized for your machine. To compile from source:

git clone https://github.com/AlexTate/shortseq.git
cd shortseq
python -m pip install .

Usage

import shortseq as sq

# Construct from PyUnicode or PyBytes
seq_str = "ATGC"
seq_bytes = b"ATGC"
seq_1 = sq.pack(seq_str)
seq_2 = sq.pack(seq_bytes)

# Verify outputs (optional)
assert seq_1 == seq_2 == seq_str
assert len(seq_1) == len(seq_2) == len(seq_str)

seq_3 = sq.pack("TATTAGCGATTGACAGTTGTCCTGTAATAACGCCGGGTAAATTTGCCG")
seq_4 = sq.pack("TATTACCGATTGACAGTTGTCCTGTAATAACGGCGGGTAAATTTGCTG")  # 5M1X26M1X13M1X1M
seq_str = str(seq_4)

# Slice and subscript
assert seq_4[5:15] == seq_str[5:15]
assert seq_4[-2] == seq_str[-2]

# Vectorized hamming distance (differing bases)
hammd = sum(a!=b for a, b in zip(seq_3, seq_4))
assert seq_3 ^ seq_4 == hammd == 3

# Count unique sequences similar to collections.Counter
from shortseq import ShortSeqCounter
counts = ShortSeqCounter([seq_bytes] * 10)
assert counts == {sq.pack("ATGC"): 10}

CPU Requirements

Intel Haswell (2014) and newer, or
AMD Excavator (2015) and newer, or
Apple M1 and newer

However, AMD processors prior to Zen 3 (2020) aren't recommended for 65-1024 nt sequences if runtime performance is a high priority.

Performance

Note that the measurement of Gzip Bytes is the length in bytes of the compressed sequence at maximum compression (level 9), which is much smaller than the actual PyBytes object returned by gzip.compress(). This footprint is therefore unattainable when using Python's gzip module, and instead serves as a theoretical lower bound for the memory footprint of a compressed sequence.

View source: MemoryBenchmarks.test_mem_by_length()

Edit distance calculation is extremely efficient for ShortSeqs and can be performed in near-constant time. This is a huge advantage over other sequence types, which require O(n) time to compute the edit distance.

View source: TimeBenchmarks.test_hamming_distance()

This benchmark is very generous towards the competition because no attempt is made to check for non-nucleotide characters, whereas ShortSeq uses a bloom filter to do so for all inputs. ShortSeq construction involves encoding the sequence string into a compressed binary representation, which is an O(n) operation, whereas x.decode() and np.char.asarray() are O(1) because they essentially copy the object's internal buffer.

View source: TimeBenchmarks.test_construction_from_bytes()

Encoding (Compression)

We represent DNA with four symbols: A, C, T, and G. Generally speaking, when these symbols are represented in computer systems, each symbol takes up one byte or 8 bits of memory because these letters are part of a symbol system that requires 7/8 of those bits for its range. These symbols can instead be represented ordinally as 0, 1, 2, and 3, which only requires 2 bits, allowing us to pack 4 nucleotides into each byte rather than just one.

Nucleotide	In ASCII	Ordinal Value
A	`0100 00 01`	`00`
C	`0100 01 11`	`01`
T	`0101 10 00`	`10`
G	`0100 11 11`	`11`

This table shows how each nucleotide is represented in ASCII and how it's ordinal value is represented in binary. I should mention that this scheme isn't my work, but rather a well-known technique that's been around for a while.

Decoding

ShortSeqs are decoded back to their original sequence strings "lazily", i.e. it happens only when you ask and the result isn't cached in the object, so it has to be recomputed with each request. However, ShortSeqs retain the original string's length and can be compared to each other for equality and edit distance without decoding.

Acknowledgements

Huge thanks to the Montgomery Lab at Colorado State University. This began as an experiment of mine while in pursuit of faster sequence deduplication and optimization for tinyRNA.

Name		Name	Last commit message	Last commit date
Latest commit History 86 Commits
.github/workflows		.github/workflows
doc/plots		doc/plots
shortseq		shortseq
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
requirements-bench.txt		requirements-bench.txt
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup.py		setup.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

shortseq

Installation

Usage

CPU Requirements

Performance

Encoding (Compression)

Decoding

Acknowledgements

About

Releases 1

Languages

License

AlexTate/shortseq

Folders and files

Latest commit

History

Repository files navigation

shortseq

Installation

Usage

CPU Requirements

Performance

Encoding (Compression)

Decoding

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases 1

Languages