This repository provides information-reproducibility on how compressible a human genome sequence is (T2T Chm13 version 2.0 [article,sequence]) using different data compressors.
The 3,117,292,120 human DNA symbols have been compressed (lossless) to
(*) The base line of 2 bits per symbol is used to calculate the (data compression) Factor, which represents the proportion of the sequence that has been fully compressed and is given by 100-((CompressedBytes*8)/(3117292120*2)*100). The Run1.sh and Run4.sh ran in a Laptop computer running Linux with 11th Gen Intel® Core™ i5-1135G7 @ 2.40GHz × 8, 8 GB of RAM, and an SSD disk of 512 GB. The remaining computations ran in a Desktop computer running Linux with Intel® Core™ i7-6700 CPU @ 3.40GHz × 8, 31,2 GiB RAM, and disk of 3 TB. The ranking is given by the lowest number of bytes (Kolmogorov complexity approximation).
Data Compressor | Repository | Description |
---|---|---|
GeCo3 | code | article |
GeCo2 | code | article |
paq8l | code | article |
nncp v3.1 | code | article |
NAF | code | article |
lzma 5.2.5 | code | article |
JARVIS | code | article |
bzip2 1.0.8 | code | article |
MFCompress | code | article |
bsc-m03 v0.2.1 | code | article |
JARVIS2 | code | article |
JARVIS3 | code | - |
Zstandard | code | base |
Change directory and give permitions:
cd scripts/ chmod +x Run*.sh
To replicate each run, use the respective replication script.