You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This repository provides information-reproducibility on how compressible a human genome sequence is (T2T Chm13 version 2.0 [article,sequence]) using different data compressors.
Results:
The 3,117,292,120 human DNA symbols have been compressed (lossless) to
Rank
Bytes
Bps
Time (m)
RAM (GB)
Program
Replication
Factor (*)
1
538,155,679
1.381
?
?
JARVIS3
?
2
539,129,963
1.384
641
13.7
JARVIS3
3
543,855,534
1.395
381
28.8
JARVIS2
4
544,059,173
1.396
389
28.8
JARVIS2
5
544,267,353
1.396
420
27.4
JARVIS2
6
544,292,577
1.397
399
26.9
JARVIS2
7
545,960,947
1.401
283
26.9
JARVIS2
8
549,594,830
1.410
284
11
JARVIS2
9
550,041,600
1.411
340
18.8
JARVIS2
10
550,051,840
1.411
309
18.8
JARVIS2
11
550,379,520
1.412
279
18.7
JARVIS2
12
554,823,680
1.423
253
18.7
JARVIS2
13
554,985,480
1.424
219
4.1
JARVIS2
14
555,412,871
1.425
690
24.8
GeCo3
15
555,679,745
1.426
616
24.3
GeCo3
16
555,977,522
1.427
488
22.2
GeCo3
17
556,415,717
1.428
427
19.7
GeCo3
18
557,100,364
1.430
428
17.2
GeCo3
19
557,438,004
1.431
426
15.7
GeCo3
20
557,995,100
1.432
406
14.6
GeCo3
21
558,343,430
1.433
396
13.3
GeCo3
22
559,124,034
1.435
425
11.6
GeCo3
23
560,694,405
1.439
354
12.8
GeCo3
24
560,982,904
1.440
416
8.1
GeCo3
25
561,644,781
1.441
280
11.3
GeCo3
26
562,253,393
1.443
281
11.3
GeCo3
27
564,282,192
1.448
222
6.3
GeCo3
28
564,613,120
1.449
82
4.5
JARVIS2
29
564,913,725
1.450
262
7.3
GeCo3
30
566,108,106
1.453
54
8.4
JARVIS2
31
566,387,531
1.454
215
6.3
GeCo3
32
575,830,095
1.478
94
2.9
GeCo3
33
576,296,690
1.479
37
5.9
JARVIS2
34
577,672,973
1.482
88
1.9
GeCo3
35
578,588,274
1.485
101
3.3
GeCo3
36
581,917,199
1.493
97
1.8
GeCo3
37
583,746,074
1.498
86
3.3
GeCo3
38
589,813,339
1.514
17,465
0.6
nncp
39
603,726,643
1.549
71
3.3
GeCo3
40
607,749,667
1.560
22
2.5
MFCompress
41
607,835,665
1.560
48
1.8
GeCo2
42
609,579,746
1.564
171
13.8
JARVIS
43
612,331,601
1.571
4,588
1.6
paq8l
44
614,339,951
1.577
39
28.5
bsc-m03
45
614,919,247
1.578
39
20.4
bsc-m03
46
618,241,906
1.587
39
16.3
bsc-m03
47
619,369,574
1.590
20
2.0
GeCo2
48
619,837,647
1.591
12
0.6
MFCompress
49
620,837,061
1.593
39
11.2
bsc-m03
50
625,647,034
1.606
38
5.6
bsc-m03
51
625,753,521
1.606
11
0.6
MFCompress
52
628,342,060
1.613
18
0.5
GeCo2
53
639,222,915
1.640
43
0.8
NAF-22
54
646,062,792
1.658
84
0.6
lzma -9
55
661,591,088
1.698
36
0.05
bsc-m03
56
752,793,986
1.932
5
0.001
bzip2 -9
Baseline
779,323,017
2.000
-
-
2 BPS
-
(*) The base line of 2 bits per symbol is used to calculate the (data compression) Factor, which represents the proportion of the sequence that has been fully compressed and is given by 100-((CompressedBytes*8)/(3117292120*2)*100). The Run1.sh and Run4.sh ran in a Laptop computer running Linux with 11th Gen Intel® Core™ i5-1135G7 @ 2.40GHz × 8, 8 GB of RAM, and an SSD disk of 512 GB. The remaining computations ran in a Desktop computer running Linux with Intel® Core™ i7-6700 CPU @ 3.40GHz × 8, 31,2 GiB RAM, and disk of 3 TB. The ranking is given by the lowest number of bytes (Kolmogorov complexity approximation).