GitHub - cobilab/HumanGenome: How compressible is a human genome sequence?

How compressible is a human genome sequence?

This repository provides information-reproducibility on how compressible a human genome sequence is (T2T Chm13 version 2.0 [article,sequence]) using different data compressors.

Results:

The 3,117,292,120 human DNA symbols have been compressed (lossless) to

Rank	Bytes	Bps	Time (m)	RAM (GB)	Program	Replication
1	538,155,679	1.381	?	?	JARVIS3	?
2	539,129,963	1.384	641	13.7	JARVIS3
3	543,855,534	1.395	381	28.8	JARVIS2
4	544,059,173	1.396	389	28.8	JARVIS2
5	544,267,353	1.396	420	27.4	JARVIS2
6	544,292,577	1.397	399	26.9	JARVIS2
7	545,960,947	1.401	283	26.9	JARVIS2
8	549,594,830	1.410	284	11	JARVIS2
9	550,041,600	1.411	340	18.8	JARVIS2
10	550,051,840	1.411	309	18.8	JARVIS2
11	550,379,520	1.412	279	18.7	JARVIS2
12	554,823,680	1.423	253	18.7	JARVIS2
13	554,985,480	1.424	219	4.1	JARVIS2
14	555,412,871	1.425	690	24.8	GeCo3
15	555,679,745	1.426	616	24.3	GeCo3
16	555,977,522	1.427	488	22.2	GeCo3
17	556,415,717	1.428	427	19.7	GeCo3
18	557,100,364	1.430	428	17.2	GeCo3
19	557,438,004	1.431	426	15.7	GeCo3
20	557,995,100	1.432	406	14.6	GeCo3
21	558,343,430	1.433	396	13.3	GeCo3
22	559,124,034	1.435	425	11.6	GeCo3
23	560,694,405	1.439	354	12.8	GeCo3
24	560,982,904	1.440	416	8.1	GeCo3
25	561,644,781	1.441	280	11.3	GeCo3
26	562,253,393	1.443	281	11.3	GeCo3
27	564,282,192	1.448	222	6.3	GeCo3
28	564,613,120	1.449	82	4.5	JARVIS2
29	564,913,725	1.450	262	7.3	GeCo3
30	566,108,106	1.453	54	8.4	JARVIS2
31	566,387,531	1.454	215	6.3	GeCo3
32	575,830,095	1.478	94	2.9	GeCo3
33	576,296,690	1.479	37	5.9	JARVIS2
34	577,672,973	1.482	88	1.9	GeCo3
35	578,588,274	1.485	101	3.3	GeCo3
36	581,917,199	1.493	97	1.8	GeCo3
37	583,746,074	1.498	86	3.3	GeCo3
38	589,813,339	1.514	17,465	0.6	nncp
39	603,726,643	1.549	71	3.3	GeCo3
40	607,749,667	1.560	22	2.5	MFCompress
41	607,835,665	1.560	48	1.8	GeCo2
42	609,579,746	1.564	171	13.8	JARVIS
43	612,331,601	1.571	4,588	1.6	paq8l
44	614,339,951	1.577	39	28.5	bsc-m03
45	614,919,247	1.578	39	20.4	bsc-m03
46	618,241,906	1.587	39	16.3	bsc-m03
47	619,369,574	1.590	20	2.0	GeCo2
48	619,837,647	1.591	12	0.6	MFCompress
49	620,837,061	1.593	39	11.2	bsc-m03
50	625,647,034	1.606	38	5.6	bsc-m03
51	625,753,521	1.606	11	0.6	MFCompress
52	628,342,060	1.613	18	0.5	GeCo2
53	639,222,915	1.640	43	0.8	NAF-22
54	646,062,792	1.658	84	0.6	lzma -9
55	661,591,088	1.698	36	0.05	bsc-m03
56	752,793,986	1.932	5	0.001	bzip2 -9
Baseline	779,323,017	2.000	-	-	2 BPS	-

(*) The base line of 2 bits per symbol is used to calculate the (data compression) Factor, which represents the proportion of the sequence that has been fully compressed and is given by 100-((CompressedBytes*8)/(3117292120*2)*100). The Run1.sh and Run4.sh ran in a Laptop computer running Linux with 11th Gen Intel® Core™ i5-1135G7 @ 2.40GHz × 8, 8 GB of RAM, and an SSD disk of 512 GB. The remaining computations ran in a Desktop computer running Linux with Intel® Core™ i7-6700 CPU @ 3.40GHz × 8, 31,2 GiB RAM, and disk of 3 TB. The ranking is given by the lowest number of bytes (Kolmogorov complexity approximation).

Data compression tools

Data Compressor	Repository	Description
GeCo3	code	article
GeCo2	code	article
paq8l	code	article
nncp v3.1	code	article
NAF	code	article
lzma 5.2.5	code	article
JARVIS	code	article
bzip2 1.0.8	code	article
MFCompress	code	article
bsc-m03 v0.2.1	code	article
JARVIS2	code	article
JARVIS3	code	-
Zstandard	code	base

Reproducibility:

Change directory and give permitions:

cd scripts/
chmod +x Run*.sh

To replicate each run, use the respective replication script.

Name		Name	Last commit message	Last commit date
Latest commit History 122 Commits
bin		bin
scripts		scripts
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

How compressible is a human genome sequence?

Results:

Data compression tools

Reproducibility:

About

Releases

Packages

Languages

License

cobilab/HumanGenome

Folders and files

Latest commit

History

Repository files navigation

How compressible is a human genome sequence?

Results:

Data compression tools

Reproducibility:

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages