This project is my implementation of bzip2 (wiki) — a popular and efficient data compression algorithm.
Here's how it zips Leo Tolstoy's "War and Peace" versus standard ZIP-algorithm:
- a pure Python (
>=3.10
) implementation with no third-party dependencies - it outperforms (slightly) the standard zip-algorithm
- it works with binary data, therefore no file-type restrictions
Though the code presented is fully functional, passes all the tests and has notable compression efficiency, the following should be taken into account:
- it's a pet project — it is made to satisfy my curiosity, it was never meant to be used in production
- the file binary structure is incompatible with the original bzip2 format (= can't be opened with an archive manager app)
- no consistency check (conversely to bzip canonical implementation)
- optimization leaves much to be desired due to a variety of factors:
- it's written on pure Python
- algorithm parameters are not fine-tuned enough
- only works with single files (so to compress a folder you have to tar it first)
The project files can be roughly groupt into three cathegories:
- The implementation itself
- Infrastructure (
Makefile
,main.py
, settings, tests) - Extended documentation (all the
README.md
files)
bzip2 algorithm can be described as a chain of reversible transformations:
- Splitting into blocks
- RLE
→
BWT→
MTF→
RLE→
HFC - Merging the blocks
where:
term | wiki | description |
---|---|---|
RLE | link | run-length encoding |
BWT | link | Burrows-Wheeler transform |
MTF | link | move-to-front transform |
HFC | link | Huffman coding |
So, to encode (and compress) the file we apply the transformations from the list sequentially.
Thus to decode the file, one should apply the inverse transformations in inverse order.
The Splitting into blocks
step is just making an inerator based on the file descriptor given.
This iterator yields byte-blocks of a fixed size (currently the default block size is 128 KiB
).
See: RLE README (≈ 12 minutes to read)
See: BWT README (≈ 20 minutes to read)
See: MTF README.md (≈ 6 minutes to read)
See: HFC README.md (≈ 20 minutes to read)
Merging the blocks
is a little bit trickier. The final block size is indetermined,
so we have to store it somewhere. Otherwise we won't be able to reverse this operation.
The binary format is as follow:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 A B C D E F 0 1 2 3 4 5 6 7 8 9 A B C D E F
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| 1st block size (4 bytes) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
| 1st block data (up to 4 GiB) |
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| 2nd block size (4 bytes) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
| 2nd block data (up to 4 GiB) |
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
| ... |
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Python >=3.10
- (optional)
make
tool — for build-automation
- Choose a file you want to compress and copy its path to the clipboard.
- Paste it into the
in_file
variable inmain.py
. - Run
main
from the project root folder.- or just run
python app/main.py
- or just run
pip install -r requirements.dev.txt
— to install all the dev dependenciesmake lint
— for formating and linting (isort
=>black
=>flake8
)make test
— to run fast testsmake test-all
— to run all the tests (incloding the slow ones)