bzip2

This project is my implementation of bzip2 (wiki) — a popular and efficient data compression algorithm.

Here's how it zips Leo Tolstoy's "War and Peace" versus standard ZIP-algorithm:

Features

a pure Python (>=3.10) implementation with no third-party dependencies
it outperforms (slightly) the standard zip-algorithm
it works with binary data, therefore no file-type restrictions

Disclaimer

Though the code presented is fully functional, passes all the tests and has notable compression efficiency, the following should be taken into account:

it's a pet project — it is made to satisfy my curiosity, it was never meant to be used in production
the file binary structure is incompatible with the original bzip2 format (= can't be opened with an archive manager app)
no consistency check (conversely to bzip canonical implementation)
optimization leaves much to be desired due to a variety of factors:
- it's written on pure Python
- algorithm parameters are not fine-tuned enough
only works with single files (so to compress a folder you have to tar it first)

Project overview

The project files can be roughly groupt into three cathegories:

The implementation itself
Infrastructure (Makefile, main.py, settings, tests)
Extended documentation (all the README.md files)

Algorithm specification

bzip2 algorithm can be described as a chain of reversible transformations:

Splitting into blocks
RLE → BWT → MTF → RLE → HFC
Merging the blocks

where:

term	wiki	description
RLE	link	run-length encoding
BWT	link	Burrows-Wheeler transform
MTF	link	move-to-front transform
HFC	link	Huffman coding

So, to encode (and compress) the file we apply the transformations from the list sequentially.

Thus to decode the file, one should apply the inverse transformations in inverse order.

Splitting into blocks

The Splitting into blocks step is just making an inerator based on the file descriptor given. This iterator yields byte-blocks of a fixed size (currently the default block size is 128 KiB).

Merging the blocks

Merging the blocks is a little bit trickier. The final block size is indetermined, so we have to store it somewhere. Otherwise we won't be able to reverse this operation. The binary format is as follow:

 0               1               2               3
 0 1 2 3 4 5 6 7 8 9 A B C D E F 0 1 2 3 4 5 6 7 8 9 A B C D E F
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                  1st block size (4 bytes)                     |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                                                               |
|                1st block data (up to 4 GiB)                   |
|                                                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                  2nd block size (4 bytes)                     |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                                                               |
|                2nd block data (up to 4 GiB)                   |
|                                                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                                                               |
|                             ...                               |
|                                                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Project infrastructure

Software requirements

Python >=3.10
(optional) make tool — for build-automation

How to launch

Choose a file you want to compress and copy its path to the clipboard.
Paste it into the in_file variable in main.py.
Run main from the project root folder.
- or just run python app/main.py

How to setup developer environment

pip install -r requirements.dev.txt — to install all the dev dependencies
make lint — for formating and linting (isort => black => flake8 )
make test — to run fast tests
make test-all — to run all the tests (incloding the slow ones)

Licensing

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
app		app
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
poetry.lock		poetry.lock
poetry.toml		poetry.toml
preview.png		preview.png
pyproject.toml		pyproject.toml
requirements.dev.txt		requirements.dev.txt
requirements.prod.txt		requirements.prod.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

bzip2

Table of contents

Features

Disclaimer

Project overview

Algorithm specification

Splitting into blocks

Run-length encoding

Burrows-Wheeler transform

Move-to-front transform

Huffman coding

Merging the blocks

Project infrastructure

Software requirements

How to launch

How to setup developer environment

Licensing

About

Languages

License

sentenzo/bzip2

Folders and files

Latest commit

History

Repository files navigation

bzip2

Table of contents

Features

Disclaimer

Project overview

Algorithm specification

Splitting into blocks

Run-length encoding

Burrows-Wheeler transform

Move-to-front transform

Huffman coding

Merging the blocks

Project infrastructure

Software requirements

How to launch

How to setup developer environment

Licensing

About

Topics

Resources

License

Stars

Watchers

Forks

Languages