Skip to content

Commit

Permalink
docs: add sizing explanation to bloom filter docs in parquet (#5705)
Browse files Browse the repository at this point in the history
* docs: add sizing explanation to bloom filter docs in parquet

Added documentation detailing the sizing of bloom filters in the parquet
crate.

* docs: fix doc comment typo

* docs: clarify doc comment on bloom filters in metadata

Updated the bloom filter module doc comment to clarify that the metadata stores the offset/length of the bloom filter, and not the bloom filter in its entirety.

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
  • Loading branch information
hiltontj and alamb authored May 4, 2024
1 parent b3f06f6 commit 2bdc9c1
Showing 1 changed file with 55 additions and 1 deletion.
56 changes: 55 additions & 1 deletion parquet/src/bloom_filter/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,61 @@
// under the License.

//! Bloom filter implementation specific to Parquet, as described
//! in the [spec](https://github.com/apache/parquet-format/blob/master/BloomFilter.md).
//! in the [spec][parquet-bf-spec].
//!
//! # Bloom Filter Size
//!
//! Parquet uses the [Split Block Bloom Filter][sbbf-paper] (SBBF) as its bloom filter
//! implementation. For each column upon which bloom filters are enabled, the offset and length of an SBBF
//! is stored in the metadata for each row group in the parquet file. The size of each filter is
//! initialized using a calculation based on the desired number of distinct values (NDV) and false
//! positive probability (FPP). The FPP for a SBBF can be approximated as<sup>[1][bf-formulae]</sup>:
//!
//! ```text
//! f = (1 - e^(-k * n / m))^k
//! ```
//!
//! Where, `f` is the FPP, `k` the number of hash functions, `n` the NDV, and `m` the total number
//! of bits in the bloom filter. This can be re-arranged to determine the total number of bits
//! required to achieve a given FPP and NDV:
//!
//! ```text
//! m = -k * n / ln(1 - f^(1/k))
//! ```
//!
//! SBBFs use eight hash functions to cleanly fit in SIMD lanes<sup>[2][sbbf-paper]</sup>, therefore
//! `k` is set to 8. The SBBF will spread those `m` bits accross a set of `b` blocks that
//! are each 256 bits, i.e., 32 bytes, in size. The number of blocks is chosen as:
//!
//! ```text
//! b = NP2(m/8) / 32
//! ```
//!
//! Where, `NP2` denotes *the next power of two*, and `m` is divided by 8 to be represented as bytes.
//!
//! Here is a table of calculated sizes for various FPP and NDV:
//!
//! | NDV | FPP | b | Size (KB) |
//! |-----------|-----------|---------|-----------|
//! | 10,000 | 0.1 | 256 | 8 |
//! | 10,000 | 0.01 | 512 | 16 |
//! | 10,000 | 0.001 | 1,024 | 32 |
//! | 10,000 | 0.0001 | 1,024 | 32 |
//! | 100,000 | 0.1 | 4,096 | 128 |
//! | 100,000 | 0.01 | 4,096 | 128 |
//! | 100,000 | 0.001 | 8,192 | 256 |
//! | 100,000 | 0.0001 | 16,384 | 512 |
//! | 100,000 | 0.00001 | 16,384 | 512 |
//! | 1,000,000 | 0.1 | 32,768 | 1,024 |
//! | 1,000,000 | 0.01 | 65,536 | 2,048 |
//! | 1,000,000 | 0.001 | 65,536 | 2,048 |
//! | 1,000,000 | 0.0001 | 131,072 | 4,096 |
//! | 1,000,000 | 0.00001 | 131,072 | 4,096 |
//! | 1,000,000 | 0.000001 | 262,144 | 8,192 |
//!
//! [parquet-bf-spec]: https://github.com/apache/parquet-format/blob/master/BloomFilter.md
//! [sbbf-paper]: https://arxiv.org/pdf/2101.01719
//! [bf-formulae]: http://tfk.mit.edu/pdf/bloom.pdf
use crate::data_type::AsBytes;
use crate::errors::ParquetError;
Expand Down

0 comments on commit 2bdc9c1

Please sign in to comment.