-
Notifications
You must be signed in to change notification settings - Fork 838
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write Bloom filters between row groups instead of the end #5860
Conversation
This allows Bloom filters to not be saved in memory, which can save significant space when writing long files
Based on the test failures, it seems the Bloom Filters are either not written, or not picked up by the readers. Not sure why that is. |
Thank you @progval cc @Ted-Jiang and @jimexist I think there is a tradeoff:
Thus given there is a tradeoff it seems like we should at least offer an config setting of where to write the bloom filters. I don't know if the parquet bloom filter spec dictates where the bloom filters should be written or if the ecosystem (aka paruqet-java) implicity requires them in a particular location |
When using BloomFilterPosition::AfterRowGroup this was only writing the Bloom Filter offset to a temporary clone of the metadata, causing the Bloom Filter to never be seen by readers
f3e7e78
to
83b475e
Compare
Indeed, done. I believe my changes should make it easy to add an API to allow writers to trigger flushing of Bloom Filters, so they can pick a middle-ground themselves by writing all Bloom Filters for a group of row groups next to each other.
The way I read it, it allows them to be anywhere with any layout we like
I don't know about this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @progval -- this looks very nice (as always 🙏 )
The only thing I think needs to be changed is removing the new dependencies. Otherwise this PR looks ready to me
parquet/Cargo.toml
Outdated
@@ -68,6 +68,9 @@ twox-hash = { version = "1.6", default-features = false } | |||
paste = { version = "1.0" } | |||
half = { version = "2.1", default-features = false, features = ["num-traits"] } | |||
|
|||
dsi-progress-logger = { version = "0.2.4", optional = true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please remove these new dependencies (even though I do realize they are optional and won't be activated very often)
I think they will add some ongoing maintenance cost (keeping the dependencies updated) which I would prefer to avoid if possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you feel about depending only on sysinfo
to display the RAM usage? It has a small set of dependencies
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. It now looks like this:
$ cargo run --release --features="cli sysinfo" --example write_parquet -- /tmp/test.parquet
2024-06-13 21:45:40 Writing 1000 batches of 1000000 rows. RSS = 1MB
2024-06-13 21:45:50 Iteration 260/1000. RSS = 50MB
2024-06-13 21:46:00 Iteration 518/1000. RSS = 50MB
2024-06-13 21:46:10 Iteration 772/1000. RSS = 50MB
2024-06-13 21:46:19 Done. RSS = 17MB
$ cargo run --release --features="cli sysinfo" --example write_parquet -- /tmp/test.parquet --bloom-filter-position end
2024-06-13 21:46:29 Writing 1000 batches of 1000000 rows. RSS = 1MB
2024-06-13 21:46:39 Iteration 267/1000. RSS = 451MB
2024-06-13 21:46:49 Iteration 533/1000. RSS = 791MB
2024-06-13 21:46:59 Iteration 799/1000. RSS = 1151MB
2024-06-13 21:47:07 Done. RSS = 1055MB
use parquet::errors::Result; | ||
use parquet::file::properties::WriterProperties; | ||
|
||
fn main() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we could add some comments here explaining what this example is trying to show
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, along with a Clap argument parser:
$ cargo run --release --features="cli sysinfo" --example write_parquet -- -h
Writes sequences of integers, with a Bloom Filter, while logging timing and memory usage
Usage: write_parquet [OPTIONS] <PATH>
Arguments:
<PATH> Path to the file to write
Options:
--iterations <ITERATIONS> Number of batches to write [default: 1000]
--batch <BATCH> Number of rows in each batch [default: 1000000]
--bloom-filter-position <BLOOM_FILTER_POSITION> Where to write Bloom Filters [default: after-row-group] [possible values: end, after-row-group]
-h, --help Print help
-V, --version Print version
Improve documentation Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @progval -- this PR now looks good to me
I merged this branch up to main to resolve conflicts and I double checked that this is an additive API (rather than an API change) so I think it can be merged for inclusion in the next minor release |
|
Shoot -- you are right I merged this PR by accident. I will revert this change in #5932 and open a new PR to re-add it marked correctly with api-change |
PR with the changes re-introduced: #5933 |
This was reverted and thus does will not be present in 52.1.0 release #5905 |
I will merge #5933 when we open for breaking API changes |
to avoid OOMs due to storing Bloom Filters in memory while writing, see apache/arrow-rs#5860
Which issue does this PR close?
Closes #5859.
Rationale for this change
This allows Bloom filters to not be saved in memory, which can save significant space when writing long files. This switches between the two layouts mentioned in the spec
What changes are included in this PR?
This includes a script that demonstrates the memory usage.
Increases linearly up to 4.3GB of RAM before the change:
Remains constant at 55.2MB after the change:
This is a demo of the change, just to make sure this is something we want.
In particular, this breakswrong, see commentarrow::arrow_writer::tests::*_bloom_filter
because they expect to read the Bloom Filters from the memory at the end except... they aren't anymore.So if this looks good to you, I'll add a field indoneWriterProperties
to switch between the old behavior (all Bloom Filters at the end) and this one (interleaved Bloom Filters). How should I call it?Are there any user-facing changes?
The layout of output files changes significantly. This may have a negative performance effect on readers expecting data locality, as Bloom Filters are now scattered across the file.
This required changes to the
flushed_row_groups
return type (Arc<T>
toT
) andOnCloseRowGroup
, as we now need to mutate row groups whileSerializedRowGroupWriter
is "live" instead of just at the end inwrite_metadata()
(which used to leave the structure in an inconsistent state, but it didn't matter because onlyclose()
called it)