Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write Bloom filters between row groups instead of the end #5860

Merged
merged 16 commits into from
Jun 21, 2024

Conversation

progval
Copy link
Contributor

@progval progval commented Jun 10, 2024

Which issue does this PR close?

Closes #5859.

Rationale for this change

This allows Bloom filters to not be saved in memory, which can save significant space when writing long files. This switches between the two layouts mentioned in the spec

What changes are included in this PR?

This includes a script that demonstrates the memory usage.

Increases linearly up to 4.3GB of RAM before the change:

$ cargo run --example write_parquet --release --features=log
    Finished release [optimized] target(s) in 0.11s
     Running `target/release/examples/write_parquet`
12:52:11 [INFO] Writing batches
12:52:21 [INFO] 267 iterations, 10s, 26.68 iterations/s, 37.48 ms/iterations; 8.90% done, 1m 42s to end; res/vir/avail/free/total mem 399.72MB/419.99MB/25.93GB/10.45GB/33.44GB
12:52:31 [INFO] 536 iterations, 20s, 26.75 iterations/s, 37.38 ms/iterations; 17.87% done, 1m 31s to end; res/vir/avail/free/total mem 805.78MB/829.16MB/25.93GB/10.45GB/33.44GB
12:52:41 [INFO] 805 iterations, 30s, 26.80 iterations/s, 37.31 ms/iterations; 26.83% done, 1m 21s to end; res/vir/avail/free/total mem 1.24GB/1.27GB/25.93GB/10.45GB/33.44GB
12:52:51 [INFO] 1,073 iterations, 40s, 26.79 iterations/s, 37.33 ms/iterations; 35.77% done, 1m 11s to end; res/vir/avail/free/total mem 1.61GB/1.64GB/25.93GB/10.45GB/33.44GB
12:53:01 [INFO] 1,342 iterations, 50s, 26.80 iterations/s, 37.31 ms/iterations; 44.73% done, 1m 1s to end; res/vir/avail/free/total mem 2.00GB/2.03GB/25.93GB/10.45GB/33.44GB
12:53:11 [INFO] 1,610 iterations, 1m 0s, 26.80 iterations/s, 37.32 ms/iterations; 53.67% done, 51s to end; res/vir/avail/free/total mem 2.39GB/2.42GB/25.93GB/10.45GB/33.44GB
12:53:21 [INFO] 1,869 iterations, 1m 10s, 26.65 iterations/s, 37.52 ms/iterations; 62.30% done, 42s to end; res/vir/avail/free/total mem 2.78GB/2.82GB/25.93GB/10.45GB/33.44GB
12:53:31 [INFO] 2,130 iterations, 1m 20s, 26.57 iterations/s, 37.63 ms/iterations; 71.00% done, 32s to end; res/vir/avail/free/total mem 3.16GB/3.21GB/25.93GB/10.45GB/33.44GB
12:53:41 [INFO] 2,391 iterations, 1m 30s, 26.52 iterations/s, 37.71 ms/iterations; 79.70% done, 22s to end; res/vir/avail/free/total mem 3.54GB/3.59GB/25.93GB/10.45GB/33.44GB
12:53:51 [INFO] 2,650 iterations, 1m 40s, 26.45 iterations/s, 37.80 ms/iterations; 88.33% done, 13s to end; res/vir/avail/free/total mem 3.93GB/3.98GB/25.93GB/10.45GB/33.44GB
12:54:01 [INFO] 2,908 iterations, 1m 50s, 26.39 iterations/s, 37.90 ms/iterations; 96.93% done, 3s to end; res/vir/avail/free/total mem 4.32GB/4.37GB/25.93GB/10.45GB/33.44GB
12:54:05 [INFO] Completed.
12:54:05 [INFO] Elapsed: 1m 53s [3,000 iterations, 26.36 iterations/s, 37.93 ms/iterations]; res/vir/avail/free/total mem 4.49GB/4.54GB/25.93GB/10.45GB/33.44GB

Remains constant at 55.2MB after the change:

$ cargo run --example write_parquet --release --features=log
   Compiling parquet v51.0.0 (/home/rust/arrow-rs/parquet)
    Finished release [optimized] target(s) in 11.24s
     Running `target/release/examples/write_parquet`
12:54:29 [INFO] Writing batches
12:54:39 [INFO] 261 iterations, 10s, 26.02 iterations/s, 38.43 ms/iterations; 8.70% done, 1m 44s to end; res/vir/avail/free/total mem 49.92MB/69.59MB/25.87GB/10.40GB/33.44GB
12:54:49 [INFO] 525 iterations, 20s, 26.20 iterations/s, 38.17 ms/iterations; 17.50% done, 1m 34s to end; res/vir/avail/free/total mem 55.23MB/73.79MB/25.87GB/10.40GB/33.44GB
12:54:59 [INFO] 791 iterations, 30s, 26.32 iterations/s, 38.00 ms/iterations; 26.37% done, 1m 23s to end; res/vir/avail/free/total mem 55.23MB/73.79MB/25.87GB/10.40GB/33.44GB
12:55:09 [INFO] 1,058 iterations, 40s, 26.40 iterations/s, 37.88 ms/iterations; 35.27% done, 1m 13s to end; res/vir/avail/free/total mem 55.23MB/73.79MB/25.87GB/10.40GB/33.44GB
12:55:19 [INFO] 1,325 iterations, 50s, 26.45 iterations/s, 37.81 ms/iterations; 44.17% done, 1m 3s to end; res/vir/avail/free/total mem 55.23MB/73.79MB/25.87GB/10.40GB/33.44GB
12:55:29 [INFO] 1,593 iterations, 1m 0s, 26.50 iterations/s, 37.74 ms/iterations; 53.10% done, 53s to end; res/vir/avail/free/total mem 55.23MB/73.79MB/25.87GB/10.40GB/33.44GB
12:55:39 [INFO] 1,861 iterations, 1m 10s, 26.54 iterations/s, 37.68 ms/iterations; 62.03% done, 42s to end; res/vir/avail/free/total mem 55.23MB/73.79MB/25.87GB/10.40GB/33.44GB
12:55:49 [INFO] 2,128 iterations, 1m 20s, 26.55 iterations/s, 37.66 ms/iterations; 70.93% done, 32s to end; res/vir/avail/free/total mem 55.23MB/73.79MB/25.87GB/10.40GB/33.44GB
12:55:59 [INFO] 2,384 iterations, 1m 30s, 26.44 iterations/s, 37.82 ms/iterations; 79.47% done, 23s to end; res/vir/avail/free/total mem 55.23MB/73.79MB/25.87GB/10.40GB/33.44GB
12:56:09 [INFO] 2,642 iterations, 1m 40s, 26.37 iterations/s, 37.91 ms/iterations; 88.07% done, 13s to end; res/vir/avail/free/total mem 55.23MB/73.79MB/25.87GB/10.40GB/33.44GB
12:56:19 [INFO] 2,900 iterations, 1m 50s, 26.32 iterations/s, 38.00 ms/iterations; 96.67% done, 3s to end; res/vir/avail/free/total mem 55.23MB/73.79MB/25.87GB/10.40GB/33.44GB
12:56:23 [INFO] Completed.
12:56:23 [INFO] Elapsed: 1m 54s [3,000 iterations, 26.29 iterations/s, 38.04 ms/iterations]; res/vir/avail/free/total mem 55.23MB/73.79MB/25.87GB/10.40GB/33.44GB

This is a demo of the change, just to make sure this is something we want.
In particular, this breaks arrow::arrow_writer::tests::*_bloom_filter because they expect to read the Bloom Filters from the memory at the end except... they aren't anymore. wrong, see comment

So if this looks good to you, I'll add a field in WriterProperties to switch between the old behavior (all Bloom Filters at the end) and this one (interleaved Bloom Filters). How should I call it? done

Are there any user-facing changes?

The layout of output files changes significantly. This may have a negative performance effect on readers expecting data locality, as Bloom Filters are now scattered across the file.

This required changes to the flushed_row_groups return type (Arc<T> to T) and OnCloseRowGroup, as we now need to mutate row groups while SerializedRowGroupWriter is "live" instead of just at the end in write_metadata() (which used to leave the structure in an inconsistent state, but it didn't matter because only close() called it)

progval added 2 commits June 10, 2024 15:03
This allows Bloom filters to not be saved in memory, which can save
significant space when writing long files
@github-actions github-actions bot added the parquet Changes to the parquet crate label Jun 10, 2024
@progval
Copy link
Contributor Author

progval commented Jun 10, 2024

Based on the test failures, it seems the Bloom Filters are either not written, or not picked up by the readers. Not sure why that is.

@alamb
Copy link
Contributor

alamb commented Jun 10, 2024

Thank you @progval

cc @Ted-Jiang and @jimexist

I think there is a tradeoff:

  • Writing all the bloom filters at the end requires them to be buffered (which you point out)
  • Writing all the bloom filters at the end means they are contiguous and thus the reader can fetch multiple bloom filters in a single IO (which is important if reading from something like S3)

Thus given there is a tradeoff it seems like we should at least offer an config setting of where to write the bloom filters.

I don't know if the parquet bloom filter spec dictates where the bloom filters should be written or if the ecosystem (aka paruqet-java) implicity requires them in a particular location

progval added 2 commits June 10, 2024 22:10
When using BloomFilterPosition::AfterRowGroup this was only writing the Bloom Filter
offset to a temporary clone of the metadata, causing the Bloom Filter to never
be seen by readers
@progval progval force-pushed the interleave-bloom branch from f3e7e78 to 83b475e Compare June 10, 2024 20:18
@progval
Copy link
Contributor Author

progval commented Jun 10, 2024

Thus given there is a tradeoff it seems like we should at least offer an config setting of where to write the bloom filters.

Indeed, done.

I believe my changes should make it easy to add an API to allow writers to trigger flushing of Bloom Filters, so they can pick a middle-ground themselves by writing all Bloom Filters for a group of row groups next to each other.

I don't know if the parquet bloom filter spec dictates where the bloom filters should be written

The way I read it, it allows them to be anywhere with any layout we like

or if the ecosystem (aka paruqet-java) implicity requires them in a particular location

I don't know about this

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @progval -- this looks very nice (as always 🙏 )

The only thing I think needs to be changed is removing the new dependencies. Otherwise this PR looks ready to me

@@ -68,6 +68,9 @@ twox-hash = { version = "1.6", default-features = false }
paste = { version = "1.0" }
half = { version = "2.1", default-features = false, features = ["num-traits"] }

dsi-progress-logger = { version = "0.2.4", optional = true }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please remove these new dependencies (even though I do realize they are optional and won't be activated very often)

I think they will add some ongoing maintenance cost (keeping the dependencies updated) which I would prefer to avoid if possible

Copy link
Contributor Author

@progval progval Jun 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you feel about depending only on sysinfo to display the RAM usage? It has a small set of dependencies

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be ok

Copy link
Contributor Author

@progval progval Jun 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. It now looks like this:

$ cargo run --release --features="cli sysinfo" --example write_parquet -- /tmp/test.parquet
2024-06-13 21:45:40 Writing 1000 batches of 1000000 rows. RSS = 1MB
2024-06-13 21:45:50 Iteration 260/1000. RSS = 50MB
2024-06-13 21:46:00 Iteration 518/1000. RSS = 50MB
2024-06-13 21:46:10 Iteration 772/1000. RSS = 50MB
2024-06-13 21:46:19 Done. RSS = 17MB

$ cargo run --release --features="cli sysinfo" --example write_parquet -- /tmp/test.parquet --bloom-filter-position end
2024-06-13 21:46:29 Writing 1000 batches of 1000000 rows. RSS = 1MB
2024-06-13 21:46:39 Iteration 267/1000. RSS = 451MB
2024-06-13 21:46:49 Iteration 533/1000. RSS = 791MB
2024-06-13 21:46:59 Iteration 799/1000. RSS = 1151MB
2024-06-13 21:47:07 Done. RSS = 1055MB

parquet/src/file/properties.rs Outdated Show resolved Hide resolved
parquet/src/file/writer.rs Show resolved Hide resolved
parquet/src/file/writer.rs Show resolved Hide resolved
use parquet::errors::Result;
use parquet::file::properties::WriterProperties;

fn main() -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could add some comments here explaining what this example is trying to show

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, along with a Clap argument parser:

$ cargo run --release --features="cli sysinfo" --example write_parquet -- -h
Writes sequences of integers, with a Bloom Filter, while logging timing and memory usage

Usage: write_parquet [OPTIONS] <PATH>

Arguments:
  <PATH>  Path to the file to write

Options:
      --iterations <ITERATIONS>                        Number of batches to write [default: 1000]
      --batch <BATCH>                                  Number of rows in each batch [default: 1000000]
      --bloom-filter-position <BLOOM_FILTER_POSITION>  Where to write Bloom Filters [default: after-row-group] [possible values: end, after-row-group]
  -h, --help                                           Print help
  -V, --version                                        Print version

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @progval -- this PR now looks good to me

@alamb alamb added the api-change Changes to the arrow API label Jun 21, 2024
@alamb
Copy link
Contributor

alamb commented Jun 21, 2024

I merged this branch up to main to resolve conflicts and I double checked that this is an additive API (rather than an API change) so I think it can be merged for inclusion in the next minor release

@progval
Copy link
Contributor Author

progval commented Jun 21, 2024

OnCloseRowGroup grew a type parameter and flushed_row_groups changed return type (Arc<RowGroupMetaData> -> RowGroupMetaData)

@alamb alamb merged commit 3930d5b into apache:master Jun 21, 2024
17 checks passed
alamb added a commit that referenced this pull request Jun 21, 2024
@alamb
Copy link
Contributor

alamb commented Jun 21, 2024

OnCloseRowGroup grew a type parameter and flushed_row_groups changed return type (Arc<RowGroupMetaData> -> RowGroupMetaData)

Shoot -- you are right I merged this PR by accident. I will revert this change in #5932 and open a new PR to re-add it marked correctly with api-change

@alamb
Copy link
Contributor

alamb commented Jun 21, 2024

PR with the changes re-introduced: #5933

@alamb
Copy link
Contributor

alamb commented Jul 2, 2024

This was reverted and thus does will not be present in 52.1.0 release #5905

@alamb
Copy link
Contributor

alamb commented Jul 2, 2024

I will merge #5933 when we open for breaking API changes

swhmirror pushed a commit to SoftwareHeritage/swh-graph that referenced this pull request Dec 10, 2024
to avoid OOMs due to storing Bloom Filters in memory while writing,
see apache/arrow-rs#5860
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

parquet::ArrowWriter show allow writing Bloom filters before the end of the file
2 participants