-
Notifications
You must be signed in to change notification settings - Fork 847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support reading and writingStringView
and BinaryView
in parquet (part 2)
#5557
Conversation
17695d3
to
1c7260e
Compare
The parquet build for wasm32 check always failed, but it works on my machine... |
Thanks @ariesdevil -- I think @tustvold is away for a few days. I will try and give this PR a look over the next day or two. Very exciting! |
d2f1a30
to
fdd2e48
Compare
StringView
and BinaryView
in parquet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @ariesdevil -- I think this is looking close to something we can merge 🙏
To merge this, I think we need:
- Benchmarks for reading/writing StringView / BinaryViews
- Cover a few more cases for round tripping
I think there is quite a bit more performance to be had as well by avoiding string copies. This PR will copy the data I think twice (once out of parquet buffer and once out of the offset buffer).
It would be nice to avoid at least one of those copies prior to merge, but I also think once we have the test coverage it would be ok to merge this PR and then do additional optimizations as follow on PRs
We can add benchmarks here:
- reading
arrow-rs/parquet/benches/arrow_reader.rs
Lines 867 to 868 in 7e134f4
// string benchmarks //============================== - writing:
arrow-rs/parquet/benches/arrow_writer.rs
Lines 382 to 392 in 7e134f4
let batch = create_string_bench_batch(4096, 0.25, 0.75).unwrap(); group.throughput(Throughput::Bytes( batch .columns() .iter() .map(|f| f.get_array_memory_size() as u64) .sum(), )); group.bench_function("4096 values string", |b| { b.iter(|| write_batch(&batch).unwrap()) });
End to end round trip tests:
This is really nice step forward -- thank you @ariesdevil |
f37c3dd
to
f65807f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @ariesdevil -- I think this PR looks like a really nice incremental step.
As I think we have identified, it can be significantly improved performance wise but having the basic tests and benchmarks in place we are in a good space to optimize it.
I left some other smaller style / documentation comments that might be nice to cleanup, but I don't think they are strictly required
Let's wait another day for @tustvold to see if he wants to review (i think he is still not back yet) but otherwise we'll merge this PR and add additional features as a follow on PR
a22ee14
to
d286521
Compare
I plan to give this another review today |
The integration test https://github.com/apache/arrow-rs/actions/runs/8553472637/job/23436722309?pr=5557 seems to have failed for reasons unrelated to this PR. I have restarted it
|
I am sorry @ariesdevil -- I need more time and a fresh set of eyes to review the new implementation in this PR. I will find time over the next few days. |
Hi @alamb ,the new implementation has indeed changed a lot. It should have been implemented in another PR, but the original performance is really poor. Thanks for your patience and have a nice weekend. |
I wrote new tests in #5639 |
fef3c9f
to
cbf08f4
Compare
I am on vacation this week but I hope to find some time later today or tomorrow to give this another look. Thanks @ariesdevil |
No need to rush, enjoy your vacation, just so I can have time to rethink how to continue optimization 😃 |
I see some new updates -- I plan to look at this PR later this week (probably won't be until Wednesday) |
Hi @ariesdevil, thanks for the great work! I'm looking at this PR these days and trying to push it forward. However, I do feel the code change is too large for a single PR that it is quite difficult to understand every detail of the implementation. I suggest we break the PR into even smaller pieces and incrementally add the features/optimizations so it poses less mental burdens on the reviewers, what do you think? I'll try to consolidate a few low-hanging fruit changes and try to get them merged first; let me know your thoughts! |
Hi @XiangpengHao, thanks for your reply. I don't think this PR is very large, the previous reviewers are already very familiar with this PR, this PR has been submitted for almost three months, and the reviewer has had enough time to get familiar with it. |
Indeed -- I am sorry this PR is stalled @ariesdevil -- it has been very hard for me to find enough contiguous time to get familar enough with this PR and really understand it (and ensure the test coverage is adequate, etc). @ariesdevil do you think we should attempt to merge this entire PR at once? I do think it would help speed up review if we could break it into some smaller PRs |
Thank you for understanding -- I feel bad whenever PRs get stuck.
That is awesome. THank you. I will make every effort to help get this over the line and now that @XiangpengHao is helping I am optimistic we can get it in shortly |
#5877 is merged now! |
5730aa0
to
7fe2847
Compare
closed by #5877 |
Which issue does this PR close?
Closes #5530
Rationale for this change
Use view_buffer instead of offset_buffer for view type parquet reader
What changes are included in this PR?
Use view_buffer instead of offset_buffer for view type parquet reader
Are there any user-facing changes?
Yes