-
Notifications
You must be signed in to change notification settings - Fork 847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance of casting StringView
/BinaryView
to DictionaryArray
#5872
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @XiangpengHao --
Since the rationale for these changes is performance, I think we should have benchmark results showing how much faster it is.
Thus, can you please make a PR with a banchmark (a single PR with benchmarks both for this change and the one in #5871)
Then we'll use the benchmarks to verify that this PR really does improve performance (I am sure it will, but we should double check)
StringView
/BinaryView
to DictionaryArray
I took the liberty of merging this PR up from main. Running performance benchmarks on it now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I measure a 30% performance improvement on this branch. Thank you @XiangpengHao
++ critcmp master dict-to-view
group dict-to-view master
----- ------------ ------
cast dict to string view 1.00 69.0±2.58µs ? ?/sec 1.02 70.6±12.48µs ? ?/sec
cast string view to dict 1.00 203.3±0.32µs ? ?/sec 1.34 272.1±0.69µs ? ?/sec
Does this PR close #5861 I wonder 🤔 |
I guess it should. Are there any remaining items? |
Not that I know of |
🚀 |
Which issue does this PR close?
Closes #5861 . This pr should only be reviewed/merged after #5871 is merged.
Rationale for this change
Cast view types to dictionary arrays, the opposite of #5871
Note that casting was handled by pack_byte_to_dictionary:
https://github.com/XiangpengHao/arrow-rs/blob/dict-to-view/arrow-cast/src/cast/dictionary.rs#L307
But it takes two steps: Utf8View -> Utf8 -> Dictionary, which causes two copies.
This PR only do one copy (directly from Utf8View -> Dictionary<>)
What changes are included in this PR?
Are there any user-facing changes?