Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compute data buffer length by using start and end values in offset buffer #5756

Closed
viirya opened this issue May 11, 2024 · 2 comments · Fixed by #5741 or #5964
Closed

Compute data buffer length by using start and end values in offset buffer #5756

viirya opened this issue May 11, 2024 · 2 comments · Fixed by #5741 or #5964
Labels
arrow Changes to the arrow crate bug

Comments

@viirya
Copy link
Member

viirya commented May 11, 2024

Describe the bug

Encountered an issue when importing empty variable-size binary layout array (e.g., string) from Java Arrow.

There is difference between Java Arrow and arrow-rs when computing the length of data buffer: apache/arrow#41610 (comment)

This is how Java Arrow imports an Utf8 array:

try (ArrowBuf offsets = importOffsets(type, VarCharVector.OFFSET_WIDTH)) {
      final int start = offsets.getInt(0);
      final int end = offsets.getInt(fieldNode.getLength() * (long) VarCharVector.OFFSET_WIDTH);
      final int len = end - start;
      ...
}

So even the offset buffer is not initialized, for empty array with one element offset buffer, end - start is always 0 that is the length of data buffer. That is why the added roundtrip tests are passed.

But in arrow-rs, it takes the last value of the offset buffer as the length of data buffer, i.e., end. If the value is not initialized to zero, the computed length of data buffer is incorrect.

That is what I found for the first offset value from the spec:

Generally the first slot in the offsets array is 0, and the last slot is the length of the values array.
When serializing this layout, we recommend normalizing the offsets to start at 0.

It looks like the first value doesn't have to be 0, although generally it is. So seems Java Arrow's approach is (more) correct.

To Reproduce

Expected behavior

Additional context

@tustvold
Copy link
Contributor

tustvold commented Jun 3, 2024

label_issue.py automatically added labels {'arrow'} from #5741

@tustvold
Copy link
Contributor

When serializing this layout, we recommend normalizing the offsets to start at 0.

My reading is this is for IPC and not FFI, it makes sense to normalise when serializing as otherwise you are encoding a lot of "invisible" value data. For FFI I'm not sure you want to pay this cost.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate bug
Projects
None yet
2 participants