You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Encountered an issue when importing empty variable-size binary layout array (e.g., string) from Java Arrow.
There is difference between Java Arrow and arrow-rs when computing the length of data buffer: apache/arrow#41610 (comment)
This is how Java Arrow imports an Utf8 array:
try (ArrowBuf offsets = importOffsets(type, VarCharVector.OFFSET_WIDTH)) {
final int start = offsets.getInt(0);
final int end = offsets.getInt(fieldNode.getLength() * (long) VarCharVector.OFFSET_WIDTH);
final int len = end - start;
...
}
So even the offset buffer is not initialized, for empty array with one element offset buffer, end - start is always 0 that is the length of data buffer. That is why the added roundtrip tests are passed.
But in arrow-rs, it takes the last value of the offset buffer as the length of data buffer, i.e., end. If the value is not initialized to zero, the computed length of data buffer is incorrect.
That is what I found for the first offset value from the spec:
Generally the first slot in the offsets array is 0, and the last slot is the length of the values array.
When serializing this layout, we recommend normalizing the offsets to start at 0.
It looks like the first value doesn't have to be 0, although generally it is. So seems Java Arrow's approach is (more) correct.
To Reproduce
Expected behavior
Additional context
The text was updated successfully, but these errors were encountered:
When serializing this layout, we recommend normalizing the offsets to start at 0.
My reading is this is for IPC and not FFI, it makes sense to normalise when serializing as otherwise you are encoding a lot of "invisible" value data. For FFI I'm not sure you want to pay this cost.
Describe the bug
Encountered an issue when importing empty variable-size binary layout array (e.g., string) from Java Arrow.
There is difference between Java Arrow and arrow-rs when computing the length of data buffer: apache/arrow#41610 (comment)
This is how Java Arrow imports an Utf8 array:
So even the offset buffer is not initialized, for empty array with one element offset buffer,
end - start
is always 0 that is the length of data buffer. That is why the added roundtrip tests are passed.But in arrow-rs, it takes the last value of the offset buffer as the length of data buffer, i.e.,
end
. If the value is not initialized to zero, the computed length of data buffer is incorrect.That is what I found for the first offset value from the spec:
It looks like the first value doesn't have to be 0, although generally it is. So seems Java Arrow's approach is (more) correct.
To Reproduce
Expected behavior
Additional context
The text was updated successfully, but these errors were encountered: