support avro to record batch directly #768

zhuliquan · 2024-10-28T10:35:41Z

Arroyo is a very good library, and we ran into some performance issues when using it, and we found that there were large-scale decoding operations, as shown below.

I analyzed the code
https://github.com/zhuliquan/arroyo/blob/776965ae9d6ee818595197288d5cca379c564368/crates/arroyo-formats/src/de.rs#L338-L355
We found The consumed Kafka data of AVRO is first converted to Avro Value, then to Json Value, then serialized to bytes, and finally to RecordBatch. I actually have a question here, why not just convert from avro to RecordBatch? The arrow-rs also support AVRO format (https://github.com/apache/arrow-rs/tree/master/arrow-avro).

The text was updated successfully, but these errors were encountered:

mwylde · 2024-10-28T21:00:24Z

The answer is two parts:

When we built the avro support into Arroyo, the arrow-rs avro implementation was not complete enough to use so we took a bit of a shortcut with the avro-to-json approach
It's not straightforward to support all avro features as SQL data types (for example, arbitrary unions), so today for any fields that have an unsupported data type, we use "raw json" encoding, where we re-encode those columns as JSON and make them available for querying with json functions. This allows us to support any avro schema.

Assuming we can find a pathway to support (2) with the arrow-rs implementation (and it's reasonably complete/fast) we can move to that. The approach might look like what we already do for JSON in our arrow-rs fork: https://github.com/ArroyoSystems/arrow-rs/blob/52.1.0/json/arrow-json/src/reader/json_array.rs

mwylde added the enhancement New feature or request label Nov 18, 2024

Provide feedback