Tips on speeding up serialization to parquet using RecordBatch #4755
-
In my framework, I export analysis results to a parquet file. Serializing 163 thousand rows with 81 columns takes just under ten minutes. The exact export code is here. All of the computations from line 80 until line 334 are done in less than one second according to the log timestamps. Then, it takes just under eight minutes to serialize the data to parquet. That seems very slow... Is there any option I can pass to the ArrowWriter or the RecordBatch to speed up the serialization? Thanks |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 2 replies
-
Have you tried using a profiling tool to see where time is being spent? If on Linux I can recommend linux perf combined with hotspot? I also presume you are compiling in release mode? Generally things to check:
|
Beta Was this translation helpful? Give feedback.
-
Thanks for the tip on using flamegraph. It seems that the Brotli compression uses about 50% of the CPU cycles, and 80% of the overall run is Brotli. I'll try to switch to another compression algorithm and see how that performs. |
Beta Was this translation helpful? Give feedback.
Zstd is also significantly faster and the files are about not too much larger (5.4 MB instead of 4.9 MB). One will note in the SVG below that there is a new visible band that corresponds to the actual creation of the data.