feat: Parquet reader builder supports building multiple ranges to read #3841
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I hereby agree to the terms of the GreptimeDB CLA.
Refer to a related PR or issue link (optional)
What's changed and what's your intention?
This PR defines the
FileRange
struct for parquet files and implements a methodbuild_file_ranges()
to build ranges from a parquet file. AFileRange
contains a range of rows to read from a parquet file. We can read different ranges in parallel later. Now aFileRange
is a row group exactly.To reuse code, this PR
ParquetReader
. Now it adds aRowGroupReader
to readBatches
from a row group. TheParquetReader
invokes theRowGroupReader
to read the parquet file.FileRangeContext
for all ranges of the same parquet file. This PR also moves theprecise_filter()
method to the context as the context contains all inputs the method needs.Now the builder uses a method
build_reader_input()
to construct theFileRangeContext
and row groups to read. Bothbuild()
andbuild_file_ranges
can reuse this method.This PR also fixes some remaining issues and improves the
ReadFormat
helper struct.projection_indices
in advancefield_id_to_projected_index
in advance. Thenconvert_record_batch()
doesn't require&mut self
. This is necessary for theFileRangeContext
as we have to share theReadFormat
SimpleFilterEvaluator
into aSimpleFilterContext
. The context gets the column info in advance, from the expected region metadata. This ensures the context can use the correct column id to find the column in theReadFormat
Checklist