feat: Parquet reader builder supports building multiple ranges to read #3841

evenyag · 2024-04-30T03:59:59Z

I hereby agree to the terms of the GreptimeDB CLA.

Refer to a related PR or issue link (optional)

Support parallel scan in mito engine #2806

What's changed and what's your intention?

This PR defines the FileRange struct for parquet files and implements a method build_file_ranges() to build ranges from a parquet file. A FileRange contains a range of rows to read from a parquet file. We can read different ranges in parallel later. Now a FileRange is a row group exactly.

pub struct FileRange {}

impl FileRange {
    pub(crate) async fn reader(&self) -> Result<RowGroupReader> {}
}

To reuse code, this PR

Refactors the ParquetReader. Now it adds a RowGroupReader to read Batches from a row group. The ParquetReader invokes the RowGroupReader to read the parquet file.
Defines a struct FileRangeContext for all ranges of the same parquet file. This PR also moves the precise_filter() method to the context as the context contains all inputs the method needs.

Now the builder uses a method build_reader_input() to construct the FileRangeContext and row groups to read. Both build() and build_file_ranges can reuse this method.

impl ParquetReaderBuilder {
    pub async fn build(&self) -> Result<ParquetReader> {
        let (context, row_groups) = self.build_reader_input().await?;
        ParquetReader::new(context, row_groups).await
    }

    pub async fn build_file_ranges(&self, file_ranges: &mut Vec<FileRange>) -> Result<()> {}

    async fn build_reader_input(&self) -> Result<(FileRangeContextRef, RowGroupMap)> {}
}

This PR also fixes some remaining issues and improves the ReadFormat helper struct.

It passes the projection while constructing the struct so it could
- compute the projection_indices in advance
- compute the field_id_to_projected_index in advance. Then convert_record_batch() doesn't require &mut self. This is necessary for the FileRangeContext as we have to share the ReadFormat
It wraps the SimpleFilterEvaluator into a SimpleFilterContext. The context gets the column info in advance, from the expected region metadata. This ensures the context can use the correct column id to find the column in the ReadFormat

Checklist

I have written the necessary rustdoc comments.
I have added the necessary unit tests and integration tests.
This PR does not require documentation updates.

codecov · 2024-04-30T07:13:34Z

Codecov Report

Attention: Patch coverage is 88.25397% with 37 lines in your changes are missing coverage. Please review.

Project coverage is 85.32%. Comparing base (154f561) to head (53396e1).
Report is 6 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3841      +/-   ##
==========================================
- Coverage   85.60%   85.32%   -0.28%     
==========================================
  Files         952      953       +1     
  Lines      163079   163431     +352     
==========================================
- Hits       139596   139451     -145     
- Misses      23483    23980     +497

src/mito2/src/sst/parquet/reader.rs

fengjiachun

LGTM

v0y4g3r

LGTM

github-actions bot added the docs-not-required This change does not impact docs. label Apr 30, 2024

evenyag force-pushed the feat/par-row-group-scan branch from 4c6649b to 8f9c9d2 Compare April 30, 2024 06:14

evenyag mentioned this pull request Apr 30, 2024

Support parallel scan in mito engine #2806

Closed

7 tasks

evenyag marked this pull request as ready for review April 30, 2024 07:00

evenyag requested review from v0y4g3r and waynexia as code owners April 30, 2024 07:00

evenyag requested a review from fengjiachun April 30, 2024 07:09

evenyag changed the title ~~feat: Parquet reader builder support building multiple partitions~~ feat: Parquet reader builder supports building multiple partitions Apr 30, 2024

evenyag marked this pull request as draft April 30, 2024 08:37

evenyag changed the title ~~feat: Parquet reader builder supports building multiple partitions~~ feat: Parquet reader builder supports building multiple ranges to read Apr 30, 2024

evenyag added 12 commits May 7, 2024 15:30

chore: change &mut self to &self

48054c0

feat: define partition and partition context

647cd55

refactor: move precise_filter to PartitionContext

a6302fc

feat: filter wip

628ad79

feat: compute projection and fields in format

7c9dadf

feat: use RowGroupReader to implement ParquetReader

7f1bbd5

fix: use expected meta to get column id for filters

478db9d

feat: partition returns row group reader

13ec554

style: fix clippy

dec6b71

feat: add build partitions method

66f3ce1

docs: comment

97bc9b5

refactor: rename Partition to FileRange

907fae7

evenyag force-pushed the feat/par-row-group-scan branch from 94a1590 to 907fae7 Compare May 7, 2024 07:31

evenyag marked this pull request as ready for review May 7, 2024 07:34

v0y4g3r reviewed May 7, 2024

View reviewed changes

src/mito2/src/sst/parquet/reader.rs Show resolved Hide resolved

src/mito2/src/sst/parquet/reader.rs Show resolved Hide resolved

src/mito2/src/sst/parquet/reader.rs Outdated Show resolved Hide resolved

src/mito2/src/sst/parquet/reader.rs Show resolved Hide resolved

evenyag added 2 commits May 7, 2024 19:44

chore: address CR comments

bd66d9d

feat: avoid allocating column ids while constructing ReadFormat

53396e1

evenyag requested a review from v0y4g3r May 7, 2024 12:07

fengjiachun approved these changes May 9, 2024

View reviewed changes

v0y4g3r approved these changes May 10, 2024

View reviewed changes

v0y4g3r added this pull request to the merge queue May 10, 2024

Merged via the queue into GreptimeTeam:main with commit 5a0629e May 10, 2024
21 checks passed

v0y4g3r deleted the feat/par-row-group-scan branch May 10, 2024 07:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Parquet reader builder supports building multiple ranges to read #3841

feat: Parquet reader builder supports building multiple ranges to read #3841

evenyag commented Apr 30, 2024 •

edited

Loading

codecov bot commented Apr 30, 2024 •

edited

Loading

fengjiachun left a comment

v0y4g3r left a comment

feat: Parquet reader builder supports building multiple ranges to read #3841

feat: Parquet reader builder supports building multiple ranges to read #3841

Conversation

evenyag commented Apr 30, 2024 • edited Loading

Refer to a related PR or issue link (optional)

What's changed and what's your intention?

Checklist

codecov bot commented Apr 30, 2024 • edited Loading

Codecov Report

fengjiachun left a comment

Choose a reason for hiding this comment

v0y4g3r left a comment

Choose a reason for hiding this comment

evenyag commented Apr 30, 2024 •

edited

Loading

codecov bot commented Apr 30, 2024 •

edited

Loading