Support parallel scan in mito engine #2806

evenyag · 2023-11-23T13:48:48Z

What type of enhancement is this?

Performance

What does the enhancement do?

Currently, the Mito engine only supports single-threaded scanning. We can consider using parallel scanning to improve speed when dealing with larger amounts of data.

Implementation challenges

There are several approaches to parallel scanning:

Parallel scanning of each file: If there is only one file to scan, parallelization is not possible.
Parallel scanning of row groups: It is important to note that, in order to ensure the final result is sorted, it is necessary to scan the row groups used by the MergeReader. Implementing this approach can be more complex.

We also need to figure out a way to control the parallelism of the query so spawning a task for each file might not be the best solution (We might need some experiments).

Steps

Scan files in parallel perf(mito): scan SSTs and memtables in parallel #2852
Scan row groups in parallel if the table is append only
- Build multiple ranges to scan feat: Parquet reader builder supports building multiple ranges to read #3841
- Scan multiple ranges feat: Implements row group level parallel unordered scanner #3992
Scan row groups in parallel for non append only table feat: Implement RegionScanner for SeqScan #4060
Fix the pruning issue for columns with the same name but different column ids fix: prune row groups correctly for columns with the same name #3802
Partitioned scan for RegionEngine #3886

evenyag · 2024-04-25T08:46:25Z

I'm going to reopen this issue as file-level parallelism is not enough if the number of parquet files is less than the parallelism. We still need to implement a more fine-grained parallel scan strategy, such as row group level parallel scan. However, parallel scanning row groups might not be able to maintain a sorted order of the SST file. As a result, we can implement this parallel strategy in append only tables that use UnorderedScan. In the future, we might support parallel merge to solve this restriction.

I did some experiments on this branch. I also found that we must support multiple output partitioning to maximize computations. I also fixed an issue that the parquet reader might have unexpected pruning behavior as it always uses the file's schema to create the physical expression.

Another potential optimization is to scan columns in parallel if the number of row groups is not enough like polar-rs. But this is not very necessary.

I'll update this issue to track the implementation of parallel scanning row groups.

evenyag added C-enhancement Category Enhancements C-performance Category Performance labels Nov 23, 2023

evenyag self-assigned this Nov 23, 2023

evenyag mentioned this issue Dec 5, 2023

perf(mito): scan SSTs and memtables in parallel #2852

Merged

2 tasks

killme2008 closed this as completed Dec 12, 2023

evenyag reopened this Apr 25, 2024

evenyag added the tracking-issue A tracking issue for a feature. label Apr 25, 2024

evenyag mentioned this issue Apr 30, 2024

feat: Parquet reader builder supports building multiple ranges to read #3841

Merged

3 tasks

evenyag mentioned this issue May 8, 2024

Partitioned scan for RegionEngine #3886

Closed

This was referenced May 15, 2024

feat: Adds RegionScanner trait #3948

Merged

feat: Implements row group level parallel unordered scanner #3992

Merged

evenyag mentioned this issue May 28, 2024

feat: Implement RegionScanner for SeqScan #4060

Merged

3 tasks

killme2008 closed this as completed Jul 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support parallel scan in mito engine #2806

Support parallel scan in mito engine #2806

evenyag commented Nov 23, 2023 •

edited by killme2008

Loading

evenyag commented Apr 25, 2024 •

edited

Loading

Support parallel scan in mito engine #2806

Support parallel scan in mito engine #2806

Comments

evenyag commented Nov 23, 2023 • edited by killme2008 Loading

What type of enhancement is this?

What does the enhancement do?

Implementation challenges

Steps

evenyag commented Apr 25, 2024 • edited Loading

evenyag commented Nov 23, 2023 •

edited by killme2008

Loading

evenyag commented Apr 25, 2024 •

edited

Loading