Skip to content

Commit

Permalink
Minor: Improve ArrowReaderBuilder::with_row_selection docs
Browse files Browse the repository at this point in the history
  • Loading branch information
alamb committed May 31, 2024
1 parent c2b05cd commit 283491e
Showing 1 changed file with 35 additions and 6 deletions.
41 changes: 35 additions & 6 deletions parquet/src/arrow/arrow_reader/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -117,6 +117,8 @@ impl<T> ArrowReaderBuilder<T> {
}

/// Only read data from the provided row group indexes
///
/// This is also called row group filtering
pub fn with_row_groups(self, row_groups: Vec<usize>) -> Self {
Self {
row_groups: Some(row_groups),
Expand All @@ -135,14 +137,41 @@ impl<T> ArrowReaderBuilder<T> {
/// Provide a [`RowSelection`] to filter out rows, and avoid fetching their
/// data into memory.
///
/// Row group filtering is applied prior to this, and therefore rows from skipped
/// row groups should not be included in the [`RowSelection`]
/// This feature is used to restrict which rows are decoded within row
/// groups, skipping ranges of rows that are not needed. Such selections
/// could be determined by evaluating predicates against the parquet page
/// [`Index`] or some other external information available to a query
/// engine.
///
/// An example use case of this would be applying a selection determined by
/// evaluating predicates against the [`Index`]
/// # Notes
///
/// It is recommended to enable reading the page index if using this functionality, to allow
/// more efficient skipping over data pages. See [`ArrowReaderOptions::with_page_index`]
/// Row group filtering (see [`Self::with_row_groups`]) is applied prior to
/// applying the row selection, and therefore rows from skipped row groups
/// should not be included in the [`RowSelection`] (see example below)
///
/// It is recommended to enable writing the page index if using this
/// functionality, to allow more efficient skipping over data pages. See
/// [`ArrowReaderOptions::with_page_index`].
///
/// # Example
///
/// Given a parquet file with 3 row groups, and a row group filter of
/// `[0, 2]`, in order to only scan rows 50-100 in row group 2:
///
/// ```text
/// Row Group 0, 1000 rows (selected)
/// Row Group 1, 1000 rows (skipped)
/// Row Group 2, 1000 rows (selected, but want to only scan rows 50-100)
/// ```
///
/// You would pass the following [`RowSelection`]:
///
/// ```text
/// Select 1000 (scan all rows in row group 0)
/// Select 50-100 (scan rows 50-100 in row group 2)
/// ```
///
/// Note there is no entry for the (entirely) skipped row group 1.
///
/// [`Index`]: crate::file::page_index::index::Index
pub fn with_row_selection(self, selection: RowSelection) -> Self {
Expand Down

0 comments on commit 283491e

Please sign in to comment.