Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix writing of invalid Parquet ColumnIndex when row group contains null pages #6319

Merged
merged 5 commits into from
Aug 31, 2024

Conversation

adriangb
Copy link
Contributor

@adriangb adriangb commented Aug 28, 2024

Fixes #6310. All credit goes to @etseidl.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Aug 28, 2024
Comment on lines 586 to 597
// assert_eq!(left.offset_index_length(), right.offset_index_length());
// assert_eq!(left.column_index_length(), right.column_index_length());
Copy link
Contributor Author

@adriangb adriangb Aug 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With just the changes to this file (so the new test data) this fails on master because of #6316, so I commented it out for now and if:

  1. Pass empty vectors as min/max for all null pages when building ColumnIndex #6316 gets merged first I can rebase and re-enable the code
  2. If this gets merged first then Pass empty vectors as min/max for all null pages when building ColumnIndex #6316 can uncomment and re-enable this code

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re-enabled in 8eea1e5

Comment on lines +489 to +502
let writer_props = writer_props_builder
.set_max_row_group_size(4)
.set_data_page_row_count_limit(2)
.set_write_batch_size(2)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recognize that this is forcing things a bit. I leaned towards this approach because (1) it felt like a small step from where we are now and (2) I'm not that familiar with the lower level APIs that would be required if we wanted to build a ParquetMetadata by hand and use that for tests instead of relying on file writing to build one for us. I think a reasonable piece of feedback to this PR would be that now is the time to refactor the tests to build the ParquetMetadata by hand since we now care about getting pages of a certain size, etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thik this is reasonable

Comment on lines +659 to +673
// make sure each row group has 2 pages by checking the first column
// page counts for each column for each row group, should all be the same and there should be
// 12 pages in total across 6 row groups / 1 column
let mut page_counts = vec![];
for row_group in column_indexes {
for column in row_group {
match column {
Index::INT32(column_index) => {
page_counts.push(column_index.indexes.len());
}
_ => panic!("unexpected column index type"),
}
}
}
assert_eq!(page_counts, vec![2; 6]);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless we refactor this test to create the metadata directly I think this is a good idea. Previously there was no assertion that we were creating more than 1 page or anything of the sort, since this is testing code that evidently diverges in code paths depending on how many pages there are, etc. it's good to at least have assertions that the data is laid out as we expect it to be.

Comment on lines 759 to 760
vec![],
vec![],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, that's reading ahead 🤣

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverted it and pushed a single commit that also credits you 😄

…ll pages

Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>
@adriangb adriangb force-pushed the fix-metadata-write-nulls branch from cd93a1e to 5a127c6 Compare August 28, 2024 17:42
@adriangb adriangb requested a review from etseidl August 28, 2024 17:42
Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 (non-binding). I like the increased sanity checking in the tests. Having second thoughts on the fix, however, but it's just a style thing and not blocking.

@@ -450,19 +451,57 @@ mod tests {
true,
)]));

let a: ArrayRef = Arc::new(Int32Array::from(vec![Some(1), None, Some(2)]));
// build row groups / pages that exercise different combinations of nulls and values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like these test additions, particularly that they build off of the tests added in #6197. They also seem much more thorough than what I proposed 😄 .

Comment on lines 230 to 239
let mut min_values = vec![vec![]; self.indexes.len()];
let mut max_values = vec![vec![]; self.indexes.len()];
for (i, index) in self.indexes.iter().enumerate() {
if let Some(min) = index.min_bytes() {
min_values[i].extend_from_slice(min);
}
if let Some(max) = index.max_bytes() {
max_values[i].extend_from_slice(max);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know I'm responsible for this, but now I wonder if simply changing the above to something like:

        let min_values = self
            .indexes
            .iter()
            .map(|x| x.min_bytes().unwrap_or(&[]).to_vec())
            .collect::<Vec<_>>();

would be more rusty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup i think it would be, changed

Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>
@adriangb
Copy link
Contributor Author

Thank you @etseidl! Can we merge this or do we need to loop in some more folks?

@etseidl
Copy link
Contributor

etseidl commented Aug 28, 2024

I'm not a committer so I have no power here 😅. @alamb will hopefully take a look once his vacation is over.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me -- thank you @adriangb and @etseidl . I really appreciate the quick turnaround and help. Again sorry for the delay in reviewing

I just merged #6316 and re-enabled the tests in https://github.com/apache/arrow-rs/pull/6319/files#r1735056465

Comment on lines +489 to +502
let writer_props = writer_props_builder
.set_max_row_group_size(4)
.set_data_page_row_count_limit(2)
.set_write_batch_size(2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thik this is reasonable

@alamb
Copy link
Contributor

alamb commented Aug 31, 2024

🔧

@alamb alamb merged commit acdd27a into apache:master Aug 31, 2024
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Invalid ColumnIndex written in parquet
3 participants