Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow typeLength to come from opts.column when decoding FIXED_LEN_BYTE_ARRAY #108

Merged
merged 4 commits into from
Jan 19, 2024

Conversation

j4ys0n
Copy link

@j4ys0n j4ys0n commented Jan 5, 2024

Problem

typeLength is present in column options but decoding is throwing an error.

thrown: "missing option: typeLength (required for FIXED_LEN_BYTE_ARRAY)"

options object for reference:

    {
      type: 'FIXED_LEN_BYTE_ARRAY',
      rLevelMax: 0,
      dLevelMax: 1,
      compression: 'SNAPPY',
      column: {
        name: 'BLOCK_NUMBER',
        primitiveType: 'FIXED_LEN_BYTE_ARRAY',
        originalType: 'DECIMAL',
        path: [ 'BLOCK_NUMBER' ],
        repetitionType: 'OPTIONAL',
        encoding: 'PLAIN',
        statistics: undefined,
        compression: 'UNCOMPRESSED',
        precision: 38,
        scale: 0,
        typeLength: 16,
        rLevelMax: 0,
        dLevelMax: 1
      },
      num_values: { buffer: <Buffer 00 00 00 00 00 00 27 10>, offset: 0 }
    }

using parquet-tools schema here is the schema for this column:

optional fixed_len_byte_array(16) BLOCK_NUMBER (DECIMAL(38,0))

The parquet file is a direct export from snowflake and the data type of the column is NUMBER(38,0).

Solution

I traced through the code to find where the decode was erroring and added the ability to take the typeLength from column in the column options when it is not present at the top level.

Change summary:

see above

Steps to Verify:

decode a parquet file with this type of field.

Copy link
Member

@wilwade wilwade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. @j4ys0n if you have a simple parquet file with this, you can add it to test/test-files.js and that would provide a basic integration type test for it as well, but I don't see that as blocking.

@wilwade wilwade requested a review from shannonwells January 12, 2024 13:19
Copy link
Collaborator

@shannonwells shannonwells left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a unit test for this, thanks

Copy link
Member

@wilwade wilwade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@j4ys0n Can you add a test with a parquet file that this fails on? I tried one and it already was working.

A simple test in test/test-files.js that looks something like:

it('nested_fixed_length_byte_array.parquet loads', async function () {
    const schema = await readSchema('nested_fixed_length_byte_array.parquet');
    const data = await readData('nested_fixed_length_byte_array.parquet');

    const typeLength = schema.fields["BLOCK_NUMBER"].typeLength;
    assert.equal(typeLength, 16);
    const expectedBuffer = Buffer.from([0x00, 0x00, 0x03, 0xe8]);
    assert.deepEqual(data[0], { BLOCK_NUMBER: expectedBuffer });
  });

@wilwade
Copy link
Member

wilwade commented Jan 18, 2024

Testing again with float16_nonzeros_and_nans.parquet and float16_zeros_and_nans.parquet and they do show this fixing things. I'm about to get up a pr that tests all the various reference tests, so we should just be able to uncomment those two and that shows this works. PR coming soon to fix that, then I think we can update and merge this one

@wilwade wilwade requested a review from shannonwells January 19, 2024 00:09
Copy link
Member

@wilwade wilwade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the tests from the reference test, I think we are good!

Copy link
Collaborator

@shannonwells shannonwells left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allllllrightythen

@wilwade wilwade merged commit 2622ff1 into LibertyDSNP:main Jan 19, 2024
1 check passed
@j4ys0n j4ys0n deleted the fix-typelength branch January 19, 2024 05:29
@j4ys0n
Copy link
Author

j4ys0n commented Jan 19, 2024

apologies for the unresponsiveness - i'm away for a bit. thank you for moving this forward 🙏

@wilwade
Copy link
Member

wilwade commented Jan 19, 2024

No problem. Thanks for the PR! This was released with v1.5.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants