aws-sdk-pandas throws "Unknown encoding type" exception while reading a parquet file with column with type boolean #1773

yaronso · 2022-11-14T20:41:23Z

yaronso
Nov 14, 2022

I have a python AWS lambda function which tries to read a parquet file which has 2 columns whose type is boolean (total 46 different columns in each parquet file).
When I exclude those two boolean columns named: "iscritical" and "iscyclic" from the input columns list the read_parquet operation success.

code snippet:

valid_cols = [col for col in list(parquet_file_cols_metadata.keys()) if col != "iscritical" and col != "iscyclic"]
stage_file_full_data_df = wr.s3.read_parquet(
path=stage_file,
ignore_empty=True,
use_threads=True,
columns=valid_cols)

When I am trying to read the entire data (inlcude the boolean types columns) the read_parquet operation fails with exception: "Unknown encoding"

code snippet:

stage_file_full_data_df = wr.s3.read_parquet(
path=stage_file,
ignore_empty=True,
use_threads=True)

What I am asking is why wr.s3.read_parquet() cannot handle boolean columns dtypes?
Thanks.

malachi-constant · 2022-11-16T11:40:54Z

malachi-constant
Nov 16, 2022
Maintainer

wr.s3.read_parquet() can handle bool column types. I'm unable to replicate. I've created two DataFrames one with boolean column with null values and one without null values.

> python
Python 3.9.13 (main, Aug  2 2022, 14:33:29)
[Clang 12.0.0 (clang-1200.0.31.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import awswrangler as wr
>>> import pandas as pd
>>> bucket = "{BUCKET OBFUSCATED}"
>>> path = f"s3://{bucket}/parquet/1773.parquet"
>>> df = pd.DataFrame({
...     "id": [1, 2],
...     "name": ["foo", "bar"],
...     "critical": [True, False]
... })
>>> df.dtypes
id           int64
name        object
critical      bool
dtype: object
>>> wr.s3.to_parquet(df, path)
{'paths': ['s3://{BUCKET OBFUSCATED}/parquet/1773.parquet'], 'partitions_values': {}}
>>> read_df = wr.s3.read_parquet(
...   path=path,
...   ignore_empty=True,
...   use_threads=True
... )
>>> read_df.dtypes
id            Int64
name         string
critical    boolean
dtype: object
>>> df1 = pd.DataFrame({
...     "id": [1, 2],
...     "name": ["foo", "bar"],
...     "critical": [True, None]
... })
>>> wr.s3.to_parquet(df1, path)
{'paths': ['s3://{BUCKET OBFUSCATED}/parquet/1773.parquet'], 'partitions_values': {}}
>>> df1.dtypes
id           int64
name        object
critical    object
dtype: object
>>> read_df = wr.s3.read_parquet(
...   path=path,
...   ignore_empty=True,
...   use_threads=True
... )
>>> read_df.dtypes
id            Int64
name         string
critical    boolean
dtype: object

Are you able to provide a simplified example of your parquet file to help us replicate your issue?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aws-sdk-pandas throws "Unknown encoding type" exception while reading a parquet file with column with type boolean #1773

{{title}}

Replies: 1 comment

{{title}}

Select a reply

aws-sdk-pandas throws "Unknown encoding type" exception while reading a parquet file with column with type boolean #1773

yaronso Nov 14, 2022

Replies: 1 comment

malachi-constant Nov 16, 2022 Maintainer

yaronso
Nov 14, 2022

malachi-constant
Nov 16, 2022
Maintainer