Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix batch transform issue for tabular predictor with multiple partitions #138

Merged
merged 8 commits into from
Sep 19, 2024

Conversation

tonyhoo
Copy link
Contributor

@tonyhoo tonyhoo commented Sep 17, 2024

Description:

This PR fixes the issue where batch transform jobs fail due to column misalignment when the input CSV file is partitioned into multiple records. The problem arises because headers from different partitions are not handled properly, leading to misaligned columns and prediction failures during inference.

Changes:

  • Added logic to align columns across partitions by ensuring headers are managed correctly.
  • Introduced _read_with_fallback and _align_columns helper functions to handle column alignment.
  • Updated transform_fn in tabular_serve.py to use these helper functions.

Limitations:

  • This fix currently only works for the tabular predictor. Support for multimodal and timeseries predictors depends on the implementation of original_features, which can be tracked in issue #4477.

Steps to Reproduce:
The following script can be used to reproduce the issue:

from autogluon.cloud import TabularCloudPredictor
import pandas as pd

# Load datasets
train_data = pd.read_csv("https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv")
test_data = pd.read_csv("https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv")
test_data.drop(columns=['class'], inplace=True)

# Cloud Predictor Arguments
predictor_init_args = {"label": "class"}  
predictor_fit_args = {"train_data": train_data, "time_limit": 60}  

# Initialize Cloud Predictor and Fit
cloud_predictor = TabularCloudPredictor(cloud_output_path='tonyhu-autogluon')
cloud_predictor.fit(predictor_init_args=predictor_init_args, predictor_fit_args=predictor_fit_args)

# Batch Inference with small max_payload to force multiple partitions
result = cloud_predictor.predict(test_data, backend_kwargs={"transformer_kwargs": {"max_payload": 1}})

Expected Behavior:
The batch transform job should handle multiple partitions correctly, aligning columns across the partitions and ignoring or managing headers if present in individual partitions.

Observed Behavior:
The job fails with the following error logs:

Bad HTTP status received from algorithm: 500
invalid literal for int() with base 10: '0.1': Error while type casting for column 'capital-loss'

Logs show that the columns are misaligned for certain partitions:

test_columns: [' 11th', ' Machine-op-inspct', ' Male', ' Never-married', ' Other-relative', ' Private', ' United-States', ' White', '0', '0.1', '207443', '50', '62', '7']
2024-09-13T21:56:19,062 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - train_columns: ['age', 'capital-gain', 'capital-loss', 'education', 'education-num', 'fnlwgt', 'hours-per-week', 'marital-status', 'native-country', 'occupation', 'race', 'relationship', 'sex', 'workclass']

Environment:

  • autogluon==1.1.0
  • Running batch transform in SageMaker with MultiRecord strategy.
  • MaxPayloadInMB=1 is set to ensure multiple partitions.

Additional Information:
The issue seems to be that AutoGluon Cloud is not handling the headers properly when dealing with batch transform partitioned records. In a multi-partition job, not all batches will have the header/column, which is causing the column misalignment.

Note:
This fix currently only works for the tabular predictor. Support for multimodal and timeseries predictors depends on the implementation of original_features, which can be tracked in issue #4477.

Comment on lines -1108 to -1115
if isinstance(test_data, str) and not os.path.isdir(test_data):
# either a file to a dataframe, or a file to an image
if is_image_file(test_data):
logger.warning(
"Are you sure you want to do batch inference on a single image? You might want to try `deploy()` and `predict_real_time()` instead"
)
else:
test_data = load_pd.load(test_data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it cause problem if images are uploaded here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These logics have been copied over to line #L1184 - #L1192

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured. Now after I read it again, I think moving this to _predict() makes sense.

if isinstance(test_data, str) and not os.path.isdir(test_data):
# either a file to a dataframe, or a file to an image
if is_image_file(test_data):
raise ValueError("Image file is not supported for batch inference")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe reword to "Single image file"? And suggest to try .deploy()

Comment on lines +1292 to +1293
if not persist:
os.remove(results_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

Comment on lines +82 to +84
# Ensure inference_kwargs is a dictionary right before use
if inference_kwargs is None:
inference_kwargs = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check seems to be redundant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this code are moved from previous #L64 where to make sure if inference_kwargs are explicitly set up to None in payload, we will still have the dict format

Copy link
Contributor

@suzhoum suzhoum Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you updated line #65 to make the default value be {}, so inference_kwargs should not be None. But this is a good safety nest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default will be set to {} only if inference_kwargs not available from payload, it will be set to None if

payload = {
    "inference_kwargs": None
}

Copy link
Contributor

@suzhoum suzhoum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix!

@tonyhoo tonyhoo merged commit ee65620 into autogluon:master Sep 19, 2024
17 of 19 checks passed
@tonyhoo tonyhoo deleted the fix_batch_transform branch October 1, 2024 21:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants