Fix batch transform issue for tabular predictor with multiple partitions #138

tonyhoo · 2024-09-17T23:31:51Z

Description:

This PR fixes the issue where batch transform jobs fail due to column misalignment when the input CSV file is partitioned into multiple records. The problem arises because headers from different partitions are not handled properly, leading to misaligned columns and prediction failures during inference.

Changes:

Added logic to align columns across partitions by ensuring headers are managed correctly.
Introduced _read_with_fallback and _align_columns helper functions to handle column alignment.
Updated transform_fn in tabular_serve.py to use these helper functions.

Limitations:

This fix currently only works for the tabular predictor. Support for multimodal and timeseries predictors depends on the implementation of original_features, which can be tracked in issue #4477.

Steps to Reproduce:
The following script can be used to reproduce the issue:

from autogluon.cloud import TabularCloudPredictor
import pandas as pd

# Load datasets
train_data = pd.read_csv("https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv")
test_data = pd.read_csv("https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv")
test_data.drop(columns=['class'], inplace=True)

# Cloud Predictor Arguments
predictor_init_args = {"label": "class"}  
predictor_fit_args = {"train_data": train_data, "time_limit": 60}  

# Initialize Cloud Predictor and Fit
cloud_predictor = TabularCloudPredictor(cloud_output_path='tonyhu-autogluon')
cloud_predictor.fit(predictor_init_args=predictor_init_args, predictor_fit_args=predictor_fit_args)

# Batch Inference with small max_payload to force multiple partitions
result = cloud_predictor.predict(test_data, backend_kwargs={"transformer_kwargs": {"max_payload": 1}})

Expected Behavior:
The batch transform job should handle multiple partitions correctly, aligning columns across the partitions and ignoring or managing headers if present in individual partitions.

Observed Behavior:
The job fails with the following error logs:

Bad HTTP status received from algorithm: 500
invalid literal for int() with base 10: '0.1': Error while type casting for column 'capital-loss'

Logs show that the columns are misaligned for certain partitions:

test_columns: [' 11th', ' Machine-op-inspct', ' Male', ' Never-married', ' Other-relative', ' Private', ' United-States', ' White', '0', '0.1', '207443', '50', '62', '7']
2024-09-13T21:56:19,062 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - train_columns: ['age', 'capital-gain', 'capital-loss', 'education', 'education-num', 'fnlwgt', 'hours-per-week', 'marital-status', 'native-country', 'occupation', 'race', 'relationship', 'sex', 'workclass']

Environment:

autogluon==1.1.0
Running batch transform in SageMaker with MultiRecord strategy.
MaxPayloadInMB=1 is set to ensure multiple partitions.

Additional Information:
The issue seems to be that AutoGluon Cloud is not handling the headers properly when dealing with batch transform partitioned records. In a multi-partition job, not all batches will have the header/column, which is causing the column misalignment.

Note:
This fix currently only works for the tabular predictor. Support for multimodal and timeseries predictors depends on the implementation of original_features, which can be tracked in issue #4477.

suzhoum · 2024-09-19T16:01:42Z

src/autogluon/cloud/backend/sagemaker_backend.py

-        if isinstance(test_data, str) and not os.path.isdir(test_data):
-            # either a file to a dataframe, or a file to an image
-            if is_image_file(test_data):
-                logger.warning(
-                    "Are you sure you want to do batch inference on a single image? You might want to try `deploy()` and `predict_real_time()` instead"
-                )
-            else:
-                test_data = load_pd.load(test_data)


Will it cause problem if images are uploaded here?

These logics have been copied over to line #L1184 - #L1192

I figured. Now after I read it again, I think moving this to _predict() makes sense.

suzhoum · 2024-09-19T16:03:28Z

src/autogluon/cloud/backend/sagemaker_backend.py

+        if isinstance(test_data, str) and not os.path.isdir(test_data):
+            # either a file to a dataframe, or a file to an image
+            if is_image_file(test_data):
+                raise ValueError("Image file is not supported for batch inference")


Maybe reword to "Single image file"? And suggest to try .deploy()

suzhoum · 2024-09-19T16:08:32Z

src/autogluon/cloud/backend/sagemaker_backend.py

+            if not persist:
+                os.remove(results_path)


Good catch!

suzhoum · 2024-09-19T16:35:42Z

src/autogluon/cloud/scripts/sagemaker_scripts/tabular_serve.py

+    # Ensure inference_kwargs is a dictionary right before use
+    if inference_kwargs is None:
+        inference_kwargs = {}


The check seems to be redundant?

No, this code are moved from previous #L64 where to make sure if inference_kwargs are explicitly set up to None in payload, we will still have the dict format

I think you updated line #65 to make the default value be {}, so inference_kwargs should not be None. But this is a good safety nest.

The default will be set to {} only if inference_kwargs not available from payload, it will be set to None if

payload = { "inference_kwargs": None }

suzhoum

Thanks for the fix!

tonyhoo added 8 commits September 17, 2024 16:27

working fix for batch transform issue using tabular predictor

698850b

format fix

7abb0e8

updated .gitignore

8a7464f

fix inference_kwargs NoneType issue

4f391bb

add back single image support for batch inference

5a0728a

store the original_features at backend.fit call

8608e26

reformatted

6942563

converting train_data into dataframe

6d5e60e

tonyhoo requested a review from suzhoum September 19, 2024 16:29

suzhoum reviewed Sep 19, 2024

View reviewed changes

suzhoum approved these changes Sep 19, 2024

View reviewed changes

tonyhoo merged commit ee65620 into autogluon:master Sep 19, 2024
17 of 19 checks passed

tonyhoo deleted the fix_batch_transform branch October 1, 2024 21:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix batch transform issue for tabular predictor with multiple partitions #138

Fix batch transform issue for tabular predictor with multiple partitions #138

tonyhoo commented Sep 17, 2024

suzhoum Sep 19, 2024

tonyhoo Sep 19, 2024

suzhoum Sep 19, 2024

suzhoum Sep 19, 2024

suzhoum Sep 19, 2024

suzhoum Sep 19, 2024

tonyhoo Sep 19, 2024

suzhoum Sep 19, 2024 •

edited

Loading

tonyhoo Sep 19, 2024

suzhoum left a comment

Fix batch transform issue for tabular predictor with multiple partitions #138

Fix batch transform issue for tabular predictor with multiple partitions #138

Conversation

tonyhoo commented Sep 17, 2024

Description:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

suzhoum Sep 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

suzhoum left a comment

Choose a reason for hiding this comment

suzhoum Sep 19, 2024 •

edited

Loading