For "Out Of Memory Error" add functionality to skip the current config and resume training #1273

zxingz · 2024-02-04T08:07:07Z

zxingz
Feb 4, 2024

Users can have GPUs with 4GB, 6GB, 12GB, 24GB vRAM. When we are tuning hyperparameters the training stops if the "OOM" error is encountered for GPU memory. For example, if we set different "max_bin" values.

Please add a "skip_and_resume" flag in settings so that the code can go to the next iteration of hyperparameter config if OOM is encountered.

class GPULGBM(LGBMEstimator):
    def __init__(self, **config):
        super().__init__(device="gpu", **config)
        
automl = AutoML()
automl.add_learner(learner_name='gpulgbm', learner_class=GPULGBM)

settings = {
    "time_budget": 12*3600,  # total running time in seconds
    "metric": 'mae',  # primary metrics for regression can be chosen from: ['mae','mse','r2']
    "estimator_list": ['gpulgbm'],  # list of ML learners; we tune lightgbm in this example
    "task": 'regression',  # task type
    "log_file_name": 'HFT_experiment.log',  # flaml log file
    "seed": 7654321,  # random seed
    "eval_method": "cv",
    "ensemble": True,
     "custom_hp": {
        "gpulgbm": {
            "log_max_bin": {
                "domain": tune.lograndint(lower=3, upper=7),
                "init_value": 5,
            },
        }
    },
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

For "Out Of Memory Error" add functionality to skip the current config and resume training #1273

{{title}}

Replies: 0 comments

Select a reply

For "Out Of Memory Error" add functionality to skip the current config and resume training #1273

zxingz Feb 4, 2024

Replies: 0 comments

zxingz
Feb 4, 2024