bug: Error with logging #664

eddiebergman · 2024-12-04T13:05:21Z

I'm sporadically encountering an error in logging which boils down to this line, in which it states that ValueError: I/O operation on closed file. It's not conclusive and I can't make a reproducing script. Just rerunning the failed jobs seems to work ~99% of the time.

automlbenchmark/amlb/logger.py

Line 98 in aded7c7

buf.write(line) # "end" newline always added by logger

I'm not sure where, but somehow this error cascades into a recursion error as the error keeps trying to get logged (perhaps in the overwritten print?).

The initial trigger seems to come from inside the .framework_module.run(), which then cascades several times. I can say for certain that

automlbenchmark/amlb/benchmark.py

Lines 602 to 613 in aded7c7

    
           try: 
        
               log.info("Running task %s on framework %s with config:\n%s", task_config.name, self.benchmark.framework_name, task_config) 
        
               json_dump(task_config, task_config.output_metadata_file, style='pretty') 
        
               meta_result = self.benchmark.framework_module.run(self._dataset, task_config) 
        
           except Exception as e: 
        
               if rconfig().job_scheduler.exit_on_job_failure: 
        
                   raise 
        
               log.exception(e) 
        
               result = ErrorResult(e) 
        
           finally: 
        
               self._dataset.release() 
        
           return results.compute_score(result=result, meta_result=meta_result)

This is a trimmed down version of the exec.py file:

from __future__ import annotations

# Yup, don't ask, this is required, and in this order
# https://github.com/pypa/setuptools/issues/3044
import setuptools  # noqa: F401
import pip  # noqa: F401

import logging

from typing import TYPE_CHECKING
from frameworks.shared.callee import call_run, result
import pandas as pd
import time

log = logging.getLogger(__name__)
log.setLevel(logging.INFO)

if TYPE_CHECKING:
    from amlb.benchmark import TaskConfig
    from amlb.data import Dataset

def run(dataset: Dataset, config: TaskConfig):
    log.info("\n**** OMITTED ****\n")

    is_classification = config.type == "classification"

    # Helpful to debug locally
    print("cat indices:", dataset.categorical_indices)
    print("feature names:", dataset.train.X.columns)

    if is_classification:
        model = OMITTED
    else:
        model = OMMITED

    X_train, X_test = dataset.train.X, dataset.test.X
    y_train, y_test = dataset.train.y, dataset.test.y
    # Turn into series if needs be
    if isinstance(y_train, pd.DataFrame):
        y_train = y_train.squeeze()
    if isinstance(y_test, pd.DataFrame):
        y_test = y_test.squeeze()

    now = time.monotonic()
    model.fit(X_train, y_train)
    train_duration_seconds = time.monotonic() - now

    log.info(f"Finished fit in {train_duration_seconds}s.")

    # 2x forwards
    probabilities = model.predict_proba(X_test) if is_classification else None

    now = time.monotonic()
    predictions = model.predict(X_test)
    predict_duration_seconds = time.monotonic() - now
    log.info(f"Finished predict in {predict_duration_seconds}s.")

    return result(
        output_file=config.output_predictions_file,
        predictions=predictions,
        probabilities=probabilities,
        target_is_encoded=False,
        models_count=1,
        training_duration=train_duration_seconds,
        predict_duration=predict_duration_seconds,
    )


if __name__ == "__main__":
    call_run(run)

Here is the traceback, I cut out all of the recursion that happens as the original file is 20k lines long.

Running benchmark `OMITTED` on `openml/t/359939` framework in `local` mode.
Loading frameworks definitions from ['SOMEPATH/resources/frameworks.yaml', 'SOMEPATH/benchmarking-user-dir/frameworks.yaml'].
Loading benchmark constraint definitions from ['SOMEPATH/resources/constraints.yaml'].
Loading openml task 359939.

------------------------------------------------------------------
Starting job local.openml_t_359939.4h8c.topo_2_1.5.OMITTED.
Assigning 8 cores (total=20) for new task topo_2_1.
[MONITORING] [local.openml_t_359939.4h8c.topo_2_1.5.OMITTED] CPU Utilization: 24.4%
Assigning 373846 MB (total=385593 MB) for new topo_2_1 task.
[MONITORING] [local.openml_t_359939.4h8c.topo_2_1.5.OMITTED] Memory Usage: 2.5%
[MONITORING] [local.openml_t_359939.4h8c.topo_2_1.5.OMITTED] Disk Usage: 22.2%
Running task topo_2_1 on framework OMITTED with config:
TaskConfig({'framework': 'OMITTED', 'framework_params': {}, 'framework_version': 'main', 'type': 'regression', 'name': 'topo_2_1', 'openml_task_id': 359939, 'test_server': False, 'fold': 5, 'metric': 'rmse', 'metrics': ['rmse', 'r2', 'mae'], 'seed': 892804397, 'job_timeout_seconds': 21600, 'max_runtime_seconds': 14400, 'cores': 8, 'max_mem_size_mb': 373846, 'min_vol_size_mb': 65536, 'input_dir': 'SOMEPATH/openml-cache', 'output_dir': 'SOMEPATH/results/tabpfnrefactor.openml_t_359939.4h8c.local.20241204T122418', 'output_predictions_file': 'SOMEPATH/results/tabpfnrefactor.openml_t_359939.4h8c.local.20241204T122418/predictions/topo_2_1/5/predictions.csv', 'tag': None, 'command': 'SOMEPATH/runbenchmark.py OMITTED openml/t/359939 4h8c --fold 5 --indir SOMEPATH/openml-cache --outdir SOMEPATH/results --userdir SOMEPATH/benchmarking-user-dir --setup skip', 'git_info': {'repo': 'git@github.com:LennartPurucker/amlb-fork.git', 'branch': 'ag_synthetic_bench', 'commit': '303b14df7018c6eb9266693343744fa3c2e35231', 'tags': [], 'status': []}, 'measure_inference_time': False, 'ext': {}, 'quantile_levels': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9], 'type_': 'regression', 'output_metadata_file': 'SOMEPATH/results/tabpfnrefactor.openml_t_359939.4h8c.local.20241204T122418/predictions/topo_2_1/5/metadata.json'})
Running cmd `SOMEPATH/frameworks/OMITTED/venv/bin/python -W ignore SOMEPATH/frameworks/OMITTED/exec.py`
INFO:__main__:
**** OMITTED ****

cat indices: []
[MONITORING] [local.openml_t_359939.4h8c.topo_2_1.5.OMITTED] CPU Utilization: 35.2%
[MONITORING] [local.openml_t_359939.4h8c.topo_2_1.5.OMITTED] Memory Usage: 3.7%
[MONITORING] [local.openml_t_359939.4h8c.topo_2_1.5.OMITTED] Disk Usage: 22.2%
feature names: Index(['oz1', 'oz2', 'oz3', 'oz4', 'oz5', 'oz6', 'oz7', 'oz8', 'oz9', 'oz10',

--- Logging error ---
Traceback (most recent call last):
maximum recursion depth exceeded while calling a Python object

Traceback (most recent call last):
  File "/usr/lib/python3.10/logging/__init__.py", line 1104, in emit
    self.flush()
  File "/usr/lib/python3.10/logging/__init__.py", line 1084, in flush
    self.stream.flush()
OSError: [Errno 116] Stale file handle

# =================
# OMITTED REPEATS
RecursionError: maximum recursion depth exceeded while calling a Python object
# =================

--- Logging error ---
Job `local.openml_t_359939.4h8c.topo_2_1.5.OMITTED` failed with error: I/O operation on closed file

Traceback (most recent call last):
  File "/usr/lib/python3.10/logging/__init__.py", line 1104, in emit
    self.flush()
  File "/usr/lib/python3.10/logging/__init__.py", line 1084, in flush
    self.stream.flush()
OSError: [Errno 116] Stale file handle

# ============================
# OMITTED REPEATS
RecursionError: maximum recursion depth exceeded while calling a Python object
# ============================

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "SOMEPATH/amlb/benchmark.py", line 581, in run
    meta_result = self.benchmark.framework_module.run(self._dataset, task_config)
  File "SOMEPATH/frameworks/OMITTED/__init__.py", line 28, in run
    return run_in_venv(
  File "SOMEPATH/frameworks/shared/caller.py", line 134, in run_in_venv
    output, err = run_cmd(cmd, *args,
  File "SOMEPATH/amlb/utils/process.py", line 255, in run_cmd
    completed = run_subprocess(str_cmd if params.shell else full_cmd,
  File "SOMEPATH/amlb/utils/process.py", line 85, in run_subprocess
    stdout, stderr = communicate(process, input, timeout=timeout)
  File "SOMEPATH/amlb/utils/process.py", line 80, in communicate
    return (communicate_fn(process, input=input, timeout=timeout) if communicate_fn
  File "SOMEPATH/amlb/utils/process.py", line 249, in communicate
    res = h(*args, **kwargs)
  File "SOMEPATH/amlb/utils/process.py", line 179, in live_output_unix
    process_output = list(iter(
  File "SOMEPATH/amlb/utils/process.py", line 180, in <lambda>
    lambda: read_pipe(
  File "SOMEPATH/amlb/utils/process.py", line 173, in read_pipe
    print(re.sub(r'\n$', '', line, count=1))
  File "SOMEPATH/amlb/logger.py", line 102, in new_print
    print_logger.log(level, buf.getvalue())
  File "/usr/lib/python3.10/logging/__init__.py", line 1547, in log
    self._log(level, msg, args, **kwargs)
  File "/usr/lib/python3.10/logging/__init__.py", line 1624, in _log
    self.handle(record)
  File "/usr/lib/python3.10/logging/__init__.py", line 1634, in handle
    self.callHandlers(record)
  File "/usr/lib/python3.10/logging/__init__.py", line 1696, in callHandlers
    hdlr.handle(record)
  File "/usr/lib/python3.10/logging/__init__.py", line 968, in handle
    self.emit(record)
  File "/usr/lib/python3.10/logging/__init__.py", line 1218, in emit
    StreamHandler.emit(self, record)
  File "/usr/lib/python3.10/logging/__init__.py", line 1108, in emit
    self.handleError(record)
  File "/usr/lib/python3.10/logging/__init__.py", line 1022, in handleError
    traceback.print_exception(t, v, tb, None, sys.stderr)

# ============================
# OMITTED REPEATS
RecursionError: maximum recursion depth exceeded while calling a Python object
# ============================

  File "/usr/lib/python3.10/traceback.py", line 121, in print_exception
    print(line, file=file, end="")
During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.10/logging/__init__.py", line 1104, in emit
    self.flush()
  File "/usr/lib/python3.10/logging/__init__.py", line 1084, in flush
    self.stream.flush()
OSError: [Errno 116] Stale file handle

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "SOMEPATH/amlb/benchmark.py", line 585, in run
    log.exception(e)
  File "/usr/lib/python3.10/logging/__init__.py", line 1512, in exception
    self.error(msg, *args, exc_info=exc_info, **kwargs)
  File "/usr/lib/python3.10/logging/__init__.py", line 1506, in error
    self._log(ERROR, msg, args, **kwargs)
  File "/usr/lib/python3.10/logging/__init__.py", line 1624, in _log
    self.handle(record)
  File "/usr/lib/python3.10/logging/__init__.py", line 1634, in handle
    self.callHandlers(record)
  File "/usr/lib/python3.10/logging/__init__.py", line 1696, in callHandlers
    hdlr.handle(record)
  File "/usr/lib/python3.10/logging/__init__.py", line 968, in handle
    self.emit(record)
  File "/usr/lib/python3.10/logging/__init__.py", line 1218, in emit
    StreamHandler.emit(self, record)
  File "/usr/lib/python3.10/logging/__init__.py", line 1108, in emit
    self.handleError(record)
  File "/usr/lib/python3.10/logging/__init__.py", line 1022, in handleError
    traceback.print_exception(t, v, tb, None, sys.stderr)
  File "/usr/lib/python3.10/traceback.py", line 121, in print_exception
    print(line, file=file, end="")
  File "SOMEPATH/amlb/logger.py", line 98, in new_print
    buf.write(line)  # "end" newline always added by logger
ValueError: I/O operation on closed file

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.10/logging/__init__.py", line 1104, in emit
    self.flush()
  File "/usr/lib/python3.10/logging/__init__.py", line 1084, in flush
    self.stream.flush()
OSError: [Errno 116] Stale file handle

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "SOMEPATH/amlb/job.py", line 123, in start
    result = self._run()
  File "SOMEPATH/amlb/utils/process.py", line 744, in profiler
    return fn(*args, **kwargs)
  File "SOMEPATH/amlb/benchmark.py", line 588, in run
    self._dataset.release()
  File "SOMEPATH/amlb/utils/process.py", line 744, in profiler
    return fn(*args, **kwargs)
  File "SOMEPATH/amlb/data.py", line 236, in release
    self.train.release()
  File "SOMEPATH/amlb/datasets/openml.py", line 283, in release
    super().release(properties)
  File "SOMEPATH/amlb/utils/process.py", line 744, in profiler
    return fn(*args, **kwargs)
  File "SOMEPATH/amlb/data.py", line 169, in release
    clear_cache(self, properties)
  File "SOMEPATH/amlb/utils/cache.py", line 23, in clear_cache
    log.debug("Cleared cached properties: %s.", properties_to_clear)
  File "/usr/lib/python3.10/logging/__init__.py", line 1465, in debug
    self._log(DEBUG, msg, args, **kwargs)
  File "/usr/lib/python3.10/logging/__init__.py", line 1624, in _log
    self.handle(record)
  File "/usr/lib/python3.10/logging/__init__.py", line 1634, in handle
    self.callHandlers(record)
  File "/usr/lib/python3.10/logging/__init__.py", line 1696, in callHandlers
    hdlr.handle(record)
  File "/usr/lib/python3.10/logging/__init__.py", line 968, in handle
    self.emit(record)
  File "/usr/lib/python3.10/logging/__init__.py", line 1218, in emit
    StreamHandler.emit(self, record)
  File "/usr/lib/python3.10/logging/__init__.py", line 1108, in emit
    self.handleError(record)
  File "/usr/lib/python3.10/logging/__init__.py", line 1022, in handleError
    traceback.print_exception(t, v, tb, None, sys.stderr)
  File "/usr/lib/python3.10/traceback.py", line 121, in print_exception
    print(line, file=file, end="")
  File "SOMEPATH/amlb/logger.py", line 98, in new_print
    buf.write(line)  # "end" newline always added by logger
ValueError: I/O operation on closed file
--- Logging error ---
--- Logging error ---
--- Logging error ---
[MONITORING] [local.openml_t_359939.4h8c.topo_2_1.5.OMITTED] CPU Utilization: 30.2%
--- Logging error ---

ERROR:
I/O operation on closed file
--- Logging error ---

The text was updated successfully, but these errors were encountered:

PGijsbers · 2024-12-04T18:53:05Z

Thanks. This has me stumped too. Maybe I find some time for a closer look in the weekend. I don't recall seeing this before. Is there a particular reason for Py3.10? Just grasping at straws here.

PGijsbers · 2024-12-09T18:25:50Z

I didn't manage to reproduce earlier but now encountered it myself on Py3.9. I don't know if it was related but was in a run where my computer also went to sleep during. Is it possible your experiments were similarly "interrupted"?

Innixma · 2024-12-13T18:53:14Z

@eddiebergman If this is the cause of the sporadic 20-30% instance failures without any s3 file saves/logs I've been having on AMLB in AWS mode for the past year without finding a fix, you are my favorite person ever.

PGijsbers · 2024-12-13T21:15:26Z

@Innixma Do you have any idea when your errors started?

PGijsbers · 2024-12-13T21:22:56Z

I guess in any case we can try add some guardrails, e.g., if buf.closed: ...

PGijsbers · 2024-12-13T21:40:29Z

I still don't understand the root cause of the failure, but I do think I might understand what leads to it crashing the benchmark, have a look at this:

automlbenchmark/amlb/logger.py

Lines 99 to 114 in 98bf554

    
           def new_print(*args, sep=" ", end=nl, file=None): 
        
               if file not in [None, sys.stdout, sys.stderr]: 
        
                   return ori_print(*args, sep=sep, end=end, file=file) 
        
               nonlocal buffer 
        
               buf_type = "err" if file is sys.stderr else "out" 
        
               buf = buffer[buf_type] 
        
               if buf is None: 
        
                   buf = buffer[buf_type] = io.StringIO() 
        
               line = sep.join(map(str, [*args])) 
        
               buf.write(line)  # "end" newline always added by logger 
        
               if end == nl or line.endswith(nl):  # flush buffer for every line 
        
                   with buf: 
        
                       level = logging.ERROR if buf_type == "err" else logging.INFO 
        
                       print_logger.log(level, buf.getvalue()) 
        
                       buffer[buf_type] = None

When line 113 has an error (it is present as line 102 in your stack trace), it exists the context manager, thus closing buf. However, buf is never set to None(line 114), which means next time this function is called buf.write(...) is attempting to write to a closed stream, hence the error. This error then tries to be printed and ... well you see where this is going.

In #672 I'll add a clause to the guard that checks whether a new StringIO should be openend. It's not a great solution: it will drop at least the one log line that was attempted to be written. However, hopefully it allows the app to recover and/or write more informative errors. It's late now, can't come up with anything better. I'll try to continue tomorrow, thoughts and ideas welcome.

PGijsbers · 2024-12-13T21:58:55Z

The root cause may well be the app logger and the framework logger writing to the same file:

automlbenchmark/amlb/logger.py

Lines 77 to 82 in 98bf554

    
           # create file handler 
        
           app_handler = logging.FileHandler(log_file, mode="a") 
        
           app_handler.setLevel(app_level) 
        
           app_handler.setFormatter(file_formatter) 
        
           app_logger.addHandler(app_handler) 
        
           frameworks_logger.addHandler(app_handler)

The framework logger is in a separate process, so there's probably race conditions about writing to the same file. If one process writes to the file while the other also attempts to log, this seems to be able to lead to stale file errors (reddit).

So solving the root cause will likely redoing logging to something multi-processing safe. Quick hack would be to keep separate logs in separate files, but probably switching to something multi-process safe like loguru is inevitable if we want to keep single logs.

It does at least match the sporadic nature of failures.

PGijsbers · 2024-12-14T08:08:27Z

Tried to write a MWE which uses the Python logger to write to the same log from two processes, but it does not seem to generate the same error despite writing gigabytes of logdata (gist).

PGijsbers · 2024-12-14T09:54:04Z

While it doesn't address the stale file error, the other part of the problem is that overwrite print. When an error does occur when writing to the log file, as part of the handler's emit routine, the error is instead printed with traceback.print_exception which, as the name suggests, calls print. But now we have a hook which intercepts this call, and instead tries to log it, the handle is still stale, so during the emit routine the error is raised again, which is intercepted to be fed to a traceback.print_exception routine and... well. This at least is trivially reproducible:

import logging
import builtins


root = logging.getLogger()
handler = logging.FileHandler("log.txt", mode="a")
root.addHandler(handler)

def new_print(*args, sep=" ", end="\n", file=None):
    root.log(logging.ERROR, sep.join(map(str, [*args])))

builtins.print = new_print

print("Hey!")

And you'll need to modify your Python's StreamHandler emit method to raise an error, e.g.:

# logging/__init__.py lines 1080-1090ish (depending on python version)
        try:
            msg = self.format(record)
            stream = self.stream
            # issue 35046: merged two stream.writes into one.
            stream.write(msg + self.terminator)
+           raise OSError("Oops!")
            self.flush()
        except RecursionError:  # See issue 36272
            raise
        except Exception:
            self.handleError(record)

Innixma · 2024-12-14T16:35:08Z

@PGijsbers roughly 1 year ago, but I'm unsure exact date. I also was using an old version of AMLB, but it also occurs after I switched to using mainline. My issue is either a dependency upgrade causing a bug, or AWS causing a bug, unsure which, I haven't been able to solve it despite spending over a week of effort trying.

PGijsbers · 2024-12-14T21:49:10Z

That's interesting, I did not observe any such failures during the 2024 paper experiments (june '23, v2.1.x), even though that was after what I would guess is the main culprit (v2.1.0 release with the bump to Py3.9 and dependency upgrades in June 2023). This error almost certainly has nothing to do with it, unless some of the underlying stack causes more frequent stale file handles - the AMLB code around this hasn't changed since January 2019 (or, 2020 for a minor change on how this method is invoked itself). :/

Innixma · 2024-12-14T22:35:56Z

Yeah unsure. It is probably something weird on my end with the AWS account.

PGijsbers added the bug Something isn't working label Dec 4, 2024

PGijsbers mentioned this issue Dec 13, 2024

Check if stream is already closed #672

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: Error with logging #664

bug: Error with logging #664

eddiebergman commented Dec 4, 2024

PGijsbers commented Dec 4, 2024

PGijsbers commented Dec 9, 2024

Innixma commented Dec 13, 2024 •

edited

Loading

PGijsbers commented Dec 13, 2024 •

edited

Loading

PGijsbers commented Dec 13, 2024

PGijsbers commented Dec 13, 2024

PGijsbers commented Dec 13, 2024 •

edited

Loading

PGijsbers commented Dec 14, 2024

PGijsbers commented Dec 14, 2024 •

edited

Loading

Innixma commented Dec 14, 2024

PGijsbers commented Dec 14, 2024

Innixma commented Dec 14, 2024

bug: Error with logging #664

bug: Error with logging #664

Comments

eddiebergman commented Dec 4, 2024

PGijsbers commented Dec 4, 2024

PGijsbers commented Dec 9, 2024

Innixma commented Dec 13, 2024 • edited Loading

PGijsbers commented Dec 13, 2024 • edited Loading

PGijsbers commented Dec 13, 2024

PGijsbers commented Dec 13, 2024

PGijsbers commented Dec 13, 2024 • edited Loading

PGijsbers commented Dec 14, 2024

PGijsbers commented Dec 14, 2024 • edited Loading

Innixma commented Dec 14, 2024

PGijsbers commented Dec 14, 2024

Innixma commented Dec 14, 2024

Innixma commented Dec 13, 2024 •

edited

Loading

PGijsbers commented Dec 13, 2024 •

edited

Loading

PGijsbers commented Dec 13, 2024 •

edited

Loading

PGijsbers commented Dec 14, 2024 •

edited

Loading