Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: Error with logging #664

Open
eddiebergman opened this issue Dec 4, 2024 · 12 comments
Open

bug: Error with logging #664

eddiebergman opened this issue Dec 4, 2024 · 12 comments
Labels
bug Something isn't working

Comments

@eddiebergman
Copy link
Collaborator

I'm sporadically encountering an error in logging which boils down to this line, in which it states that ValueError: I/O operation on closed file. It's not conclusive and I can't make a reproducing script. Just rerunning the failed jobs seems to work ~99% of the time.

buf.write(line) # "end" newline always added by logger

I'm not sure where, but somehow this error cascades into a recursion error as the error keeps trying to get logged (perhaps in the overwritten print?).

The initial trigger seems to come from inside the .framework_module.run(), which then cascades several times. I can say for certain that

try:
log.info("Running task %s on framework %s with config:\n%s", task_config.name, self.benchmark.framework_name, task_config)
json_dump(task_config, task_config.output_metadata_file, style='pretty')
meta_result = self.benchmark.framework_module.run(self._dataset, task_config)
except Exception as e:
if rconfig().job_scheduler.exit_on_job_failure:
raise
log.exception(e)
result = ErrorResult(e)
finally:
self._dataset.release()
return results.compute_score(result=result, meta_result=meta_result)

This is a trimmed down version of the exec.py file:

from __future__ import annotations

# Yup, don't ask, this is required, and in this order
# https://github.com/pypa/setuptools/issues/3044
import setuptools  # noqa: F401
import pip  # noqa: F401

import logging

from typing import TYPE_CHECKING
from frameworks.shared.callee import call_run, result
import pandas as pd
import time

log = logging.getLogger(__name__)
log.setLevel(logging.INFO)

if TYPE_CHECKING:
    from amlb.benchmark import TaskConfig
    from amlb.data import Dataset

def run(dataset: Dataset, config: TaskConfig):
    log.info("\n**** OMITTED ****\n")

    is_classification = config.type == "classification"

    # Helpful to debug locally
    print("cat indices:", dataset.categorical_indices)
    print("feature names:", dataset.train.X.columns)

    if is_classification:
        model = OMITTED
    else:
        model = OMMITED

    X_train, X_test = dataset.train.X, dataset.test.X
    y_train, y_test = dataset.train.y, dataset.test.y
    # Turn into series if needs be
    if isinstance(y_train, pd.DataFrame):
        y_train = y_train.squeeze()
    if isinstance(y_test, pd.DataFrame):
        y_test = y_test.squeeze()

    now = time.monotonic()
    model.fit(X_train, y_train)
    train_duration_seconds = time.monotonic() - now

    log.info(f"Finished fit in {train_duration_seconds}s.")

    # 2x forwards
    probabilities = model.predict_proba(X_test) if is_classification else None

    now = time.monotonic()
    predictions = model.predict(X_test)
    predict_duration_seconds = time.monotonic() - now
    log.info(f"Finished predict in {predict_duration_seconds}s.")

    return result(
        output_file=config.output_predictions_file,
        predictions=predictions,
        probabilities=probabilities,
        target_is_encoded=False,
        models_count=1,
        training_duration=train_duration_seconds,
        predict_duration=predict_duration_seconds,
    )


if __name__ == "__main__":
    call_run(run)

Here is the traceback, I cut out all of the recursion that happens as the original file is 20k lines long.

Running benchmark `OMITTED` on `openml/t/359939` framework in `local` mode.
Loading frameworks definitions from ['SOMEPATH/resources/frameworks.yaml', 'SOMEPATH/benchmarking-user-dir/frameworks.yaml'].
Loading benchmark constraint definitions from ['SOMEPATH/resources/constraints.yaml'].
Loading openml task 359939.

------------------------------------------------------------------
Starting job local.openml_t_359939.4h8c.topo_2_1.5.OMITTED.
Assigning 8 cores (total=20) for new task topo_2_1.
[MONITORING] [local.openml_t_359939.4h8c.topo_2_1.5.OMITTED] CPU Utilization: 24.4%
Assigning 373846 MB (total=385593 MB) for new topo_2_1 task.
[MONITORING] [local.openml_t_359939.4h8c.topo_2_1.5.OMITTED] Memory Usage: 2.5%
[MONITORING] [local.openml_t_359939.4h8c.topo_2_1.5.OMITTED] Disk Usage: 22.2%
Running task topo_2_1 on framework OMITTED with config:
TaskConfig({'framework': 'OMITTED', 'framework_params': {}, 'framework_version': 'main', 'type': 'regression', 'name': 'topo_2_1', 'openml_task_id': 359939, 'test_server': False, 'fold': 5, 'metric': 'rmse', 'metrics': ['rmse', 'r2', 'mae'], 'seed': 892804397, 'job_timeout_seconds': 21600, 'max_runtime_seconds': 14400, 'cores': 8, 'max_mem_size_mb': 373846, 'min_vol_size_mb': 65536, 'input_dir': 'SOMEPATH/openml-cache', 'output_dir': 'SOMEPATH/results/tabpfnrefactor.openml_t_359939.4h8c.local.20241204T122418', 'output_predictions_file': 'SOMEPATH/results/tabpfnrefactor.openml_t_359939.4h8c.local.20241204T122418/predictions/topo_2_1/5/predictions.csv', 'tag': None, 'command': 'SOMEPATH/runbenchmark.py OMITTED openml/t/359939 4h8c --fold 5 --indir SOMEPATH/openml-cache --outdir SOMEPATH/results --userdir SOMEPATH/benchmarking-user-dir --setup skip', 'git_info': {'repo': 'git@github.com:LennartPurucker/amlb-fork.git', 'branch': 'ag_synthetic_bench', 'commit': '303b14df7018c6eb9266693343744fa3c2e35231', 'tags': [], 'status': []}, 'measure_inference_time': False, 'ext': {}, 'quantile_levels': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9], 'type_': 'regression', 'output_metadata_file': 'SOMEPATH/results/tabpfnrefactor.openml_t_359939.4h8c.local.20241204T122418/predictions/topo_2_1/5/metadata.json'})
Running cmd `SOMEPATH/frameworks/OMITTED/venv/bin/python -W ignore SOMEPATH/frameworks/OMITTED/exec.py`
INFO:__main__:
**** OMITTED ****

cat indices: []
[MONITORING] [local.openml_t_359939.4h8c.topo_2_1.5.OMITTED] CPU Utilization: 35.2%
[MONITORING] [local.openml_t_359939.4h8c.topo_2_1.5.OMITTED] Memory Usage: 3.7%
[MONITORING] [local.openml_t_359939.4h8c.topo_2_1.5.OMITTED] Disk Usage: 22.2%
feature names: Index(['oz1', 'oz2', 'oz3', 'oz4', 'oz5', 'oz6', 'oz7', 'oz8', 'oz9', 'oz10',

--- Logging error ---
Traceback (most recent call last):
maximum recursion depth exceeded while calling a Python object

Traceback (most recent call last):
  File "/usr/lib/python3.10/logging/__init__.py", line 1104, in emit
    self.flush()
  File "/usr/lib/python3.10/logging/__init__.py", line 1084, in flush
    self.stream.flush()
OSError: [Errno 116] Stale file handle

# =================
# OMITTED REPEATS
RecursionError: maximum recursion depth exceeded while calling a Python object
# =================

--- Logging error ---
Job `local.openml_t_359939.4h8c.topo_2_1.5.OMITTED` failed with error: I/O operation on closed file

Traceback (most recent call last):
  File "/usr/lib/python3.10/logging/__init__.py", line 1104, in emit
    self.flush()
  File "/usr/lib/python3.10/logging/__init__.py", line 1084, in flush
    self.stream.flush()
OSError: [Errno 116] Stale file handle

# ============================
# OMITTED REPEATS
RecursionError: maximum recursion depth exceeded while calling a Python object
# ============================

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "SOMEPATH/amlb/benchmark.py", line 581, in run
    meta_result = self.benchmark.framework_module.run(self._dataset, task_config)
  File "SOMEPATH/frameworks/OMITTED/__init__.py", line 28, in run
    return run_in_venv(
  File "SOMEPATH/frameworks/shared/caller.py", line 134, in run_in_venv
    output, err = run_cmd(cmd, *args,
  File "SOMEPATH/amlb/utils/process.py", line 255, in run_cmd
    completed = run_subprocess(str_cmd if params.shell else full_cmd,
  File "SOMEPATH/amlb/utils/process.py", line 85, in run_subprocess
    stdout, stderr = communicate(process, input, timeout=timeout)
  File "SOMEPATH/amlb/utils/process.py", line 80, in communicate
    return (communicate_fn(process, input=input, timeout=timeout) if communicate_fn
  File "SOMEPATH/amlb/utils/process.py", line 249, in communicate
    res = h(*args, **kwargs)
  File "SOMEPATH/amlb/utils/process.py", line 179, in live_output_unix
    process_output = list(iter(
  File "SOMEPATH/amlb/utils/process.py", line 180, in <lambda>
    lambda: read_pipe(
  File "SOMEPATH/amlb/utils/process.py", line 173, in read_pipe
    print(re.sub(r'\n$', '', line, count=1))
  File "SOMEPATH/amlb/logger.py", line 102, in new_print
    print_logger.log(level, buf.getvalue())
  File "/usr/lib/python3.10/logging/__init__.py", line 1547, in log
    self._log(level, msg, args, **kwargs)
  File "/usr/lib/python3.10/logging/__init__.py", line 1624, in _log
    self.handle(record)
  File "/usr/lib/python3.10/logging/__init__.py", line 1634, in handle
    self.callHandlers(record)
  File "/usr/lib/python3.10/logging/__init__.py", line 1696, in callHandlers
    hdlr.handle(record)
  File "/usr/lib/python3.10/logging/__init__.py", line 968, in handle
    self.emit(record)
  File "/usr/lib/python3.10/logging/__init__.py", line 1218, in emit
    StreamHandler.emit(self, record)
  File "/usr/lib/python3.10/logging/__init__.py", line 1108, in emit
    self.handleError(record)
  File "/usr/lib/python3.10/logging/__init__.py", line 1022, in handleError
    traceback.print_exception(t, v, tb, None, sys.stderr)

# ============================
# OMITTED REPEATS
RecursionError: maximum recursion depth exceeded while calling a Python object
# ============================

  File "/usr/lib/python3.10/traceback.py", line 121, in print_exception
    print(line, file=file, end="")
During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.10/logging/__init__.py", line 1104, in emit
    self.flush()
  File "/usr/lib/python3.10/logging/__init__.py", line 1084, in flush
    self.stream.flush()
OSError: [Errno 116] Stale file handle

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "SOMEPATH/amlb/benchmark.py", line 585, in run
    log.exception(e)
  File "/usr/lib/python3.10/logging/__init__.py", line 1512, in exception
    self.error(msg, *args, exc_info=exc_info, **kwargs)
  File "/usr/lib/python3.10/logging/__init__.py", line 1506, in error
    self._log(ERROR, msg, args, **kwargs)
  File "/usr/lib/python3.10/logging/__init__.py", line 1624, in _log
    self.handle(record)
  File "/usr/lib/python3.10/logging/__init__.py", line 1634, in handle
    self.callHandlers(record)
  File "/usr/lib/python3.10/logging/__init__.py", line 1696, in callHandlers
    hdlr.handle(record)
  File "/usr/lib/python3.10/logging/__init__.py", line 968, in handle
    self.emit(record)
  File "/usr/lib/python3.10/logging/__init__.py", line 1218, in emit
    StreamHandler.emit(self, record)
  File "/usr/lib/python3.10/logging/__init__.py", line 1108, in emit
    self.handleError(record)
  File "/usr/lib/python3.10/logging/__init__.py", line 1022, in handleError
    traceback.print_exception(t, v, tb, None, sys.stderr)
  File "/usr/lib/python3.10/traceback.py", line 121, in print_exception
    print(line, file=file, end="")
  File "SOMEPATH/amlb/logger.py", line 98, in new_print
    buf.write(line)  # "end" newline always added by logger
ValueError: I/O operation on closed file

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.10/logging/__init__.py", line 1104, in emit
    self.flush()
  File "/usr/lib/python3.10/logging/__init__.py", line 1084, in flush
    self.stream.flush()
OSError: [Errno 116] Stale file handle

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "SOMEPATH/amlb/job.py", line 123, in start
    result = self._run()
  File "SOMEPATH/amlb/utils/process.py", line 744, in profiler
    return fn(*args, **kwargs)
  File "SOMEPATH/amlb/benchmark.py", line 588, in run
    self._dataset.release()
  File "SOMEPATH/amlb/utils/process.py", line 744, in profiler
    return fn(*args, **kwargs)
  File "SOMEPATH/amlb/data.py", line 236, in release
    self.train.release()
  File "SOMEPATH/amlb/datasets/openml.py", line 283, in release
    super().release(properties)
  File "SOMEPATH/amlb/utils/process.py", line 744, in profiler
    return fn(*args, **kwargs)
  File "SOMEPATH/amlb/data.py", line 169, in release
    clear_cache(self, properties)
  File "SOMEPATH/amlb/utils/cache.py", line 23, in clear_cache
    log.debug("Cleared cached properties: %s.", properties_to_clear)
  File "/usr/lib/python3.10/logging/__init__.py", line 1465, in debug
    self._log(DEBUG, msg, args, **kwargs)
  File "/usr/lib/python3.10/logging/__init__.py", line 1624, in _log
    self.handle(record)
  File "/usr/lib/python3.10/logging/__init__.py", line 1634, in handle
    self.callHandlers(record)
  File "/usr/lib/python3.10/logging/__init__.py", line 1696, in callHandlers
    hdlr.handle(record)
  File "/usr/lib/python3.10/logging/__init__.py", line 968, in handle
    self.emit(record)
  File "/usr/lib/python3.10/logging/__init__.py", line 1218, in emit
    StreamHandler.emit(self, record)
  File "/usr/lib/python3.10/logging/__init__.py", line 1108, in emit
    self.handleError(record)
  File "/usr/lib/python3.10/logging/__init__.py", line 1022, in handleError
    traceback.print_exception(t, v, tb, None, sys.stderr)
  File "/usr/lib/python3.10/traceback.py", line 121, in print_exception
    print(line, file=file, end="")
  File "SOMEPATH/amlb/logger.py", line 98, in new_print
    buf.write(line)  # "end" newline always added by logger
ValueError: I/O operation on closed file
--- Logging error ---
--- Logging error ---
--- Logging error ---
[MONITORING] [local.openml_t_359939.4h8c.topo_2_1.5.OMITTED] CPU Utilization: 30.2%
--- Logging error ---

ERROR:
I/O operation on closed file
--- Logging error ---
@PGijsbers PGijsbers added the bug Something isn't working label Dec 4, 2024
@PGijsbers
Copy link
Collaborator

Thanks. This has me stumped too. Maybe I find some time for a closer look in the weekend. I don't recall seeing this before. Is there a particular reason for Py3.10? Just grasping at straws here.

@PGijsbers
Copy link
Collaborator

I didn't manage to reproduce earlier but now encountered it myself on Py3.9. I don't know if it was related but was in a run where my computer also went to sleep during. Is it possible your experiments were similarly "interrupted"?

@Innixma
Copy link
Collaborator

Innixma commented Dec 13, 2024

@eddiebergman If this is the cause of the sporadic 20-30% instance failures without any s3 file saves/logs I've been having on AMLB in AWS mode for the past year without finding a fix, you are my favorite person ever.

@PGijsbers
Copy link
Collaborator

PGijsbers commented Dec 13, 2024

@Innixma Do you have any idea when your errors started?

@PGijsbers
Copy link
Collaborator

I guess in any case we can try add some guardrails, e.g., if buf.closed: ...

@PGijsbers
Copy link
Collaborator

I still don't understand the root cause of the failure, but I do think I might understand what leads to it crashing the benchmark, have a look at this:

def new_print(*args, sep=" ", end=nl, file=None):
if file not in [None, sys.stdout, sys.stderr]:
return ori_print(*args, sep=sep, end=end, file=file)
nonlocal buffer
buf_type = "err" if file is sys.stderr else "out"
buf = buffer[buf_type]
if buf is None:
buf = buffer[buf_type] = io.StringIO()
line = sep.join(map(str, [*args]))
buf.write(line) # "end" newline always added by logger
if end == nl or line.endswith(nl): # flush buffer for every line
with buf:
level = logging.ERROR if buf_type == "err" else logging.INFO
print_logger.log(level, buf.getvalue())
buffer[buf_type] = None

When line 113 has an error (it is present as line 102 in your stack trace), it exists the context manager, thus closing buf. However, buf is never set to None(line 114), which means next time this function is called buf.write(...) is attempting to write to a closed stream, hence the error. This error then tries to be printed and ... well you see where this is going.

In #672 I'll add a clause to the guard that checks whether a new StringIO should be openend. It's not a great solution: it will drop at least the one log line that was attempted to be written. However, hopefully it allows the app to recover and/or write more informative errors. It's late now, can't come up with anything better. I'll try to continue tomorrow, thoughts and ideas welcome.

@PGijsbers
Copy link
Collaborator

PGijsbers commented Dec 13, 2024

The root cause may well be the app logger and the framework logger writing to the same file:

# create file handler
app_handler = logging.FileHandler(log_file, mode="a")
app_handler.setLevel(app_level)
app_handler.setFormatter(file_formatter)
app_logger.addHandler(app_handler)
frameworks_logger.addHandler(app_handler)

The framework logger is in a separate process, so there's probably race conditions about writing to the same file. If one process writes to the file while the other also attempts to log, this seems to be able to lead to stale file errors (reddit).

So solving the root cause will likely redoing logging to something multi-processing safe. Quick hack would be to keep separate logs in separate files, but probably switching to something multi-process safe like loguru is inevitable if we want to keep single logs.

It does at least match the sporadic nature of failures.

@PGijsbers
Copy link
Collaborator

Tried to write a MWE which uses the Python logger to write to the same log from two processes, but it does not seem to generate the same error despite writing gigabytes of logdata (gist).

@PGijsbers
Copy link
Collaborator

PGijsbers commented Dec 14, 2024

While it doesn't address the stale file error, the other part of the problem is that overwrite print. When an error does occur when writing to the log file, as part of the handler's emit routine, the error is instead printed with traceback.print_exception which, as the name suggests, calls print. But now we have a hook which intercepts this call, and instead tries to log it, the handle is still stale, so during the emit routine the error is raised again, which is intercepted to be fed to a traceback.print_exception routine and... well. This at least is trivially reproducible:

import logging
import builtins


root = logging.getLogger()
handler = logging.FileHandler("log.txt", mode="a")
root.addHandler(handler)

def new_print(*args, sep=" ", end="\n", file=None):
    root.log(logging.ERROR, sep.join(map(str, [*args])))

builtins.print = new_print

print("Hey!")

And you'll need to modify your Python's StreamHandler emit method to raise an error, e.g.:

# logging/__init__.py lines 1080-1090ish (depending on python version)
        try:
            msg = self.format(record)
            stream = self.stream
            # issue 35046: merged two stream.writes into one.
            stream.write(msg + self.terminator)
+           raise OSError("Oops!")
            self.flush()
        except RecursionError:  # See issue 36272
            raise
        except Exception:
            self.handleError(record)

@Innixma
Copy link
Collaborator

Innixma commented Dec 14, 2024

@PGijsbers roughly 1 year ago, but I'm unsure exact date. I also was using an old version of AMLB, but it also occurs after I switched to using mainline. My issue is either a dependency upgrade causing a bug, or AWS causing a bug, unsure which, I haven't been able to solve it despite spending over a week of effort trying.

@PGijsbers
Copy link
Collaborator

That's interesting, I did not observe any such failures during the 2024 paper experiments (june '23, v2.1.x), even though that was after what I would guess is the main culprit (v2.1.0 release with the bump to Py3.9 and dependency upgrades in June 2023). This error almost certainly has nothing to do with it, unless some of the underlying stack causes more frequent stale file handles - the AMLB code around this hasn't changed since January 2019 (or, 2020 for a minor change on how this method is invoked itself). :/

@Innixma
Copy link
Collaborator

Innixma commented Dec 14, 2024

Yeah unsure. It is probably something weird on my end with the AWS account.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants