-
-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug: Error with logging #664
Comments
Thanks. This has me stumped too. Maybe I find some time for a closer look in the weekend. I don't recall seeing this before. Is there a particular reason for Py3.10? Just grasping at straws here. |
I didn't manage to reproduce earlier but now encountered it myself on Py3.9. I don't know if it was related but was in a run where my computer also went to sleep during. Is it possible your experiments were similarly "interrupted"? |
@eddiebergman If this is the cause of the sporadic 20-30% instance failures without any s3 file saves/logs I've been having on AMLB in AWS mode for the past year without finding a fix, you are my favorite person ever. |
@Innixma Do you have any idea when your errors started? |
I guess in any case we can try add some guardrails, e.g., |
I still don't understand the root cause of the failure, but I do think I might understand what leads to it crashing the benchmark, have a look at this: automlbenchmark/amlb/logger.py Lines 99 to 114 in 98bf554
When line 113 has an error (it is present as line 102 in your stack trace), it exists the context manager, thus closing In #672 I'll add a clause to the guard that checks whether a new StringIO should be openend. It's not a great solution: it will drop at least the one log line that was attempted to be written. However, hopefully it allows the app to recover and/or write more informative errors. It's late now, can't come up with anything better. I'll try to continue tomorrow, thoughts and ideas welcome. |
The root cause may well be the app logger and the framework logger writing to the same file: automlbenchmark/amlb/logger.py Lines 77 to 82 in 98bf554
The framework logger is in a separate process, so there's probably race conditions about writing to the same file. If one process writes to the file while the other also attempts to log, this seems to be able to lead to stale file errors (reddit). So solving the root cause will likely redoing logging to something multi-processing safe. Quick hack would be to keep separate logs in separate files, but probably switching to something multi-process safe like loguru is inevitable if we want to keep single logs. It does at least match the sporadic nature of failures. |
Tried to write a MWE which uses the Python logger to write to the same log from two processes, but it does not seem to generate the same error despite writing gigabytes of logdata (gist). |
While it doesn't address the stale file error, the other part of the problem is that overwrite import logging
import builtins
root = logging.getLogger()
handler = logging.FileHandler("log.txt", mode="a")
root.addHandler(handler)
def new_print(*args, sep=" ", end="\n", file=None):
root.log(logging.ERROR, sep.join(map(str, [*args])))
builtins.print = new_print
print("Hey!") And you'll need to modify your Python's StreamHandler # logging/__init__.py lines 1080-1090ish (depending on python version)
try:
msg = self.format(record)
stream = self.stream
# issue 35046: merged two stream.writes into one.
stream.write(msg + self.terminator)
+ raise OSError("Oops!")
self.flush()
except RecursionError: # See issue 36272
raise
except Exception:
self.handleError(record) |
@PGijsbers roughly 1 year ago, but I'm unsure exact date. I also was using an old version of AMLB, but it also occurs after I switched to using mainline. My issue is either a dependency upgrade causing a bug, or AWS causing a bug, unsure which, I haven't been able to solve it despite spending over a week of effort trying. |
That's interesting, I did not observe any such failures during the 2024 paper experiments (june '23, v2.1.x), even though that was after what I would guess is the main culprit (v2.1.0 release with the bump to Py3.9 and dependency upgrades in June 2023). This error almost certainly has nothing to do with it, unless some of the underlying stack causes more frequent stale file handles - the AMLB code around this hasn't changed since January 2019 (or, 2020 for a minor change on how this method is invoked itself). :/ |
Yeah unsure. It is probably something weird on my end with the AWS account. |
I'm sporadically encountering an error in logging which boils down to this line, in which it states that
ValueError: I/O operation on closed file
. It's not conclusive and I can't make a reproducing script. Just rerunning the failed jobs seems to work ~99% of the time.automlbenchmark/amlb/logger.py
Line 98 in aded7c7
I'm not sure where, but somehow this error cascades into a recursion error as the error keeps trying to get logged (perhaps in the overwritten
print
?).The initial trigger seems to come from inside the
.framework_module.run()
, which then cascades several times. I can say for certain thatautomlbenchmark/amlb/benchmark.py
Lines 602 to 613 in aded7c7
This is a trimmed down version of the
exec.py
file:Here is the traceback, I cut out all of the recursion that happens as the original file is 20k lines long.
The text was updated successfully, but these errors were encountered: