Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CloudWatchLogHandler deadlocks when used with PYTHONBREAKPOINT=ipdb.set_trace #197

Open
dreid opened this issue Apr 4, 2024 · 2 comments

Comments

@dreid
Copy link

dreid commented Apr 4, 2024

Versions:

  • watchtower==3.1.0
  • ipython==8.23.0
  • ipdb==0.13.13
  • python 3.11

Reproduction

The following script will hang when run with PYTHONBREAKPOINT=ipdb.set_trace

from __future__ import annotations

from logging import INFO, LogRecord

from watchtower import CloudWatchLogHandler

log_record = LogRecord(
    name="",
    level=INFO,
    pathname="",
    lineno=0,
    msg="ok",
    args={},
    exc_info=None,
    func="",
    sinfo="",
)

handler = CloudWatchLogHandler()

print("sending messages")
handler.emit(log_record)

print("breaking")
breakpoint()

The running process has the following stacktraces:

💤 Process 56995 🧵 Thread 0:6155792384

    Thread._bootstrap (/Users/dreid/.local/share/mise/installs/python/3.11.5/lib/python3.11/threading.py:995)
    Thread._bootstrap_inner (/Users/dreid/.local/share/mise/installs/python/3.11.5/lib/python3.11/threading.py:1038)
    Thread.run (/Users/dreid/.local/share/mise/installs/python/3.11.5/lib/python3.11/threading.py:975)
    CloudWatchLogHandler._dequeue_batch (/Users/dreid/src/pilot/zapgram/.ve/lib/python3.11/site-packages/watchtower/__init__.py:470)
    CloudWatchLogHandler._submit_batch (/Users/dreid/src/pilot/zapgram/.ve/lib/python3.11/site-packages/watchtower/__init__.py:344)
    ClientCreator._create_api_method.<locals>._api_call (/Users/dreid/src/pilot/zapgram/.ve/lib/python3.11/site-packages/botocore/client.py:553)
    BaseClient._make_api_call (/Users/dreid/src/pilot/zapgram/.ve/lib/python3.11/site-packages/botocore/client.py:946)
    BaseClient._emit_api_params (/Users/dreid/src/pilot/zapgram/.ve/lib/python3.11/site-packages/botocore/client.py:1072)
    EventAliaser.emit (/Users/dreid/src/pilot/zapgram/.ve/lib/python3.11/site-packages/botocore/hooks.py:413)
    HierarchicalEmitter.emit (/Users/dreid/src/pilot/zapgram/.ve/lib/python3.11/site-packages/botocore/hooks.py:257)
    HierarchicalEmitter._emit (/Users/dreid/src/pilot/zapgram/.ve/lib/python3.11/site-packages/botocore/hooks.py:238)
    Logger.debug (/Users/dreid/.local/share/mise/installs/python/3.11.5/lib/python3.11/logging/__init__.py:1476)
    Logger.isEnabledFor (/Users/dreid/.local/share/mise/installs/python/3.11.5/lib/python3.11/logging/__init__.py:1744)
    _acquireLock (/Users/dreid/.local/share/mise/installs/python/3.11.5/lib/python3.11/logging/__init__.py:235)


💤 Process 56995 🧵 Thread 0:7911054400

    <module> (/Users/dreid/src/pilot/zapgram/t.py:25)
    set_trace (/Users/dreid/src/pilot/zapgram/.ve/lib/python3.11/site-packages/ipdb/__main__.py:77)
    _init_pdb (/Users/dreid/src/pilot/zapgram/.ve/lib/python3.11/site-packages/ipdb/__main__.py:54)
    _get_debugger_cls (/Users/dreid/src/pilot/zapgram/.ve/lib/python3.11/site-packages/ipdb/__main__.py:34)
    catch_config_error.<locals>.inner (/Users/dreid/src/pilot/zapgram/.ve/lib/python3.11/site-packages/traitlets/config/application.py:118)
    TerminalIPythonApp.initialize (/Users/dreid/src/pilot/zapgram/.ve/lib/python3.11/site-packages/IPython/terminal/ipapp.py:269)
    catch_config_error.<locals>.inner (/Users/dreid/src/pilot/zapgram/.ve/lib/python3.11/site-packages/traitlets/config/application.py:118)
    BaseIPythonApplication.initialize (/Users/dreid/src/pilot/zapgram/.ve/lib/python3.11/site-packages/IPython/core/application.py:480)
    catch_config_error.<locals>.inner (/Users/dreid/src/pilot/zapgram/.ve/lib/python3.11/site-packages/traitlets/config/application.py:118)
    Application.parse_command_line (/Users/dreid/src/pilot/zapgram/.ve/lib/python3.11/site-packages/traitlets/config/application.py:880)
    Application._create_loader (/Users/dreid/src/pilot/zapgram/.ve/lib/python3.11/site-packages/traitlets/config/application.py:791)
    TraitType.__get__ (/Users/dreid/src/pilot/zapgram/.ve/lib/python3.11/site-packages/traitlets/traitlets.py:687)
    TraitType.get (/Users/dreid/src/pilot/zapgram/.ve/lib/python3.11/site-packages/traitlets/traitlets.py:653)
    HasTraits._notify_observers (/Users/dreid/src/pilot/zapgram/.ve/lib/python3.11/site-packages/traitlets/traitlets.py:1568)
    Application._observe_logging_default (/Users/dreid/src/pilot/zapgram/.ve/lib/python3.11/site-packages/traitlets/config/application.py:287)
    Application._configure_logging (/Users/dreid/src/pilot/zapgram/.ve/lib/python3.11/site-packages/traitlets/config/application.py:293)
    dictConfig (/Users/dreid/.local/share/mise/installs/python/3.11.5/lib/python3.11/logging/config.py:823)
    DictConfigurator.configure (/Users/dreid/.local/share/mise/installs/python/3.11.5/lib/python3.11/logging/config.py:546)
    _clearExistingHandlers (/Users/dreid/.local/share/mise/installs/python/3.11.5/lib/python3.11/logging/config.py:285)
    shutdown (/Users/dreid/.local/share/mise/installs/python/3.11.5/lib/python3.11/logging/__init__.py:2192)
    CloudWatchLogHandler.flush (/Users/dreid/src/pilot/zapgram/.ve/lib/python3.11/site-packages/watchtower/__init__.py:492)
    Queue.join (/Users/dreid/.local/share/mise/installs/python/3.11.5/lib/python3.11/queue.py:90)
    Condition.wait (/Users/dreid/.local/share/mise/installs/python/3.11.5/lib/python3.11/threading.py:320)

What is going on?

The top thread is the handler thread dequeuing messages
And the bottom is the main thread.

The main thread is waiting for the queue to be emptied,
And the handler thread is waiting to acquire the global lock in the logging, which is already held by DictConfigurator.configure in the main thread.

The handler thread has to acquire the global lock in Logger.isEnabledFor because importing ipdb as a side effect calls Logger.setLevel(…) which invalidates the enabled level cache of all Logger instances, and then calls logging.config.dictConfig which tries to shutdown all existing handlers.

This can also be reproduced if breakpoint() is replaced by:

print("setting level")
logger.setLevel(INFO)

print("changing logging config")
dictConfig({"version": 1})

This is results in a pretty straightforward deadlock, the handler thread is not able to actually send the logs to cloudwatch so it's never able to empty the queue.

Solutions?

The current behavior of flush is that it waits for the queue to be empty using queue.Queue.join

This doesn't take a timeout, however it is implemented as:

    def join(self):
        with self.all_tasks_done:
            while self.unfinished_tasks:
                self.all_tasks_done.wait()

and queue.Queue.all_tasks_done is a threading.Condition and threading.Condition.wait does take a timeout. So it is possible to implement a join that does timeout as well.

flush could also be modified to send (self.FLUSH, flush_condition) for each queue, and then wait on then call flush_condition.wait(timeout) to wait for the specific flush message to have been processed.

This changes the behavior of flush under load though, from waiting until the queue is empty to waiting until a specific flush is processed, if you have a system where flush is called from multiple threads each would only wait until it's flush message was processed and then proceed.

But this might be the more straightforward behavior as it is could be counter intuitive given a queue state of: msg1, msg2, msg3, flush, msg4, msg5 for flush to block until msg5 is delivered.

Once flush/close are able to timeout at all, they could also be taught to check if self.sequence_tokens[log_stream_name] has changed in the timeout case, allowing them to wait as long as batches are still able to be submitted.

@kislyuk
Copy link
Owner

kislyuk commented Apr 14, 2024

Thanks for your proposed solution! I am definitely interested in setting timeouts and implementing it as you describe.

@kislyuk
Copy link
Owner

kislyuk commented Jun 27, 2024

The queue wait timeout was implemented in ebf368f and released in v3.2.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants