[PoC] Cross-silo error broadcasting #175

NKcqx · 2023-08-23T08:01:28Z

Background

Before this PR, when the execution of a DAG encounters an error in 'alice', below is what will happen:

In alice, both main thread and data sending thread will raise the error, and the process will exit.
In bob, since it needs the input from 'alice', it will waits for 'alice' forever no matter whether 'alice' exists or not.

Therefore, we need a mechanism to inform the other participant when the DAG execution raises error.

What's in this PR

The below graph shows what will happen now after this PR:

In alice, when the data-sending thread finds a RayTaskError indicating a execution failure, it will wrap it to a RemoteError object and replace the original data object in place to send to bob.
In bob, the main thread will poll data from receiver actor, where it finds out the data is in the type of RemoteError and re-raises it, and gets an exception just as what happens in "alice".

The threading model in this PR is shown below:

The explanation of the `_atomic_shutdown_flag`

When the failure happens, both main thread and data thread get the error and trigger the shutdown, which will execute the "failure handler" twice. The typical method to ensure the failure_handler is executed only once is to set up a flag to check whether it has been executed or not, and wrap it with threading.lock because it's a critical section.

However, this will cause the dead lock as shown in below's graph.
The data thread triggers the shutdown stage by sending SIGINT signal that is implemented by causing KeyboardInterrupt error (step 8). In order to handle the exception, OS will hold the context of the current process, including the acquired threading.lock in step 6, and change the context to error handler, i.e. the signal handler in step 9. Since the lock has not yet released, acquiring the same lock will cause the dead lock (step 10).

The solution is to check the lock before sending the signal. That lock is _atomic_shutdown_flag.

Signed-off-by: paer <chenqixiang.cqx@antgroup.com>

fengsp · 2023-08-25T03:46:42Z

fed/_private/queue.py

+
+
+class MessageQueue:
+    def __init__(self, msg_handler, failure_handler=None, name=''):


Is this thread-safe?

The inner implementation, i.e. deque is thread-safe and the MessageQueue is only used inside RayFed instead a public util.

fengsp · 2023-08-25T03:47:59Z

fed/cleanup.py

+            lambda msg: self._process_data_message(msg),
+            name='DataQueue')
+
+        self._sending_error_q = MessageQueue(


Why do we need seperate queues instead of sharing one queue

Errors should be sent out-of-band.

How to understand the outofband in this context?

fengsp · 2023-08-25T04:07:47Z

fed/proxy/barriers.py

@@ -188,6 +188,11 @@ async def get_data(self, src_party, upstream_seq_id, curr_seq_id):
        data = await self._proxy_instance.get_data(
            src_party, upstream_seq_id, curr_seq_id
        )
+        if isinstance(data, Exception):


I think we only need to expose exc_type and src_party, the exception message should be private

Indeed, the exception will be wrap to a RemoteException that only contains type and src_party and send to other parties

Signed-off-by: paer <chenqixiang.cqx@antgroup.com>

fengsp · 2023-09-11T08:18:36Z

Good docs, maybe it is better we maintain a RayFed Enhancement Proposals directory and put this in it? So that these docs are better managed.

fengsp · 2023-09-11T08:36:29Z

fed/_private/exceptions.py

+        self._cause = cause
+
+    def __str__(self):
+        return f'RemoteError occurred at {self._src_party} caused by {str(self._cause)}'


Is {str(self._cause)} only the type of exception? We need to make sure this will not include error message detail.

No, this includes all the error stacks.
May I ask why can't details be included?

Users might use sensitive error messages which is private.

Got it, I'll add a flag to set whether exposing stack trace, default to False

fengsp · 2023-09-11T08:53:48Z

fed/api.py

+    # can avoid shutdown twice, so that the failure handler can be
+    # executed only once.
+
+    if (get_global_context() is not None):


Is get_global_context thread safe?

No, it's not. The risk only exists when intended=False, because other functions except for the user-defined failure_handler can be executed multiple times without side effects; and failure_handler only gets called when intended=False .

As for the time intended=False:

The entrance of shutdown(intended=False) inside RayFed is thread-safe, other calls in driver codes are not; There indeed exists risks that user calls shutdown(intended=False) in driver without acquiring the lock, which will accidentally execute the failure_handler twice. What do you think if we only expose shutdown(intended=True) as API, and make shutdown(intended=False) as private method?

As described in "The explanation of the _atomic_shutdown_flag", since shutdown(where get_global_context called) will be entered by OS interrupt's error handler, any lock inside shutdown may get hang if it's already been acquired but not yet released before the interrupt. Therefore, making get_global_context() thread-safe is not recommended.

What if the global context is set to None by other threads?

This SHOULD only be done by calling shutdown.
Though there's risk that our developer didn't following the rule by mistake, the shutdown is indeed can't using lock to be thread-safe.

fengsp · 2023-09-11T08:59:26Z

fed/cleanup.py

+            if (isinstance(e, RayError)):
+                logger.info(f"Sending error {e.cause} to {dest_party}.")
+                from fed.proxy.barriers import send
+                # TODO(NKcqx): Maybe broadcast to all parties?


I believe we have to broadcast to all parties, for example A -> B -> C，A triggers error then A and B exit, however C will hang forever?

I'll start another PR to implement, since this PR is already hard to understand

Signed-off-by: paer <chenqixiang.cqx@antgroup.com>

jovany-wang · 2023-09-13T06:33:56Z

Solid work! I'm going to review soon.

fed/_private/exceptions.py

jovany-wang · 2023-09-14T03:22:41Z

fed/_private/global_context.py

+    def get_failure_handler(self) -> Callable[[], None]:
+        return self._failure_handler
+
+    def acquire_shutdown_flag(self) -> bool:


It's too weird to pass the method to cleanup manager. A better practice I believe is implementing it in cleanup manager and then use it in anywhere.

I understand. The reason to do so is that:
The lock has to be maintained in GlobalContext , because it's indeed a part of the job context;
, but directly accessing it in CleanupManager needs to import "global_context" which causes a cross-import.

fed/_private/global_context.py

fed/_private/queue.py

fed/cleanup.py

jovany-wang · 2023-09-14T03:44:56Z

fed/cleanup.py

+            lambda msg: self._process_data_message(msg),
+            name='DataQueue')
+
+        self._sending_error_q = MessageQueue(


How to understand the outofband in this context?

fed/cleanup.py

Signed-off-by: paer <chenqixiang.cqx@antgroup.com>

NKcqx · 2023-09-18T03:41:29Z

In this context, "out-of-band" refers to apart control messages, including the error message, from data messages.

jovany-wang

LGTM. Just left some minor comments.

@fengsp @zhouaihui Please take a look and do you think it's ok to add the flag in the next PR?

fed/__init__.py

fed/_private/message_queue.py

jovany-wang · 2023-09-18T04:02:45Z

fed/_private/message_queue.py

+                If False: forcelly kill the for-loop sub-thread.
+        """
+        if threading.current_thread() == self._thread:
+            logger.error(f"Can't stop the message queue in the message"


I believe this should be an assertion statement or it should be raising an error instead of just logging, because it's a bug if the code path gets reached.

fed/api.py

jovany-wang

And let's pending to merge before the REP's done.

Signed-off-by: paer <chenqixiang.cqx@antgroup.com>

NKcqx · 2023-09-19T02:22:58Z

Good docs, maybe it is better we maintain a RayFed Enhancement Proposals directory and put this in it? So that these docs are better managed.

Has filed a Rayfed enhancement proposal, pls take a look: #179

cc @jovany-wang

jovany-wang · 2023-09-19T03:44:48Z

Thanks. It got approved!

paer added 7 commits August 23, 2023 10:21

tmp save

933a026

Signed-off-by: paer <chenqixiang.cqx@antgroup.com>

message queue

3e344db

Signed-off-by: paer <chenqixiang.cqx@antgroup.com>

error sending

aa26502

Signed-off-by: paer <chenqixiang.cqx@antgroup.com>

tmp save

01b5420

Signed-off-by: paer <chenqixiang.cqx@antgroup.com>

signal main thread to exit

fd18253

Signed-off-by: paer <chenqixiang.cqx@antgroup.com>

reraise error

bbdee8a

Signed-off-by: paer <chenqixiang.cqx@antgroup.com>

lint codes

e89558d

Signed-off-by: paer <chenqixiang.cqx@antgroup.com>

NKcqx changed the title ~~Cross-silo error broadcasting~~ [PoC] Cross-silo error broadcasting Aug 24, 2023

fengsp reviewed Aug 25, 2023

View reviewed changes

paer added 8 commits August 29, 2023 11:49

clean docstr

6d29ea5

Signed-off-by: paer <chenqixiang.cqx@antgroup.com>

failure handler when shutdown

26c6e2f

Signed-off-by: paer <chenqixiang.cqx@antgroup.com>

lint codes

22a39fb

Signed-off-by: paer <chenqixiang.cqx@antgroup.com>

distinguish RemoteError

b226cda

Signed-off-by: paer <chenqixiang.cqx@antgroup.com>

pass UT, but flaw

268d46f

Signed-off-by: paer <chenqixiang.cqx@antgroup.com>

tmp, alice cann't exe failure handler now

c2b4d16

Signed-off-by: paer <chenqixiang.cqx@antgroup.com>

exec failure handler exactly_once

0748c4a

Signed-off-by: paer <chenqixiang.cqx@antgroup.com>

fix UT

ee8d854

Signed-off-by: paer <chenqixiang.cqx@antgroup.com>

NKcqx requested review from a team and fengsp September 11, 2023 06:30

shutdown after UT

c6aa871

Signed-off-by: paer <chenqixiang.cqx@antgroup.com>

NKcqx added the enhancement New feature or request label Sep 11, 2023

NKcqx added this to the release0.1.1 milestone Sep 11, 2023

fengsp reviewed Sep 11, 2023

View reviewed changes

paer added 2 commits September 11, 2023 20:22

only expose shutdown(intended=True)

6cb5bcb

Signed-off-by: paer <chenqixiang.cqx@antgroup.com>

lint codes

cabe81c

Signed-off-by: paer <chenqixiang.cqx@antgroup.com>

jovany-wang reviewed Sep 14, 2023

View reviewed changes

paer added 2 commits September 15, 2023 00:42

more docstr & add flag to block stack trace

6ad3687

Signed-off-by: paer <chenqixiang.cqx@antgroup.com>

Merge branch 'main' into cross_silo_err

8a170d3

Signed-off-by: paer <chenqixiang.cqx@antgroup.com>

jovany-wang approved these changes Sep 18, 2023

View reviewed changes

jovany-wang requested changes Sep 18, 2023

View reviewed changes

More explicit naming

c667d4a

Signed-off-by: paer <chenqixiang.cqx@antgroup.com>

jovany-wang approved these changes Sep 19, 2023

View reviewed changes

NKcqx merged commit b688dc9 into main Sep 19, 2023
12 checks passed

jovany-wang deleted the cross_silo_err branch September 19, 2023 06:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PoC] Cross-silo error broadcasting #175

[PoC] Cross-silo error broadcasting #175

NKcqx commented Aug 23, 2023 •

edited

Loading

fengsp Aug 25, 2023

NKcqx Aug 29, 2023

fengsp Aug 25, 2023

NKcqx Aug 29, 2023

jovany-wang Sep 14, 2023

fengsp Aug 25, 2023

NKcqx Aug 29, 2023

fengsp commented Sep 11, 2023

fengsp Sep 11, 2023

NKcqx Sep 11, 2023

fengsp Sep 13, 2023

NKcqx Sep 14, 2023

fengsp Sep 11, 2023

NKcqx Sep 11, 2023

fengsp Sep 13, 2023

NKcqx Sep 15, 2023

fengsp Sep 11, 2023

NKcqx Sep 15, 2023

jovany-wang commented Sep 13, 2023

jovany-wang Sep 14, 2023

NKcqx Sep 14, 2023

jovany-wang Sep 14, 2023

NKcqx commented Sep 18, 2023

jovany-wang left a comment

jovany-wang Sep 18, 2023

jovany-wang left a comment

NKcqx commented Sep 19, 2023

jovany-wang commented Sep 19, 2023



		class MessageQueue:
		def __init__(self, msg_handler, failure_handler=None, name=''):

[PoC] Cross-silo error broadcasting #175

[PoC] Cross-silo error broadcasting #175

Conversation

NKcqx commented Aug 23, 2023 • edited Loading

Background

What's in this PR

The explanation of the _atomic_shutdown_flag

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fengsp commented Sep 11, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jovany-wang commented Sep 13, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NKcqx commented Sep 18, 2023

jovany-wang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jovany-wang left a comment

Choose a reason for hiding this comment

NKcqx commented Sep 19, 2023

jovany-wang commented Sep 19, 2023

NKcqx commented Aug 23, 2023 •

edited

Loading

The explanation of the `_atomic_shutdown_flag`