You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After manually setting up a ray cluster on a fully connected network of nodes we run an offline batch inference application. If one of the links between client and server is broken the client can not send information back to the server. Communication and thus processing stops until the link is brought up again. Sometimes when the link is brought up again, communication doesn't continue and starts throwing a stale RPC error.
As the client and server are still accessible through other paths, communication should not be broken and routed through a different path, but I believe that the way gRPC is set up in Ray makes it not do this. In this setup, communication should never stop.
As a testing example I am running a 3-node fully-connected network, where one node runs the server setting num_cpus to 0 and another is a ray client. The remaining node is there to provide an alternate route. On the server, we run a simple remote counting script which I have attached in the Reproduction script section. We ensure communication between client and server can occur through the new path after removing the direct link by a ping and traceroute command.
We sometimes get the following logging error repeatedly after the link is brought up again:
Actor pid=37622, ip=172.16.0.4) [2024-06-27 15:42:52,883 E 37622 37622] actor_scheduling_queue.cc:135: Cancelling stale RPC with seqno 193476 < 193477
(Actor pid=37622, ip=172.16.0.4) [2024-06-27 15:42:53,883 E 37622 37622] actor_scheduling_queue.cc:135: Cancelling stale RPC with seqno 193476 < 193477
(Actor pid=37622, ip=172.16.0.4) [2024-06-27 15:42:54,884 E 37622 37622] actor_scheduling_queue.cc:135: Cancelling stale RPC with seqno 193476 < 193477
(Actor pid=37622, ip=172.16.0.4) [2024-06-27 15:42:55,885 E 37622 37622] actor_scheduling_queue.cc:135: Cancelling stale RPC with seqno 193476 < 193477
(Actor pid=37622, ip=172.16.0.4) [2024-06-27 15:42:56,888 E 37622 37622] actor_scheduling_queue.cc:135: Cancelling stale RPC with seqno 193476 < 193477
(Actor pid=37622, ip=172.16.0.4) [2024-06-27 15:42:57,887 E 37622 37622] actor_scheduling_queue.cc:135: Cancelling stale RPC with seqno 193476 < 193477
Versions / Dependencies
ray 2.20.0
python 3.11.0
Reproduction script
import ray
import time
ray.init()
# This actor kills itself after executing 10 tasks.
@ray.remote(max_restarts=4, max_task_retries=-1)
class Actor:
def __init__(self):
self.counter = 0
def increment_and_possibly_fail(self):
# Exit after every 10 tasks.
# if self.counter == 10:
# os._exit(0)
self.counter += 1
# time.sleep(10)
return self.counter
actor = Actor.remote()
while True:
try:
counter = ray.get(actor.increment_and_possibly_fail.remote(), timeout=None)
print(counter)
except ray.exceptions.GetTimeoutError:
print("==== GetTimeoutError ====")
ray.init()
print("timed out...")
for _ in range(10):
try:
counter = ray.get(actor.increment_and_possibly_fail.remote(), timeout=20)
print(counter) # Unreachable.
except ray.exceptions.RayActorError:
print("FAILURE") # Prints 10 times.
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered:
gerardPlanella
added
bug
Something that is supposed to be working; but isn't
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
Jun 27, 2024
What happened + What you expected to happen
We sometimes get the following logging error repeatedly after the link is brought up again:
Actor pid=37622, ip=172.16.0.4) [2024-06-27 15:42:52,883 E 37622 37622] actor_scheduling_queue.cc:135: Cancelling stale RPC with seqno 193476 < 193477
(Actor pid=37622, ip=172.16.0.4) [2024-06-27 15:42:53,883 E 37622 37622] actor_scheduling_queue.cc:135: Cancelling stale RPC with seqno 193476 < 193477
(Actor pid=37622, ip=172.16.0.4) [2024-06-27 15:42:54,884 E 37622 37622] actor_scheduling_queue.cc:135: Cancelling stale RPC with seqno 193476 < 193477
(Actor pid=37622, ip=172.16.0.4) [2024-06-27 15:42:55,885 E 37622 37622] actor_scheduling_queue.cc:135: Cancelling stale RPC with seqno 193476 < 193477
(Actor pid=37622, ip=172.16.0.4) [2024-06-27 15:42:56,888 E 37622 37622] actor_scheduling_queue.cc:135: Cancelling stale RPC with seqno 193476 < 193477
(Actor pid=37622, ip=172.16.0.4) [2024-06-27 15:42:57,887 E 37622 37622] actor_scheduling_queue.cc:135: Cancelling stale RPC with seqno 193476 < 193477
Versions / Dependencies
ray 2.20.0
python 3.11.0
Reproduction script
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: