Skip to content

Commit

Permalink
[Recoverable] Allow certain tasks to gracefully reconnect without cra…
Browse files Browse the repository at this point in the history
…shing the entire cluster during transient errors (i.e. preemption).

PiperOrigin-RevId: 696927974
  • Loading branch information
Google-ML-Automation committed Nov 15, 2024
1 parent 7280b9a commit 54fad16
Show file tree
Hide file tree
Showing 7 changed files with 822 additions and 92 deletions.
3 changes: 3 additions & 0 deletions xla/tsl/distributed_runtime/coordination/BUILD
Original file line number Diff line number Diff line change
Expand Up @@ -152,6 +152,7 @@ tsl_gpu_library(
"@com_google_absl//absl/base:core_headers",
"@com_google_absl//absl/container:flat_hash_map",
"@com_google_absl//absl/container:flat_hash_set",
"@com_google_absl//absl/flags:flag",
"@com_google_absl//absl/functional:bind_front",
"@com_google_absl//absl/log",
"@com_google_absl//absl/log:check",
Expand Down Expand Up @@ -247,6 +248,7 @@ tsl_cc_test(
name = "client_server_test",
size = "medium",
srcs = ["client_server_test.cc"],
shard_count = 4,
deps = [
"//xla/tsl/distributed_runtime/coordination:coordination_client",
"//xla/tsl/distributed_runtime/coordination:coordination_service",
Expand All @@ -261,6 +263,7 @@ tsl_cc_test(
"//xla/tsl/protobuf:distributed_runtime_payloads_proto_cc_impl",
"@com_google_absl//absl/log",
"@com_google_absl//absl/status",
"@com_google_absl//absl/status:statusor",
"@com_google_absl//absl/strings",
"@com_google_absl//absl/synchronization",
"@com_google_absl//absl/time",
Expand Down
Loading

0 comments on commit 54fad16

Please sign in to comment.