Finetuning on Ray and CPU causes Runtime error #242

premdass · 2024-05-31T12:37:37Z

Ray version : ray 2.10
llm-on-ray : latest from main branch
command used to run : llm_on_ray-finetune --config_file llm-on-ray/llm_on_ray/finetune/finetune.yaml

RuntimeError: oneCCL: atl_ofi_comm.cpp:79 atl_ofi_comm: EXCEPTION: init transport failed

harborn · 2024-06-03T07:19:30Z

maybe you should oneCCL environment variables. just calling:

source $(python -c "import oneccl_bindings_for_pytorch as torch_ccl;print(torch_ccl.cwd)")/env/setvars.sh

premdass · 2024-06-03T17:49:16Z

oneccl environment has been sourced correctly before ray starts (i can see it in the worker startup logs)

harborn · 2024-06-04T07:10:10Z

which version of oneccl-bind-pt you installed?
Here is my used version:

oneccl-bind-pt              2.2.0+cpu

KepingYan · 2024-06-04T07:48:44Z

Hi @premdass , is the ray cluster started on a single node or multiple nodes? Also, could you remove these two parameters

"FI_TCP_IFACE": "lo",
"FI_PROVIDER": "tcp",

in llm_on_ray/finetune/finetune.py and try again?

premdass · 2024-06-04T08:56:10Z

@harborn @KepingYan : Thanks for responding. Please find the env details

oneccl-bind-pt = 2.2.0+cpu
Ray = 2.10
K8s = 1.29

I have run the finetune.py without the FI_TCP_IFACE and FI_PROVIDER params and still seeing the same runtime i it error. all the ports between the worker nodes are open as well.

i enabled the ccl debug logging and seeing below error in worker nodes

2024:06:04-01:54:41:( 3387) |CCL_DEBUG| datatype.cpp:69 ccl_datatype_storage: create datatype_storage
2024:06:04-01:54:41:( 3387) |CCL_DEBUG| hwloc_wrapper.cpp:69 ccl_hwloc_wrapper: hwloc root object: type: Machine
2024:06:04-01:54:41:( 3387) |CCL_DEBUG| internal_kvs.cpp:323 fill_local_host_ip: use ipv4: 100.64.141.71
2024:06:04-01:54:41:( 3387) |CCL_DEBUG| communicator_impl.hpp:115 create_communicator: size 2, rank 0
2024:06:04-01:54:41:( 3387) |CCL_DEBUG| atl_ofi_comm.cpp:265 init_transport: init atl, requested ep_count 1
RuntimeError: oneCCL: atl_ofi_comm.cpp:79 atl_ofi_comm: EXCEPTION: init transport failed
RuntimeError: oneCCL: atl_ofi_comm.cpp:79 atl_ofi_comm: EXCEPTION: init transport failed
RuntimeError: oneCCL: atl_ofi_comm.cpp:79 atl_ofi_comm: EXCEPTION: init transport failed
2024:06:04-01:56:55:( 3161) |CCL_DEBUG| buffer_cache.cpp:60 clear: clear buffer cache: size: 0
2024:06:04-01:56:55:( 3161) |CCL_DEBUG| ofi_api_wrapper.cpp:48 ofi_api_fini: close OFI lib: handle: 0x7fc4543bca80
2024:06:04-01:56:55:( 3161) |CCL_DEBUG| mpi_api_wrapper.cpp:50 mpi_api_fini: close MPI lib: handle: 0x7fc4543bf040

premdass · 2024-06-04T09:06:58Z

Just to add context, i am running this in kubernetes / container environment. Hence Ray workers are pods/containers. Do i have to do something specific to enabel mpi on container environment ?

xwu99 · 2024-06-05T02:11:09Z

Just to add context, i am running this in kubernetes / container environment. Hence Ray workers are pods/containers. Do i have to do something specific to enabel mpi on container environment ?

It looks the error was from oneCCL init transport failed, the finetuning code works on physical nodes, maybe the network interfaces are different in K8S that cause the oneCCL failure. Could you try other FI_PROVIDER?
Otherwise, this could be specific issue for oneCCL running on K8S.

@premdass could you share the full log for CCL_DEBUG that we can check what happened when init oneCLL?

@mshiryaev Hi, is this something known to you? Do you know if torch-ccl need special config on K8S?

xwu99 · 2024-06-05T09:08:58Z

@premdass I tried in my local K8S, it doesn't fail as you did. could you just set "FI_PROVIDER": "tcp" and remove "FI_TCP_IFACE": "lo". Pls make sure to rebuild the docker images to include the code updates for K8S.

Could you share your full log with CCL_DEBUG enabled so that we can know what interface and provider were selected ? How many network interfaces you have for each container ?

premdass · 2024-06-25T16:31:12Z

Apologies for delayed responde @xwu99 . i have enabled debug logs for ccl and tcp as FI_PROVIDER. Below are the logs when i grep for ccl from ray worker nodes

2024:06:25-09:23:14:( 5098) |CCL_WARN| did not find MPI-launcher specific variables, switch to ATL/OFI, to force enable ATL/MPI set CCL_ATL_TRANSPORT=mpi
2024:06:25-09:23:14:( 5098) |CCL_WARN| could not get local_idx/count from environment variables, trying to get them from ATL
2024:06:25-09:23:14:( 5098) |CCL_INFO| process launcher: hydra, local_proc_idx: -1, local_proc_count: -1
2024:06:25-09:23:14:( 5098) |CCL_DEBUG| ofi_api_wrapper.cpp:38 ofi_api_init: OFI lib path: libfabric.so.1
2024:06:25-09:23:14:( 5098) |CCL_DEBUG| mpi_api_wrapper.cpp:40 mpi_api_init: MPI lib path: libmpi.so.12
2024:06:25-09:23:14:( 5098) |CCL_INFO| OS info: { Linux ray-cpu-cluster-train-kuberay-worker-workergroup-8j2mb 5.10.218-208.862.amzn2.x86_64 #1 SMP Tue Jun 4 16:52:10 UTC 2024 x86_64 }
2024:06:25-09:23:14:( 5098) |CCL_DEBUG| datatype.cpp:69 ccl_datatype_storage: create datatype_storage
2024:06:25-09:23:14:( 5098) |CCL_DEBUG| hwloc_wrapper.cpp:69 ccl_hwloc_wrapper: hwloc root object: type: Machine
2024:06:25-09:23:14:( 5098) |CCL_DEBUG| internal_kvs.cpp:323 fill_local_host_ip: use ipv4: 100.64.183.160
2024:06:25-09:25:24:( 5098) |CCL_ERROR| internal_kvs.cpp:529 kvs_init: connection time (130) >= limit (120)
2024:06:25-09:25:24:( 5098) |CCL_ERROR| internal_kvs_server.hpp:66 put: read/write error: Broken pipe
2024:06:25-09:25:24:( 5098) |CCL_ERROR| internal_kvs.cpp:108 kvs_get_value_by_name_key: client: get_value
2024:06:25-09:25:24:( 5098) |CCL_ERROR| pmi_resizable_simple_internal.cpp:319 get_local_kvs_id: failed to get local kvs id
2024:06:25-09:25:24:( 5098) |CCL_ERROR| pmi_resizable_simple_internal.cpp:65 pmrt_init: failed to get local id
2024:06:25-09:25:24:( 5098) |CCL_ERROR| atl_ofi_comm.cpp:268 init_transport: pmi init failed
2024:06:25-09:25:24:( 5098) |CCL_ERROR| atl_ofi_comm.cpp:79 atl_ofi_comm: condition init_transport(true) == ATL_STATUS_SUCCESS failed
2024:06:25-09:25:24:( 5098) |CCL_DEBUG| communicator_impl.hpp:115 create_communicator: size 2, rank 1
2024:06:25-09:25:24:( 5098) |CCL_DEBUG| atl_ofi_comm.cpp:265 init_transport: init atl, requested ep_count 1
RuntimeError: oneCCL: atl_ofi_comm.cpp:79 atl_ofi_comm: EXCEPTION: init transport failed
RuntimeError: oneCCL: atl_ofi_comm.cpp:79 atl_ofi_comm: EXCEPTION: init transport failed
2024:06:25-09:25:26:( 4890) |CCL_DEBUG| buffer_cache.cpp:60 clear: clear buffer cache: size: 0
2024:06:25-09:25:26:( 4890) |CCL_DEBUG| ofi_api_wrapper.cpp:48 ofi_api_fini: close OFI lib: handle: 0x7fab7ee19d20
2024:06:25-09:25:26:( 4890) |CCL_DEBUG| mpi_api_wrapper.cpp:50 mpi_api_fini: close MPI lib: handle: 0x7fab7ee1d2d0

premdass · 2024-06-28T08:18:20Z

Bit more details to add, i have 2 interfaces lo, and eth0 and i tried with both the names and end up in similar error. I am trying to run distributed training with 2 ray worker nodes. Does it need any entries in hostfile or something to find the another worker node ?

xwu99 · 2024-07-01T03:21:54Z

2024:06:25-09:23:14:( 5098) |CCL_DEBUG| hwloc_wrapper.cpp:69 ccl_hwloc_wrapper: hwloc root object: type: Machine
> 2024:06:25-09:23:14:( 5098) |CCL_DEBUG| internal_kvs.cpp:323 fill_local_host_ip: use ipv4: 100.64.183.160
> 2024:06:25-09:25:24:( 5098) |CCL_ERROR| internal_kvs.cpp:529 kvs_init: connection time (130) >= limit (120)

This shows oneCCL doesn't init correctly due to connection timeout. It seems a network problem. How do you setup your Ray cluster in K8S? There is KubeRay project to help setup Ray cluster properly for K8S.

premdass · 2024-07-09T09:15:38Z

kuberay is being used to setup the clusters in this case.. i need to dig why the ccl cannot init.. any pointers please ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finetuning on Ray and CPU causes Runtime error #242

Finetuning on Ray and CPU causes Runtime error #242

premdass commented May 31, 2024

harborn commented Jun 3, 2024

premdass commented Jun 3, 2024

harborn commented Jun 4, 2024

KepingYan commented Jun 4, 2024

premdass commented Jun 4, 2024 •

edited

Loading

premdass commented Jun 4, 2024

xwu99 commented Jun 5, 2024 •

edited

Loading

xwu99 commented Jun 5, 2024

premdass commented Jun 25, 2024 •

edited

Loading

premdass commented Jun 28, 2024

xwu99 commented Jul 1, 2024 •

edited

Loading

premdass commented Jul 9, 2024

Finetuning on Ray and CPU causes Runtime error #242

Finetuning on Ray and CPU causes Runtime error #242

Comments

premdass commented May 31, 2024

harborn commented Jun 3, 2024

premdass commented Jun 3, 2024

harborn commented Jun 4, 2024

KepingYan commented Jun 4, 2024

premdass commented Jun 4, 2024 • edited Loading

premdass commented Jun 4, 2024

xwu99 commented Jun 5, 2024 • edited Loading

xwu99 commented Jun 5, 2024

premdass commented Jun 25, 2024 • edited Loading

premdass commented Jun 28, 2024

xwu99 commented Jul 1, 2024 • edited Loading

premdass commented Jul 9, 2024

premdass commented Jun 4, 2024 •

edited

Loading

xwu99 commented Jun 5, 2024 •

edited

Loading

premdass commented Jun 25, 2024 •

edited

Loading

xwu99 commented Jul 1, 2024 •

edited

Loading