Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finetuning on Ray and CPU causes Runtime error #242

Open
premdass opened this issue May 31, 2024 · 12 comments
Open

Finetuning on Ray and CPU causes Runtime error #242

premdass opened this issue May 31, 2024 · 12 comments

Comments

@premdass
Copy link

Ray version : ray 2.10
llm-on-ray : latest from main branch
command used to run : llm_on_ray-finetune --config_file llm-on-ray/llm_on_ray/finetune/finetune.yaml

RuntimeError: oneCCL: atl_ofi_comm.cpp:79 atl_ofi_comm: EXCEPTION: init transport failed

@harborn
Copy link
Contributor

harborn commented Jun 3, 2024

maybe you should oneCCL environment variables. just calling:

source $(python -c "import oneccl_bindings_for_pytorch as torch_ccl;print(torch_ccl.cwd)")/env/setvars.sh

@premdass
Copy link
Author

premdass commented Jun 3, 2024

oneccl environment has been sourced correctly before ray starts (i can see it in the worker startup logs)

@harborn
Copy link
Contributor

harborn commented Jun 4, 2024

which version of oneccl-bind-pt you installed?
Here is my used version:

oneccl-bind-pt              2.2.0+cpu

@KepingYan
Copy link
Contributor

Hi @premdass , is the ray cluster started on a single node or multiple nodes? Also, could you remove these two parameters

"FI_TCP_IFACE": "lo",
"FI_PROVIDER": "tcp",

in llm_on_ray/finetune/finetune.py and try again?

@premdass
Copy link
Author

premdass commented Jun 4, 2024

@harborn @KepingYan : Thanks for responding. Please find the env details

oneccl-bind-pt = 2.2.0+cpu
Ray = 2.10
K8s = 1.29

I have run the finetune.py without the FI_TCP_IFACE and FI_PROVIDER params and still seeing the same runtime i it error. all the ports between the worker nodes are open as well.

i enabled the ccl debug logging and seeing below error in worker nodes

2024:06:04-01:54:41:( 3387) |CCL_DEBUG| datatype.cpp:69 ccl_datatype_storage: create datatype_storage
2024:06:04-01:54:41:( 3387) |CCL_DEBUG| hwloc_wrapper.cpp:69 ccl_hwloc_wrapper: hwloc root object: type: Machine
2024:06:04-01:54:41:( 3387) |CCL_DEBUG| internal_kvs.cpp:323 fill_local_host_ip: use ipv4: 100.64.141.71
2024:06:04-01:54:41:( 3387) |CCL_DEBUG| communicator_impl.hpp:115 create_communicator: size 2, rank 0
2024:06:04-01:54:41:( 3387) |CCL_DEBUG| atl_ofi_comm.cpp:265 init_transport: init atl, requested ep_count 1
RuntimeError: oneCCL: atl_ofi_comm.cpp:79 atl_ofi_comm: EXCEPTION: init transport failed
RuntimeError: oneCCL: atl_ofi_comm.cpp:79 atl_ofi_comm: EXCEPTION: init transport failed
RuntimeError: oneCCL: atl_ofi_comm.cpp:79 atl_ofi_comm: EXCEPTION: init transport failed
2024:06:04-01:56:55:( 3161) |CCL_DEBUG| buffer_cache.cpp:60 clear: clear buffer cache: size: 0
2024:06:04-01:56:55:( 3161) |CCL_DEBUG| ofi_api_wrapper.cpp:48 ofi_api_fini: close OFI lib: handle: 0x7fc4543bca80
2024:06:04-01:56:55:( 3161) |CCL_DEBUG| mpi_api_wrapper.cpp:50 mpi_api_fini: close MPI lib: handle: 0x7fc4543bf040

@premdass
Copy link
Author

premdass commented Jun 4, 2024

Just to add context, i am running this in kubernetes / container environment. Hence Ray workers are pods/containers. Do i have to do something specific to enabel mpi on container environment ?

@xwu99
Copy link
Contributor

xwu99 commented Jun 5, 2024

Just to add context, i am running this in kubernetes / container environment. Hence Ray workers are pods/containers. Do i have to do something specific to enabel mpi on container environment ?

It looks the error was from oneCCL init transport failed, the finetuning code works on physical nodes, maybe the network interfaces are different in K8S that cause the oneCCL failure. Could you try other FI_PROVIDER?
Otherwise, this could be specific issue for oneCCL running on K8S.

@premdass could you share the full log for CCL_DEBUG that we can check what happened when init oneCLL?

@mshiryaev Hi, is this something known to you? Do you know if torch-ccl need special config on K8S?

@xwu99
Copy link
Contributor

xwu99 commented Jun 5, 2024

@premdass I tried in my local K8S, it doesn't fail as you did. could you just set "FI_PROVIDER": "tcp" and remove "FI_TCP_IFACE": "lo". Pls make sure to rebuild the docker images to include the code updates for K8S.

Could you share your full log with CCL_DEBUG enabled so that we can know what interface and provider were selected ? How many network interfaces you have for each container ?

@premdass
Copy link
Author

premdass commented Jun 25, 2024

Apologies for delayed responde @xwu99 . i have enabled debug logs for ccl and tcp as FI_PROVIDER. Below are the logs when i grep for ccl from ray worker nodes

2024:06:25-09:23:14:( 5098) |CCL_WARN| did not find MPI-launcher specific variables, switch to ATL/OFI, to force enable ATL/MPI set CCL_ATL_TRANSPORT=mpi
2024:06:25-09:23:14:( 5098) |CCL_WARN| could not get local_idx/count from environment variables, trying to get them from ATL
2024:06:25-09:23:14:( 5098) |CCL_INFO| process launcher: hydra, local_proc_idx: -1, local_proc_count: -1
2024:06:25-09:23:14:( 5098) |CCL_DEBUG| ofi_api_wrapper.cpp:38 ofi_api_init: OFI lib path: libfabric.so.1
2024:06:25-09:23:14:( 5098) |CCL_DEBUG| mpi_api_wrapper.cpp:40 mpi_api_init: MPI lib path: libmpi.so.12
2024:06:25-09:23:14:( 5098) |CCL_INFO| OS info: { Linux ray-cpu-cluster-train-kuberay-worker-workergroup-8j2mb 5.10.218-208.862.amzn2.x86_64 #1 SMP Tue Jun 4 16:52:10 UTC 2024 x86_64 }
2024:06:25-09:23:14:( 5098) |CCL_DEBUG| datatype.cpp:69 ccl_datatype_storage: create datatype_storage
2024:06:25-09:23:14:( 5098) |CCL_DEBUG| hwloc_wrapper.cpp:69 ccl_hwloc_wrapper: hwloc root object: type: Machine
2024:06:25-09:23:14:( 5098) |CCL_DEBUG| internal_kvs.cpp:323 fill_local_host_ip: use ipv4: 100.64.183.160
2024:06:25-09:25:24:( 5098) |CCL_ERROR| internal_kvs.cpp:529 kvs_init: connection time (130) >= limit (120)
2024:06:25-09:25:24:( 5098) |CCL_ERROR| internal_kvs_server.hpp:66 put: read/write error: Broken pipe
2024:06:25-09:25:24:( 5098) |CCL_ERROR| internal_kvs.cpp:108 kvs_get_value_by_name_key: client: get_value
2024:06:25-09:25:24:( 5098) |CCL_ERROR| pmi_resizable_simple_internal.cpp:319 get_local_kvs_id: failed to get local kvs id
2024:06:25-09:25:24:( 5098) |CCL_ERROR| pmi_resizable_simple_internal.cpp:65 pmrt_init: failed to get local id
2024:06:25-09:25:24:( 5098) |CCL_ERROR| atl_ofi_comm.cpp:268 init_transport: pmi init failed
2024:06:25-09:25:24:( 5098) |CCL_ERROR| atl_ofi_comm.cpp:79 atl_ofi_comm: condition init_transport(true) == ATL_STATUS_SUCCESS failed
2024:06:25-09:25:24:( 5098) |CCL_DEBUG| communicator_impl.hpp:115 create_communicator: size 2, rank 1
2024:06:25-09:25:24:( 5098) |CCL_DEBUG| atl_ofi_comm.cpp:265 init_transport: init atl, requested ep_count 1
RuntimeError: oneCCL: atl_ofi_comm.cpp:79 atl_ofi_comm: EXCEPTION: init transport failed
RuntimeError: oneCCL: atl_ofi_comm.cpp:79 atl_ofi_comm: EXCEPTION: init transport failed
2024:06:25-09:25:26:( 4890) |CCL_DEBUG| buffer_cache.cpp:60 clear: clear buffer cache: size: 0
2024:06:25-09:25:26:( 4890) |CCL_DEBUG| ofi_api_wrapper.cpp:48 ofi_api_fini: close OFI lib: handle: 0x7fab7ee19d20
2024:06:25-09:25:26:( 4890) |CCL_DEBUG| mpi_api_wrapper.cpp:50 mpi_api_fini: close MPI lib: handle: 0x7fab7ee1d2d0

@premdass
Copy link
Author

Bit more details to add, i have 2 interfaces lo, and eth0 and i tried with both the names and end up in similar error. I am trying to run distributed training with 2 ray worker nodes. Does it need any entries in hostfile or something to find the another worker node ?

@xwu99
Copy link
Contributor

xwu99 commented Jul 1, 2024

2024:06:25-09:23:14:( 5098) |CCL_DEBUG| hwloc_wrapper.cpp:69 ccl_hwloc_wrapper: hwloc root object: type: Machine
> 2024:06:25-09:23:14:( 5098) |CCL_DEBUG| internal_kvs.cpp:323 fill_local_host_ip: use ipv4: 100.64.183.160
> 2024:06:25-09:25:24:( 5098) |CCL_ERROR| internal_kvs.cpp:529 kvs_init: connection time (130) >= limit (120)

This shows oneCCL doesn't init correctly due to connection timeout. It seems a network problem. How do you setup your Ray cluster in K8S? There is KubeRay project to help setup Ray cluster properly for K8S.

@premdass
Copy link
Author

premdass commented Jul 9, 2024

kuberay is being used to setup the clusters in this case.. i need to dig why the ccl cannot init.. any pointers please ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants