-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Finetuning on Ray and CPU causes Runtime error #242
Comments
maybe you should oneCCL environment variables. just calling: source $(python -c "import oneccl_bindings_for_pytorch as torch_ccl;print(torch_ccl.cwd)")/env/setvars.sh |
oneccl environment has been sourced correctly before ray starts (i can see it in the worker startup logs) |
which version of oneccl-bind-pt 2.2.0+cpu |
Hi @premdass , is the ray cluster started on a single node or multiple nodes? Also, could you remove these two parameters
in llm_on_ray/finetune/finetune.py and try again? |
@harborn @KepingYan : Thanks for responding. Please find the env details oneccl-bind-pt = 2.2.0+cpu I have run the finetune.py without the FI_TCP_IFACE and FI_PROVIDER params and still seeing the same runtime i it error. all the ports between the worker nodes are open as well. i enabled the ccl debug logging and seeing below error in worker nodes 2024:06:04-01:54:41:( 3387) |CCL_DEBUG| datatype.cpp:69 ccl_datatype_storage: create datatype_storage |
Just to add context, i am running this in kubernetes / container environment. Hence Ray workers are pods/containers. Do i have to do something specific to enabel mpi on container environment ? |
It looks the error was from oneCCL init transport failed, the finetuning code works on physical nodes, maybe the network interfaces are different in K8S that cause the oneCCL failure. Could you try other FI_PROVIDER? @premdass could you share the full log for CCL_DEBUG that we can check what happened when init oneCLL? @mshiryaev Hi, is this something known to you? Do you know if torch-ccl need special config on K8S? |
@premdass I tried in my local K8S, it doesn't fail as you did. could you just set "FI_PROVIDER": "tcp" and remove "FI_TCP_IFACE": "lo". Pls make sure to rebuild the docker images to include the code updates for K8S. Could you share your full log with CCL_DEBUG enabled so that we can know what interface and provider were selected ? How many network interfaces you have for each container ? |
Apologies for delayed responde @xwu99 . i have enabled debug logs for ccl and tcp as FI_PROVIDER. Below are the logs when i grep for ccl from ray worker nodes 2024:06:25-09:23:14:( 5098) |CCL_WARN| did not find MPI-launcher specific variables, switch to ATL/OFI, to force enable ATL/MPI set CCL_ATL_TRANSPORT=mpi |
Bit more details to add, i have 2 interfaces lo, and eth0 and i tried with both the names and end up in similar error. I am trying to run distributed training with 2 ray worker nodes. Does it need any entries in hostfile or something to find the another worker node ? |
This shows oneCCL doesn't init correctly due to connection timeout. It seems a network problem. How do you setup your Ray cluster in K8S? There is KubeRay project to help setup Ray cluster properly for K8S. |
kuberay is being used to setup the clusters in this case.. i need to dig why the ccl cannot init.. any pointers please ? |
Ray version : ray 2.10
llm-on-ray : latest from main branch
command used to run : llm_on_ray-finetune --config_file llm-on-ray/llm_on_ray/finetune/finetune.yaml
RuntimeError: oneCCL: atl_ofi_comm.cpp:79 atl_ofi_comm: EXCEPTION: init transport failed
The text was updated successfully, but these errors were encountered: