Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] oneapi/ccl.hpp: No such file or directory. #5653

Open
weiji14 opened this issue Jun 12, 2024 · 7 comments
Open

[BUG] oneapi/ccl.hpp: No such file or directory. #5653

weiji14 opened this issue Jun 12, 2024 · 7 comments
Assignees
Labels
bug Something isn't working training

Comments

@weiji14
Copy link

weiji14 commented Jun 12, 2024

Describe the bug

The builds on conda-forge have been failing since deepspeed=0.14.1 for CUDA 11.8 and 12.0 with an error like fatal error: oneapi/ccl.hpp: No such file or directory. Originally reported at conda-forge/deepspeed-feedstock#56 (comment).

To Reproduce
Steps to reproduce the behavior:

  1. Go to deepspeed v0.14.2 conda-forge/deepspeed-feedstock#57 and clone the branch
  2. Run python build_locally.py locally, select the option with CUDA 11.8 and Python 3.9
  3. See error below

Expected behavior
A clear and concise description of what you expected to happen.

CUDA builds work as expected.

ds_report output
Please run ds_report to give us details about your setup.

Note, this isn't the exact report for the conda-forge CI device, I copied this from the CPU build logs

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [FAIL]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
deepspeed_not_implemented  [NO] ....... [OKAY]
deepspeed_ccl_comm ..... [NO] ....... [OKAY]
deepspeed_shm_comm ..... [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['$PREFIX/lib/python3.9/site-packages/torch']
torch version .................... 2.3.0.post101
deepspeed install path ........... ['$PREFIX/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.14.3, f492cfc, HEAD
deepspeed wheel compiled w. ...... torch 0.0 
shared memory (/dev/shm) size .... 64.00 MB

Screenshots

Truncated traceback from https://dev.azure.com/conda-forge/feedstock-builds/_build/results?buildId=953875&view=logs&j=bb1c2637-64c6-57bd-9ea6-93823b2df951&t=350df31b-3291-5209-0bb7-031395f0baa1&l=3486:

2024-06-12T22:49:04.3043574Z   building 'deepspeed.ops.comm.deepspeed_ccl_comm_op' extension
2024-06-12T22:49:04.3043994Z   creating build/temp.linux-x86_64-cpython-39
2024-06-12T22:49:04.3053293Z   creating build/temp.linux-x86_64-cpython-39/csrc
2024-06-12T22:49:04.3054113Z   creating build/temp.linux-x86_64-cpython-39/csrc/cpu
2024-06-12T22:49:04.3054729Z   creating build/temp.linux-x86_64-cpython-39/csrc/cpu/comm
2024-06-12T22:49:04.3071806Z   /home/conda/feedstock_root/build_artifacts/deepspeed_1718232254780/_build_env/bin/x86_64-conda-linux-gnu-cc -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /home/conda/feedstock_root/build_artifacts/deepspeed_1718232254780/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/include -fPIC -O2 -isystem /home/conda/feedstock_root/build_artifacts/deepspeed_1718232254780/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/include -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /home/conda/feedstock_root/build_artifacts/deepspeed_1718232254780/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/include -fdebug-prefix-map=/home/conda/feedstock_root/build_artifacts/deepspeed_1718232254780/work=/usr/local/src/conda/deepspeed-0.14.3 -fdebug-prefix-map=/home/conda/feedstock_root/build_artifacts/deepspeed_1718232254780/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p=/usr/local/src/conda-prefix -isystem /usr/local/cuda/include -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /home/conda/feedstock_root/build_artifacts/deepspeed_1718232254780/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/include -isystem /usr/local/cuda/include -fPIC -I/home/conda/feedstock_root/build_artifacts/deepspeed_1718232254780/work/csrc/cpu/includes -I/home/conda/feedstock_root/build_artifacts/deepspeed_1718232254780/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/lib/python3.9/site-packages/torch/include -I/home/conda/feedstock_root/build_artifacts/deepspeed_1718232254780/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/conda/feedstock_root/build_artifacts/deepspeed_1718232254780/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/lib/python3.9/site-packages/torch/include/TH -I/home/conda/feedstock_root/build_artifacts/deepspeed_1718232254780/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/lib/python3.9/site-packages/torch/include/THC -I/home/conda/feedstock_root/build_artifacts/deepspeed_1718232254780/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/include/python3.9 -c csrc/cpu/comm/ccl.cpp -o build/temp.linux-x86_64-cpython-39/csrc/cpu/comm/ccl.o -O2 -fopenmp -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -DTORCH_EXTENSION_NAME=deepspeed_ccl_comm_op -D_GLIBCXX_USE_CXX11_ABI=1 -std=c++17
2024-06-12T22:49:08.2062484Z   csrc/cpu/comm/ccl.cpp:8:10: fatal error: oneapi/ccl.hpp: No such file or directory
2024-06-12T22:49:08.2067800Z       8 | #include <oneapi/ccl.hpp>
2024-06-12T22:49:08.2068222Z         |          ^~~~~~~~~~~~~~~~
2024-06-12T22:49:08.2068507Z   compilation terminated.
2024-06-12T22:49:08.2182741Z   error: command '/home/conda/feedstock_root/build_artifacts/deepspeed_1718232254780/_build_env/bin/x86_64-conda-linux-gnu-cc' failed with exit code 1
2024-06-12T22:49:08.6174937Z   error: subprocess-exited-with-error
2024-06-12T22:49:08.6176666Z   
2024-06-12T22:49:08.6188012Z   × python setup.py bdist_wheel did not run successfully.
2024-06-12T22:49:08.6227487Z   │ exit code: 1
2024-06-12T22:49:08.6240717Z   ╰─> See above for output.
2024-06-12T22:49:08.6252920Z   
2024-06-12T22:49:08.6264017Z   note: This error originates from a subprocess, and is likely not a problem with pip.
2024-06-12T22:49:08.6271330Z   full command: /home/conda/feedstock_root/build_artifacts/deepspeed_1718232254780/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/bin/python -u -c '
2024-06-12T22:49:08.6272043Z   exec(compile('"'"''"'"''"'"'
2024-06-12T22:49:08.6277838Z   # This is <pip-setuptools-caller> -- a caller that pip uses to run setup.py
2024-06-12T22:49:08.6283726Z   #
2024-06-12T22:49:08.6284428Z   # - It imports setuptools before invoking setup.py, to enable projects that directly
2024-06-12T22:49:08.6289287Z   #   import from `distutils.core` to work with newer packaging standards.
2024-06-12T22:49:08.6289949Z   # - It provides a clear error message when setuptools is not installed.
2024-06-12T22:49:08.6295383Z   # - It sets `sys.argv[0]` to the underlying `setup.py`, when invoking `setup.py` so
2024-06-12T22:49:08.6295837Z   #   setuptools doesn'"'"'t think the script is `-c`. This avoids the following warning:
2024-06-12T22:49:08.6301077Z   #     manifest_maker: standard file '"'"'-c'"'"' not found".
2024-06-12T22:49:08.6307069Z   # - It generates a shim setup.py, for handling setup.cfg-only projects.
2024-06-12T22:49:08.6307810Z   import os, sys, tokenize
2024-06-12T22:49:08.6314125Z   
2024-06-12T22:49:08.6314907Z   try:
2024-06-12T22:49:08.6320956Z       import setuptools
2024-06-12T22:49:08.6321316Z   except ImportError as error:
2024-06-12T22:49:08.6325049Z       print(
2024-06-12T22:49:08.6326023Z           "ERROR: Can not execute `setup.py` since setuptools is not available in "
2024-06-12T22:49:08.6335095Z           "the build environment.",
2024-06-12T22:49:08.6335348Z           file=sys.stderr,
2024-06-12T22:49:08.6338543Z       )
2024-06-12T22:49:08.6338832Z       sys.exit(1)
2024-06-12T22:49:08.6339045Z   
2024-06-12T22:49:08.6339554Z   __file__ = %r
2024-06-12T22:49:08.6340070Z   sys.argv[0] = __file__
2024-06-12T22:49:08.6340336Z   
2024-06-12T22:49:08.6340562Z   if os.path.exists(__file__):
2024-06-12T22:49:08.6340835Z       filename = __file__
2024-06-12T22:49:08.6341059Z       with tokenize.open(__file__) as f:
2024-06-12T22:49:08.6341411Z           setup_py_code = f.read()
2024-06-12T22:49:08.6341621Z   else:
2024-06-12T22:49:08.6341993Z       filename = "<auto-generated setuptools caller>"
2024-06-12T22:49:08.6342280Z       setup_py_code = "from setuptools import setup; setup()"
2024-06-12T22:49:08.6342576Z   
2024-06-12T22:49:08.6342895Z   exec(compile(setup_py_code, filename, "exec"))
2024-06-12T22:49:08.6343569Z   '"'"''"'"''"'"' % ('"'"'/home/conda/feedstock_root/build_artifacts/deepspeed_1718232254780/work/setup.py'"'"',), "<pip-setuptools-caller>", "exec"))' bdist_wheel -d /tmp/pip-wheel-v4pibtb1
2024-06-12T22:49:08.6344021Z   cwd: /home/conda/feedstock_root/build_artifacts/deepspeed_1718232254780/work/
2024-06-12T22:49:08.6344416Z   Building wheel for deepspeed (setup.py): finished with status 'error'
2024-06-12T22:49:08.6348857Z   ERROR: Failed building wheel for deepspeed
2024-06-12T22:49:08.6349126Z   Running setup.py clean for deepspeed
2024-06-12T22:49:08.6349635Z   Running command python setup.py clean
2024-06-12T22:49:10.9424015Z   No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
2024-06-12T22:49:10.9498275Z   [WARNING] Torch did not find cuda available, if cross-compiling or running with cpu only you can ignore this message. Adding compute capability for Pascal, Volta, and Turing (compute capabilities 6.0, 6.1, 6.2)
2024-06-12T22:49:10.9509931Z   DS_BUILD_OPS=1
2024-06-12T22:49:17.9894660Z   Install Ops={'deepspeed_not_implemented': 1, 'deepspeed_ccl_comm': 1, 'deepspeed_shm_comm': 1, 'cpu_adam': 1, 'fused_adam': 1}
2024-06-12T22:49:18.0269777Z   version=0.14.3, git_hash=f492cfc, git_branch=HEAD
2024-06-12T22:49:18.0270870Z   install_requires=['hjson', 'ninja', 'numpy', 'nvidia-ml-py', 'packaging>=20.0', 'psutil', 'py-cpuinfo', 'pydantic', 'torch', 'tqdm']
2024-06-12T22:49:18.0278897Z   ext_modules=[<setuptools.extension.Extension('deepspeed.ops.comm.deepspeed_not_implemented_op') at 0x7ff248dbe460>, <setuptools.extension.Extension('deepspeed.ops.comm.deepspeed_ccl_comm_op') at 0x7ff248dbe4c0>, <setuptools.extension.Extension('deepspeed.ops.comm.deepspeed_shm_comm_op') at 0x7ff248dbe520>, <setuptools.extension.Extension('deepspeed.ops.adam.cpu_adam_op') at 0x7ff16dd51b80>, <setuptools.extension.Extension('deepspeed.ops.adam.fused_adam_op') at 0x7ff16dd51d90>]
2024-06-12T22:49:18.0651351Z   running clean
2024-06-12T22:49:18.0714575Z   removing 'build/temp.linux-x86_64-cpython-39' (and everything under it)
2024-06-12T22:49:18.0715596Z   removing 'build/lib.linux-x86_64-cpython-39' (and everything under it)
2024-06-12T22:49:18.1133919Z   'build/bdist.linux-x86_64' does not exist -- can't clean it
2024-06-12T22:49:18.1143315Z   'build/scripts-3.9' does not exist -- can't clean it
2024-06-12T22:49:18.1151899Z   removing 'build'
2024-06-12T22:49:18.1171105Z   deepspeed build time = 0.08735942840576172 secs
2024-06-12T22:49:18.5956605Z Failed to build deepspeed
2024-06-12T22:49:18.5973299Z ERROR: Could not build wheels for deepspeed, which is required to install pyproject.toml-based projects
2024-06-12T22:49:18.5973660Z Exception information:
2024-06-12T22:49:18.5986009Z Traceback (most recent call last):
2024-06-12T22:49:18.5993394Z   File "$PREFIX/lib/python3.9/site-packages/pip/_internal/cli/base_command.py", line 180, in exc_logging_wrapper
2024-06-12T22:49:18.5993851Z     status = run_func(*args)
2024-06-12T22:49:18.5994316Z   File "$PREFIX/lib/python3.9/site-packages/pip/_internal/cli/req_command.py", line 245, in wrapper
2024-06-12T22:49:18.5995399Z     return func(self, options, args)
2024-06-12T22:49:18.5999462Z   File "$PREFIX/lib/python3.9/site-packages/pip/_internal/commands/install.py", line 429, in run
2024-06-12T22:49:18.5999708Z     raise InstallationError(
2024-06-12T22:49:18.6006202Z pip._internal.exceptions.InstallationError: Could not build wheels for deepspeed, which is required to install pyproject.toml-based projects
2024-06-12T22:49:18.6010801Z Removed build tracker: '/tmp/pip-build-tracker-gvbib0oo'
2024-06-12T22:49:20.4656669Z Traceback (most recent call last):
2024-06-12T22:49:20.4664375Z   File "/opt/conda/bin/conda-build", line 11, in <module>
2024-06-12T22:49:20.4669893Z     sys.exit(execute())
2024-06-12T22:49:20.4670437Z   File "/opt/conda/lib/python3.10/site-packages/conda_build/cli/main_build.py", line 590, in execute
2024-06-12T22:49:20.4677725Z     api.build(
2024-06-12T22:49:20.4678886Z   File "/opt/conda/lib/python3.10/site-packages/conda_build/api.py", line 250, in build
2024-06-12T22:49:20.4685860Z     return build_tree(
2024-06-12T22:49:20.4691479Z   File "/opt/conda/lib/python3.10/site-packages/conda_build/build.py", line 3638, in build_tree
2024-06-12T22:49:20.4708481Z     packages_from_this = build(
2024-06-12T22:49:20.4713969Z   File "/opt/conda/lib/python3.10/site-packages/conda_build/build.py", line 2506, in build
2024-06-12T22:49:20.4714313Z     utils.check_call_env(
2024-06-12T22:49:20.4724506Z   File "/opt/conda/lib/python3.10/site-packages/conda_build/utils.py", line 405, in check_call_env
2024-06-12T22:49:20.4729616Z     return _func_defaulting_env_to_os_environ("call", *popenargs, **kwargs)
2024-06-12T22:49:20.4730205Z   File "/opt/conda/lib/python3.10/site-packages/conda_build/utils.py", line 381, in _func_defaulting_env_to_os_environ
2024-06-12T22:49:20.4735612Z     raise subprocess.CalledProcessError(proc.returncode, _args)
2024-06-12T22:49:20.4736400Z subprocess.CalledProcessError: Command '['/bin/bash', '-o', 'errexit', '/home/conda/feedstock_root/build_artifacts/deepspeed_1718232254780/work/conda_build.sh']' returned non-zero exit status 1.
2024-06-12T22:49:30.4784588Z 
2024-06-12T22:49:30.5793301Z ##[error]Bash exited with code '1'.
2024-06-12T22:49:30.5974127Z ##[section]Finishing: Run docker build

System info (please complete the following information):

  • OS: Ubuntu 22.04.4
  • GPU count and types [e.g. two machines with x8 A100s each]: 1 NVIDIA GPU
  • Interconnects (if applicable) [e.g., two machines connected with 100 Gbps IB]: N/A
  • Python version: 3.9
  • Any other relevant info about your setup: None

Launcher context
Are you launching your experiment with the deepspeed launcher, MPI, or something else? No

Docker context
Are you using a specific docker image that you can share?

quay.io/condaforge/linux-anvil-cuda:11.8

Additional context
Add any other context about the problem here.

The builds have been failing in these PRs as well:

@weiji14 weiji14 added bug Something isn't working training labels Jun 12, 2024
@weiji14 weiji14 changed the title [BUG] [BUG] oneapi/ccl.hpp: No such file or directory. Jun 12, 2024
@loadams
Copy link
Contributor

loadams commented Jun 13, 2024

Thanks @weiji14 for opening this to track.

@loadams loadams self-assigned this Jun 17, 2024
@tgkul
Copy link

tgkul commented Aug 9, 2024

Hello, any update on this issue?

@SnzFor16Min
Copy link

SnzFor16Min commented Aug 11, 2024

Follow the instructions here to install oneccl-devel from Intel as:

conda install -c https://software.repos.intel.com/python/conda/ -c conda-forge oneccl-devel

Solved this problem.

weiji14 added a commit to regro-cf-autotick-bot/deepspeed-feedstock that referenced this issue Aug 11, 2024
Try to fix `fatal error: oneapi/ccl.hpp: No such file or directory` on CUDA builds using suggestion at microsoft/DeepSpeed#5653 (comment)
weiji14 added a commit to regro-cf-autotick-bot/deepspeed-feedstock that referenced this issue Aug 11, 2024
Try to fix `fatal error: oneapi/ccl.hpp: No such file or directory` on CUDA builds using suggestion at microsoft/DeepSpeed#5653 (comment)
weiji14 added a commit to conda-forge/deepspeed-feedstock that referenced this issue Aug 11, 2024
* updated v0.14.4

* MNT: Re-rendered with conda-build 24.5.1, conda-smithy 3.36.2, and conda-forge-pinning 2024.06.21.08.07.40

* Remove ninja as runtime dependency

Xref #1

* Replace pynvml with nvidia-ml-py

Xref microsoft/DeepSpeed#5529.

Also added note about compatibility with pydantic 2.0.

* Reset build number to 0

* Add oneccl-devel to host dependencies

Try to fix `fatal error: oneapi/ccl.hpp: No such file or directory` on CUDA builds using suggestion at microsoft/DeepSpeed#5653 (comment)

* MNT: Re-rendered with conda-build 24.5.1, conda-smithy 3.38.0, and conda-forge-pinning 2024.08.11.18.23.17

---------

Co-authored-by: Wei Ji <23487320+weiji14@users.noreply.github.com>
@weiji14
Copy link
Author

weiji14 commented Aug 11, 2024

Thanks so much @SnzFor16Min for pointing me to that oneccl-devel package (which is also on the conda-forge channel at https://anaconda.org/conda-forge/oneccl-devel/files). As a temporary workaround, I've managed to build deepspeed=0.14.4 at conda-forge/deepspeed-feedstock#63 by adding oneccl-devel to the host dependencies.

That said, I'm still unsure if this issue should be closed, because this Intel oneAPI Toolkit should only be used for CPU builds and not CUDA (GPU) builds no? As mentioned at conda-forge/deepspeed-feedstock#56 (comment):

Do we need to get that oneapi/ccl.hpp file from somewhere? Don't quite get it since these are CUDA (GPU) builds, not CPU builds.

@weiji14 - It looks like this comes from the Intel extensions for pytorch, but we shouldn't need that, and some DeepSpeed tests should have caught that. I'll take a look soon to see if I can tell why we are hitting this here.

Will leave this up to @loadams and the deepspeed team to resolve.

@SnzFor16Min
Copy link

I'm no expert in building DeepSpeed, but as I see DS_BUILD_OPS=1 in the traceback, perhaps @weiji14 you should check if the building script was also pre-compiling CPU ops (e.g., DS_BUILD_CPU_ADAM). This is mentioned in the DeepSpeed documentation, which might require oneAPI libraries even if it's a GPU build.

@weiji14
Copy link
Author

weiji14 commented Aug 12, 2024

Ah yes, the DS_BUILD_OPS=1 flag is set at https://github.com/conda-forge/deepspeed-feedstock/blame/b0193a708c3f1f6864e2a85f7cbdf92ee3bf39ff/recipe/build.sh#L4-L6, and DS_BUILD_CPU_ADAM might be enabled as a result (I haven't checked those build flags for almost a year). Maybe it doesn't hurt to compile with the CPU ops enabled even on CUDA?

@loadams
Copy link
Contributor

loadams commented Aug 12, 2024

@weiji14 - this should be fine to add to the dependencies, it should not cause any issues on the CUDA builds.

Also it should be fine to leave DS_BUILD_OPS=1, That should have all ops enabled, including the CPU ops.

I'd say lets leave this open for now, and I'll check back to confirm we have no issues reported from users, and we can also confirm the flow works with the next DeepSpeed release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working training
Projects
None yet
Development

No branches or pull requests

4 participants