Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Building on Void Linux #117

Closed
phriv opened this issue Jul 31, 2024 · 26 comments
Closed

Building on Void Linux #117

phriv opened this issue Jul 31, 2024 · 26 comments

Comments

@phriv
Copy link

phriv commented Jul 31, 2024

I am currently getting a patch for Void Linux up and running and haven't had any issues until I had to compile UCX.

So far I'm getting the following errors:

/media/LI/rocm_sdk_builder/src_projects/ucx/src/tools/perf/perftest.c:222: error: ignoring '#pragma omp barrier' [-Werror=unknown-pragmas] 222 | #pragma omp barrier

/media/LI/rocm_sdk_builder/src_projects/ucx/src/tools/perf/perftest.c:224: error: ignoring '#pragma omp master' [-Werror=unknown-pragmas] 224 | #pragma omp master

/media/LI/rocm_sdk_builder/src_projects/ucx/src/tools/perf/perftest.c:242: error: ignoring '#pragma omp barrier' [-Werror=unknown-pragmas] 242 | #pragma omp barrier

Searching through stackoverflow suggests something about pass '-fopenmp' to gcc but I'll admit I don't know automake/autoconf well enough to figure out which args I need to edit in order to do so.

@phriv phriv changed the title Build Fails While compiling UCX (openmp unknow pragma) Build Fails While compiling UCX (openmp unknown pragma) Jul 31, 2024
@lamikr
Copy link
Owner

lamikr commented Aug 1, 2024

Hmm, newer seen that error myself. One thing to try quickly is to add the "-fopenmp" to CFLAGS and LDFLAGS.
File binfo/envsetup.sh has some definitions for those in the end of the file. Could you try to add them there?

Another possibility is that the openmp libraries are not installed in your Linux distro? Do you have for example
ibgomp or ibgomp-dev in the Void Linux?

@jeroen-mostert
Copy link
Contributor

jeroen-mostert commented Aug 2, 2024

This is an automake usage fail; autoconf detects that OpenMP is not installed but the project fails to do anything with it. Remove your ucx builddir, issue xbps-install libgomp-devel and try again; in my VM this is sufficient. I'm working on getting a complete list of deps for a clean Void install so we can add this to the deps.

Building this revealed another issue that breaks the build on my machine; issued openucx/ucx#10041 to fix that.

@jeroen-mostert
Copy link
Contributor

jeroen-mostert commented Aug 2, 2024

Well, it's more than just deps, in fact. Void makes the bold choice of symlinking /bin/sh to dash. In theory, this is fine. In practice, there are a lot of nooks and crannies where people assume that bash is everywhere, even in the context of sh. I'll use this comment specifically to track all pull requests I've upstreamed for these fixes (I suggest renaming this one to track overall support for Void instead). Note that some may take quite a long time to get to us as they need to pass through two upstreams (when AMD forks things), so we may need to add local patches.

Of course a simpler alternative is to tell everyone with a distro that doesn't symlink /bin/sh to bash to please do that instead, but I'm guessing that's an even bigger uphill battle. 🙃

@lamikr
Copy link
Owner

lamikr commented Aug 2, 2024

I may have couple of bashism also, is the "declare -A DICTIONARY_PATCHED_PROJECTS" thing from babs.sh working ok with the dash?

@lamikr
Copy link
Owner

lamikr commented Aug 2, 2024

While you are working with the void linux dependencies, here is one problem to think.
I have thinked that the install_deps.sh would only handle the dependencies for the core apps that are build automatically and are listed in the binfo/binfo_list.sh and binfo/core

But as there are now the binfo/extra apps folder for addons, it would be nice to have some dependency handling support also for them. (I consider those apps like a one with little less support and not fully quarantee that they will support on everyblase).

For example binfo/extra/corectrl has quite a many dependencies and so-far I have only but the dependencies I found on mageia and fedora to comments of that binfo file and expect that the user will handle the install of those manually before building the app.

In addition for example on Mageia 9, I needed to backport one app by myself to newer version to be able to build it. Not sure should I print some advice for that or just let the builder to figure it out,

@jeroen-mostert
Copy link
Contributor

jeroen-mostert commented Aug 2, 2024

I may have couple of bashism also, is the "declare -A DICTIONARY_PATCHED_PROJECTS" thing from babs.sh working ok with the dash?

Your own bashisms are always fine, because... you're using bash. :P If the script uses #!/bin/bash at the top then bash is required. But (AFAIK) automake/autoconf can't be forced to use bash, and even if they could, it would be a bad idea since the whole idea is for them to generate portable files.

Note that bash is installed on Void Linux and is even the default shell, it just chooses not to use it for sh. To, uh, keep people honest, I guess.

@jeroen-mostert
Copy link
Contributor

jeroen-mostert commented Aug 2, 2024

For the extra deps, there are basically two choices:

  • Extend the binfo format to support a dictionary of dependencies for known distros, so they can be automatically installed when building. If the distro is not found/unknown, print a message to that effect to encourage people to figure out the deps and submit a PR. This seems like quite a bit of work, especially in terms of shell scripting, but does give the best experience. We can stuff the dependencies from install-deps.sh in the first core binfo, since it's not worth it to carefully track which project needs which dependency, nor do we want to invoke the package manager every time on every build.
  • Don't do any of that, throw up our hands and say "hey, if you want to build this stuff, you need to figure out yourself how to do it, maybe read the comments". Easy, but it also kind of defeats the purpose of including these projects in the repo in the first place if we don't plan on supporting them. The user may as well just clone and build them manually altogether in that case. :P

Note that for rolling distros like Arch and Manjaro, if there is no dependency on ROCm and there is a package for it it's basically not worth building from scratch (corectrl is a good example), since they are realistically close enough to upstream that you don't need to waste time building your own. So rather than figuring out dependencies there, you can just install the distro package and be done with it.

@phriv
Copy link
Author

phriv commented Aug 2, 2024

Well, it's more than just deps, in fact. Void makes the bold choice of symlinking /bin/sh to dash. In theory, this is fine. In practice, there are a lot of nooks and crannies where people assume that bash is everywhere, even in the context of sh.

I was about to come back to this thread to mention this. I did follow your advice @jeroen-mostert and libgomp-devel was enough to get ucx to compile without any issues. The following dependency, UCC, will fail during the autogen step. Like you said this comes down to the fact this distro uses dash instead of bash for /bin/sh (Wish they'd put than in the user manual). I'll apply your patch and report back any further issues.

Also yeah its best to rename this thread at this point.

@jeroen-mostert
Copy link
Contributor

jeroen-mostert commented Aug 2, 2024

For now the build fails on rocBLAS_Tensile/rocBLAS; the latter complains that it can't find Tensile (no module named 'Tensile'), even though the lines just above it say it installed. It seems like it's not building Tensile binaries at all, possibly some platform mismatch or upstream change. Unfortunately most of the rest depends on it (and it's one of the most time-consuming parts of the build).

For now this is the list of dependencies I've recorded (which is not complete, but may allow others to proceed on a build):

xbps-install base-devel cmake babeltrace-devel doxygen elfutils-devel expat-devel gcc-fortran \
    json-c++ gmp-devel libdrm-devel libglvnd-devel libgomp-devel libnuma-devel \
    mpfr-devel msgpack-cxx ncurses-devel openssl-devel protobuf protobuf-devel \
    python3-pybind11 sqlite-devel xxd zlib-devel

Note that although ccache is not formally a dependency of anything, anyone resembling a contributor is encouraged to install it anyway as it really helps speed up those endless rebuilds of llvm, among other things.

@phriv phriv changed the title Build Fails While compiling UCX (openmp unknown pragma) Building on Void Linux Aug 2, 2024
@phriv
Copy link
Author

phriv commented Aug 3, 2024

I see,

Here's the list of packages I installed. I took arch linux's list (in hindsight, not the best idea) and would replace dependencies that had different/wrong names. Whether or not this is an extensive list of packages remains to be seen.

sudo xbps-install -S libgcc make pkgconf numactl cmake doxygen libelf perl-rename perl-URI perl-File-BaseDir perl-File-Copy-Recursive perl-File-Listing wget gcc gcc-fortran fakeroot libgomp pciutils libdrm vim glew autoconf automake libtool bzip2 xz icu perl libmpack python-pip openssl python3-openssl libffi json-c++ texinfo extra-cmake-modules sqlite git git-lfs valgrind flex byacc gettext ninja texlive-basic ocl-icd protobuf python3-pybind11 libaio gmp mpfr libpng libjpeg-turbo msgpack msgpack-cxx sox ncurses expat babeltrace python3-pip libnuma-devel libgomp-devel

As both Ubuntu and arch call for this dependency in some way, I also added this line for Void (void has no official package).
pip3 install --break-system-packages --user CppHeaderParser

@jeroen-mostert
Copy link
Contributor

There's a lot of fluff in the list; for example all -devel packages for Void also install their base libs as dependencies, so adding those separately is not necessary (by contrast the opposite is true for Arch, all the non dev packages install the dev stuff). Of course there's no real competition going on for getting the smallest dependency lists (unless you're angling for a cloud build, I guess), so whatever works. :P

@jeroen-mostert
Copy link
Contributor

jeroen-mostert commented Aug 3, 2024

Also the rocBLAS build continues after installing sqlite3 and rebuilding Python (or possibly as a result of things other than sqlite3), showing the importance of having all the deps in place at least before Python gets built, as otherwise it will silently turn off core modules. Conceivably it could probably also be configured to fail the build if certain necessary libs are missing.

Note that some deps we still have left in install-deps, like CppHeaderParser, seem to be leftovers from when we didn't build and use our own Python (which pulls in this package, among others), because aside from pybind11 I haven't found any of them necessary. So far, at least, knock on wood. We shouldn't be installing them if our own packaged Python is doing the work, as it may mask problems where we incorrectly use the system Python. Also, users cannot expect the packages we build to work with their own native Python without sourcing the ROCm environment except by chance, since the versions can mismatch.

There's still something weird going on where rocBLAS builds Tensile (and succeeds) but then hipBLASLt and hipSPARSELt also try to build it, and fail, giving a very strange error involving gcc headers ("non-virtual member function marked 'override' hides virtual member function"). I dread having to do a full rebuild, maybe @phriv can catch up before I have to retry. If not, I'll get there when I get there, it's been a long day's work. :P

@lamikr
Copy link
Owner

lamikr commented Aug 3, 2024

Hipblaslt and hipsparselt use different version of Tensile (something called tensilelite) and these two projects are more for MI-level of cards at the moment.

@jeroen-mostert
Copy link
Contributor

Ah, so it's actually different code bases. That explains the seemingly redundant builds, though not why they fail (the errors have nothing to do with the gfx arches but are some include snag from pulling in <chrono>, of all things -- it seems like it's falling back to the system gcc when it should be using clang, but I've been wrong about that before).

Personally I would be fine pretending these cards don't exist and not build these projects either (let AMD have their proprietary builds for that) but I understand it would be nicer to have them anyway, as long as they want to work. :P

@jeroen-mostert
Copy link
Contributor

Well, it took a while, but I found it. There are two CMakeLists that specify CXX_STANDARD 20 (the top-level CMakeLists and one deep within the bowels of tensilelite), which does not work; both of these have to be changed to CXX_STANDARD 17, as rocBLAS uses, and then things work. It seems the code is not actually relying on any C++20 features and someone just bumped the standard for a laugh, so that this seems like a safe change. The general problem of having a Clang which needs to play nice with the system GCC's STL remains; there is no fully general solution to this, short of something like including libc++ with Clang, which doubtlessly has its own issues as this is not a common setup.

@jeroen-mostert
Copy link
Contributor

jeroen-mostert commented Aug 3, 2024

I have a successful build, with the caveat that the latest DeepSpeed doesn't build for my gfx1030 -- it chokes on a float16 divide. That is neither here nor there, though. As I've built this on a VM my ability to actually test it is very limited (only CPU operations), but I will package things up for consideration by others. This involves doing a second build to verify things, so it'll be a while yet.

@phriv
Copy link
Author

phriv commented Aug 4, 2024

Alright, after a day of non-stop compiling (rip my poor 7700k), I have a working SDK. All in-suite tests passed. There are two minor issues (probably related to whatever weird stuff void does to shell routing).

Invoking source /opt/rocm_sdk_611/bin/env_rocm.sh will just close the terminal, but running bash first seems to fix this. As far as I can tell I'm running bash (prior to invoking it again), so I'm not sure if this is regular/expected behavior or not.

The only other (possible) issue is I get when I run the pytorch test. The following is dumped to terminal before passing the test:
hip_fatbin.cpp: COMGR API could not find the CO for this GPU device/ISA: amdgcn-amd-amdhsa--gfx1030
hip_fatbin.cpp: COMGR API could not find the CO for this GPU device/ISA: amdgcn-amd-amdhsa--gfx1030
I'm using an RX6600 so this may be related to that but I'm not fully sure.
Also, I did make sure to export the HSA_OVERRIDE_GFX_VERSION variable.

@jeroen-mostert
Copy link
Contributor

jeroen-mostert commented Aug 4, 2024

What shell are you using? source should not close the terminal and it ought to work with other shells, but it hasn't been tested extensively with all possible shells. It obviously works for bash but also zsh, for example. Note that source is a shell built-in and you can't run source env_rocm.sh as an application, that would just close the shell. If you have this as a shortcut or somesuch, you probably need to wrap this in a script and invoke that with -c. If you are just running it from a fresh shell, it sounds like it might be related to whatever's in your profile. It could still be a bug in the script but we'd need more info.

The PyTorch warning is expected; you will get this no matter what card you use (I have an actual gfx1030). it doesn't inhibit acceleration and is due to the way PyTorch loads this stuff at runtime. It does look scary so it would be better if it didn't do that; I don't know if there's a way to circumvent it. Maybe there's one specific binary we shouldn't be compiling with hipcc (or maybe it's unavoidable because comgr checks Python itself -- we might get cheeky and build our Python with hipcc. :P)

@phriv
Copy link
Author

phriv commented Aug 4, 2024

Alright I have an update to issue. It is not a shell issue whatsoever. It's a terminal issue. I usually use tilix, which will close upon sourcing the file, however sourcing the file using xterm and xfce-terminal causes no issue. I'm looking into this. I know you have to export vars to get some of tilix's functionality to work on xfce, this may be related to that.

Regardless, the terminal closing looks like a local issue and not some sort of distro issue.

@jeroen-mostert
Copy link
Contributor

jeroen-mostert commented Aug 4, 2024

The file does have a guard at the top which exits the script if you're not sourcing it but attempting to directly execute it, which uses a bash-specific way of checking things (but it does have a shebang at the top that explicitly invokes bash, so that's OK). This should never cause the outer terminal to exit, because when you're sourcing it that stuff should be skipped, but it's the only thing I can think of that could force a premature exit -- maybe tilix is performing some sort of "optimization" where it fails to put bash in the right mood. The rest is just plain exports that shouldn't give anything any trouble.

jeroen-mostert added a commit to jeroen-mostert/rocm_sdk_builder that referenced this issue Aug 5, 2024
- Add the necessary dependencies
- Include necessary patches for systems where /bin/sh is not bash (including
  but not limited to Void Linux)
- Fix an incompatibility in hipBLASLt between Clang 17 and GCC 13
- Take latest version of ucc (latest release is compatible with ROCm 6.x)
- Remove building the "native" architecture from ucc -- this doesn't work in
  headless setups. The default list should be complete.

Signed-off-by: Jeroen Mostert <jeroen.mostert@cm.com>
@tvegas1
Copy link

tvegas1 commented Aug 5, 2024

Building this revealed another issue that breaks the build on my machine; issued openucx/ucx#10041 to fix that.

Update: It will be fixed with openucx/ucx#10043.

@lamikr
Copy link
Owner

lamikr commented Aug 5, 2024

Thanks for the update. Once openucx/ucx#10043 is applied in upstream we will jump to that version on rocm sdk builder.

lamikr pushed a commit that referenced this issue Aug 6, 2024
- Add the necessary dependencies
- Include necessary patches for systems where /bin/sh is not bash (including
  but not limited to Void Linux)
- Fix an incompatibility in hipBLASLt between Clang 17 and GCC 13
- Take latest version of ucc (latest release is compatible with ROCm 6.x)
- Remove building the "native" architecture from ucc -- this doesn't work in
  headless setups. The default list should be complete.

Signed-off-by: Jeroen Mostert <jeroen.mostert@cm.com>
@jeroen-mostert
Copy link
Contributor

I'm thinking we can close this one? The necessary patches have not all been integrated into upstream repos yet, but that doesn't prevent us from building (and will likely take a very long time for inactive projects like amd-fftw).

@lamikr
Copy link
Owner

lamikr commented Aug 10, 2024

What about the ucx patch? I think it's now merged? If you can verify that the upstream version now builds. we could jump to that version.

@jeroen-mostert
Copy link
Contributor

The thing with that is that we're currently checking out v1.17.0 explicitly. The necessary change was merged (openucx/ucx@e9fac97), but there are many other changes between 1.17.0 and that commit. So it's not just verifying that it builds, but also that it works, and going to an unreleased commit just for this change looks like a bad risk/reward ratio to me. It makes more sense to me to wait for v1.18.0 and remove the patch when we switch to that.

@lamikr
Copy link
Owner

lamikr commented Aug 11, 2024

ah, I mixed to ucc from which we are at the moment picking the latest version. You are right that better to wait 1.18.0. I will close this now as you suggested. (I have not been able to test void linux by myself)

@lamikr lamikr closed this as completed Aug 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants