-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
generic/portable support for non-x86 systems #1
Comments
@jeffhammond thank you for spending time to review this package! Since most compute clusters use CPUs which are based on x86_64, I never even considered supporting other architectures. Writing a high-performance arm implementation would be too much work for too little benefit (since publication-quality results will be produced on clusters anyway), but I like the idea of a scalar fallback implementation for easy testing. I'm happy to give scalar implementation a go, but I have no arm/aarch64 machine to test it on. Would you be willing to work together on this and use your Apple M1 as a guinea pig for compiling the code? |
Yes, I would be happy to test a generic port of this code on all of my ARM systems. I have lots of them 😎 Below is more info about ARM adoption than you ever wanted to know... You can buy a 64-bit Raspberry Pi 4 system for less than €50. They are not the most powerful systems but they were just fine for porting many codes. I have developed LLVM and NWChem on Raspberry Pi or similar systems, both of which are millions of lines of code. AWS offers 64-core ARM servers that are competitive with the best x86 CPUs for both compute- and memory-bound workloads, and are, at the very least, useful for doing ARM porting if you don't want to own hardware locally. Ampere is building ARM servers for data centers, which are more powerful than many x86 servers (details). Oracle offers Ampere servers in their cloud right now. I recently presented on the performance of these CPUs versus Intel Xeon processors for the NWChem quantum chemistry workload. The current fastest supercomputer in the world, Fugaku, is an ARM system. The next two after that are Power9+NVIDIA. Number 4 is a custom non-x86 chip. At #5, we see the first x86 supercomputer, which gets a large portion of its performance from NVIDIA GPUs. Top500 has details. I can help you get access to the same CPUs as are found in Fugaku at Bristol or Georgia Tech if you like. I guess you in Europe so the ping times to Bristol should be better. In time, you will see an increasing number of ARM-based HPC systems in the world, such as those based on the NVIDIA Grace processor, which are already planned at Los Alamos and CSCS. I didn't even mention what Apple, Qualcomm and SiPearl are doing with ARM right now, but the internet has details. Anyways, I work for NVIDIA and building ARM HPC software is a big part of my day job, so I'm biased, but even when I worked for Intel (2014-2021), I did not encourage people to write x86-specific code. |
Wow! This is a lot of very good info. I wasn't aware of these developments... Thank you so much! This argument is so convincing that now I want a fast ARM implementation :D And then compare the performance of Ampere to AMD EPYC (which I've used for benchmarking). The easiest is to swap vectorclass/version2 for e.g. simd-everywhere/simde and play around with CMake to make it compile on ARM as well. A completely different solution is to use Halide to generate kernels for various platforms and then do run-time dispatch similarly to how it's done now. The upside is that performance tuning for a few specific architectures becomes way easier, and it would be trivial to get a proof-of-concept GPU version running. The downside is that it's much more work, and I'm not entirely sure that the performance will be on par with the current version. So I'm still thinking about the best way to approach this issue... |
Yeah, SIMDe is one of the recommended solutions here. An ARM friend mentioned he ported most of vectorclass to ARM with it, but there were a few small gaps (that might not affect you). Halide and other DSLs are very interesting. I looked at Halide a bit in the past but the learning curve was steep. I will have to look at your code more to see what other options exist. If there's a massive amount of task or data parallelism, there are many options. If one has to manage SIMD/SIMT and L1/SLMs explicitly, it's harder. |
Hi! Porting vectorclass to Arm (Gravition 2 actually) wasn't too hard. SIMDe's not a drop-in replacement (see below) but it gets you nearly there. Most of the code changes were due to differences in the compilers between systems; you'd see the same sort of issues moving from x86 to POWER. Someone comfortable writing AVX512 intrinsics is also comfortable leveraging Intel compiler intrinsics and inline x86 assembly, and SIMDe can't help with that. I would love to send you a pull request, but Arm's lawyers would get upset since I don't yet have the proper paperwork in place to contribute to this open source project. While I'm getting that sorted, I can explain the changes and let you take it from there, if you like. To get started:In version2/instrset.h:
Problems and solutions:
All this took less than an afternoon. Some tests fail due to caveats. See https://github.com/simd-everywhere/simde#caveats. I've attached the output from Stats: Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz: Graviton 2: |
Sorry for the delay, I was moving and had to take a few days off. @jlinford thank you for your insights, they're very helpful! I still haven't decided whether Halide or SIMDe is the right way to go. In lattice-symmetries, SIMD is currently used for two things:
I gave 1. a try in Halide (you can find some work-in-progress code here https://github.com/twesterhout/lattice-symmetries-halide). It seems to produce reasonable C code, but I haven't analyzed assembly yet. The reason why I'd like to use Halide here is that it's trivial to change how loops are unrolled and vectorized, and I'm afraid that simply processing 512 bits at a time (what the current implementation does) is sub-optimal when natural vector size is 128 bits (i.e. I see that gcc dumps stuff to the stack rather than doing everything in registers).
Everything else is pure C & C++ code and it should just compile for other architectures. If I manage to get Halide implementation of 1. to perform reasonably well, I'll generate ARM kernels and write some wrapper code and ping you @jeffhammond to try compiling it on one of your machines. |
Another update.
|
That's interesting. @jlinford will have better advice than me on SVE porting. Is the 512b case addressed by SIMDe support that maps AVX-512 w/ ZMM to SVE512? |
I am reviewing openjournals/joss-reviews#3537 and was unable to test on my Apple M1 (my machine is not setup to compiler for x86 even if Rosetta2 supports executing such binaries).
I see that the code has support for all the x86 SIMD instruction sets via Agner Fog's library, but there does not appear to be a generic scalar implementation that would allow the code to be used on any platform with a ISO C++ compiler.
If it was completely obvious to me, I would contribute this feature but I do not think I am qualified to unwind this dependency and implement generic platform support.
The text was updated successfully, but these errors were encountered: