Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

generic/portable support for non-x86 systems #1

Open
jeffhammond opened this issue Jul 26, 2021 · 8 comments
Open

generic/portable support for non-x86 systems #1

jeffhammond opened this issue Jul 26, 2021 · 8 comments

Comments

@jeffhammond
Copy link

I am reviewing openjournals/joss-reviews#3537 and was unable to test on my Apple M1 (my machine is not setup to compiler for x86 even if Rosetta2 supports executing such binaries).

I see that the code has support for all the x86 SIMD instruction sets via Agner Fog's library, but there does not appear to be a generic scalar implementation that would allow the code to be used on any platform with a ISO C++ compiler.

If it was completely obvious to me, I would contribute this feature but I do not think I am qualified to unwind this dependency and implement generic platform support.

@twesterhout
Copy link
Owner

@jeffhammond thank you for spending time to review this package!

Since most compute clusters use CPUs which are based on x86_64, I never even considered supporting other architectures. Writing a high-performance arm implementation would be too much work for too little benefit (since publication-quality results will be produced on clusters anyway), but I like the idea of a scalar fallback implementation for easy testing.

I'm happy to give scalar implementation a go, but I have no arm/aarch64 machine to test it on. Would you be willing to work together on this and use your Apple M1 as a guinea pig for compiling the code?

@jeffhammond
Copy link
Author

Yes, I would be happy to test a generic port of this code on all of my ARM systems. I have lots of them 😎

Below is more info about ARM adoption than you ever wanted to know...

You can buy a 64-bit Raspberry Pi 4 system for less than €50. They are not the most powerful systems but they were just fine for porting many codes. I have developed LLVM and NWChem on Raspberry Pi or similar systems, both of which are millions of lines of code.

AWS offers 64-core ARM servers that are competitive with the best x86 CPUs for both compute- and memory-bound workloads, and are, at the very least, useful for doing ARM porting if you don't want to own hardware locally.

Ampere is building ARM servers for data centers, which are more powerful than many x86 servers (details). Oracle offers Ampere servers in their cloud right now. I recently presented on the performance of these CPUs versus Intel Xeon processors for the NWChem quantum chemistry workload.

The current fastest supercomputer in the world, Fugaku, is an ARM system. The next two after that are Power9+NVIDIA. Number 4 is a custom non-x86 chip. At #5, we see the first x86 supercomputer, which gets a large portion of its performance from NVIDIA GPUs. Top500 has details.

I can help you get access to the same CPUs as are found in Fugaku at Bristol or Georgia Tech if you like. I guess you in Europe so the ping times to Bristol should be better.

In time, you will see an increasing number of ARM-based HPC systems in the world, such as those based on the NVIDIA Grace processor, which are already planned at Los Alamos and CSCS.

I didn't even mention what Apple, Qualcomm and SiPearl are doing with ARM right now, but the internet has details.

Anyways, I work for NVIDIA and building ARM HPC software is a big part of my day job, so I'm biased, but even when I worked for Intel (2014-2021), I did not encourage people to write x86-specific code.

@twesterhout
Copy link
Owner

Wow! This is a lot of very good info. I wasn't aware of these developments... Thank you so much!

This argument is so convincing that now I want a fast ARM implementation :D And then compare the performance of Ampere to AMD EPYC (which I've used for benchmarking).

The easiest is to swap vectorclass/version2 for e.g. simd-everywhere/simde and play around with CMake to make it compile on ARM as well.

A completely different solution is to use Halide to generate kernels for various platforms and then do run-time dispatch similarly to how it's done now. The upside is that performance tuning for a few specific architectures becomes way easier, and it would be trivial to get a proof-of-concept GPU version running. The downside is that it's much more work, and I'm not entirely sure that the performance will be on par with the current version.

So I'm still thinking about the best way to approach this issue...

@jeffhammond
Copy link
Author

Yeah, SIMDe is one of the recommended solutions here. An ARM friend mentioned he ported most of vectorclass to ARM with it, but there were a few small gaps (that might not affect you).

Halide and other DSLs are very interesting. I looked at Halide a bit in the past but the learning curve was steep. I will have to look at your code more to see what other options exist. If there's a massive amount of task or data parallelism, there are many options. If one has to manage SIMD/SIMT and L1/SLMs explicitly, it's harder.

@jlinford
Copy link

Hi! Porting vectorclass to Arm (Gravition 2 actually) wasn't too hard. SIMDe's not a drop-in replacement (see below) but it gets you nearly there. Most of the code changes were due to differences in the compilers between systems; you'd see the same sort of issues moving from x86 to POWER. Someone comfortable writing AVX512 intrinsics is also comfortable leveraging Intel compiler intrinsics and inline x86 assembly, and SIMDe can't help with that.

I would love to send you a pull request, but Arm's lawyers would get upset since I don't yet have the proper paperwork in place to contribute to this open source project. While I'm getting that sorted, I can explain the changes and let you take it from there, if you like.

To get started:

In version2/instrset.h:

//#include <x86intrin.h>                 // Gcc or Clang compiler
#include <simde/x86/avx512.h>

Problems and solutions:

  • vectorclass assumed the signedness of char, e.g. ../version2/vectori128.h:1037:45: error: narrowing conversion of ‘-1’ from ‘int’ to ‘char’ inside { } [-Wnarrowing]. Either specify signed char in the appropriate place, or compile with -fsigned-char.

  • vectorclass macros don't always paren-protect their arguments, e.g. ../version2/vectorf128.h:2126:82: error: macro "_mm_castsi128_pd" passed 4 arguments, but takes just 1. Fixed by use double parens, e.g. _mm_castsi128_pd((args))

  • vectorclass assumes your compiler supports Intel compiler intrinsics, e.g. ../version2/instrset.h:289:22: error: ‘_mm_popcnt_u32’ was not declared in this scope. Fixed by rewriting the code to not use Intel intrinsics.

  • vectorclass uses inline x86 assembly, e.g. __asm("bsrl %1, %0" : "=r"(r) : "r"(a) : );. Fixed by rewriting the code to use GCC builtin functions.

  • vectorclass doesn't know __aarch64__ is a 64-bit platform. Fixed by defining __x86_64__ when __aarch64__ is defined. Equating __x86_64__ with __aarch64__ is a dirty hack and shouldn't be done, but it's a first step.

All this took less than an afternoon.

Some tests fail due to caveats. See https://github.com/simd-everywhere/simde#caveats. I've attached the output from ./runtest.sh test1.lst so you can see which tests are failing.

test1.out.gz

Stats:

Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz:
5490/5490 tests executed successfully in 62 minutes 5 seconds

Graviton 2:
5297/5490 tests executed successfully in 135 minutes 0 seconds

@twesterhout
Copy link
Owner

Sorry for the delay, I was moving and had to take a few days off.

@jlinford thank you for your insights, they're very helpful!

I still haven't decided whether Halide or SIMDe is the right way to go. In lattice-symmetries, SIMD is currently used for two things:

  1. Computing representatives of basis configurations. This amounts to a bunch of bit manipulation intrinsics and multiple reductions.
  2. Searching for representatives, i.e. we have a vectorized binary search.

I gave 1. a try in Halide (you can find some work-in-progress code here https://github.com/twesterhout/lattice-symmetries-halide). It seems to produce reasonable C code, but I haven't analyzed assembly yet. The reason why I'd like to use Halide here is that it's trivial to change how loops are unrolled and vectorized, and I'm afraid that simply processing 512 bits at a time (what the current implementation does) is sub-optimal when natural vector size is 128 bits (i.e. I see that gcc dumps stuff to the stack rather than doing everything in registers).
My plan now is to benchmark and optimize this Halide implementation and see how fast I can get it to work.

  1. definitely needs to be implemented in SIMDe, but I think it's not too difficult.

Everything else is pure C & C++ code and it should just compile for other architectures. If I manage to get Halide implementation of 1. to perform reasonably well, I'll generate ARM kernels and write some wrapper code and ping you @jeffhammond to try compiling it on one of your machines.

@twesterhout
Copy link
Owner

Another update.

  • Halide is awesome for 64-bit stuff (i.e. when we're dealing with 64 spins at most). I have implemented kernels for is_representative and state_info. These are in principle sufficient for all ED stuff (if we replace binary search with the one from STL). The performance is also quite good. On my laptop with avx2 Halide kernels run up to 1.5 times faster than my hand-written ones.
  • 512-bit manipulation stuff sucks... I need to be able to bit-shift 512-bit lanes (mind you, not 64-bit lanes, but really all 512 bits together). I failed to make Halide generate reasonable assembly for it, so I guess I'll need to fall back to SIMDe for this. I think I can get a version with 128-bit vectors (i.e. SSE2 and NEON) up and running, but I'll need help with fancier SVE versions since SIMDe doesn't yet support it very well.

@jeffhammond
Copy link
Author

That's interesting. @jlinford will have better advice than me on SVE porting. Is the 512b case addressed by SIMDe support that maps AVX-512 w/ ZMM to SVE512?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants