-
-
Notifications
You must be signed in to change notification settings - Fork 302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support NEON instruction set #12
Comments
Yes it definitely would. I would have no ARM hardware to test the implementation though. The process to port ozz SIMD implementation is:
The whole library, including SoA implementation, is based on the functions from simd_math_*-inl.h, so there's nothing else needed. |
I reopen the request as I think it makes a lot of sense to implement it indeed. |
https://github.com/scoopr/vectorial or even this one https://github.com/jratcliff63367/sse2neon -> good reference for sse/neon implementation. |
We're going to be starting on Switch soon. Expect a PR early next year, but if someone wants to do it before us, that would be nice! |
Awesome news @kylawl. Don't hesitate to reach me if you want to discuss this or need help/support. |
So it's been a while and I'm back looking at this again. As a first step, I thought I'd just try using sse2neon to see if there's any benefit from simply aliasing all the instructions raw like that. Performance is actually surprisingly poor going this route on Switch. The sse reference implementation takes about 1.2ms for our whole animation phase while using sse2neon takes 2.7ms! Not exactly the sort of thing I was expecting/hoping for. I've seen some discussion that we could be throttled due to memory access overhead rather than computation, going to need some more investigation. |
If I remember correctly, Bullet physics had code contributed by Apple that made it very performant on ARM/iOS. Maybe that would be worth looking at? |
Welcome back! You say 1.2ms for "sse reference implementation". Do you mean float/scalar reference implementation? If so, it could be worth checking the generated code, to see how much the compiler auto-vectorizes the code. All the SoA usages of the math library in ozz are very easy for the compiler to auto-vectorize, so maybe neon is already at use. That doesn't mean 1.2ms can not be optimized, but optimization expectations would be lower. Are the memory access overhead issues you mentioned specific to neon? |
You're probably right that the autovectorization is doing a decent job. One thing that sse2neon misses is the common shuffle operations that we do to splat the same value into all 4 components. For that particular shuffle operation, they use a multi instruction "generic" path even though arm has a specific instruction for handling that operation. After spending some more time on switch optimizations, I don't think this is a memory access issue. Needs further investigation for sure. |
Hi, what did you end up doing on Switch? Did you need/implement neon optimizations ? Cheers, |
It's been a while, but if I remember correctly. The compiler was able to
optimize the output sufficiently for us to use. We tried one of those sse
to neon headers and it was significantly slower that just using the vanilla
one. Baring in mind that our skeletons were only a small number of bones
maybe averaging 30 bones on like max 5 characters at a time.
On Switch we were cpu bound but the minimal animation time was outstripped
by the "open worldness" of the game.
Sorry we never got to completing that.
…On Mon, Mar 4, 2024, 12:34 p.m. Guillaume Blanc ***@***.***> wrote:
Hi,
what did you end up doing on Switch? Did you need/implement neon
optimizations ?
Cheers,
Guillaume
—
Reply to this email directly, view it on GitHub
<#12 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABFY7V5T6OVDSRUHT5QYRXDYWTLD7AVCNFSM4CQSJALKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOJXG42DANZQGEZA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
No worries, thanks for the feedback. |
It would be great if NEON is supported :)
The text was updated successfully, but these errors were encountered: