-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wuffs 0.4 significantly slower than 0.3 decoding PNGs #148
Comments
Thanks for the bug report. I don't run Windows or VS myself. Are you able to bisect a few Wuffs versions to see which commit caused the slow down? If you're using
That's pretty coarse-grained though. If you have the time, I'd find it more helpful if you can bisect using
Also, what compiler flags are you using? |
Tangentially, if this is your code: Then an explicit "Convert from BGR to RGB" shouldn't be necessary, Instead, I think that you can change line 276 from this:
to this:
|
I'd also like to know whether any or these
|
Yeah that's my code. In addition, despite being relatively unoptimised code, my BGRA->RGBA swizzling seems faster than Wuffs:
|
Just _M_X64. I'm not building with AVX. |
|
I suspect that the slow-down is due to SIMD code no longer being used. Comparing
Also, there's a
Out of curiosity (I'm not familiar with MSVC / Visual Studio), both
Did you notice that at all, when building Wuffs? |
That might not be true, though. Fortunately, there's not that many commits between
|
WUFFS_BASE__CPU_ARCH__X86_64 is not defined for me.
Yes, and it's kind of annoying :) Looking at the code, the issue seems to be
preventing SSE from being used. As mentioned before I'm not building with AVX. |
To be clear, are you saying "WUFFS_BASE__CPU_ARCH__X86_64 is not defined for me" for just Again, what does
Yeah, it's not ideal, but I don't know how to make it better. Wuffs ships as a "single file C library". And in gcc or clang, code can opt-in to "compile me with SIMD enabled" via an So, for VS, I'd like the single file C library to work out of the box (even if, by default, it's leaving significant performance on the table), and it does, but the Sort of tangential to the original post, but as you're concerned about PNG decode performance: if your CPUs are less than 10 years old, then I'm curious how that "8.4 milliseconds" number changes if you do pass |
Maybe you could add a preprocessor option to suppress the warning? something like |
|
It's not defined for me using v0.4.0-alpha.4. Not sure about alpha 3. |
I think I have done enough remote debugging for now sorry. I think you need a windows build machine :) |
OK, but in that case, I don't expect this bug to be fixed any time soon. Sorry. |
Isn't it clear what the issue is? The SSE code is only being used when AVX is defined. |
For Wuffs + Visual Studio, "SSE code is only being used when AVX is defined" was true for Wuffs v0.3, v0.4.0-alpha.3 and v0.4.0-alpha.4. All three versions have the same At least, I think it's the same across all three versions, and that's why I asked you previously if you could confirm that (for older versions not just v0.4.0-alpha.4). If so, "SSE only when AVX defined" doesn't explain why performance regressed between v0.3 and v0.4, or between v0.4.0-alpha.3 and v0.4.0-alpha.4. I don't think it's clear yet what the issue is. |
But also, I don't think "SSE code is only being used when AVX is defined" has an obvious fix. SSE isn't a single thing, it's at least six different things: SSE, SSE2, SSE3, SSSE3 and SSE4.1 and SSE4.2. If all you have is On the other hand, Wuffs "SSE code" (in both version 0.3 and 0.4) uses intrinsics like |
There's a few things going on here I think. WUFFS_BASE__CPU_ARCH__X86_FAMILY and are only defined when The second thing that is going on is that I was working around this (I guess you could call it) by defining WUFFS_BASE__CPU_ARCH__X86_FAMILY myself before including the wuffs c file. This workaround stopped working well in wuffs 0. 4 alpha 4 when WUFFS_BASE__CPU_ARCH__X86_FAMILY became used less. (in fact not used at all) |
Well currently just including wuffs by default on windows x64 wouldn't even use SSE2. Solution I think is to detect building on x64, then allow SSE1 and SSE2. Or you could call it something like |
I'd say more "poorly named" than "incorrect". WUFFS_BASE__APPLY_X86_SIMD_OPTIMIZATIONS_AND_YOU_CAN_ASSUME_ALL_OF_SSE_NOT_JUST_SSE1_AND_SSE2 is more accurate, but maybe not a better name. In terms of "SIMD capability granularities" and not "which WUFFS macros trigger which SIMD code paths", I'm only looking at MSVC documentation (I'm not running MSVC myself) but for e.g. SSE4.1 code, not SSE1 or SSE2, https://learn.microsoft.com/en-us/cpp/build/reference/arch-x86?view=msvc-170 says you're going to need |
Ah, "defining it yourself" is a crucial bit of information that wasn't obvious from the earlier conversation. Having a second look at your That macro is a private implementation detail and it's not designed for library users to configure. Wuffs' documentation could certainly be better, but only the macros starting with If you need an immediate workaround for I'll think about whether to introduce a I wouldn't have done so in the past as I didn't know how MSVC would behave when you try to compile SSE4.2 intrinsics without also passing the |
If you're curious about why Wuffs' "SSE-family enabled" behavior changed, from depending on Rather than dealing with any similar-to-#145 issues in the future, it seemed simpler for Wuffs to enable its SIMD code (both SSE-family and AVX) only on 64-bit x86, not both 32-bit and 64-bit x86. That involved the |
That won't help Wuffs' PNG performance. Here's a code snippet from
That all works fine if you assume SSE4.2 (and earlier SSEs). But what if you only have SSE1 and SSE2? It turns out that e.g. the
Wuffs does choose code at run time. Furthermore, Wuffs' compiler (it takes in Currently, Wuffs' "can I call this |
Yeah, I'm not interested in just targeting SSE2. Targeting SSE 4.2 is much more reasonable these days. |
The x86-64-v3 micro-architecture level requires AVX2. Updates #148
If I understand the #148 discussion correctly, a user overrode WUFFS_BASE__CPU_ARCH__X86_FAMILY (without also providing /arch:AVX) to enable SSE4.2 code (even though, technically, x86_64 doesn't guarantee SSE4.2, only SSE2). This seemed to work fine in practice, but was unsupported by Wuffs and 'broke' by fixing #145, now looking for WUFFS_BASE__CPU_ARCH__X86_64 instead of WUFFS_BASE__CPU_ARCH__X86_FAMILY. WUFFS_BASE__CPU_ARCH__X86_FAMILY has since been renamed to WUFFS_PRIVATE_IMPL__CPU_ARCH__X86_FAMILY and then removed entirely. As SSE4.2 (roughly equivalent to x86-64-v2) seems to work fine at compile time (with the existing cpuid detection at runtime), enable it by default for MSVC, without need /arch. Enabling x86-64-v3 too, by default, is held back for now, pending further confirmation from MSVC users. The #148 user was exercising Wuffs' PNG codec, which uses SSE4.2 but not AVX or AVX2. Currently, only Wuffs' JPEG codec (and YCbCr conversion) uses AVX or AVX2. GCC and Clang don't need this fiddliness, because they support "__attribute__((target(arg)))". Updates #148
I have cut a new
If you try it, please let me know if you hit any problems (or performance regressions from v0.3). I don't have a Windows or MSVC install myself. |
This is possibly causing "Internal compiler error" problems with MSVC: see issue #151. Hmm... |
This recreates the case as of commit f169822 (tag v0.4.0-alpha.4), in that, by default (without #define'ing a macro or passing an /arch:ETC compiler flag), Wuffs does not use SIMD on MSVC x86_64. Commit b64a761 (after tag v0.4.0-alpha.4, before tag v0.4.0-alpha.5) changed the default so that x86_64_v2 (roughly equivalent to SSE4.2) was enabled by default, since the user from issue #148 was enabling that anyway (in an unsupported way, by #define'ing a macro that was a private implementation detail) with no problems (and better performance). However, another user later reported (in issue #151) that enabling SIMD on MSVC x86_64 somehow lead to ICEs (Internal Compiler Errors). This commit restores the default to "no SIMD" and it is up to the MSVC user to opt in to the SIMD code paths. Clang and GCC are unaffected: SIMD remains enabled by default. Updates #148 Updates #151
I've just rolled Wuffs You can still get ICE if you opt in, but at least you should no longer see #151's ICE by default.
|
On Windows, Visual studio 2022, no AVX, AMD Ryzen 9 5950x.
The text was updated successfully, but these errors were encountered: