Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revelator darkmod AVX* changes #631

Open
wants to merge 26 commits into
base: master
Choose a base branch
from

Conversation

revelator
Copy link

changes for adding DarkMods AVX* based culling and cpu detection as well as the changes nessesary to use them in the renderer.

Adds some additional math used for shadow generation / interaction using Advanced Vector Extensions if Cpu supports it falls back to generic or SSE if not.
@revelator
Copy link
Author

A good deal of formatting differences make these a bit larger than they have to be but i cannot do much about that.
should be no conflicts at least.

return "MMX & SSE & SSE2 & SSE3 & AVX";
}

#elif defined(_MSC_VER) && defined(_M_IX86)
Copy link
Member

@DanielGibson DanielGibson Nov 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this code (and the AVX2 code) only enabled when using MSVC (VisualC++) on 32bit(!) x86?

@DanielGibson
Copy link
Member

Did you do any benchmarks to see how much this improves performance?

@revelator
Copy link
Author

ah damn forgot to add the enums for the new cpu types (slaps head) :/

just forgot to add them for the other targets in the middle of correcting that :)

i did do a simple benchmark with vsync off in sikkmod with all the gruesome applied i got 130 fps in 4k.

@revelator
Copy link
Author

i dont currently have a linux box on hand to test for mistakes but i guess the buildbot in git does have one.
ill see if there are more things to look out for.

forgot these in the last pull request doh...
@DanielGibson
Copy link
Member

i did do a simple benchmark with vsync off in sikkmod with all the gruesome applied i got 130 fps in 4k.

and how much without these changes?

@@ -332,7 +327,9 @@ void TestAdd( void ) {
}
idLib::common->Printf( "====================================\n" );

// ======================
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the purpose of all these changes?

@@ -35,19 +35,6 @@ If you have questions concerning this license or the applicable additional terms
//
//===============================================================

#if defined(__GNUC__) && defined(__SSE3__)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing all this probably won't work either, esp. on non-x86 CPUs

@revelator
Copy link
Author

hmm compiles here on x64 atleast on msvc but try it on a linux build and let me know if it fails.

forgot to add the type for ALIGNTYPE16 its basically the same as ALIGN16 with a small difference

#define ALIGN16( x )				__declspec( align( 16 ) ) x
#define ALIGNTYPE16				__declspec( align( 16 ) )

for msvc

#define ALIGN16( x )				x __attribute__ ( ( aligned ( 16 ) ) )
#define ALIGNTYPE16				__attribute__ ( ( aligned ( 16 ) ) )

for gnu and probably other unixes.

without the avx code changes sikkmod could barely reach above 60 fps with the heaviest settings (was playable but it did show)
my card is an older RTX 2080 ti

@revelator
Copy link
Author

happy i did not also add the GLSL renderer yet im a bit rusty in handling pull requests and coding in general, been retired for some time. Lets hope the last change clears up the build errors. :)

@revelator
Copy link
Author

looking at it now i see i could just have written it as
ALIGN16( unsigned char FXSaveArea[512] );

@DanielGibson
Copy link
Member

I think these "inlining failed in call to ‘always_inline’ ‘int _mm256_movemask_ps(__m256)’: target specific option mismatch" errors mean that the source file it's in needs to be built with the corresponding CPU extension enabled, so Simd_AVX.cpp (and only that source file!) must be built with -mavx. That must be setup in CMake

#define ID_MAX_ALLOCA_SIZE 1048576 // 1MB
// Linux has a 8MB stack by default, and so does macOS, at least for the main thread
// anyway, a 2MB limit alloca should be safe even when using it multiple times in the same function
#define ID_MAX_ALLOCA_SIZE 2097152 // 2MB
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a specific reason you increased this to 2MB?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dont remember changing that ? maybe something from TDM carried over :S

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no just checked this flag does not exist there and neither anywhere else...
no idea tbh.

@DanielGibson
Copy link
Member

I recommand using the git gui tool to make commits.
It allows you to commit only parts of your changes, by hunk or by line(s), so you can leave out changes that aren't relevant (empty lines, lines with // ======= etc)

@nbohr1more
Copy link

From a TDM perspective, I think the bulk of the AVX changes happened in 2.06 and 2.07 but it's hard to discern how much performance impact they had since we also went "Multi-Core" ( SMP ) with game code vs the renderer in 2.06. I can say that in 2.06 the 32-bit executable was at least 25% slower than the 64-bit because we hadn't yet ported SIMD optimizations to 32-bit.

Though I think 32-bit lost about 10% performance by virtue of going to a Dhewm3 codebase with legacy ( Pentium 4 ) SIMD features removed. ( I know that AMD 64 players instantly pounced on 2.06 as a long awaited vindication that their hardware was always better than the Pentium 4.)

So I guess if this is successfully back-ported, we should expect a 15 to 20% performance depending on the scene. ( This presumes that other engine optimizations like our new light culling system (etc), SMP, vertex buffer changes, aren't also included in here somewhere. ).

Maybe Rev will open a new PR to submit our asset loading optimizations? ( SIMD vectorized mipmap generation was one of the major wins in TDM 2.10 for loading time improvements. )

@DanielGibson
Copy link
Member

more optimizations from TDM are definitely welcome, preferably with one PR per feature :)

@revelator
Copy link
Author

inline problem with linux and mac intrinsics i see argh...

definatly need some more work for those but atleast the windows builds seem to go through now.
ill revert the ALIGNTYPE16 change as it is unessesary with ALIGN16.

@revelator
Copy link
Author

ahh yes need to add the corresponding instruction set with gcc / clang.
adding -mavx and -mavx2 should fix those errors.

@DanielGibson
Copy link
Member

By the way, you can also test building with GCC on Windows by using MinGW32-w64.
The Yamagi Quake II MinGW buildenv worked for dhewm3 last time I tried it, see https://github.com/yquake2/yquake2/blob/4e41dbf186d6d125b35ea338ec7472313071f0d5/doc/020_installation.md#prerequisites-on-windows-when-using-mingw
and the MinGW-specific parts of https://github.com/dhewm/dhewm3?tab=readme-ov-file#using-the-provided-windows-binaries
(you must run cmake in the MinGW shell so it can find its gcc executable etc)

@revelator
Copy link
Author

i do have msys2 with the mingw compilers on another pc.
will move current workdir there also to test with gcc / clang.

@revelator
Copy link
Author

calling it a night for now, its getting quite late here in denmark :)

@revelator
Copy link
Author

to late actually i have to go to the dentist in an hour xD.

reverting some macros for the SSE simd instructions so that it checks for the SSE flag and adding a few new ones for the AVX* ones checking for the AVX and AVX2 flag on gcc / clang since they are the ones making trouble atm.

for avx / avx2 the addition of -mavx -mavx2 might also be needed for gcc / clang i reckon unless checking for those flags are enough. ill try a test build with gcc when i get back :)

Except for shader gamma it works just fine with SDL2, might become a problem in SDL3 though unless gamma code and shader is added to the GLSL interaction routine.
optimized VBO code from michael hiney (InsideQC)
@DanielGibson
Copy link
Member

I'd prefer to have the GLSL code in a separate PR

Added SSE intrinsics to R_LocalPointToGlobal and myGlMultiMatrix and changed name of myGlMultiMatrix to R_MatrixMultiply so it does'nt feel out of place.

/* fast path use sse */
if ( cpuid & CPUID_SSE ) {
__m128 B0x = _mm_loadu_ps( &b[0] );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this won't even compile on non-x86 (or x86_64) machines, so it must be behind some kind of #if or #ifdef

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be fixed now havent comitted the new changes yet though (different machine so i need to setup a few things).

it does compile with mingw-w64 and clang now but i noticed i forgot to change the SSE2 macro to SSE which caused it to fail.

might need another guard for unix / mac lets see.


for ( i = 0 ; i < numPlanes ; i++ ) {
frust = planes + i;
d = frust->Distance( worldOrigin );

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please don't commit all these unrelated whitespace changes. use git gui or similar to only commit the relevant changes

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry been hammering away at getting it to compile on gcc / clang.
mingw-w64 builds work now as well as windows clang builds.

there was one sorry bugger with current mingw-w64-abi the __cpuid function now uses the same syntax as msvc's so i had to change the #if defined(GNUC) && (defined(i386) || defined (x86_64)) guard into one specific for unix so the mingw builds dont use that one.

well atleast it was informal i reckon id make a new repo after getting the last build bugs out and start slowly adding in the changes, that way we can avoid all the whitespace changes from me mucking around in the code.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for now lets use it as a draft to iron out build bugs on different arch's, bit easier than having multiple operating systems
:) when it is all up and running ill retire the fork and start from a fresh one adding only the changes nessesary without the formatting. also gives you the option to chime in before we make changes that may make or break something.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok seems to build on ubuntu now atleast in the one in wsl.
running it on wsl does not seem to work though the game window opens it crashes shortly after with a crash in imgui (not sure wsl is the best testing environment).

i packed up the binaries here https://sourceforge.net/projects/cbadvanced/files/Game%20Projects/dhewm3-ubuntu.7z/download

could you try them out on a real linux box maybe ?.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://sourceforge.net/projects/cbadvanced/files/Game%20Projects/dhewm3-git.tar.zst/download

my current source code ignore whitespace it is not ment to be merged as is only for testing if it works :).

https://sourceforge.net/projects/cbadvanced/files/Game%20Projects/dhewm3.tar.zst/download

i forgot linux sets some flags that are probably lost when zipping it with a windows tool so heres a tarball of the game executables made in ubuntu, i used the linux dist dir in the source as install target (not sure if correct ?).

@revelator
Copy link
Author

ok most whitespace changes are now gone in my local build tree.

one question though __cpuid on mac does something odd to the cpu registers i guess that was why your version exchanged the ebx value for esi, i had to revert it partially because even though the version from darkmod worked in linux it broke when trying to compile it for mac.

i hope to god once i uploaded the change to cpu.cpp that all build error's are gone so i can start seperating the different changes in a new fork.

void R_LocalPointToGlobal( const float modelMatrix[16], const idVec3 &in, idVec3 &out ) {
#if !( defined(MACOS_X) || defined(__APPLE__) )
#if !( defined(__ppc__) || defined(MACOS_X) || defined(__APPLE__) )
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is a very good check for this purpose. It doesn't do what you want on

  • x86 Macs (there it should be enabled)
  • non-x86 CPUs that aren't PPC (unless they happen to run OSX)

Can't you just use #ifdef __SSE__?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i newer coded for MAC before so was a bit unsure how to proceed :) ill see if just using SSE helps.


// Revelator: these work whether in gcc clang or msvc in x86 or x64 (no inline assembly used)
#if defined(_MSC_VER) && ( defined(_M_X64) || defined(_M_IX86) ) || \
defined(__GNUC__) && ( defined(__i386__) || defined (__x86_64__) ) && defined(__AVX__)
Copy link
Member

@DanielGibson DanielGibson Nov 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With GCC, this currently disabled the code because __AVX__ is only defined if -mavx is used - which would have to be set in CMakeLists.txt, only for this file

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, I guess __AVX__ is only set if __i386__ or __x86_64__ is set, so those checks aren't necessary

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i been working on some cmake changes to probe the users cpu and set the correct flags for gnu, still not ready but ill post the change when it is.

but is it only avx that needs a flag set or do we also need it for every SSE type ? like -sse -sse2 -sse3 -ssse3 -mavx -mavx2 etc. if so that would suck.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i been working on some cmake changes to probe the users cpu and set the correct flags for gnu

that is not the right approach - we want binaries that work on all x86 (_64) systems supporting at least SSE or SSE2 or whatever the current baseline is.

so almost all code of dhewm3 should be compiled just like before - only the Simd_FOO.cpp files should get their corresponding -mfoo flag.
That's what I meant with "only for this file".

At least for x86_64 (x64), SSE and SSE2 are always available, so -msse and -msse2 aren't needed (and neither -mmmx).
For SSE3 it would be needed in theory, but in practice Simd_SSE3.cpp uses MSVC-specific inline-assembly that doesn't work with GCC anyway - at least the version currently in dhewm3, don't know if TDM has switched that code to use the more portable intrinsics.

Copy link
Member

@DanielGibson DanielGibson Nov 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm maybe this could also be handled in the source files instead of CMake, like

// add this at the beginning of the source file, under the big comment block with the license
#if defined(__i386__) || defined (__x86_64__)
#ifdef __GNUC__
#pragma GCC target ("avx")
#endif
#endif // // x86 or x86_64

https://gcc.gnu.org/onlinedocs/gcc/x86-Function-Attributes.html#index-target-function-attribute-5 has a list of valid target arguments

not sure if Clang supports this GCC extension, though.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree i just thought you keept the SDL1 code for posterity or possibly some ancient machine ;)
guess it should be removed at some point.

hmm SDL_win32_main.cpp also only seems to be pulled in if you specifically request SDL1.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i foresee one problem atleast with my hybrid glsl/arb renderer since SDL3 no longer supports the old hardware gamma we would need shader code for the glsl interactions for gamma as well when moving on to SDL3.

but lets get the intrinsics working first :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm SDL_win32_main.cpp also only seems to be pulled in if you specifically request SDL1.

oh right, now I remember.
yes, that atexit(cleanup_output); line can be deleted, cleanup_output() has been moved to win_main.cpp and atexit(cleanup_output); is now called from SDL_main() in win_main.cpp (for all SDL versions)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I want GLSL at all at this point, but if it's added it definitely needs to support shader gamma.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

normally would not be a problem but because of the way it is loaded directly in RB_ARB2_DrawInteraction it causes problems for my renderer because it does the housekeeping itself if told this one goes to arb then it goes to the arb side and vice versa but since it is loaded directly in the interaction renderer it bonks out if glsl is selected.
as an example i can run sikkmod with glsl interactions and still use all the arb effects :)

@DanielGibson
Copy link
Member

I created a short testcase for D3_PUSH_X86_TARGETS(): https://gist.github.com/DanielGibson/4e35947c548523e2ac42047a76df9239

it needs to be implemented like in that file, the code I suggested earlier didn't work

processorString[9] = pstring[9];
processorString[10] = pstring[10];
processorString[11] = pstring[11];
processorString[12] = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the point of copying pstring to processorString?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure either it's from darkmod stgatilov might know ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

null terminating pstring ? but there are better ways.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry had to get some shuteye it was allmost 2.00 in the night here :)
not a young man anymore sadly xD

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for terminating some say 0 is sufficient if using C strings though i hear quite a lot prefer the special '\0' character.

tbh in a c++ project std::string would probably have been better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure how a std::string would help here, but for 0-termination pstring[12] = 0; should suffice?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this code is directly from the original Doom3 GPL release: https://github.com/TTimo/doom3.gpl/blob/aaa855815ab484d5bd095f347163194ac569abcc/neo/sys/win32/win_cpu.cpp#L199-L223

still no idea why they did that, and I also think that it doesn't really matter if the users CPU is from AMD (original Doom3 used that to decide what clockrate a CPU needs to qualify for medium/high quality, but dhwem3 doesn't do that. https://github.com/TTimo/doom3.gpl/blob/aaa855815ab484d5bd095f347163194ac569abcc/neo/framework/Common.cpp#L2746-L2749)

Copy link
Author

@revelator revelator Dec 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm could be right i admit i did not look at the original code release.
not sure myself why it was done like that myself.

yeah it probably makes less sense with the check today seing as ryzen supports anything intel does, not even sure if needed at the time of 3DNow since that was more like an extension of the MMX model.
probably could be used for determining unsupported cpu types but since there are few (none) of those left and the ones left dont support playing doom3 anyway it might be prudent to remove it unless users actually want to look at cpu information in the doom3 console :)

@DanielGibson
Copy link
Member

DanielGibson commented Nov 22, 2024

Patch for CMakeLists.txt, so it uses (demands) SSE2 on 32bit x86 when building with GCC/Clang (it was already the default for Visual C++):

diff --git a/neo/CMakeLists.txt b/neo/CMakeLists.txt
index b9805579..c97dca69 100644
--- a/neo/CMakeLists.txt
+++ b/neo/CMakeLists.txt
@@ -331,9 +331,22 @@ if(D3_COMPILER_IS_GCC_OR_CLANG)
        if(NOT CMAKE_CROSSCOMPILING AND ONATIVE)
                add_compile_options(-march=native)
        elseif(NOT APPLE AND cpu STREQUAL "x86")
+               # TODO: set this at all? has worked for years so it should be safe, though now that
+               #       we enable SSE2 by default, pentium4 would in theory be more suitable..
+               #       however, according to the GCC manpage -march is kinda dangerous and the binaries
+               #       may not work on any CPU besides the one specified..
                add_compile_options(-march=pentium3)
        endif()
 
+       if(cpu STREQUAL "x86")
+               # use SSE and SSE2 on 32bit x86 CPUs (on 64bit x86 CPUs they are used by default)
+               # NOTE: All our Windows release binaries of the last years have been built with VS2015 
+               #   or VS2017 which enable SSE2 by default for 32bit x86 and no one ever complained, 
+               #   so it seems like demanding SSE2 is fine (it's supported by Athlon64 and newer
+               #   and Pentium4 and newer)
+               add_compile_options(-msse2)
+       endif()
+
        if(FORCE_COLORED_OUTPUT)
                if(CMAKE_COMPILER_IS_GNUCC)
                   add_compile_options (-fdiagnostics-color=always)
@@ -446,6 +459,12 @@ if(D3_COMPILER_IS_GCC_OR_CLANG)
                set(sys_libs ${sys_libs} ${EXECINFO_LIBRARIES})
        endif()
 elseif(MSVC)
+
+       # NOTE: At least since VS2012, /arch:SSE2 is the default when targeting x86
+       # see https://learn.microsoft.com/en-us/previous-versions/visualstudio/visual-studio-2012/7t5yh4fd(v=vs.110)
+       # so don't set it here - that would be problematic anyway, because we can't detect the
+       # target CPU architecture when using MSVC, and /arch:SSE2 is invalid for ARM or even x64
+
        add_compile_options(/MP) # parallel build (use all cores, or as many as configured in VS)
        
        add_compile_options(/W3) # TODO: was /W4, caused trouble with VS2019 (and/or its integrated CMake? or only HarrieVG's setup?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants