Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to change SVE vector length for current and children processes #101295

Merged
merged 18 commits into from
Jun 7, 2024
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 30 additions & 4 deletions src/coreclr/vm/codeman.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,10 @@
//
// codeman.cpp - a managment class for handling multiple code managers
//

#if defined(TARGET_LINUX)
#include <sys/prctl.h>
#include <sys/syscall.h>
#endif // defined(TARGET_LINUX)
#include "common.h"
#include "jitinterface.h"
#include "corjit.h"
Expand Down Expand Up @@ -42,6 +45,13 @@
#include "perfmap.h"
#endif

#if defined(TARGET_LINUX) && !defined(PR_SVE_GET_VL)
#define PR_SVE_SET_VL 50 /* set task vector length */
#define PR_SVE_GET_VL 51 /* get task vector length */
#define PR_SVE_VL_LEN_MASK 0xffff
#define PR_SVE_VL_INHERIT (1 << 17) /* inherit across exec */
SwapnilGaikwad marked this conversation as resolved.
Show resolved Hide resolved
#endif // defined(TARGET_LINUX) && !defined(PR_SVE_GET_VL)

// Default number of jump stubs in a jump stub block
#define DEFAULT_JUMPSTUBS_PER_BLOCK 32

Expand Down Expand Up @@ -1257,6 +1267,9 @@ void EEJitManager::SetCpuInfo()

int cpuFeatures = minipal_getcpufeatures();

// Get the maximum bitwidth of Vector<T>, rounding down to the nearest multiple of 128-bits
uint32_t maxVectorTBitWidth = (CLRConfig::GetConfigValue(CLRConfig::EXTERNAL_MaxVectorTBitWidth) / 128) * 128;
SwapnilGaikwad marked this conversation as resolved.
Show resolved Hide resolved

#if defined(TARGET_X86) || defined(TARGET_AMD64)

#if defined(TARGET_X86) && !defined(TARGET_WINDOWS)
Expand All @@ -1271,9 +1284,6 @@ void EEJitManager::SetCpuInfo()

CPUCompileFlags.Set(InstructionSet_VectorT128);

// Get the maximum bitwidth of Vector<T>, rounding down to the nearest multiple of 128-bits
uint32_t maxVectorTBitWidth = (CLRConfig::GetConfigValue(CLRConfig::EXTERNAL_MaxVectorTBitWidth) / 128) * 128;

if (((cpuFeatures & XArchIntrinsicConstants_VectorT256) != 0) && ((maxVectorTBitWidth == 0) || (maxVectorTBitWidth >= 256)))
{
// We allow 256-bit Vector<T> by default
Expand Down Expand Up @@ -1525,6 +1535,22 @@ void EEJitManager::SetCpuInfo()

if (((cpuFeatures & ARM64IntrinsicConstants_Sve) != 0) && CLRConfig::GetConfigValue(CLRConfig::EXTERNAL_EnableArm64Sve))
{
#if defined(TARGET_LINUX)
// prctl() expects vector length in bytes.
int maxVectorLength = (maxVectorTBitWidth >> 3);
SwapnilGaikwad marked this conversation as resolved.
Show resolved Hide resolved

// Limit the SVE vector length to 'maxVectorLength' if the underlying hardware offers longer vectors.
if ((prctl(PR_SVE_GET_VL, 0,0,0,0) & PR_SVE_VL_LEN_MASK) > maxVectorLength)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will set the SVE vector length only for the current thread only. It won't set it for threads that has been started already. Is that correct?

How does it work for new threads? Do new threads inherit SVE vector length of the parent thread or do new threads inherit SVE vector length of the process?

Are we sure that none of the libraries that has been initialized by this point have not cached the vector length?

https://www.man7.org/linux/man-pages/man2/prctl.2.html has explicit warning about use of this API only if you know what you are doing.

It seems that allowing the SVE vector length to be set only before process start is the only reliable option. Setting it here may break all sorts of random things. I am not convinced that we know what we are doing by setting it here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean in the case of something like the C runtime or in the case of something else hosting the CLR?

Are you thinking the only valid thing for us to do here is to fail to launch for an SVE mismatch (AOT) and to just disable SVE usage otherwise (JIT)?

Copy link
Member

@jkotas jkotas Apr 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this is obviously not compatible with hosting. (We have no reliable way to tell whether we are hosted.)

Even without hosting and external libraries outside our control in the picture, there are number of our own threads created in the process by this point (PAL, EventPipe) that will have the wrong size configured. It is hard to guarantee that none of these threads is going to wander into managed code.

Are you thinking the only valid thing for us to do here is to fail to launch for an SVE mismatch (AOT)

I do not see a better option. It suggests that the design we are working with is questionable since it does not work well for AOT.

just disable SVE usage otherwise (JIT)

Yes, if the JIT is not able to accommodate the SVE length that the process was started with.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will set the SVE vector length only for the current thread only. It won't set it for threads that has been started already. Is that correct?

Correct. There shouldn't be any other threads running at this point? (it's possible this code is called much later than I expected or there are hosting scenarios I'm not familiar with)

How does it work for new threads? Do new threads inherit SVE vector length of the parent thread or do new threads inherit SVE vector length of the process?

It will use the parent thread vector length.

All other SVE state of a thread, including the currently configured vector length, the state of the PR_SVE_VL_INHERIT flag, and the deferred vector length (if any), is preserved across all syscalls, subject to the specific exceptions for execve() described in section 6.
In particular, on return from a fork() or clone(), the parent and new child process or thread share identical SVE configuration, matching that of the parent before the call.

sve.rst

Are we sure that none of the libraries that has been initialized by this point have not cached the vector length?

Assuming this is happening early in the process, there should be no use of SVE at all yet. Vector length can be easily read using an instruction (eg CNT). But we can't guarantee what a library might do.

https://www.man7.org/linux/man-pages/man2/prctl.2.html has explicit warning about use of this API only if you know what you are doing.

It seems that allowing the SVE vector length to be set only before process start is the only reliable option. Setting it here may break all sorts of random things. I am not convinced that we know what we are doing by setting it here.

If this is only used for debugging/testing and never used in production, then I lean towards it's fine. Anything more then maybe not.

Are you thinking the only valid thing for us to do here is to fail to launch for an SVE mismatch (AOT) and to just disable SVE usage otherwise (JIT)?

Or should DOTNET_MaxVectorTBitWidth be X86 only?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is only used for debugging/testing and never used in production, then I lean towards it's fine. Anything more then maybe not.

In general these switches are primarily there for debugging/testing purposes. However, they also generally exist as a way to disable or limit intrinsic support if a library is found to have a blocking bug.

It's not great that we can't help setup the process to achieve success, but its also not the end of the world and is something we can ideally give user guidance around.

Or should DOTNET_MaxVectorTBitWidth be X86 only?

I think it's fine for us to respect it still, that's functionally what AOT compiled for a particular SVE size would have to do after all. It's just a different way to disable SVE support.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documented switches have to be reliable. DOTNET_MaxVectorTBitWidth is documented switch.

It does not sound like that this switch can be reliable. It means that it should have different name, and ideally be a debug-only switch. We do not want to be dealing with inscrutable crashes caused the different parts of the process being configured to different vector sizes.

Copy link
Member

@tannergooding tannergooding Apr 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not sound like that this switch can be reliable.

It's still reliable and working as documented, even with this change in direction. The switch was intentionally named MaxVectorTBitWidth because such complications could exist. All that's changed is that instead of us setting sizeof(Vector<T>) based on min(SveLength, DOTNET_MaxVectorTBitWidth), we simply set it based on (SveLength > DOTNET_MaxVectorTBitWidth) ? 16 : SveLength.

So, this minor change in direction is really no different than us limiting the maximum bit width to 256 by default on x64 hardware or not considering 512-bits on certain first gen AVX512 hardware unless the users also opt into a hidden undocumented switch.

Which is to say, it still simply represents the maximum size a user wants to support (defaulting to 0 which means the system can decide). It can be smaller if the system doesn't support the size specified.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As proposed in this PR, it does more than just setting the sizeof(Vector<T>). PR_SVE_SET_VL call makes it unreliable.

It impacts the global state of the thread and process in a way that may be incompatible with other components in the process. It is what makes it unrealizable. It is guaranteed to be broken for CoreCLR hosting scenarios, and it may have issues without hosting too based on the documentation. It is very hard to audit what is loaded in the process.

I agree that it would be ok if the switch set sizeof(Vector<T>) only without calling PR_SVE_SET_VL.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I should have clarified I meant that given your input that we shouldn't take this PR to call prcrtl because its unreliable, that the alternative where we simply just don't use SVE if its larger than the DOTNET_MaxVectorTSize is fine and still inline with the currently documented behavior for that config switch.

If we provided anything around prctl (and it sounds like we're leaning towards no), it would need to be a separate undocumented switch, potentially debug only. -- I don't think we have the need to add that given our current testing needs and the known sizes (128 and 256) we'll want to support for existing SVE capable hardware (both consumer and server/cloud).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tannergooding for clarifying some of the things offline, so it eventually boils down to:

int VectorTLength = 0;
int SystemVectorTLength = Get_System_VL();

if (DOTNET_MaxVectorTBitWidth == 0)
{
   // If we fix getVectorTByteLength() to return system length - then use system provided length
   // Otherwise use 128
   VectorTLength = SystemVectorTLength;
}
else if (DOTNET_MaxVectorTBitWidth >= SystemVectorTLength)
{ 
   // For a 256-bit machine, if user provides DOTNET_MaxVectorTBitWidth=512, we will use the
   // maximum available length of 256-bits
   VectorTLength = SystemVectorTLength;
}
else if (DOTNET_MaxVectorTBitWidth < SystemVectorTLength)
{
   // For a 256-bit machine, if user provides DOTNET_MaxVectorTBitWidth=128, we do not want
   // to update it using syscall because that has implications on already initialized components
   // as it might have stale vector length. In that case, disable SVE.
   //
   // If user really wants to downgrade the size, they can call `prctl` or windows `CreateProcess()` 
   // to limit the size before launching dotnet process
   VectorTLength = 128
   DisableSve();
}

To implement Get_System_VL(), we can use prctl PR_SVE_GET_VL, it will be good to use CNTB or an equivalent instruction because that way, it will be OS agnostic. We will need that anyway for getVectorTByteLength() when DOTNET_MaxVectorTBitWidth is not set.

With that said, there is no need to introduce a different environment variable that is DEBUG only to support the downgrade scenario.

{
if (prctl(PR_SVE_SET_VL, (maxVectorLength | PR_SVE_VL_INHERIT), 0, 0, 0) == -1)
{
LogErrorToHost("LoadAndInitializeJIT: prctl() FAILED - unable to set maxVectorLength to %d", maxVectorLength);
}
}
#elif defined(TARGET_WINDOWS)
// TODO-SVE: Add prctl() equivalent for windows.
#endif

CPUCompileFlags.Set(InstructionSet_Sve);
}

Expand Down