-
-
Notifications
You must be signed in to change notification settings - Fork 619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metal (Apple) GPU back-end for Tracy #793
base: master
Are you sure you want to change the base?
Conversation
@wolfpld I'd like to request reviews from @nosferalatu and @JamesMcCarthy44, but I can't seem to be able to add reviewers. |
I don't know how assigning reviewers work on Github. Mentioning people should be enough to get their attention. |
Also pinging @theblackunknown for a code review. |
#ifndef __TRACYMETAL_HMM__ | ||
#define __TRACYMETAL_HMM__ | ||
|
||
/* The Metal back-end in Tracy operates differently than other GPU back-ends like Vulkan, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this work only on Apple Silicon, or does it work on the older non-Apple GPUs as well? (I personally only care about Apple devices with Apple Silicon, but a clarifying comment might be helpful to others)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should work on Intel-based Macs as well. I cant quite test it at the moment, though.
/* The Metal back-end in Tracy operates differently than other GPU back-ends like Vulkan, | ||
Direct3D and OpenGL. Specifically, TracyMetalZone() must be placed around the site where | ||
a command encoder is created. This is because not all hardware supports timestamps at | ||
command granularity, and can only provide timestamps around an entire command encoder. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the TracyMetalZone() is placed at command granularity, what happens on hardware that doesn't support command granularity timestamps?
The comment might be revised to say something like "... must be placed around the site where a command encoder is created. This is .... . If running on hardware that doesn't support command granularity timestamps, then XXX happens."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you call TracyMetalZone()
, you need to pass the command encoder descriptor, so it's technically impossible to call it by passing a command encoder to it.
There will be a TracyMetalZone()
interface in the future that takes the command encoder, once I have updated hardware here to test the other granularities.
discarded and reallocated. In practice, the race condition will never materialize so long | ||
as TracyMetalCollect() is called frequently to keep the amount of unresolved queries low. | ||
Finally, there's a timeout mechanism during timestamp collection to detect "empty" command | ||
encoders and ensure progress. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice explanation! All three reasons a,b, and c are all worth commenting about.
return sentinel, | ||
"NextQueryId: FULL! too many pending timestamp queries. [%llu, %llu] (%u)", | ||
m_previousCheckpoint.load(), id, count | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume that when this happens, TracyMetalCollect() should be called more frequently? It might be useful to add a comment about that before the panic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good point, I'll clarify that further.
We can also consider adding more buffers to collect timestamps, but that will complicate the implementation, so I'll leave it as future work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice work Marcos.
I have made a first pass of code review without testing so far.
The PR description and analysis is greatly appreciated !
Can you also clarify if this code is supposed to be use with ObjC ARC or not ?
I know that codebase from codebase we can run into different expectations with this behavior.
} while(false); | ||
|
||
|
||
#define TRACY_METAL_DEBUG_MASK (0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this expected to unconditionally define this ?
did you meant the following so that it can be overridden on the client side ?
#define TRACY_METAL_DEBUG_MASK (0) | |
#ifndef TRACY_METAL_DEBUG_MASK | |
#define TRACY_METAL_DEBUG_MASK (0) | |
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, at this initial stage, I'd like to have the verbose debugging log available when needed.
Should someone report issues, I can ask them to set this define prior to including the header.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is understood that you want people to possibly override the macro, but the right coding pattern to do so is to wrap the #define
with #ifndef ... #endif
which is missing as of writing to avoid compiler warnings (and thus maybe errors) in client codebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, LOL, yeah, I did not realize this was a code change suggestion. You are absolutely right!
TracyMetalDebug(1<<0, TracyMetalPanic(, "MTLCounterErrorValue = 0x%llx", MTLCounterErrorValue)); | ||
TracyMetalDebug(1<<0, TracyMetalPanic(, "MTLCounterDontSample = 0x%llx", MTLCounterDontSample)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you plan on keeping all those debug points ?
Coming from the Tracy Vulkan background they look unfamiliar to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, at least at this initial stage. Should people start reporting issues, this can help me triage the problem.
if (![m_device supportsCounterSampling:MTLCounterSamplingPointAtDrawBoundary]) | ||
{ | ||
TracyMetalPanic(, "WARNING: timestamp sampling at draw call boundary is not supported."); | ||
} | ||
if (![m_device supportsCounterSampling:MTLCounterSamplingPointAtBlitBoundary]) | ||
{ | ||
TracyMetalPanic(, "WARNING: timestamp sampling at blit boundary is not supported."); | ||
} | ||
if (![m_device supportsCounterSampling:MTLCounterSamplingPointAtDispatchBoundary]) | ||
{ | ||
TracyMetalPanic(, "WARNING: timestamp sampling at compute dispatch boundary is not supported."); | ||
} | ||
if (![m_device supportsCounterSampling:MTLCounterSamplingPointAtTileDispatchBoundary]) | ||
{ | ||
TracyMetalPanic(, "WARNING: timestamp sampling at tile dispatch boundary is not supported."); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is useful to have those logs but as you have explained TracyMetalZone
only works around command encoder as of writing, so I am unsure how those warnings are useful to clients.
But it may be worth it to keep them as TracyMetalDebug
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good point, I'll do that.
In the future, you'd request the granularity expected, and if that fails, then a panic is issued.
TracyMetalDebug(1<<0, TracyMetalPanic(, "Calibration: CPU timestamp (Metal): %llu", cpuTimestamp)); | ||
TracyMetalDebug(1<<0, TracyMetalPanic(, "Calibration: GPU timestamp (Metal): %llu", gpuTimestamp)); | ||
|
||
cpuTimestamp = Profiler::GetTime(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this expected that you ditch the CPU timestamp returned by the MTLDevice ? Is this for consistency with other Tracy event messages ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what Tracy expects. The CPU timestamp reported when creating the GPU context must be the timestamp that Tracy understands. This is consistent with other backends as well.
//MemWrite(&item->gpuNewContext.flags, GpuContextCalibration); | ||
MemWrite(&item->gpuNewContext.flags, GpuContextFlags(0)); | ||
MemWrite(&item->gpuNewContext.type, GpuContextType::Metal); | ||
Profiler::QueueSerialFinish(); // TODO: DeferItem() for TRACY_ON_DEMAND |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do address the TRACY_ON_DEMAND
it is rather easy to add here.
t_start = m_mostRecentTimestamp + 5; | ||
t_end = t_start + 5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this means you try tp "patch" unresolved or not yet resolved timestamps ?
Can't we defer their resolution instead ?
SubmitZoneEndGpu(m_ctx, m_query.idx + 1); | ||
} | ||
|
||
TracyMetalZoneScopeWireTap; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the use of this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a debugging technique. An application can define the macro prior to including TracyMetal.hmm and run application-specific code at that point. Will probably go away once I am confident the back-end is working well.
const bool m_active; | ||
|
||
MetalCtx* m_ctx; | ||
id<MTLComputeCommandEncoder> m_cmdEncoder; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is currently unused because the above code is within #if 0 ... #endif
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and I'll be working on that in a subsequent PR. I don't have hardware that supports command granularity right now.
MemWrite( &item->gpuZoneEnd.context, ctx->GetContextId() ); | ||
Profiler::QueueSerialFinish(); | ||
|
||
TracyMetalDebug(1<<2, TracyAllocN((void*)(uintptr_t)queryId, 1, "TracyMetalGpuZone")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are those not supposed to be pair of TracyAllocN
/TracyFreeN
?
private: | ||
const bool m_active; | ||
|
||
MetalCtx* m_ctx; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively you could just store the context ID
{ | ||
auto checkTime = std::chrono::high_resolution_clock::now(); | ||
auto requestTime = m_timestampRequestTime[k]; | ||
auto ms_in_flight = std::chrono::duration<float>(checkTime-requestTime).count()*1000.0f; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wolfpld I want to remove uses of std::chrono
and use what's available in Tracy already, that is, Profiler::GetTime()
. I may be missing something obvious here, but how do you convert a time difference between two Profiler::GetTime()
samples and convert it to, say, seconds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#ifndef __TRACYMETAL_HMM__ | ||
#define __TRACYMETAL_HMM__ | ||
|
||
/* The Metal back-end in Tracy operates differently than other GPU back-ends like Vulkan, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should work on Intel-based Macs as well. I cant quite test it at the moment, though.
/* The Metal back-end in Tracy operates differently than other GPU back-ends like Vulkan, | ||
Direct3D and OpenGL. Specifically, TracyMetalZone() must be placed around the site where | ||
a command encoder is created. This is because not all hardware supports timestamps at | ||
command granularity, and can only provide timestamps around an entire command encoder. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you call TracyMetalZone()
, you need to pass the command encoder descriptor, so it's technically impossible to call it by passing a command encoder to it.
There will be a TracyMetalZone()
interface in the future that takes the command encoder, once I have updated hardware here to test the other granularities.
return sentinel, | ||
"NextQueryId: FULL! too many pending timestamp queries. [%llu, %llu] (%u)", | ||
m_previousCheckpoint.load(), id, count | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good point, I'll clarify that further.
We can also consider adding more buffers to collect timestamps, but that will complicate the implementation, so I'll leave it as future work.
} while(false); | ||
|
||
|
||
#define TRACY_METAL_DEBUG_MASK (0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, at this initial stage, I'd like to have the verbose debugging log available when needed.
Should someone report issues, I can ask them to set this define prior to including the header.
ZoneValue(begin); | ||
ZoneValue(latestCheckpoint); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup; will try to move them to TracyMetalDebug
macro.
//uintptr_t nextCheckpoint = m_queryCounter.load(); | ||
//if (nextCheckpoint != latestCheckpoint) | ||
//{ | ||
// // TODO: signal event / fence now? | ||
//} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some ideas for signaling in Metal, but I'd rather experiment with that in a future PR.
I'm leaving the comment with TODO there as a reminder.
if (ms_in_flight < timeout_ms) | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Empirical choice based on the applications I tested.
This is reasonable for any interactive, game-like loop.
(I'll add a macro for that)
SubmitZoneEndGpu(m_ctx, m_query.idx + 1); | ||
} | ||
|
||
TracyMetalZoneScopeWireTap; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a debugging technique. An application can define the macro prior to including TracyMetal.hmm and run application-specific code at that point. Will probably go away once I am confident the back-end is working well.
const bool m_active; | ||
|
||
MetalCtx* m_ctx; | ||
id<MTLComputeCommandEncoder> m_cmdEncoder; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and I'll be working on that in a subsequent PR. I don't have hardware that supports command granularity right now.
(I still need to update the manual, but I'm putting the code here for review to save some time).
The Metal back-end in Tracy operates differently than other GPU back-ends like Vulkan, Direct3D and OpenGL. Specifically,
TracyMetalZone()
must be placed around the site where a command encoder is created.This is because not all hardware supports timestamps at command granularity, and can only provide timestamps around an entire command encoder. This accommodates for all tiers of hardware; in the future, variants of
TracyMetalZone()
will be added to support the habitual command-level granularity of Tracy GPU back-ends.Metal also imposes a few restrictions that make the process of requesting and collecting queries more complicated in Tracy:
Because of the limitations above, two timestamp buffers are managed internally. Once one of the buffers fills up with requests, the second buffer can start serving new requests.
Once all requests in a buffer get resolved and collected, the entire buffer is discarded and a new one allocated for future requests. (Proper cycling through a ring buffer would require bookkeeping and completion handlers to collect only the known complete queries.)
In the current implementation, there is potential for a race condition when the buffer is discarded and reallocated. In practice, the race condition will never materialize so long as TracyMetalCollect() is called frequently to keep the amount of unresolved queries low.
Finally, there's a timeout mechanism during timestamp collection to detect "empty" command encoders and ensure progress.