Metal (Apple) GPU back-end for Tracy #793

slomp · 2024-05-17T05:35:11Z

(I still need to update the manual, but I'm putting the code here for review to save some time).

The Metal back-end in Tracy operates differently than other GPU back-ends like Vulkan, Direct3D and OpenGL. Specifically, TracyMetalZone() must be placed around the site where a command encoder is created.

This is because not all hardware supports timestamps at command granularity, and can only provide timestamps around an entire command encoder. This accommodates for all tiers of hardware; in the future, variants of TracyMetalZone() will be added to support the habitual command-level granularity of Tracy GPU back-ends.

Metal also imposes a few restrictions that make the process of requesting and collecting queries more complicated in Tracy:

timestamp query buffers are limited to 4096 queries (32KB, where each query is 8 bytes)
when a timestamp query buffer is created, Metal initializes all timestamps with zeroes, and there's no way to reset them back to zero after timestamps get resolved; the only way to clear the timestamps is by allocating a new timestamp query buffer
if a command encoder records no commands and its corresponding command buffer ends up committed to the command queue, Metal will "optimize-away" the encoder along with any timestamp queries associated with it (the timestamp will remain as zero and will never get resolved)

Because of the limitations above, two timestamp buffers are managed internally. Once one of the buffers fills up with requests, the second buffer can start serving new requests.

Once all requests in a buffer get resolved and collected, the entire buffer is discarded and a new one allocated for future requests. (Proper cycling through a ring buffer would require bookkeeping and completion handlers to collect only the known complete queries.)

In the current implementation, there is potential for a race condition when the buffer is discarded and reallocated. In practice, the race condition will never materialize so long as TracyMetalCollect() is called frequently to keep the amount of unresolved queries low.

Finally, there's a timeout mechanism during timestamp collection to detect "empty" command encoders and ensure progress.

slomp · 2024-05-17T05:37:22Z

@wolfpld I'd like to request reviews from @nosferalatu and @JamesMcCarthy44, but I can't seem to be able to add reviewers.

wolfpld · 2024-05-17T10:15:04Z

I don't know how assigning reviewers work on Github. Mentioning people should be enough to get their attention.

slomp · 2024-05-18T04:59:17Z

Also pinging @theblackunknown for a code review.

nosferalatu · 2024-07-25T00:51:45Z

public/tracy/TracyMetal.hmm

+#ifndef __TRACYMETAL_HMM__
+#define __TRACYMETAL_HMM__
+
+/* The Metal back-end in Tracy operates differently than other GPU back-ends like Vulkan,


Does this work only on Apple Silicon, or does it work on the older non-Apple GPUs as well? (I personally only care about Apple devices with Apple Silicon, but a clarifying comment might be helpful to others)

It should work on Intel-based Macs as well. I cant quite test it at the moment, though.

nosferalatu · 2024-07-25T00:53:36Z

public/tracy/TracyMetal.hmm

+/* The Metal back-end in Tracy operates differently than other GPU back-ends like Vulkan,
+ Direct3D and OpenGL. Specifically, TracyMetalZone() must be placed around the site where
+ a command encoder is created. This is because not all hardware supports timestamps at
+ command granularity, and can only provide timestamps around an entire command encoder.


If the TracyMetalZone() is placed at command granularity, what happens on hardware that doesn't support command granularity timestamps?

The comment might be revised to say something like "... must be placed around the site where a command encoder is created. This is .... . If running on hardware that doesn't support command granularity timestamps, then XXX happens."

When you call TracyMetalZone(), you need to pass the command encoder descriptor, so it's technically impossible to call it by passing a command encoder to it.
There will be a TracyMetalZone() interface in the future that takes the command encoder, once I have updated hardware here to test the other granularities.

nosferalatu · 2024-07-25T00:54:38Z

public/tracy/TracyMetal.hmm

+ discarded and reallocated. In practice, the race condition will never materialize so long
+ as TracyMetalCollect() is called frequently to keep the amount of unresolved queries low.
+ Finally, there's a timeout mechanism during timestamp collection to detect "empty" command
+ encoders and ensure progress.


Nice explanation! All three reasons a,b, and c are all worth commenting about.

nosferalatu · 2024-07-25T00:58:31Z

public/tracy/TracyMetal.hmm

+ return sentinel,
+ "NextQueryId: FULL! too many pending timestamp queries. [%llu, %llu] (%u)",
+ m_previousCheckpoint.load(), id, count
+ );


I assume that when this happens, TracyMetalCollect() should be called more frequently? It might be useful to add a comment about that before the panic.

Yeah, good point, I'll clarify that further.
We can also consider adding more buffers to collect timestamps, but that will complicate the implementation, so I'll leave it as future work.

theblackunknown

Very nice work Marcos.
I have made a first pass of code review without testing so far.
The PR description and analysis is greatly appreciated !

Can you also clarify if this code is supposed to be use with ObjC ARC or not ?
I know that codebase from codebase we can run into different expectations with this behavior.

theblackunknown · 2024-07-25T16:52:02Z

public/tracy/TracyMetal.hmm

+ } while(false);
+
+
+#define TRACY_METAL_DEBUG_MASK (0)


Is this expected to unconditionally define this ?

did you meant the following so that it can be overridden on the client side ?

Suggested change

#define TRACY_METAL_DEBUG_MASK (0)

#ifndef TRACY_METAL_DEBUG_MASK

#define TRACY_METAL_DEBUG_MASK (0)

#endif

Yes, at this initial stage, I'd like to have the verbose debugging log available when needed.
Should someone report issues, I can ask them to set this define prior to including the header.

It is understood that you want people to possibly override the macro, but the right coding pattern to do so is to wrap the #define with #ifndef ... #endif which is missing as of writing to avoid compiler warnings (and thus maybe errors) in client codebase.

Ah, LOL, yeah, I did not realize this was a code change suggestion. You are absolutely right!

theblackunknown · 2024-07-25T16:53:48Z

public/tracy/TracyMetal.hmm

+ TracyMetalDebug(1<<0, TracyMetalPanic(, "MTLCounterErrorValue = 0x%llx", MTLCounterErrorValue));
+ TracyMetalDebug(1<<0, TracyMetalPanic(, "MTLCounterDontSample = 0x%llx", MTLCounterDontSample));


Do you plan on keeping all those debug points ?
Coming from the Tracy Vulkan background they look unfamiliar to me.

Yes, at least at this initial stage. Should people start reporting issues, this can help me triage the problem.

theblackunknown · 2024-07-25T16:57:17Z

public/tracy/TracyMetal.hmm

+ if (![m_device supportsCounterSampling:MTLCounterSamplingPointAtDrawBoundary])
+ {
+ TracyMetalPanic(, "WARNING: timestamp sampling at draw call boundary is not supported.");
+ }
+ if (![m_device supportsCounterSampling:MTLCounterSamplingPointAtBlitBoundary])
+ {
+ TracyMetalPanic(, "WARNING: timestamp sampling at blit boundary is not supported.");
+ }
+ if (![m_device supportsCounterSampling:MTLCounterSamplingPointAtDispatchBoundary])
+ {
+ TracyMetalPanic(, "WARNING: timestamp sampling at compute dispatch boundary is not supported.");
+ }
+ if (![m_device supportsCounterSampling:MTLCounterSamplingPointAtTileDispatchBoundary])
+ {
+ TracyMetalPanic(, "WARNING: timestamp sampling at tile dispatch boundary is not supported.");
+ }


It is useful to have those logs but as you have explained TracyMetalZone only works around command encoder as of writing, so I am unsure how those warnings are useful to clients.
But it may be worth it to keep them as TracyMetalDebug ?

Yeah, good point, I'll do that.
In the future, you'd request the granularity expected, and if that fails, then a panic is issued.

theblackunknown · 2024-07-25T17:00:13Z

public/tracy/TracyMetal.hmm

+ TracyMetalDebug(1<<0, TracyMetalPanic(, "Calibration: CPU timestamp (Metal): %llu", cpuTimestamp));
+ TracyMetalDebug(1<<0, TracyMetalPanic(, "Calibration: GPU timestamp (Metal): %llu", gpuTimestamp));
+
+ cpuTimestamp = Profiler::GetTime();


Is this expected that you ditch the CPU timestamp returned by the MTLDevice ? Is this for consistency with other Tracy event messages ?

That's what Tracy expects. The CPU timestamp reported when creating the GPU context must be the timestamp that Tracy understands. This is consistent with other backends as well.

theblackunknown · 2024-07-25T17:00:48Z

public/tracy/TracyMetal.hmm

+ //MemWrite(&item->gpuNewContext.flags, GpuContextCalibration);
+ MemWrite(&item->gpuNewContext.flags, GpuContextFlags(0));
+ MemWrite(&item->gpuNewContext.type, GpuContextType::Metal);
+ Profiler::QueueSerialFinish(); // TODO: DeferItem() for TRACY_ON_DEMAND


Please do address the TRACY_ON_DEMAND it is rather easy to add here.

theblackunknown · 2024-07-25T17:13:15Z

public/tracy/TracyMetal.hmm

+ t_start = m_mostRecentTimestamp + 5;
+ t_end = t_start + 5;


Does this means you try tp "patch" unresolved or not yet resolved timestamps ?
Can't we defer their resolution instead ?

theblackunknown · 2024-07-25T17:20:04Z

public/tracy/TracyMetal.hmm

+ SubmitZoneEndGpu(m_ctx, m_query.idx + 1);
+ }
+
+ TracyMetalZoneScopeWireTap;


What is the use of this ?

It's a debugging technique. An application can define the macro prior to including TracyMetal.hmm and run application-specific code at that point. Will probably go away once I am confident the back-end is working well.

theblackunknown · 2024-07-25T17:21:03Z

public/tracy/TracyMetal.hmm

+ const bool m_active;
+
+ MetalCtx* m_ctx;
+ id<MTLComputeCommandEncoder> m_cmdEncoder;


This is currently unused because the above code is within #if 0 ... #endif

Yes, and I'll be working on that in a subsequent PR. I don't have hardware that supports command granularity right now.

theblackunknown · 2024-07-25T17:22:28Z

public/tracy/TracyMetal.hmm

+ MemWrite( &item->gpuZoneEnd.context, ctx->GetContextId() );
+ Profiler::QueueSerialFinish();
+
+ TracyMetalDebug(1<<2, TracyAllocN((void*)(uintptr_t)queryId, 1, "TracyMetalGpuZone"));


Are those not supposed to be pair of TracyAllocN/TracyFreeN ?

theblackunknown · 2024-07-25T17:23:14Z

public/tracy/TracyMetal.hmm

+private:
+ const bool m_active;
+
+ MetalCtx* m_ctx;


Alternatively you could just store the context ID

slomp · 2024-05-17T17:15:22Z

public/tracy/TracyMetal.hmm

+ {
+ auto checkTime = std::chrono::high_resolution_clock::now();
+ auto requestTime = m_timestampRequestTime[k];
+ auto ms_in_flight = std::chrono::duration<float>(checkTime-requestTime).count()*1000.0f;


@wolfpld I want to remove uses of std::chrono and use what's available in Tracy already, that is, Profiler::GetTime(). I may be missing something obvious here, but how do you convert a time difference between two Profiler::GetTime() samples and convert it to, say, seconds?

Discord: https://discord.com/channels/585214693895962624/585214693895962630/1242530042366394409

slomp · 2024-07-25T16:40:50Z

public/tracy/TracyMetal.hmm

+#ifndef __TRACYMETAL_HMM__
+#define __TRACYMETAL_HMM__
+
+/* The Metal back-end in Tracy operates differently than other GPU back-ends like Vulkan,


It should work on Intel-based Macs as well. I cant quite test it at the moment, though.

slomp · 2024-07-25T16:42:54Z

public/tracy/TracyMetal.hmm

+/* The Metal back-end in Tracy operates differently than other GPU back-ends like Vulkan,
+ Direct3D and OpenGL. Specifically, TracyMetalZone() must be placed around the site where
+ a command encoder is created. This is because not all hardware supports timestamps at
+ command granularity, and can only provide timestamps around an entire command encoder.


When you call TracyMetalZone(), you need to pass the command encoder descriptor, so it's technically impossible to call it by passing a command encoder to it.
There will be a TracyMetalZone() interface in the future that takes the command encoder, once I have updated hardware here to test the other granularities.

slomp · 2024-07-25T16:44:09Z

public/tracy/TracyMetal.hmm

+ return sentinel,
+ "NextQueryId: FULL! too many pending timestamp queries. [%llu, %llu] (%u)",
+ m_previousCheckpoint.load(), id, count
+ );


Yeah, good point, I'll clarify that further.
We can also consider adding more buffers to collect timestamps, but that will complicate the implementation, so I'll leave it as future work.

slomp · 2024-07-25T18:19:39Z

public/tracy/TracyMetal.hmm

+ } while(false);
+
+
+#define TRACY_METAL_DEBUG_MASK (0)


Yes, at this initial stage, I'd like to have the verbose debugging log available when needed.
Should someone report issues, I can ask them to set this define prior to including the header.

slomp · 2024-07-25T18:28:25Z

public/tracy/TracyMetal.hmm

+ ZoneValue(begin);
+ ZoneValue(latestCheckpoint);


Yup; will try to move them to TracyMetalDebug macro.

slomp · 2024-07-25T18:29:35Z

public/tracy/TracyMetal.hmm

+ //uintptr_t nextCheckpoint = m_queryCounter.load();
+ //if (nextCheckpoint != latestCheckpoint)
+ //{
+ // // TODO: signal event / fence now?
+ //}


I have some ideas for signaling in Metal, but I'd rather experiment with that in a future PR.
I'm leaving the comment with TODO there as a reminder.

slomp · 2024-07-25T18:32:06Z

public/tracy/TracyMetal.hmm

+ if (ms_in_flight < timeout_ms)
+ break;


Empirical choice based on the applications I tested.
This is reasonable for any interactive, game-like loop.
(I'll add a macro for that)

slomp · 2024-07-25T18:34:28Z

public/tracy/TracyMetal.hmm

+ SubmitZoneEndGpu(m_ctx, m_query.idx + 1);
+ }
+
+ TracyMetalZoneScopeWireTap;


It's a debugging technique. An application can define the macro prior to including TracyMetal.hmm and run application-specific code at that point. Will probably go away once I am confident the back-end is working well.

slomp · 2024-07-25T18:35:10Z

public/tracy/TracyMetal.hmm

+ const bool m_active;
+
+ MetalCtx* m_ctx;
+ id<MTLComputeCommandEncoder> m_cmdEncoder;


Yes, and I'll be working on that in a subsequent PR. I don't have hardware that supports command granularity right now.

Marcos Slomp added 22 commits May 10, 2024 10:36

Metal back-end WIP

c60311a

basing metal zone scopes on MTLComputePassDescriptor

5034e77

debugging timestamps...

96962fd

fixing cpu timestamp baseline

d69351b

giving up on calibration, for now

a327415

fixing timestamp mapping range

cb78c0e

stale comments

b263ce2

collecting/resolving timestamps in pairs

d123582

adding blit pass and render pass interfaces

4340e91

more debugging

cdff459

fixing collect wrap-around

45fa57d

debugging

7b18e0f

bugfixes

3c94014

improved panic macro (supports print args)

93683af

blargh

ab7f687

blarg again...

1e85d6b

blarg3

b3bd032

cleanup

5a1fe26

more cleanup

1e8c9b9

adding wiretap for debugging purposes

223a1c1

Collect pending timestamps during shutdown

beb9d53

comments about the decisions and behavior of the Metal back-end

62082e0

cleanup and comments

ea3610b

nosferalatu approved these changes Jul 25, 2024

View reviewed changes

theblackunknown reviewed Jul 25, 2024

View reviewed changes

slomp commented Jul 25, 2024

View reviewed changes

		TracyMetalDebug(1<<0, TracyMetalPanic(, "MTLCounterErrorValue = 0x%llx", MTLCounterErrorValue));
		TracyMetalDebug(1<<0, TracyMetalPanic(, "MTLCounterDontSample = 0x%llx", MTLCounterDontSample));

Metal (Apple) GPU back-end for Tracy #793

Are you sure you want to change the base?

Metal (Apple) GPU back-end for Tracy #793

Conversation

slomp commented May 17, 2024

slomp commented May 17, 2024

wolfpld commented May 17, 2024

slomp commented May 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

theblackunknown left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment