Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metal (Apple) GPU back-end for Tracy #793

Open
wants to merge 23 commits into
base: master
Choose a base branch
from

Conversation

slomp
Copy link
Contributor

@slomp slomp commented May 17, 2024

(I still need to update the manual, but I'm putting the code here for review to save some time).

The Metal back-end in Tracy operates differently than other GPU back-ends like Vulkan, Direct3D and OpenGL. Specifically, TracyMetalZone() must be placed around the site where a command encoder is created.

This is because not all hardware supports timestamps at command granularity, and can only provide timestamps around an entire command encoder. This accommodates for all tiers of hardware; in the future, variants of TracyMetalZone() will be added to support the habitual command-level granularity of Tracy GPU back-ends.

Metal also imposes a few restrictions that make the process of requesting and collecting queries more complicated in Tracy:

  • timestamp query buffers are limited to 4096 queries (32KB, where each query is 8 bytes)
  • when a timestamp query buffer is created, Metal initializes all timestamps with zeroes, and there's no way to reset them back to zero after timestamps get resolved; the only way to clear the timestamps is by allocating a new timestamp query buffer
  • if a command encoder records no commands and its corresponding command buffer ends up committed to the command queue, Metal will "optimize-away" the encoder along with any timestamp queries associated with it (the timestamp will remain as zero and will never get resolved)

Because of the limitations above, two timestamp buffers are managed internally. Once one of the buffers fills up with requests, the second buffer can start serving new requests.

Once all requests in a buffer get resolved and collected, the entire buffer is discarded and a new one allocated for future requests. (Proper cycling through a ring buffer would require bookkeeping and completion handlers to collect only the known complete queries.)

In the current implementation, there is potential for a race condition when the buffer is discarded and reallocated. In practice, the race condition will never materialize so long as TracyMetalCollect() is called frequently to keep the amount of unresolved queries low.

Finally, there's a timeout mechanism during timestamp collection to detect "empty" command encoders and ensure progress.

@slomp
Copy link
Contributor Author

slomp commented May 17, 2024

@wolfpld I'd like to request reviews from @nosferalatu and @JamesMcCarthy44, but I can't seem to be able to add reviewers.

@wolfpld
Copy link
Owner

wolfpld commented May 17, 2024

I don't know how assigning reviewers work on Github. Mentioning people should be enough to get their attention.

@slomp
Copy link
Contributor Author

slomp commented May 18, 2024

Also pinging @theblackunknown for a code review.

#ifndef __TRACYMETAL_HMM__
#define __TRACYMETAL_HMM__

/* The Metal back-end in Tracy operates differently than other GPU back-ends like Vulkan,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work only on Apple Silicon, or does it work on the older non-Apple GPUs as well? (I personally only care about Apple devices with Apple Silicon, but a clarifying comment might be helpful to others)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should work on Intel-based Macs as well. I cant quite test it at the moment, though.

/* The Metal back-end in Tracy operates differently than other GPU back-ends like Vulkan,
Direct3D and OpenGL. Specifically, TracyMetalZone() must be placed around the site where
a command encoder is created. This is because not all hardware supports timestamps at
command granularity, and can only provide timestamps around an entire command encoder.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the TracyMetalZone() is placed at command granularity, what happens on hardware that doesn't support command granularity timestamps?

The comment might be revised to say something like "... must be placed around the site where a command encoder is created. This is .... . If running on hardware that doesn't support command granularity timestamps, then XXX happens."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you call TracyMetalZone(), you need to pass the command encoder descriptor, so it's technically impossible to call it by passing a command encoder to it.
There will be a TracyMetalZone() interface in the future that takes the command encoder, once I have updated hardware here to test the other granularities.

discarded and reallocated. In practice, the race condition will never materialize so long
as TracyMetalCollect() is called frequently to keep the amount of unresolved queries low.
Finally, there's a timeout mechanism during timestamp collection to detect "empty" command
encoders and ensure progress.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice explanation! All three reasons a,b, and c are all worth commenting about.

return sentinel,
"NextQueryId: FULL! too many pending timestamp queries. [%llu, %llu] (%u)",
m_previousCheckpoint.load(), id, count
);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that when this happens, TracyMetalCollect() should be called more frequently? It might be useful to add a comment about that before the panic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good point, I'll clarify that further.
We can also consider adding more buffers to collect timestamps, but that will complicate the implementation, so I'll leave it as future work.

Copy link
Contributor

@theblackunknown theblackunknown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice work Marcos.
I have made a first pass of code review without testing so far.
The PR description and analysis is greatly appreciated !

Can you also clarify if this code is supposed to be use with ObjC ARC or not ?
I know that codebase from codebase we can run into different expectations with this behavior.

} while(false);


#define TRACY_METAL_DEBUG_MASK (0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this expected to unconditionally define this ?

did you meant the following so that it can be overridden on the client side ?

Suggested change
#define TRACY_METAL_DEBUG_MASK (0)
#ifndef TRACY_METAL_DEBUG_MASK
#define TRACY_METAL_DEBUG_MASK (0)
#endif

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, at this initial stage, I'd like to have the verbose debugging log available when needed.
Should someone report issues, I can ask them to set this define prior to including the header.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is understood that you want people to possibly override the macro, but the right coding pattern to do so is to wrap the #define with #ifndef ... #endif which is missing as of writing to avoid compiler warnings (and thus maybe errors) in client codebase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, LOL, yeah, I did not realize this was a code change suggestion. You are absolutely right!

Comment on lines +113 to +114
TracyMetalDebug(1<<0, TracyMetalPanic(, "MTLCounterErrorValue = 0x%llx", MTLCounterErrorValue));
TracyMetalDebug(1<<0, TracyMetalPanic(, "MTLCounterDontSample = 0x%llx", MTLCounterDontSample));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you plan on keeping all those debug points ?
Coming from the Tracy Vulkan background they look unfamiliar to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, at least at this initial stage. Should people start reporting issues, this can help me triage the problem.

Comment on lines +124 to +139
if (![m_device supportsCounterSampling:MTLCounterSamplingPointAtDrawBoundary])
{
TracyMetalPanic(, "WARNING: timestamp sampling at draw call boundary is not supported.");
}
if (![m_device supportsCounterSampling:MTLCounterSamplingPointAtBlitBoundary])
{
TracyMetalPanic(, "WARNING: timestamp sampling at blit boundary is not supported.");
}
if (![m_device supportsCounterSampling:MTLCounterSamplingPointAtDispatchBoundary])
{
TracyMetalPanic(, "WARNING: timestamp sampling at compute dispatch boundary is not supported.");
}
if (![m_device supportsCounterSampling:MTLCounterSamplingPointAtTileDispatchBoundary])
{
TracyMetalPanic(, "WARNING: timestamp sampling at tile dispatch boundary is not supported.");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is useful to have those logs but as you have explained TracyMetalZone only works around command encoder as of writing, so I am unsure how those warnings are useful to clients.
But it may be worth it to keep them as TracyMetalDebug ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good point, I'll do that.
In the future, you'd request the granularity expected, and if that fails, then a panic is issued.

TracyMetalDebug(1<<0, TracyMetalPanic(, "Calibration: CPU timestamp (Metal): %llu", cpuTimestamp));
TracyMetalDebug(1<<0, TracyMetalPanic(, "Calibration: GPU timestamp (Metal): %llu", gpuTimestamp));

cpuTimestamp = Profiler::GetTime();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this expected that you ditch the CPU timestamp returned by the MTLDevice ? Is this for consistency with other Tracy event messages ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what Tracy expects. The CPU timestamp reported when creating the GPU context must be the timestamp that Tracy understands. This is consistent with other backends as well.

//MemWrite(&item->gpuNewContext.flags, GpuContextCalibration);
MemWrite(&item->gpuNewContext.flags, GpuContextFlags(0));
MemWrite(&item->gpuNewContext.type, GpuContextType::Metal);
Profiler::QueueSerialFinish(); // TODO: DeferItem() for TRACY_ON_DEMAND
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do address the TRACY_ON_DEMAND it is rather easy to add here.

Comment on lines +314 to +315
t_start = m_mostRecentTimestamp + 5;
t_end = t_start + 5;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this means you try tp "patch" unresolved or not yet resolved timestamps ?
Can't we defer their resolution instead ?

SubmitZoneEndGpu(m_ctx, m_query.idx + 1);
}

TracyMetalZoneScopeWireTap;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the use of this ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a debugging technique. An application can define the macro prior to including TracyMetal.hmm and run application-specific code at that point. Will probably go away once I am confident the back-end is working well.

const bool m_active;

MetalCtx* m_ctx;
id<MTLComputeCommandEncoder> m_cmdEncoder;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is currently unused because the above code is within #if 0 ... #endif

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and I'll be working on that in a subsequent PR. I don't have hardware that supports command granularity right now.

MemWrite( &item->gpuZoneEnd.context, ctx->GetContextId() );
Profiler::QueueSerialFinish();

TracyMetalDebug(1<<2, TracyAllocN((void*)(uintptr_t)queryId, 1, "TracyMetalGpuZone"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are those not supposed to be pair of TracyAllocN/TracyFreeN ?

private:
const bool m_active;

MetalCtx* m_ctx;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively you could just store the context ID

{
auto checkTime = std::chrono::high_resolution_clock::now();
auto requestTime = m_timestampRequestTime[k];
auto ms_in_flight = std::chrono::duration<float>(checkTime-requestTime).count()*1000.0f;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wolfpld I want to remove uses of std::chrono and use what's available in Tracy already, that is, Profiler::GetTime(). I may be missing something obvious here, but how do you convert a time difference between two Profiler::GetTime() samples and convert it to, say, seconds?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#ifndef __TRACYMETAL_HMM__
#define __TRACYMETAL_HMM__

/* The Metal back-end in Tracy operates differently than other GPU back-ends like Vulkan,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should work on Intel-based Macs as well. I cant quite test it at the moment, though.

/* The Metal back-end in Tracy operates differently than other GPU back-ends like Vulkan,
Direct3D and OpenGL. Specifically, TracyMetalZone() must be placed around the site where
a command encoder is created. This is because not all hardware supports timestamps at
command granularity, and can only provide timestamps around an entire command encoder.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you call TracyMetalZone(), you need to pass the command encoder descriptor, so it's technically impossible to call it by passing a command encoder to it.
There will be a TracyMetalZone() interface in the future that takes the command encoder, once I have updated hardware here to test the other granularities.

return sentinel,
"NextQueryId: FULL! too many pending timestamp queries. [%llu, %llu] (%u)",
m_previousCheckpoint.load(), id, count
);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good point, I'll clarify that further.
We can also consider adding more buffers to collect timestamps, but that will complicate the implementation, so I'll leave it as future work.

} while(false);


#define TRACY_METAL_DEBUG_MASK (0)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, at this initial stage, I'd like to have the verbose debugging log available when needed.
Should someone report issues, I can ask them to set this define prior to including the header.

Comment on lines +240 to +241
ZoneValue(begin);
ZoneValue(latestCheckpoint);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup; will try to move them to TracyMetalDebug macro.

Comment on lines +246 to +250
//uintptr_t nextCheckpoint = m_queryCounter.load();
//if (nextCheckpoint != latestCheckpoint)
//{
// // TODO: signal event / fence now?
//}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some ideas for signaling in Metal, but I'd rather experiment with that in a future PR.
I'm leaving the comment with TODO there as a reminder.

Comment on lines +310 to +311
if (ms_in_flight < timeout_ms)
break;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empirical choice based on the applications I tested.
This is reasonable for any interactive, game-like loop.
(I'll add a macro for that)

SubmitZoneEndGpu(m_ctx, m_query.idx + 1);
}

TracyMetalZoneScopeWireTap;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a debugging technique. An application can define the macro prior to including TracyMetal.hmm and run application-specific code at that point. Will probably go away once I am confident the back-end is working well.

const bool m_active;

MetalCtx* m_ctx;
id<MTLComputeCommandEncoder> m_cmdEncoder;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and I'll be working on that in a subsequent PR. I don't have hardware that supports command granularity right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants