Fix ze_peak explicit scaling benchmark #88

lyu · 2024-10-16T15:16:08Z

The explicit scaling code for ze_peak violates L0 spec and has no overlap between sub-devices. This PR corrects these issues.

lyu · 2024-10-16T15:59:30Z

Original execution flow for each subdevice ID

Assume current subdevice ID is N
Reset cmdlist N
Append 1 memcpy to cmdlist N
Close cmdlist N
For each warmup iteration: [submit cmdlist N to subdevice N, and if this is the last subdevice then synchronize the cmdqueue of all subdevices]
For each benchmark iteration: do the same as step 4
Measure and return the time taken to do step 5

Suppose we have two subdevices 0 & 1. For subdevice 0 there is no synchronization at all, since the cmdqueue is async we only measure the submission time which is very small. For subdevice 1 we will call cmdqueue sync on subdevice 0 at step 4, before we actually run the benchmark on subdevice 1 so there is no overlap at all.

At the end we sum all the time measurements and calculate the BW. Although we had no overlap, we also didn't measure execution on subdevice 0, so we get half of the actual time and thus double the BW, so this bug was not discovered before.

Additionally, for subdevice 0 we do submit-submit-...(500 times)...-submit-sync using the same cmdlist & cmdqueue pair, this violates L0 spec's description of zeCommandQueueExecuteCommandLists ref:

The application must ensure the device is not currently referencing the command list since the implementation is allowed to modify the contents of the command list for submission.

So we saw command buffer GPU page faults when running ze_peak on PVC.

Corrected execution flow for each subdevice ID

Assume current subdevice ID is N
If N == 0 then reset all cmdlists
Append 1 memcpy and 1 barrier to cmdlist N
Close cmdlist N
Run warmup iterations just on subdevice N
Reset cmdlist N
Append 500 [memcpy + barrier] to cmdlist N
Close cmdlist N
Submit cmdlist N to subdevice N, and if this is the last subdevice then synchronize the cmdqueue of all subdevices
Measure and return the time taken to do step 9

Basically, now we submit 1 cmdlist to each subdevice asynchronously, and we do synchronization on all subdevices once we have submitted all cmdlists. There will be some warmup & cmdlist operation overhead mixed in there, and the barriers also have their own overhead, but the measured BW is still very close to 2x the performance on a single subdevice.

perf_tests/ze_peak/src/ze_peak.cpp

The explicit scaling code for ze_peak violates L0 spec and has no overlap between sub-devices. This PR corrects these issues. Signed-off-by: Wenbin Lu <wenbin.lu@intel.com>

lyu force-pushed the ze_peak_fix branch from d719a18 to dbc9719 Compare October 16, 2024 22:02

aravindksg reviewed Oct 22, 2024

View reviewed changes

perf_tests/ze_peak/src/ze_peak.cpp Show resolved Hide resolved

perf_tests/ze_peak/src/ze_peak.cpp Outdated Show resolved Hide resolved

perf_tests/ze_peak/src/ze_peak.cpp Show resolved Hide resolved

aravindksg requested review from Jemale, nrspruit, raiyanla and bmyates October 22, 2024 18:19

Fix ze_peak explicit scaling benchmark

9299017

The explicit scaling code for ze_peak violates L0 spec and has no overlap between sub-devices. This PR corrects these issues. Signed-off-by: Wenbin Lu <wenbin.lu@intel.com>

lyu force-pushed the ze_peak_fix branch from dbc9719 to 9299017 Compare October 23, 2024 20:23

nrspruit previously approved these changes Nov 6, 2024

View reviewed changes

nrspruit self-requested a review November 6, 2024 17:08

Fix ze_peak explicit scaling benchmark

a071ff6

The explicit scaling code for ze_peak violates L0 spec and has no overlap between sub-devices. This PR corrects these issues. Signed-off-by: Wenbin Lu <wenbin.lu@intel.com>

lyu dismissed nrspruit’s stale review via a071ff6 November 6, 2024 19:03

nrspruit approved these changes Nov 6, 2024

View reviewed changes

aravindksg approved these changes Nov 6, 2024

View reviewed changes

nrspruit merged commit ae0ea49 into oneapi-src:master Nov 6, 2024
13 checks passed

lyu deleted the ze_peak_fix branch November 6, 2024 23:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix ze_peak explicit scaling benchmark #88

Fix ze_peak explicit scaling benchmark #88

lyu commented Oct 16, 2024

lyu commented Oct 16, 2024

Fix ze_peak explicit scaling benchmark #88

Fix ze_peak explicit scaling benchmark #88

Conversation

lyu commented Oct 16, 2024

lyu commented Oct 16, 2024

Original execution flow for each subdevice ID

Corrected execution flow for each subdevice ID