Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ze_peak explicit scaling benchmark #88

Merged
merged 2 commits into from
Nov 6, 2024

Conversation

lyu
Copy link
Contributor

@lyu lyu commented Oct 16, 2024

The explicit scaling code for ze_peak violates L0 spec and has no overlap between sub-devices. This PR corrects these issues.

@lyu
Copy link
Contributor Author

lyu commented Oct 16, 2024

Original execution flow for each subdevice ID

  1. Assume current subdevice ID is N
  2. Reset cmdlist N
  3. Append 1 memcpy to cmdlist N
  4. Close cmdlist N
  5. For each warmup iteration: [submit cmdlist N to subdevice N, and if this is the last subdevice then synchronize the cmdqueue of all subdevices]
  6. For each benchmark iteration: do the same as step 4
  7. Measure and return the time taken to do step 5

Suppose we have two subdevices 0 & 1. For subdevice 0 there is no synchronization at all, since the cmdqueue is async we only measure the submission time which is very small. For subdevice 1 we will call cmdqueue sync on subdevice 0 at step 4, before we actually run the benchmark on subdevice 1 so there is no overlap at all.

At the end we sum all the time measurements and calculate the BW. Although we had no overlap, we also didn't measure execution on subdevice 0, so we get half of the actual time and thus double the BW, so this bug was not discovered before.

Additionally, for subdevice 0 we do submit-submit-...(500 times)...-submit-sync using the same cmdlist & cmdqueue pair, this violates L0 spec's description of zeCommandQueueExecuteCommandLists ref:

The application must ensure the device is not currently referencing the command list since the implementation is allowed to modify the contents of the command list for submission.

So we saw command buffer GPU page faults when running ze_peak on PVC.

Corrected execution flow for each subdevice ID

  1. Assume current subdevice ID is N
  2. If N == 0 then reset all cmdlists
  3. Append 1 memcpy and 1 barrier to cmdlist N
  4. Close cmdlist N
  5. Run warmup iterations just on subdevice N
  6. Reset cmdlist N
  7. Append 500 [memcpy + barrier] to cmdlist N
  8. Close cmdlist N
  9. Submit cmdlist N to subdevice N, and if this is the last subdevice then synchronize the cmdqueue of all subdevices
  10. Measure and return the time taken to do step 9

Basically, now we submit 1 cmdlist to each subdevice asynchronously, and we do synchronization on all subdevices once we have submitted all cmdlists. There will be some warmup & cmdlist operation overhead mixed in there, and the barriers also have their own overhead, but the measured BW is still very close to 2x the performance on a single subdevice.

perf_tests/ze_peak/src/ze_peak.cpp Show resolved Hide resolved
perf_tests/ze_peak/src/ze_peak.cpp Outdated Show resolved Hide resolved
perf_tests/ze_peak/src/ze_peak.cpp Show resolved Hide resolved
The explicit scaling code for ze_peak violates L0 spec and has no
overlap between sub-devices. This PR corrects these issues.

Signed-off-by: Wenbin Lu <wenbin.lu@intel.com>
nrspruit
nrspruit previously approved these changes Nov 6, 2024
@nrspruit nrspruit self-requested a review November 6, 2024 17:08
The explicit scaling code for ze_peak violates L0 spec and has no
overlap between sub-devices. This PR corrects these issues.

Signed-off-by: Wenbin Lu <wenbin.lu@intel.com>
@nrspruit nrspruit merged commit ae0ea49 into oneapi-src:master Nov 6, 2024
13 checks passed
@lyu lyu deleted the ze_peak_fix branch November 6, 2024 23:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants