Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL] remove dependencies from memcpy, memset and kernel launch #2075

Draft
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

AuroraPerego
Copy link
Contributor

The requirements / dependencies were needed for the buffers, so now can be removed.
SYCL queues and events are thread-safe, the mutex in QueueGenericSyclBase.hpp has been removed as well.
When there is an event to wait for, instead of registering it as a dependency, the method ext_oneapi_submit_barrier(const std::vector<event> &) is used.

// Execute task
if constexpr(is_sycl_task<TTask> && !is_sycl_kernel<TTask>) // Copy / Fill
{
m_last_event = task(m_queue, m_dependencies); // Will call queue.{copy, fill} internally
m_last_event = task(m_queue); // Will call queue.{copy, fill} internally
}
else
{
m_last_event = m_queue.submit(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks dangerously like a race condition.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean between the call to enqueue() and a possible call to get_last_event() or empty() ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried about the assignment here. What happens if two threads enqueue work concurrently? Do we have a guarantee that m_last_event will actually refer to the last event?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see.

You are right, if multiple threads enqueue a task to the same queue, there is no guarantee about which of the two events would be stored in m_last_event.

Possible solutions:

  1. add back the lock, and use it only in enqueue(), empty(), get_last_event() (and maybe wait()? probably not)
  2. remove the m_last_event member and
  • implement get_last_event() with a call to m_queue.ext_oneapi_submit_barrier()
  • implement empty() with a call to m_queue.ext_oneapi_empty()
    Unfortunately according to the extension itself, ext_oneapi_empty() is not supported by the OpenCL back0end :-(

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the extension link:

Currently support for OpenCL backend is limited, API introduced by this extension can be called only for in-order queues which doesn’t have discard_events property. Exception is thrown if new API is called on other type of queue. OpenCL currently doesn’t have an API to get queue status.

We are using in-order SYCL queues internally, so I think we should be fine, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

I guess we can try and see if it works :-)

@j-stephan
Copy link
Member

Okay, the checks are looking good. Have you executed the runtime tests?

@AuroraPerego
Copy link
Contributor Author

Yes, i've built and run the tests for CPU and GPU with the usual result (all pass on GPU, 4 out of resources on CPU)

@AuroraPerego
Copy link
Contributor Author

(I'll check also the impact on the performance)

Copy link
Member

@j-stephan j-stephan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my point of view this is good to go. @fwyzard feel free to merge once the performance analysis is done.

@j-stephan j-stephan marked this pull request as draft August 11, 2023 09:06
@j-stephan
Copy link
Member

Converting to draft until the performance analysis is ready.

@AuroraPerego
Copy link
Contributor Author

The situation is worse than expected, apparently there are many more synchronization introduced with these changes.
I compared the baseline with this PR and with another version that has only the changes to the Copy.hpp and Set.hpp files.

nThreads base@c9dbac0 base + PR base + Copy/Set changes
1 148.44 +/- 3.48 142.81 +/- 1.71 148.94 +/- 2.58
2 172.73 +/- 0.94 162.27 +/- 2.48 171.95 +/- 1.05
4 181.79 +/- 0.16 167.61 +/- 1.06 182.29 +/- 0.31
8 182.89 +/- 0.06 166.67 +/- 0.54 183.21 +/- 0.06

Doing some profiling, these are the main differences:

zeEventHostSynchronize zeCommandListAppendBarrier zeEventQueryStatus zeEventHostReset
base 5 5 2 ~60
pr 6 ~48 ~22 ~103

@j-stephan
Copy link
Member

What are those numbers? Runtime?

@AuroraPerego
Copy link
Contributor Author

Yes, sorry.
The tests have been done on pixeltrack-standalone@dcf2898.
These numbers are the throughput, i.e. the number of concurrent events processed per second, computed processing 10000 events and nThreads is the number of CPU threads used

@j-stephan
Copy link
Member

Ah, I see. So lower numbers are worse. That is indeed a change for the worse then. How did you ensure that the requirements handling is correct in your third approach (only changes to Copy and Set)?

@AuroraPerego
Copy link
Contributor Author

Not sure the handling it's correct, it was done mostly to remove the ext_oneapi_submit_barrier() and ext_oneapi_empty() calls to check the impact.
At this point I'm not sure if it's worth to change something in the current implementation.

@j-stephan
Copy link
Member

Hm, maybe open a bug report for Intel? I'm sure they would be interested to hear about this :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants