Skip to content

Commit

Permalink
[EXP][CMDBUF] Fix dependency handling for large buffer fill ops in CU…
Browse files Browse the repository at this point in the history
…DA graph support

The CUDA backend does not handle buffer fill larger than 32 bits.
The implemented workaround is to add several node that perform 32 bits buffer fill ops.
In the previous implementation, all these nodes took all previous node(s) as predecessors.
In this version, the first node takes all previous node(s) as predecessors, then the subsequent node takes the newly added node as a predecessor which ensures that the whole fill ops is completed when the last node in this sequence is completed.
  • Loading branch information
mfrancepillois committed Jan 16, 2024
1 parent c63ad9b commit 70682c4
Showing 1 changed file with 8 additions and 2 deletions.
10 changes: 8 additions & 2 deletions source/adapters/cuda/command_buffer.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -148,6 +148,9 @@ static ur_result_t enqueueCommandBufferFillHelper(

size_t NumberOfSteps = PatternSize / sizeof(uint32_t);

// List shared pointer that will point to the last node created
std::shared_ptr<CUgraphNode> GraphNodePtr;

// we walk up the pattern in 4-byte steps, and call cuMemset for each
// 4-byte chunk of the pattern.
for (auto Step = 0u; Step < NumberOfSteps; ++Step) {
Expand All @@ -173,9 +176,12 @@ static ur_result_t enqueueCommandBufferFillHelper(
DepsList.size(), &NodeParamsStep,
CommandBuffer->Device->getContext()));

GraphNodePtr = std::make_shared<CUgraphNode>(GraphNode);
// Get sync point and register the cuNode with it.
*SyncPoint = CommandBuffer->AddSyncPoint(
std::make_shared<CUgraphNode>(GraphNode));
*SyncPoint = CommandBuffer->AddSyncPoint(GraphNodePtr);

DepsList.clear();
DepsList.push_back(*GraphNodePtr.get());
}
}
} catch (ur_result_t Err) {
Expand Down

0 comments on commit 70682c4

Please sign in to comment.