Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[subgroups][non_uniform_broadcast] Fix broadcasting index generation #1680

Merged

Conversation

Nuullll
Copy link
Contributor

@Nuullll Nuullll commented Mar 23, 2023

Dynamically activate half of the work items in the current subgroup instead of hardcoding the number of active work items.

Otherwise, if subgroup size == NR_OF_ACTIVE_WORK_ITEMS == 4, then we will encounter "divide-by-zero" error when evaluating bcast_index % (n - NR_OF_ACTIVE_WORK_ITEMS).

Signed-off-by: Yilong Guo yilong.guo@intel.com

The subgroup size may not be greater than `NR_OF_ACTIVE_WORK_ITEMS`.
Broadcasting index needs to be reduced in that case.

Otherwise, if subgroup size == `NR_OF_ACTIVE_WORK_ITEMS` == 4, then we
will encounter "divide-by-zero" error when evaluating `bcast_index %
(n - NR_OF_ACTIVE_WORK_ITEMS)`.
@Nuullll
Copy link
Contributor Author

Nuullll commented Mar 27, 2023

gentle ping :-)

@Nuullll
Copy link
Contributor Author

Nuullll commented Mar 30, 2023

@StuartDBrady Can you please take a look?

@Nuullll
Copy link
Contributor Author

Nuullll commented Apr 12, 2023

ping @svenvh @StuartDBrady

// last workgroup last subgroup
if (last_subgroup_size && j == nj - 1
&& last_subgroup_size < NR_OF_ACTIVE_WORK_ITEMS)
{
bcast_if = bcast_index % last_subgroup_size;
bcast_elseif = bcast_if;
}
// reduce broadcasting index in case subgroup size <=
// NR_OF_ACTIVE_WORK_ITEMS (i.e. all items are active)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This highlights a problem with the broadcast tests: they currently are not meant to test subgroup sizes <= NR_OF_ACTIVE_WORK_ITEMS (i.e., 4). Your proposed fix causes the test to skip an important aspect of non-uniform subgroup operations, as the else in sub_group_non_uniform_broadcast_source will not be executed. That means the subgroup operation will not be tested properly when the subgroup size is <= 4, so I don't think we should commit this.

Instead, we should probably try to get rid of NR_OF_ACTIVE_WORK_ITEMS and use work-item masks to introduce divergence in the broadcast tests (as done for e.g. sub_group_non_uniform_any).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reviewing. I agree. Will update the PR.

Dynamically activate half of the work items in the current subgroup
instead of hardcoding as `NR_OF_ACTIVE_WORK_ITEMS`.
@Nuullll Nuullll requested a review from svenvh September 27, 2023 09:42
@StuartDBrady
Copy link
Contributor

@Nuullll, would you be able to use the same mechanism as used for sub_group_non_uniform_any, instead? Without that, I don't think we test with all items active, for example.

@Nuullll
Copy link
Contributor Author

Nuullll commented Oct 12, 2023

@Nuullll, would you be able to use the same mechanism as used for sub_group_non_uniform_any, instead? Without that, I don't think we test with all items active, for example.

@StuartDBrady

I did try to use the uint4 work_item_mask_vector parameter to represent the active items. The problem is that we need to provide a corresponding broadcasting index for each mask, as the second parameter of sub_group_non_uniform_broadcast call. However, the current gen() implementation is independent of the actual mask:

// Generate the desired input for the kernel
test_params.subgroup_size = subgroup_size;
Fns::gen(idata.data(), mapin.data(), sgmap.data(), test_params);
test_status status;
if (test_params.divergence_mask_arg != -1)
{
for (auto &mask : test_params.all_work_item_masks)
{
test_params.work_items_mask = mask;
cl_uint4 mask_vector = bs128_to_cl_uint4(mask);
clSetKernelArg(kernel, test_params.divergence_mask_arg,
sizeof(cl_uint4), &mask_vector);
status = executor.run_and_check(test_params);
if (status == TEST_FAIL) break;
}
}

Of course, we can do the following (pseudo code):

__kernel void test(..., uint4 work_item_mask_vector) {
    ...
    if (elect_work_item & work_item_mask)
        out[gid] = sub_group_non_uniform_broadcast(x, get_index_of_one_bit(work_item_mask_vector));
    else
        out[gid] = sub_group_non_uniform_broadcast(x, get_index_of_one_bit(~work_item_mask_vector));

get_index_of_one_bit could be anything as long as it produces an index of a set bit according to the dynamic mask value. And we need to implement the same get_index_of_one_bit algorithm in host chk() function. I think it's just another form of hardcoding on the broadcasting index.

In my opinion, the current approach (hardcoding the divergence condition both in kernel code and gen() implementation, and taking the broadcast index from the random input) is much simpler. Or I could have missed something, could you please give some advice?

@Nuullll
Copy link
Contributor Author

Nuullll commented Feb 7, 2024

any comments?

Copy link
Member

@svenvh svenvh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, the current approach (hardcoding the divergence condition both in kernel code and gen() implementation, and taking the broadcast index from the random input) is much simpler. Or I could have missed something, could you please give some advice?

As @StuartDBrady pointed out, we don't seem to test sub_group_non_uniform_broadcast with all work-items active, which is an existing gap in test coverage. It would be nice to address that, but if you'd rather only remove the hardcoded split of 4 in this PR, that should be fine too I suppose.

@Nuullll
Copy link
Contributor Author

Nuullll commented Mar 4, 2024

friendly ping.

@bashbaug
Copy link
Contributor

Merging as discussed in the March 12th teleconference.

@bashbaug bashbaug merged commit a045f76 into KhronosGroup:main Mar 12, 2024
7 checks passed
yanfeng3721 pushed a commit to yanfeng3721/OpenCL-CTS that referenced this pull request Oct 24, 2024
…hronosGroup#1680) (KhronosGroup#58)

cherry-pick
KhronosGroup@a045f76.

CMPLRLLVM-60752.

Co-authored-by: Yilong Guo <yilong.guo@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants