Consider the following GLSL snippet of a compute shader that attempts to reduce the number of atomic operations from one per invocation to one per subgroup:
// Free should be initialized to 0.
layout(set=0, binding=0) buffer BUFFER { uint free; uint data[]; } b;
void main() {
bool needs_space = false;
...
if (needs_space) {
// gl_SubgroupSize may be larger than the actual subgroup size so
// calculate the actual subgroup size.
uvec4 mask = subgroupBallot(needs_space);
uint size = subgroupBallotBitCount(mask);
uint base = 0;
if (subgroupElect()) {
// "free" tracks the next free slot for writes.
// The first invocation in the subgroup allocates space
// for each invocation in the subgroup that requires it.
base = atomicAdd(b.free, size);
}
// Broadcast the base index to other invocations in the subgroup.
base = subgroupBroadcastFirst(base);
// Calculate the offset from "base" for each invocation.
uint offset = subgroupBallotExclusiveBitCount(mask);
// Write the data in the allocated slot for each invocation that
// requested space.
b.data[base + offset] = ...;
}
...
}
There is a problem with the code that might lead to unexpected results. Vulkan only requires invocations to reconverge after the if statement that performs the subgroup election if all the invocations in the workgroup are converged at that if statement. If the invocations don’t reconverge then the broadcast and offset calculations will be incorrect. Not all invocations would write their results to the correct index.
VK_KHR_shader_subgroup_uniform_control_flow
can be utilized to make the shader behave as expected in most cases. Consider the following rewritten version of the example:
// Free should be initialized to 0.
layout(set=0, binding=0) buffer BUFFER { uint free; uint data[]; } b;
// Note the addition of a new attribute.
void main() [[subroup_uniform_control_flow]] {
bool needs_space = false;
...
// Note the change of the condition.
if (subgroupAny(needs_space)) {
// gl_SubgroupSize may be larger than the actual subgroup size so
// calculate the actual subgroup size.
uvec4 mask = subgroupBallot(needs_space);
uint size = subgroupBallotBitCount(mask);
uint base = 0;
if (subgroupElect()) {
// "free" tracks the next free slot for writes.
// The first invocation in the subgroup allocates space
// for each invocation in the subgroup that requires it.
base = atomicAdd(b.free, size);
}
// Broadcast the base index to other invocations in the subgroup.
base = subgroupBroadcastFirst(base);
// Calculate the offset from "base" for each invocation.
uint offset = subgroupBallotExclusiveBitCount(mask);
if (needs_space) {
// Write the data in the allocated slot for each invocation that
// requested space.
b.data[base + offset] = ...;
}
}
...
}
The differences from the original shader are relatively minor. First, the addition of the subgroup_uniform_control_flow
attribute informs the implementation that stronger guarantees are required by this shader. Second, the first if statement no longer tests needs_space. Instead, all invocations in the subgroup enter the if statement if any invocation in the subgroup needs to write data. This keeps the subgroup uniform to utilize the enhanced guarantees for the inner subgroup election.
There is a final caveat with this example. In order for the shader to operate correctly in all circumstances, the subgroup must be uniform (converged) prior to the first if statement.