Intro to GPU Scalarization – Part 2 -Scalarize all the lights – Dreams of flashy pixels (on a white website)

Note: The following posts are going to be very GCN specific, I hadn’t made that clear enough with my first extension, hence why only this PSA at the top of the post! Sorry if you’ve already read and it created confusion.

This is the second part of my 2-part short series on scalarization. This assumes you either read the first part or you know what a wavefront is, what SGPR/VGPR, SALU/VALU, SMEM/VMEM are, you are aware of wave intrinsics and have an idea of what scalarization is.

Just a reminder, this mini-series is targeted at people approaching scalarization for the first time, so it won’t contain any groundbreaking new technique 🙂

Before I start, if the pace of the gifs I added to the post is off for you, this is the pptx file I used to do them. This way you can follow the steps at your own leisure!

Part 1 – Introduction to concepts and simple example.
Part 2 – Scalarizing a forward+ light loop.

The problem

For quite a while now tiled and clustered data structures have been the industry standard for lighting. In both cases we assign the lights in our scene to cells — screen aligned tiles or 3d clusters — and then for each pixel, we identify which cell contains the relevant lights and perform the shading.

Another common trend is to use a forward shading, however this comes with a performance problem: pixels within the same wave might have to access a different set of lights.

Screen Shot 2018-11-01 at 12.28.57.png — Orange object is 8×8 in size and touches 4 8×8 different tiles.

That of course means that we will have diverging loads of light data within a wave and from what we learned on previous post of the series that is suboptimal (see note [0]). But hey! We could scalarize!

Let’s have a look at a sample light loop, non-scalarized yet, with three divergent threads:

Nov-01-2018 17-15-17.gif — We are making the assumption that the sublists of lights are sorted from now on.

All is going to be in VGPRs and it is going to be vector loads! Not great.

Consider a typical scene, we could say that the following:

Pixels in a wave are all likely to access the same cell.
Nearby cells are likely to have similar content.

Both of these consideration scream coherency!

Strategy #1 – Looping clusters and lane masking

We noticed how most pixels will access the same cell, that also mean that if we have divergence, it’s unlikely that we will hit many different cells within a wave. So? So we could scalarize the cell index and, for each thread, loop through all the cells touched by the wave.
Let’s first see how (from [Drobot 2017] ) and then we’ll analyse what’s going on:

Nov-01-2018 22-13-11 — Note that ulong here is a 64bit uint, not natively supported in hlsl, but check how to handle them on your platform 🙂 Also note, in the visual example I only have 16 threads vs 64

Intimidating? It might be a bit! Hopefully the step-by step execution helps, but let’s dig deeper to see what’s going on by having a closer look at parts of the code:

ulong execMask = 0xffffffff;

In the previous post I mentioned the concept of exec mask. Here we are creating our own exec mask to determine which threads will need to process a cell. Of course, since we are at the start of the frame, we mark all threads as “alive”, hence the 0xffffffff.

uint v_laneID = WaveGetLaneIndex();
ulong curLaneMask = ulong(1) << ulong(v_laneID);

This bit is simple, we are creating a mask that has 0s for every thread, except the n-th bit for the n-th lane. E.g. lane 3 will have a mask 0000000…001000.

while( ( execMask & curLaneMask ) != 0 )

This is our loop condition. Plainly put, the lane will stop when its bit in the exec mask becomes unset. What it means is that we are going to loop in a scalar fashion through all the cells and the current thread will stop doing so as soon as soon as its cell is processed.

uint s_cellIdx = WaveReadFirstLane(v_cellIdx);
ulong laneMask = WaveBallot(v_cellIdx == s_cellIdx);

This is a crucial bit. We first get the cell index from the first active lane, this will be in an SGPR. Then, via Ballot, we create a mask telling us which other threads need to process lights from the cell of the first active thread…

execMask = execMask & ~laneMask;

By doing a logical and of the negated laneMask we just created, we are essentially saying: “These threads are about to process the cell they need, so we’ll be done with them”. Since we are done with them, we zero their bit in the exec mask so that they will exit the loop next iteration.

if( s_cellIdx == v_cellIdx )
    ProcessLightsInCell(s_cellIdx)

And finally, if indeed the current thread needed to process the same cell as the first active thread, we process such cell!
Now, since the cell address is scalar, the cell content can be loaded in SGPR via scalar loads. Hurray! We did scalarize the loop!

Now that we went through it bit by bit, let’s wrap it up at an high level in words again:

Until its relevant cell is processed, each thread will read the cell index of the first active thread. If this coincides with the its own cell index, the thread will process the cell exits the loop, ceasing to be active and waiting for the rest of the wave to complete the same process. Since all threads will go through all the cells in the same order, the process is scalar.

If it is still unclear, I suggest you to follow the step by step gif again. Use the powerpoint to go at your own pace 🙂

Can we do better? Possibly, yes.
The problem here is that we could end up doing lots of redundant work.
Remember our coherency assumptions? In particular: “Nearby cells are likely to have similar content.”.
Since we are scalarizing at a cell level, we are processing the whole content of each of the cells that touch any pixel in the wave. That means that if for example light number 42 appears in three cells that touch a wave, we are going to process that three separate times.

Nov-04-2018 11-54-56.gif

See Note 1 at the bottom for some clarifications on what I mean by “process three separate times”.

What if we scalarize per light rather than per cell? This is what [Drobot 2017] goes on to show, but to touch more sources and more diverse techniques I’ll follow an example from DOOM [Sousa et Geffroy 2016].

Strategy #2 – Scalarize single light indices

As I have done before, I’ll drop here a gif of the code executing step by step and then we’ll dig into it in more details. Note that [Sousa et Geffroy 2016] don’t provide code, so this is my implementation for it, hopefully it is correct 🙂

Kapture 2018-11-08 at 21.41.05
Conceptually this is the equivalent of flattening all the cells content into a single list and let all the threads go through that list. How do we achieve this? At every step, all threads in the wave process the unprocessed element with lowest index among all the ones that needs processing in any of the threads of said wave.

Let’s take a closer look.

uint lightOffset = 0;
while(lightOffset < lightCount)

lightOffset is our index through the cell light list associated with the current thread. We’ll loop until we processed all the lights in the list.

uint v_lightIdx = GetLightIdx(v_lightStart, lightOffset);
uint s_lightIdx = WaveActiveMin(v_lightIdx);

This is where we select which light the wave, as a whole, will process. We pick the first non-processed light index for each thread and then we select the minimum between them. This will give us an index in SGPR with which we can perform scalar loads!
Be careful and keep in mind that WaveActiveMin ignores inactive lanes, this is important, see note 2 if you are actually going to implement this for an important detail.


if(s_lightIdx == v_lightIdx)
{
     v_lightOffset++;
     LightData s_light = Lights[s_lightIdx];
     ProcessLight(s_light);
}

Obviously we want to shade with the light only if we need to! If current thread first unprocessed light has the same index as the minimum index across the whole wave, we will go ahead and process such light that is now in SGPRs!

All threads that had a light processed will advance in their list together by incrementing their v_lightOffset. This way, s_lightIdx will not be processed again.
If the light index for a thread does not correspond to the minimum in the wave, then we keep pointing at the same element for the next iteration of the loop and we branch over the shading and load.

YAY! We have a scalarised loop and we process each light only once! Success… right?
Right! But there are cases in which we can do even better.

Strategy #2 bis – Adding a fast path

The WaveActiveMin() we extensively used in the previous section is unfortunately not the cheapest of the operations. It maps quite a bunch of lane swizzles and min operators.

Remember one of our coherency assumptions? Pixels in a wave are all likely to access the same cell.

What if all pixels end up accessing the same cell? We can go even faster in this case!

uint v_cellIdx = GetCellIdx();
uint s_firstLaneCellIdx = WaveReadFirstLane(v_cellIdx);

// A mask marking threads that process
// the same cell as the first active thread.
ulong laneMask = WaveBallot(v_cellIdx == s_lane0CellIdx);
// True if all the active lanes need to process the same cell
bool fastPath = (laneMask == WaveBallot(true));

if(fastPath)
{
     // This will load lights in SGPRs since cell address is scalar!
     ProcessLightsInCell(s_firstLaneCellIdx)
}
else
{
    Slower-scalarization-path
}

So here you go, we now have our loop scalarized and we even have a fast path for a very common case!

What next? Now that we went together through a couple of examples, I strongly suggest you to go through the rest of [Drobot 2017], it is truly a wonderful presentation (and if you are a registered xbox developer, check through XFest archive 🙂 ). An implementation of its approach can be found in the excellent deferred texturing sample by Matt Pettineo.

The world is your oyster! Seek for scalarization opportunities in your code!

So that is it! I hope this shed some light, if something didn’t come across clear enough please let me know on twitter or in the comments.

‘Till the next one!

– Francesco (@FCifaCiar)

Notes

[0] – Tiled deferred doesn’t have the same problem as we could align waves and tiles, resulting in coherent reads. However, Clustered deferred could still exhibit divergence as threads in a wave could access different clusters on the z axis.

[1] – Keep in mind that by “evaluate” or “process” I include times in which a thread is inactive, but needs to go through the code as other threads in the wave are active on the same code path (remember from part 1 that threads in a wave go in lockstep 🙂 ). Do not worry, you’ll not end up shading with the same light multiple times.

[2] – Pixel shaders always run 2×2 shading quads to allow for gradient operations (ddx, ddy, mip selection etc.) to be possible. To enforce this, if a lane is inactive at the shader entry, but a neighbouring lane in the 2×2 quad is active, then what is called “whole quad mode” (WQM) is enabled.
This mode will have those needed, but inactive, lanes assume the role of “helper lanes”.
Now, helper lanes will go ahead and execute the code like other lanes but their result will be discarded at the end. So, we will have a vector light index for those, but WaveActiveMin() will not consider them as technically they are considered inactive.
What does it mean for us? Well, assume a helper lane has a light index that is not in any other of the active lanes, since WaveActiveMin will never set s_lightIdx to such index, the helper lane will never pass the following:

if(s_lightIdx == v_lightIdx)
{
v_lightOffset++;
......

And therefore they will be stuck in the loop forever, causing a hang! To avoid this, just change the check to

if(s_lightIdx >= v_lightIdx)

Similarly, note that if a helper lane is the only lane that is still going, then WaveActiveMin will return -1 !

I omitted this adjustment from the step-by-step execution as it could have made things slightly more confusing to parse. Now that you know, remember these details when implementing the algorithm, or you could end up with pesky GPU hangs.

9 thoughts on “Intro to GPU Scalarization – Part 2 -Scalarize all the lights”

Intro to GPU Scalarization – Part 1 – Dreams of flashy pixels (on a white website) says:

[…] Intro to GPU Scalarization – Part 2 -Scalarize all the lights […]

LikeLiked by 1 person

November 11, 2018 at 8:02 am Reply
didi says:

Great series of articles! Thank you!

Btw, the link to Part 1 at the beginning is not working.

LikeLike

November 11, 2018 at 5:37 pm Reply
1. Francesco Cifariello Ciardi says:
  
  Thank you! I should’ve fixed it now
  
  LikeLike
  
  November 11, 2018 at 7:02 pm Reply
PAOK says:

Bolches yarboclos.

LikeLike

April 1, 2019 at 2:21 am Reply
woodypwx says:

Awesome tutorial, thank you!

LikeLiked by 1 person

October 12, 2019 at 9:34 pm Reply
reg1773 says:

“bool fastPath = (laneMask == ~0);”
What if some lanes are already not active? Shouldn’t you rather compare with WaveBallot(true) instead of full mask ~0?

LikeLiked by 1 person

August 19, 2020 at 1:22 pm Reply
1. Francesco Cifariello Ciardi says:
  
  Indeed I believe you are right, I am out in vacation so editing via phone feels dodgy, but will soon make the correction. Thanks for the catch!
  
  LikeLiked by 1 person
  
  August 19, 2020 at 4:29 pm Reply
Andrew H. says:

I think the strategies you posted might have a problem with infinite loops when executed in a pixel shader, because of helper lanes? A helper lane might be in a different cluster than every other pixel, and have a different list of lights. In that case, all the active lanes might terminate but the helper lanes won’t, because WaveActiveMin() doesn’t include the helper lanes, so s_lightIdx == v_lightIdx will never be true. At least that was a problem that I ran into, and adding a check for WaveActiveAnyTrue fixed it.

LikeLike

June 25, 2023 at 3:34 am Reply
1. Andrew H. says:
  
  And… I just realized you wrote that at the very bottom. Sorry!
  
  LikeLike
  
  June 25, 2023 at 4:44 am Reply