Intro to GPU Scalarization – Part 1

Note: The following posts are going to be very GCN specific, I hadn’t made that clear enough with my first extension, hence why only this PSA at the top of the post! Sorry if you’ve already read and it created confusion.

At work, I recently found myself scalarising a light loop and it was quite fun, so when this weekend I had a chat with a friend about how the new work was going, I mentioned it. Doing this I realised that, at least to my knowledge, there is not enough “for dummies” document that can help a new-comer to understand the process.

So hey, I am going to try something myself to fill the gap. By no means this is going to be an advanced post, but my goal is to have something on the internet that could take someone hand-by-hand through the process with a short, yet hopefully easy to understand, post. I’ll split it into two bits:

Part 1 – Introduction to concepts and simple example.
Part 2 – Scalarizing a forward+ light loop.

If you want to follow the step-by-step gifs at your own pace: this is the pptx I used to do them.


Wavefronts

If you even care about this topic I assume you know something about how GPUs work, but I’ll quickly go through some basics needed to understand the rest of the post. Skip this section if you know already what a Wave is and how threads in one work.
For much more thorough introduction to GPU execution model I refer you to this excellent series by Matthäus Chajdas.

GPUs are massively parallel processors, processing an incredible amount of data at the same time. Crucial aspect is that, most of the time, we have a lot of threads (or pixels if you prefer to think about pixel shaders) running the same shader. To deal with this obvious situation, threads are batched in groups that we’ll call Wavefronts or waves (or warps in Nvidia lingo). Like the name, the numbers of threads in a wave is architecture dependent, 32 on NVIDIA GPUs, 64 on AMD’s GCN and variable on Intel cards.

Screen Shot 2018-10-28 at 14.17.05

Note that each thread in a wave can also be called lane.

All the threads that are in a wave run the same shader in lockstep. This is why if you have a branch that is going to be divergent within a wave, you end up paying the cost of both cases.

Oct-28-2018 13-57-44

You can see in the gif I have something called exec mask. This is a bitmask, one bit per thread in the wave. If the bit for a thread is set to 1 then it means the thread (or lane) is active, if set to 0, it is considered an inactive lane and therefore whatever gets executed on that lane is to be disregarded. This is why each thread will have the correct result even though both branches are executed.

Scalar vs Vector

As the wave executes, each thread of course needs registers. There are two type of registers:

Vector registers (VGPR): for everything that has values that are diverging between threads in a wave. Most of your local variables will probably end up in VGPRs.

Scalar registers (SGPR): everything that is guaranteed to have the same values for all threads in a wave will be put in these. An easy example are values coming from constant buffers.

These are few examples of what goes in SGPRs and what in VGPRs. (check note [0] at bottom of post):


cbuffer MyValues
{
float aValue;
};

Texture2D aTexture;
StructuredBuffer aStructuredBuffer;

float4 main(uint2 pixelCoord) : SV_Target
{
    // This will be in a SGPR
    float s_value = aValue;

    // This will be put in VGPRs via a VMEM load as pixelCoord is in VGPRs
    float4 v_textureSample = aTexture.Load(pixelCoord);

    // This will be put in SGPRs via a SMEM load as 0 is constant.
    SomeData s_someData = aStructuredBuffer.Load(0);

    // This should be an SALU op (output in SGPR) since both operands are in SGPRs
    // (Note, check note [0])
    float s_someModifier = s_value + s_someData.someField;

    // This will be a VALU (output in VGPR) since one operand is VGPR.
    float4 v_finalResult = s_someModifier * v_textureSample;

    return v_finalResult;
}

As I annotated already in the code, depending on what registers the operands are in,  arithmetic instructions are executed on different units: SALU or VALU. Similarly, there are both vector memory ops and scalar memory ops depending on whether the address is in SGPR or VGPR (with some exception).

Now, why does it matter would you ask? Quite a few reasons, most importantly:

  • VGPRs are often the limiting resources for occupancy; the more we have in SGPR the more we are reducing VGPR pressure, resulting often in increased occupancy (Check note [1] for a bit of more details if hearing of occupancy confused you).
  • Scalar loads and vector loads have different caches (scalar being lower latency), SMEM and VMEM are different paths and it’s good to not pile up all the loads on VMEM to avoid longer waits for operands.
  • Making some diverging paths coherent among threads can be beneficial. For example note how in the pic in previous section both the expensive and cheap branches are executed by all threads.

So, well, this whole scalar deal sounds great, right? It really is! And we should leverage scalar units and registers as much as possible, and sometimes we need to help the compiler to do so.

Enters scalarization…

Getting wave invariant data

How do we force the usage of scalar unit? Well, we need to operate on wave invariant data of course.
Now sometime this can be achieved by operating on data that we are sure it’s going to be scalar (e.g. SV_GroupID with groups that are a multiple of the wave size). Sometimes though we really need to ensure that we operate on wave invariant data ourselves.

To do so, we can leverage wave intrinsics! Wave intrinsics allow us to query information and perform operations at a wave level. What do I mean, you ask? Let me give you few examples, it will make it much clearer (note that there are way more):

Intrinsic Description
uint WaveGetLaneIndex() Returns the index of the lane within the current wave (in a VGPR of course)
uint4 WaveActiveBallot(bool) Returns a 64-bit mask containing the result of the passed predicate for all the active lanes. This mask will be in a SGPR.
bool WaveActiveAnyTrue(bool) Probably using ballot, it returns whether the predicate passed is true for any active lane. Result is in SGPR.
bool WaveActiveAllTrue(bool) Probably using ballot, it returns whether the predicate passed is true for all active lane. Result is in SGPR.
<type> WaveReadLaneFirst(<type>) Returns the value of the passed expression for the first active lane in the wave. Result is in SGPR.
<type> WaveActiveMin(<type>) Returns the minimum value of the passed expression across all active lanes in the wave. Result is in SGPR.

Current gen consoles expose these intrinsics and so does Shader Model 6.0. If you are working on consoles, not all the SM6 intrinsics may be exposed as is, but you can replicate the behaviour of any of them with what’s available.

Let’s look at how we can use wave intrinsics to improve the example we had before:

Oct-28-2018 20-08-40
Note that this is a valid optimization only if both the fast and slow paths produce an acceptable result for the thread.

Notice how executing the more expensive path for all the threads, we end up actually executing less instructions that we would have done with divergent paths. This alone could be a good win!

Yay we did our first scalarization! Now it is time to look at something more practical and a tad more complex. Join me in the second part of this series, where we are going to scalarize a forward+ light loop.

‘Till the next one!

– Francesco (@FCifaCiar)


Notes

[0] Note, actual code may not end up as I marked up in the given context for various reasons. In particular, I believe that what I marked as SALU it is going to be VALU as there is no floating point ISA for SALU (Thanks to Andy Robbins in the comment section) Consider the instruction in isolation when matching them with the comments.

[1] GPUs have a crazy amount of bandwidth, but the latency of memory operations is also very high! To compensate for this, a GPU can suspend the execution of a wave that is waiting for data from memory and switch to another wave that can execute in the meantime. This is what you read around as “latency hiding”.

Oct-29-2018 20-42-00.gif
An example with an occupancy of 2

Note in the example above how instead of waiting for 5 units, by switching to another wave we managed to wait only 2 units.

The number of waves we can switch back and forth from is limited. This number is what we can see referred as “occupancy”, maximum being 10.  There are few limiting factors determining our occupancy, most of the time it is the number of VGPRs used by the shader (we have a finite number) and LDS used.  So, since scalarizing often means reducing the usage of VGPR, it often translates to better occupancy.

I am skipping over lots of important bits, way more details and more analysis on this topic can be found in this  great GPUOpen’s post by Sebastian Aaltonen

Advertisements
Intro to GPU Scalarization – Part 1

3 thoughts on “Intro to GPU Scalarization – Part 1

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s