Intro to GPU Scalarization – Part 2 -Scalarize all the lights

Note: The following posts are going to be very GCN specific, I hadn’t made that clear enough with my first extension, hence why only this PSA at the top of the post! Sorry if you’ve already read and it created confusion.

This is the second part of my 2-part short series on scalarization. This assumes you either read the first part or you know what a wavefront is, what SGPR/VGPR, SALU/VALU, SMEM/VMEM are, you are aware of wave intrinsics and have an idea of what scalarization is.

Just a reminder, this mini-series is targeted at people approaching scalarization for the first time, so it won’t contain any groundbreaking new technique ūüôā

Before I start, if the pace of the gifs I added to the post is off for you, this is the pptx file I used to do them. This way you can follow the steps at your own leisure!

Part 1 РIntroduction to concepts and simple example.  
Part 2 – Scalarizing a forward+ light loop.

The problem

For quite a while now tiled and clustered data structures have been the industry standard for lighting. In both cases we assign the lights in our scene to cells¬† ‚ÄĒ screen aligned tiles or 3d clusters¬†‚ÄĒ and then for each pixel, we identify which cell contains the relevant lights and perform the shading.

Another common trend is to use a forward shading, however this comes with a performance problem: pixels within the same wave might have to access a different set of lights.

Screen Shot 2018-11-01 at 12.28.57.png
Orange object is 8×8 in size and touches 4 8×8 different tiles.

That of course means that we will have diverging loads of light data within a wave and from what we learned on previous post of the series that is suboptimal (see note [0]). But hey! We could scalarize!

Let’s have a look at a sample light loop, non-scalarized yet, with three divergent threads:

Nov-01-2018 17-15-17.gif
We are making the assumption that the sublists of lights are sorted from now on.

All is going to be in VGPRs and it is going to be vector loads! Not great.

Consider a typical scene, we could say that the following:

  • Pixels in a wave are all likely to access the same cell.
  • Nearby cells are likely to have similar content.

Both of these consideration scream coherency!

Strategy #1 – Looping clusters and lane masking

We noticed how most pixels will access the same cell, that also mean that if we have divergence, it’s unlikely that we will hit many different cells within a wave. So? So we could scalarize the cell index and, for each thread, loop through all the cells touched by the wave.
Let’s first see how (from [Drobot 2017]¬†) and then we’ll analyse what’s going on:

Nov-01-2018 22-13-11
Note that ulong here is a 64bit uint, not natively supported in hlsl, but check how to handle them on your platform ūüôā Also note, in the visual example I only have 16 threads vs 64

Intimidating? It might be a bit! Hopefully the step-by step execution helps, but let’s dig deeper to see what’s going on by having a closer look at parts of the code:

ulong execMask = 0xffffffff;

In the previous post I mentioned the concept of exec mask. Here we are creating our own exec mask to determine which threads will need to process a cell. Of course, since we are at the start of the frame, we mark all threads as “alive”, hence the 0xffffffff.

uint v_laneID = WaveGetLaneIndex();
ulong curLaneMask = ulong(1) << ulong(v_laneID);

This bit is simple, we are creating a mask that has 0s for every thread, except the n-th bit for the n-th lane. E.g. lane 3 will have a mask 0000000…001000.

while( ( execMask & curLaneMask ) != 0 )

This is our loop condition. Plainly put, the lane will stop when its bit in the exec mask becomes unset. What it means is that we are going to loop in a scalar fashion through all the cells and the current thread will stop doing so as soon as soon as its cell is processed. 

uint s_cellIdx = WaveReadFirstLane(v_cellIdx);
ulong laneMask = WaveBallot(v_cellIdx == s_cellIdx);

This is a crucial bit. We first get the cell index from the first active lane, this will be in an SGPR. Then, via Ballot, we create a mask telling us which other threads need to process lights from the cell of the first active thread…

execMask = execMask & ~laneMask;

By doing a logical and of the negated laneMask we just created, we are essentially saying: “These threads are about to process the cell they need, so we’ll be done with them”. Since we are done with them, we zero their bit in the exec mask so that they will exit the loop next iteration.

if( s_cellIdx == v_cellIdx )

And finally, if indeed the current thread needed to process the same cell as the first active thread, we process such cell!
Now, since the cell address is scalar, the cell content can be loaded in SGPR via scalar loads. Hurray! We did scalarize the loop!

Now that we went through it bit by bit, let’s wrap it up at an high level in words again:

Until its relevant cell is processed, each thread will read the cell index of the first active thread. If this coincides with the its own cell index, the thread will process the cell exits the loop, ceasing to be active and waiting for the rest of the wave to complete the same process. Since all threads will go through all the cells in the same order, the process is scalar.

If it is still unclear, I suggest you to follow the step by step gif again. Use the powerpoint to go at your own pace ūüôā

Can we do better? Possibly, yes.
The problem here is that we could end up doing lots of redundant work.
Remember our coherency assumptions? In particular: “Nearby cells are likely to have similar content.”.
Since we are scalarizing at a cell level, we are processing the whole content of each of the cells that touch any pixel in the wave.  That means that if for example light number 42 appears in three cells that touch a wave, we are going to process that three separate times.

Nov-04-2018 11-54-56.gif

See Note 1 at the bottom for some clarifications on what I mean by “process three separate times”.

What if we scalarize per light rather than per cell? This is what¬† [Drobot 2017]¬† goes on to show, but to touch more sources and more diverse techniques I’ll follow an example from DOOM [Sousa et Geffroy 2016].

Strategy #2 – Scalarize single light indices

As I have done before, I’ll drop here a gif of the code executing step by step and then we’ll dig into it in more details. Note that¬†[Sousa et Geffroy 2016]¬†don’t provide code, so this is my implementation for it, hopefully it is correct ūüôā

Kapture 2018-11-08 at 21.41.05
Conceptually this is the equivalent of flattening all the cells content into a single list and let all the threads go through that list. How do we achieve this? At every step, all threads in the wave process the unprocessed element with lowest index among all the ones that needs processing in any of the threads of said wave.

Let’s take a closer look.

uint lightOffset = 0;
while(lightOffset < lightCount)

lightOffset is our index through the cell light list associated with the current thread. We'll loop until we processed all the lights in the list.

uint v_lightIdx = GetLightIdx(v_lightStart, lightOffset);
uint s_lightIdx = WaveActiveMin(v_lightIdx);

This is where we select which light the wave, as a whole, will process. We pick the first non-processed light index for each thread and then we select the minimum between them. This will give us an index in SGPR with which we can perform scalar loads!
Be careful and keep in mind that WaveActiveMin ignores inactive lanes, this is important, see note 2 if you are actually going to implement this for an important detail.

if(s_lightIdx == v_lightIdx)
LightData s_light = Lights[s_lightIdx];

Obviously we want to shade with the light only if we need to! If current thread first unprocessed light has the same index as the minimum index across the whole wave, we will go ahead and process such light that is now in SGPRs!

All threads that had a light processed will advance in their list together by incrementing their v_lightOffset. This way, s_lightIdx will not be processed again.
If the light index for a thread does not correspond to the minimum in the wave, then we keep pointing at the same element for the next iteration of the loop and we branch over the shading and load.


YAY! We have a scalarised loop and we process each light only once! Success… right?
Right! But there are cases in which we can do even better.

Strategy #2 bis – Adding a fast path

The WaveActiveMin() we extensively used in the previous section is unfortunately not the cheapest of the operations. It maps quite a bunch of lane swizzles and min operators.

Remember one of our coherency assumptions?  Pixels in a wave are all likely to access the same cell.

What if all pixels end up accessing the same cell? We can go even faster in this case!

uint v_cellIdx = GetCellIdx();
uint s_firstLaneCellIdx = WaveReadFirstLane(v_cellIdx);

// A mask marking threads that process
// the same cell as the first active thread.
ulong laneMask = WaveBallot(v_cellIdx == s_lane0CellIdx);
// True if all the lanes need to process the same cell
bool fastPath = (laneMask == ~0);

     // This will load lights in SGPRs since cell address is scalar!

So here you go, we now have our loop scalarized and we even have a fast path for a very common case!

What next?¬†Now that we went together through a couple of examples, I strongly suggest you to go through the rest of [Drobot 2017], it is truly a wonderful presentation (and if you are a registered xbox developer, check through XFest archive ūüôā ). An implementation of its approach can be found in the excellent deferred texturing sample¬†by Matt Pettineo.

The world is your oyster! Seek for scalarization opportunities in your code!

So that is it! I hope this shed some light, if something didn’t come across clear enough¬†please let me know on twitter or in the comments.

‘Till the next one!

– Francesco (@FCifaCiar)


[0] – Tiled deferred doesn’t have the same problem as we could align waves and tiles, resulting in coherent reads. However, Clustered deferred could still exhibit divergence as threads in a wave could access different clusters on the z axis.

[1] –¬†Keep in mind that by “evaluate” or “process” I include times in which a thread is inactive, but needs to go through the code as other threads in the wave are active on the same code path (remember from part 1 that threads in a wave go in lockstep ūüôā ). Do not worry, you’ll not end up shading with the same light multiple times.

[2] – Pixel shaders always run 2×2 shading quads to allow for gradient operations (ddx, ddy, mip selection etc.) to be possible. To enforce this, if a lane is inactive at the shader entry, but a neighbouring lane in the 2×2 quad is active, then what is called “whole quad mode”¬† (WQM) is enabled.
This mode will have those needed, but inactive, lanes assume the role of “helper lanes”.
Now, helper lanes will go ahead and execute the code like other lanes but their result will be discarded at the end. So, we will have a vector light index for those, but WaveActiveMin() will not consider them as technically they are considered inactive.
What does it mean for us? Well, assume a helper lane has a light index that is not in any other of the active lanes, since WaveActiveMin will never set s_lightIdx to such index, the helper lane will never pass the following:

if(s_lightIdx == v_lightIdx)

And therefore they will be stuck in the loop forever, causing a hang!  To avoid this, just change the check to

if(s_lightIdx >= v_lightIdx)

Similarly, note that if a helper lane is the only lane that is still going, then WaveActiveMin will return -1 !

I omitted this adjustment from the step-by-step execution as it could have made things slightly more confusing to parse. Now that you know, remember these details when implementing the algorithm, or you could end up with pesky GPU hangs.


Intro to GPU Scalarization – Part 2 -Scalarize all the lights

Intro to GPU Scalarization – Part 1

Note: The following posts are going to be very GCN specific, I hadn’t made that clear enough with my first extension, hence why only this PSA at the top of the post! Sorry if you’ve already read and it created confusion.

At work, I recently found myself scalarising a light loop and it was quite fun, so when this weekend I had a chat with a friend about how the new work was going, I mentioned it. Doing this I realised that, at least to my knowledge, there is not enough¬†“for dummies” document that can help a new-comer to understand the process.

So hey, I am going to try something myself to fill the gap. By no means this is going to be an advanced post, but my goal is to have something on the internet that could take someone hand-by-hand through the process with a short, yet hopefully easy to understand, post. I’ll split it into two bits:

Part 1 – Introduction to concepts and simple example.
Part 2 – Scalarizing a forward+ light loop.

If you want to follow the step-by-step gifs at your own pace: this is the pptx I used to do them.


If you even care about this topic I assume you know something about how GPUs work, but I’ll quickly go through some basics needed to understand the rest of the post. Skip this section if you know already what a Wave is and how threads in one work.
For much more thorough introduction to GPU execution model I refer you to this excellent series by Matthäus Chajdas.

GPUs are massively parallel processors, processing an incredible amount of data at the same time. Crucial aspect is that, most of the time, we have a lot of threads (or pixels if you prefer to think about pixel shaders) running the same shader. To deal with this obvious situation, threads are batched in groups that we’ll call Wavefronts or waves (or warps in Nvidia lingo). Like the name, the numbers of threads in a wave is architecture dependent, 32 on NVIDIA GPUs, 64 on AMD’s GCN and variable on Intel cards.

Screen Shot 2018-10-28 at 14.17.05

Note that each thread in a wave can also be called lane.

All the threads that are in a wave run the same shader in lockstep. This is why if you have a branch that is going to be divergent within a wave, you end up paying the cost of both cases.

Oct-28-2018 13-57-44

You can see in the gif I have something called exec mask. This is a bitmask, one bit per thread in the wave. If the bit for a thread is set to 1 then it means the thread (or lane) is active, if set to 0, it is considered an inactive lane and therefore whatever gets executed on that lane is to be disregarded. This is why each thread will have the correct result even though both branches are executed.

Scalar vs Vector

As the wave executes, each thread of course needs registers. There are two type of registers:

Vector registers (VGPR): for everything that has values that are diverging between threads in a wave. Most of your local variables will probably end up in VGPRs.

Scalar registers (SGPR): everything that is guaranteed to have the same values for all threads in a wave will be put in these. An easy example are values coming from constant buffers.

These are few examples of what goes in SGPRs and what in VGPRs. (check note [0] at bottom of post):

cbuffer MyValues
float aValue;

Texture2D aTexture;
StructuredBuffer aStructuredBuffer;

float4 main(uint2 pixelCoord) : SV_Target
    // This will be in a SGPR
    float s_value = aValue;

    // This will be put in VGPRs via a VMEM load as pixelCoord is in VGPRs
    float4 v_textureSample = aTexture.Load(pixelCoord);

    // This will be put in SGPRs via a SMEM load as 0 is constant.
    SomeData s_someData = aStructuredBuffer.Load(0);

    // This should be an SALU op (output in SGPR) since both operands are in SGPRs
    // (Note, check note [0])
    float s_someModifier = s_value + s_someData.someField;

    // This will be a VALU (output in VGPR) since one operand is VGPR.
    float4 v_finalResult = s_someModifier * v_textureSample;

    return v_finalResult;

As I annotated already in the code, depending on what registers the operands are in,  arithmetic instructions are executed on different units: SALU or VALU. Similarly, there are both vector memory ops and scalar memory ops depending on whether the address is in SGPR or VGPR (with some exception).

Now, why does it matter would you ask? Quite a few reasons, most importantly:

  • VGPRs are often the limiting resources for occupancy; the more we have in SGPR the more we are reducing VGPR pressure, resulting often in increased occupancy (Check note [1] for a bit of more details if hearing of occupancy confused you).
  • Scalar loads and vector loads have different caches (scalar being lower latency), SMEM and VMEM are different paths and it’s good to not pile up all the loads on VMEM to avoid longer waits for operands.
  • Making some diverging paths coherent among threads can be beneficial. For example note how in the pic in previous section both the expensive and cheap branches are executed by all threads.

So, well, this whole scalar deal sounds great, right? It really is! And we should leverage scalar units and registers as much as possible, and sometimes we need to help the compiler to do so.

Enters scalarization…

Getting wave invariant data

How do we force the usage of scalar unit? Well, we need to operate on wave invariant data of course.
Now sometime this can be achieved by operating on data that we are sure it’s going to be scalar (e.g. SV_GroupID with groups that are a multiple of the wave size). Sometimes though we really need to ensure that we operate on wave invariant data ourselves.

To do so, we can leverage wave intrinsics! Wave intrinsics allow us to query information and perform operations at a wave level. What do I mean, you ask? Let me give you few examples, it will make it much clearer (note that there are way more):

Intrinsic Description
uint WaveGetLaneIndex() Returns the index of the lane within the current wave (in a VGPR of course)
uint4 WaveActiveBallot(bool) Returns a 64-bit mask containing the result of the passed predicate for all the active lanes. This mask will be in a SGPR.
bool WaveActiveAnyTrue(bool) Probably using ballot, it returns whether the predicate passed is true for any active lane. Result is in SGPR.
bool WaveActiveAllTrue(bool) Probably using ballot, it returns whether the predicate passed is true for all active lane. Result is in SGPR.
<type> WaveReadLaneFirst(<type>) Returns the value of the passed expression for the first active lane in the wave. Result is in SGPR.
<type> WaveActiveMin(<type>) Returns the minimum value of the passed expression across all active lanes in the wave. Result is in SGPR.

Current gen consoles expose these intrinsics and so does Shader Model 6.0.¬†If you are working on consoles, not all the SM6 intrinsics may be exposed as is, but you can replicate the behaviour of any of them with what’s available.

Let’s look at how we can use wave intrinsics to improve the example we had before:

Oct-28-2018 20-08-40
Note that this is a valid optimization only if both the fast and slow paths produce an acceptable result for the thread.

Notice how executing the more expensive path for all the threads, we end up actually executing less instructions that we would have done with divergent paths. This alone could be a good win!

Yay we did our first scalarization! Now it is time to look at something more practical and a tad more complex. Join me in the second part of this series, where we are going to scalarize a forward+ light loop.

‘Till the next one!

– Francesco (@FCifaCiar)


[0] Note, actual code may not end up as I marked up in the given context for various reasons. In particular, I believe that what I marked as SALU it is going to be VALU as there is no floating point ISA for SALU (Thanks to Andy Robbins in the comment section) Consider the instruction in isolation when matching them with the comments.

[1] GPUs have a crazy amount of bandwidth, but the latency of memory operations is also very high! To compensate for this, a GPU can suspend the execution of a wave that is waiting for data from memory and switch to another wave that can execute in the meantime. This is what you read around as “latency hiding”.

Oct-29-2018 20-42-00.gif
An example with an occupancy of 2

Note in the example above how instead of waiting for 5 units, by switching to another wave we managed to wait only 2 units.

The number of waves we can switch back and forth from is limited. This number is what we can see referred as “occupancy”, maximum being 10.¬† There are few limiting factors determining our occupancy, most of the time it is the number of VGPRs used by the shader (we have a finite number) and LDS used.¬† So, since scalarizing often means reducing the usage of VGPR, it often translates to better occupancy.

I am skipping over lots of important bits, way more details and more analysis on this topic can be found in this¬† great GPUOpen’s post by¬†Sebastian Aaltonen

Intro to GPU Scalarization – Part 1

Derivative of Gaussians steerability

I am writing this short post on a small interesting old fact that I found being unknown to some (probably few): derivatives of Gaussian kernels are steerable… wait stee-what?

Steerable filters

In computer vision and image processing is not uncommon to find ourselves in the need of oriented filters to analyze responses under different orientations.

We can describe as “Steerable filters” [Freeman and Adelson 1991] a class of filters in which we can synthesize¬†a filter for an arbitrary orientation through a linear combination of basis filters. To put this into equations, if a class of filters is steerable we can have:


where:¬†f¬†is our filter, the various¬†fi¬†are the filters in the¬†basis and the factors ¬†ktheta¬†are what we can call “steering functions”.
Not only we can easily create filters with arbitrary orientation, we can actually skip that and obtain directly their response if we already have a response for the basis filters. This is because when it comes to convolution:


This is great! It means that if we need outputs for lots of orientations, we don’t have to compute lots of expensive convolutions, but only as many as the number of filters in our basis.

Derivative of Gaussians steerability

One very useful class of filters that is steerable is the Derivative of Gaussians (DoG).

Derivative of Gaussians are used a lot in vision/image processing as a way to detect edges (or generally features). I would write more about this, but it would transform this in a lengthier post than I want, so I’ll throw there only two things:

  • They are great to¬†compute gradients and therefore to detect discontinuities. With a large standard deviation we have poor localization (bad at finidng the exact location containing the edge) but good detection (good at detecting proper edges). With low standard deviation we have the opposite, good localization, but poor detection.
  • First order derivatives ¬†at different scales can be used to approximate a¬†Laplacian of Gaussian (LoG), the second order can be used to actually compute the Laplacian. LoG are extremely useful to detect blobs.

So that’s what the first order of Derivative of Guassians look like in x and y directions:


These two nice guys here are one possible choice of basis filters to compute a DoG oriented however we like. But what are the steering functions? ¬†It’s real simple, if we want to find the filter oriented along the direction¬†direction, then the steering functions are simply:

Therefore, using the first equation in the post, our oriented DoG is obtained with:


I’ve put a short proof of this in Appendix A at the bottom of the post.

With these filters we can now compute, for example, the gradients at a certain direction:


which could be useful, for example, in feature detection scenarios and I am sure in other cases that don’t come to my mind now ūüôā

Note: All of the above extends to higher derivative orders.

Additional use cases and further readings

I wanted to keep this post short as this is just a little nice property. Even so, I have treated it probably too shortly, there are other interesting facts and use cases. Here is a quick list of possible use cases for steerable filters (not limited to DoG) and links to papers you could read to understand more about them.

  • Feature detection: I have briefly illustrated how to use steerable filters to find edges in a specific direction. Steerable filters are also useful for example in junctions detection [Perona 92]¬†or¬†detecting the orientation of an image¬†[Freeman and Adelson 1991]. Plenty of other papers on this field can be found (e.g. [Jacob and Unser 2004]¬†and many many more).
  • Scene relighting:¬†An interesting usage of steerable functions can be seen for surface re-lighting. One ¬†of the first example of steerable light sources is from [Teo et al. 1997]; they steered light sources to relight a static scene, however using a very large (probably impractical) basis.
    Another very interesting (at least historically) example are the Steerable Illumination textures¬†of¬†[Ashikhmin and Shirley 2002];¬†They calculate a basis of “baked” illumination, steered to obtain the final result on bumpy surfaces from an arbitrary light position without any cross-fading. They use smaller basis than [Teo et al. 97].
  • Image stitching:¬†Steerable filters have been used to aid in image stitching where finding orientation of corners could help. This has been done for “classical” image mosaics [Subramanyam 2013] or with a stronger focus on satellite images [Li et al. 2007].
  • Image filtering:¬†[Freeman and Adelson 1991]¬†shows how to remove noise from an image, while enhancing oriented structures. ¬†[Gotsman 1994]¬†used steerable gaussian for texture filtering.
  • Motion detection: ¬†Steerable filters are also used to aid in motion detection algorithms. I don’t have a specific paper to link, but steerable filters are often used, implicitly or not.

That is all! It’s probably not the most interesting or incredibly obscure thing, but I figured to post it in the wild nonetheless ūüôā I hope you enjoyed it at least a bit.

’til the next one!

-Francesco  (@FCifaCiar)

Appendix A: Derivation of steerability functions for Derivative of Gaussians

Let’s first express the first order derivative of a gaussian in the x direction in polar coordinates as:


If we then rotate phi by pi2 we will obviously obtain the derivative in the y direction:


Now, let’s rotate the filter by an arbitrary angle¬†theta, using trigonometric identities we obtain:



which is what I mentioned before ūüôā

Derivative of Gaussians steerability

Bilateral Mesh Denoising

Few days ago while looking for an asset in¬†my archive, I resurfaced all sort of interesting things from the past. One thing it brought back lots of memories was a¬†mesh I scanned some¬†time ago. Not surprisingly for (poorly ūüė¶ ) scanned data it was very noisy, so I tried to resurface my snippets on mesh denoising and I figured to write a small and quick blog post.

Bilateral filter

Generally, smoothing (or blurring) filters are good to reduce the high frequency components of our data and well, noise is most of the time unwanted high frequency information. But blindly reducing high frequencies is not the greatest idea, we will lose important features of our data that are not supposed to be noise.

To¬†smooth images whilst preserving edges, the seminal paper¬†[Tomasi and Manduchi 1998]¬†introduced the Bilateral filter that¬†has soon become a fundamental tool for anyone dealing with image filtering. The main idea is to have a kernel that adapts itself based on the content that is filtering; let’s first write down the original formula

CodeCogsEqn (1)

Here p¬†represents a point in the neighbourhood of ¬†u¬†(the pixel we are filtering),¬†CodeCogsEqn (2)¬†is the image data and¬†G¬†represents Gaussian filters.¬†Gs¬†is nothing new, it’s just your classical blur¬†kernel, the novel part is all in¬†CodeCogsEqn (3), which is often referred to as range filter.
This filter provides a scale to¬†the kernel¬†¬†Gs:¬†higher the likelihood of an edge in the image (big difference¬†between neighbouring¬†values), the smaller Gs¬†is made.¬†¬†The “minimum” amplitude of the edge can be tuned changing the standard deviation of¬†CodeCogsEqn (3), the bigger this standard deviation the more the bilateral filter converges to a normal gaussian blur. A good value for¬†it would be using the mean gradient value of the image.


You can see here how the image has been smoothed out, but the strong features are still very well preserved.

This concept of edge-aware filtering has been widely used also in videogames, where is not uncommon when¬†we need to upsample low resolution buffers. ¬†Being a rendering engineer in the videogame industry I am very very interested in these applications, but is not in the scope of this blog post so here some excellent reads from:¬†Angelo Pesce’s blog,¬†Bart Wronki’s blog,¬†Padraic Hennessy’s GDC 2016 Presentation¬†(but you can find more ūüôā )

Behold the third dimension: Bilateral mesh denoising 

The bilateral filter has been introduced for images, but nothing stop us from using it on different dimensions such as meshes! Two papers independently introduced  this idea:  [Jones et al. 2003] and [Fleishman et al. 2003] . I have direct experience with the latter so I am going to refer to that one for the rest of the post.

Now to apply the 2D algorithm to 3D, we need to “map” the concepts to another domain. Obviously the vertices data set is the equivalent of the image, but what about the image¬†values needed¬†to detect features? The main idea was to consider the height of the vertices over their tangent plane as the equivalent of the intensity levels in the image:


Using my mad Paint skills  I tried to illustrate the point. In blue we have the plane tangential to the vertex we want to filter, p, in red the vector between p and neighbours pminusq and finally in orangey the projection of pminusq on the normal of p (i.e. the height w.r.t. the tangent plane); I will call this projection h from now on.
This nice orange-y bit is exactly what we need to use the bilateral filter on our noised mesh! We can in fact assume that in areas with high h values we are more likely in the presence of a feature. Now similarly, they assume that the noise-free mesh is well approximated by the tangent plane at p, so what we need to find for each vertex v (up to now called p, but I like v better now that we explicitly talk about vertices)  is:


Where d is the result of the bilateral filter (1).  As the above implies, the algorithm is applied on a per-vertex basis and allows for multiple iterations.
Now referring to the formula I mentioned in the previous section for images, in 3D the bilateral filter becomes:


Where N(v) is the neighbourhood of the considered vertex. This is best set in relation to the standard deviation we use for the range filter (the one that penalizes large variations) CodeCogsEqn (3), in particular it should be so that  diseq.

Think about it keeping in mind the correlation between image intensities and¬†h¬†(dot¬†). In an image a very strong difference in intensity usually indicates an edge, while noise that we can blur out can be usually classified as as smaller variations of the intensity. It all makes sense, let’s apply (1)¬†and DONE!


Few issues still there. Firstly we are relying on the normal at the vertex in the noisy mesh and this can lead to unwanted results. We can solve this by mollifying the normal, that is just averaging the normals of the faces of the 1-ring neighbourhood of v, using the area of said faces as weight.  In rare cases we might need to consider a k-neighbourhood when mollifying normals.

This is not the only issue here, the bilateral filtering is not energy preserving. In images this is not so noticeable or not so jarring to the eye, however in meshes this manifests as shrinking which could be quite evident. Also, because you are likely in need to apply multiple iterations of the algorithm to obtain good results, you are also more likely to create visible and serious shrinkage of the mesh.   [Fleishman et al. 2003]  mention some volume preservation technique that is either not described or, well, I did not paid enough attention while implementing the algorithm. I experienced sensible shrinkage, but for my use case it was an ok trade off, so I did not investigate much further.

Here an half-decent result I got (displayed with curvature to highlight details and features)


On the left the noised mesh, at the center the original and to the right the de-noised result.
Probably could get better with better tuning of parameters, but not too bad.

Damn you box! More problems 

To add on to the problems,¬†let’s consider a mesh that has corner-like features as¬†the one below (or any cuboid would do). Now if we have noise on the edges we are in for some troubles!

From [Fleishman et al. 2003]
The noise is not removed because is¬†wrongly classified as feature. It’s quite clear why if you¬†think that¬†we are using ¬†the heights of the neighbours relative to¬†the tangent plane to the vertex we are filtering (I call these¬†h). Being on top of a very strong feature, the hs¬†are very big because they are essentially the sum of the genuine big value from¬†the feature and the smaller value caused by¬†the actual noise. The algorithm doesn’t know we have extra noise, it just sees that big value and immediately infers an edge to preserve.

Another issue I see is that the algorithm relies on the fact that the local tangent plane always well approximates the denoised mesh, therefore we push the vertices in the normal direction. I don’t think this holds well with corners. As a result of this possibly wrong assumption ¬†you can end up smoothing or bending sharp edges.

Note that the cube on the right is after lots of unnecessary iterations, here to only exaggerate the issue.

Bilateral Normal Filtering

This corners issue was a big deal for me, so I started thinking of solutions. Apparently I came up to a conclusion that is already exploited by many research papers:  first order normals are a better representation of the surface, better than the actual vertices positions in this context. With this idea and a bit of googling I found the paper [Lee and Wang 2005] that I used to improve on the basic algorithm.
In short, they still apply the bilateral filtering as [Fleishman et al. 2003], but they filter the normals:


Where i_latex is the face of the considered normal, N(i)  are neighbouring faces of i_latex, ci is the centroid of face i_latex and finally dijshort is the equivalent of the h mentioned for version (1). In fact, here variations between normals are detected as the projection of the normal we are filtering onto the vector given by the difference between the normal itself and a sample in the neighbourhood


Again, this makes sense as a big dijshort will happen in presence of features of the mesh.

Ok this is cool, we have nice filtered normals, but I needed vertices! How do we get the denoised mesh? Well, we know that by definition the normal of a face is perpendicular to each edge of that face. This gives us a system of equations (dot product of filteredi  with each edge of its face must be zero) that we can solve  with least-square methods.


Where F is the set of neighbouring face, lambda the iteration step size and istar is the set of neighbouring vertices. For a full derivation I refer you to [Lee and Wang 2005].

From [Lee and Wang 2005]. Center is result from [Fleishman et al. 2003], right is the result from filtering normals ( [Lee and Wang 2005] )
Still not perfect, but pretty good!

I’d love to better detail this algorithm, but I am afraid the post is already too long. If you are interested in it please let me know and I will write another short post (or edit this one).

This is not cutting-edge denoising tech, nor is a fair review of all the methods based on bilateral filtering ūüôā I just found this seminal paper interesting and wanted to do a quick and dirty write up of¬†my very short journey into mesh denoising, during which the above described methods were good enough. Hopefully you found this remotely interesting and you will keep on researching the plethora of available evolution of the method; believe me, with the right¬†keywords you’ll be able to find a massive quantity of papers, I might add few links as appendix later on.

Thanks for reading!

-Francesco  (@FCifaCiar)

Bilateral Mesh Denoising

Chronicles of a tour in Skin Shading – (Some) Useful biological aspects used in CG

Preamble:  This series of blog posts is a braindump of some parts of my short journey in real time skin shading for my Master thesis, supervised by Dr. Tim Weyrich. 

Part 1: Dipole approximation for dummies 

A part of my studies that fascinated me quite a lot was the biological background used in skin shading. Not sure if this aspect receives much love in games (let me know if you actively do use these concepts!), but I can see lots of potential, especially from an authoring point of view.

As a necessary disclaimer: I am¬†not a biologist and although I love biology¬†as any other science, my knowledge is¬†fairly limited, so please correct me if I wrote something stupid¬†ūüôā


Main responsible for light absorption, and therefore for the skin colour, are chemical compounds called chromophores. There are two influential types of chromophore: Melanin and Haemoglobin.

From [Donner and Jensen 2006]. I will talk about the scattering structures later in the post.
Melanin is the pigment that most influences the skin colour and it is mostly found in the upper layer of the skin, the epidermis.
We can identify two types of them; the eumelanin is a dark or brown pigment and the pheomelanin that has a lighter, reddish-yellow colour.
Intuitively, a high concentration of melanin is found in dark skin where it is mostly eumelanin; on the contrary, in lighter skins such as the Asian or Caucasian ones, there is much less  melanin and there is an higher percentage of pheomelanin.

Haemoglobin, found at dermis level, is a protein responsible for the transport of oxygen and produces a pink-purple nuance on the skin. Small amounts of heamoglobin correspond to a pale effect on skin colour, while an high concentration makes the skin appear more pink/red.

Chromophores in Computer Graphics

Wait, do we care about that much biology? As CG community we know that to correctly represent the world we have to study it thoroughly, so it should come at no surprise that to develop skin appearance models, looking at biological parameters can be useful.

[Donner and Jensen 2006] proposed a spectral shading model which permits the manipulation of chromophores via three parameters: Haemoglobin fraction, Melanin Fraction, Melanin type blend (that describe eumelanin/pheomelanin ratio) + they add a parameter for skin oiliness for the specular contribution.

Increasing melanin fraction [Donner and Jensen 2006].
These parameters are then embedded in the multi-pole model [Donner and Jensen 2005]. I described roughly the model in the previous post, but the important bit to remember here is that it describes a multi-layer material where each¬†layer has¬†a characteristic absorption coefficient. The model proposed by [Donner and Jensen 2006] has two layers, the epidermis¬†with an absorption coefficient dependent on melanin factors and the dermis layer, where the absorption coefficient is dependent on the haemoglobin fraction. I’ve put the exact formulas at the bottom of the post.

Increasing haemoglobin fraction [Donner and Jensen 2006].
This to me has a great importance. We can define the biologically-correct diffusion profile for any target we want!

But hey, don’t we use albedo textures? Yes, and they also contain fine scale details we want to preserve. ¬†The issue is that albedo textures include overall colour data and therefore melanin and haemoglobin pigments are already embedded. If we want to get our colour from chromophores, we must normalize the albedo colour by dividing each texel by the assumed total diffuse reflectance


Where R(r) is the diffusion profile we found using the assumed chromophores information.

This work is taken forward by [Donner et al. 2008], where they introduce more physiological parameters and make them spatially-varying (whereas in [Donner and Jensen 2006] spatial variation is limited to the use of an albedo modulation texture). This paper is truly great, I am not going deeper into it  for the sake of brevity of the post, but check it, it is well worth a read.


But that is not all. Skin colour varies constantly, also very rapidly. Think about it, we blush, we look pale, we get red after some exercise etc.  A great model for this variation is proposed by [Jimenez et al. 2010] .  Even to a non-biologist like me, it is clear that all of the variations I mentioned are depending on the blood flow and what is in blood that influences most the skin colour? Oh well, our dear Haemoglobin of course!  So the great contribution of [Jimenez et al. 2010] is the proposal of a colour appearance model with spatially-varying haemoglobin distribution that allows appearance transfer between characters.

The key observation made to propose this model was that for a wide range of appearances the haemoglobin distribution is close to a Gaussian one. Starting from an acquired or painted neutral haemoglobin map, various emotions or states can be simply modelled with a bias + scale. This way they transform a neutral distribution H0.gif  to any desired combinations:


With the same idea of usual blend shapes. Each blend shape i_latex has its own set of scale/bias textures and the weights used for these are the same used for classic animation rigs.
Note that because of the low frequency nature of these info, we can efficiently store scales and biases in low resolution textures, how convenient!

Once we have all our infos about chromophores, [Jimenez et al. 2010]  provides us with a great precomputed table that we can use to extract albedo colour from melanin and haemoglobin maps/distributions:

From [Jimenez et al. 2010]
I strongly suggest you to go and read the full paper, it has lots more details on the model and a great description of the acquisition phase.

SSS and specular from a biological POV

Subsurface Scattering

In the¬†epidermis (i.e. outermost layer) the cells’ membrane, some elements in the nucleus and cytoplasmic organelles are the main scatterers; these scatterers are well separated one another, therefore the whole epidermis contribution can be approximated with a single scattering event.

From [Igarashi et al. 2005]
Wait! I told you in the dipole post that we can ignore the single scattering! Yes. I did, and I still stand my ground. As I said previously in this post, the epidermis is the main responsible for absorption, but most of the scattering in fact occurs at the dermis (the layer under the epidermis). Here the main scatterers are  collagen fibres that scatter in forward direction. These scatterers are so densely packed that the multiple scattering events can be seen as isotropic even if single fibre scattering events are not. And this should explain my claim on the dipole post that it is fine to approximate subsurface scattering as a diffusion phenomenon.

Surface reflection

On skin, surface reflection is primarily caused by a thin layer formed by an oily liquid called sebum and by lipids. Also the morphology of wrinkles has an impact on the specular aspect of the skin.
The distribution of these factors changes across the face, causing varying specular properties. A thorough study of such variations can be found in  [Weyrich et al. 2006]. For my thesis I did use the information in the paper baking roughness and  scaling factor ( rhos ) for the Cook-Torrance BRDF in a texture


Sadly I don’t have great comparison pictures any more, this is the best I am left with:


The left image is using the same parameters (cheeks’ ones) for the whole face, the right picture is using spatially varying parameters. ¬†Note that I think there is something a bit off with the lips in this picture, as they should have a bit stronger specular.


That’s all from today’s biology lesson ūüôā This is a short overview of some of the many biologically-based skin appearance models made by a lowly (non-biologist) programmer. I found these aspects fascinating when I studied them, so I hope¬†someone else found this post moderately interesting ūüôā

To the next one!

– Francesco (@FCifaCiar)


Absorption coefficients for dermis and epidermis ( [Donner and Jensen 2006] )

The wavelength dependent absorption coefficients for the two different types of melanin (eumelanin and pheomelanin) are:


Then we need to define the baseline absorption coefficient from small tissue in the epidermis as:


So that finally we can define the absorption coefficient for epidermis as:


Where Bm defines the melanin type blend (eumelanin/pheomelanin) and Cm is the melanin fraction.

For the dermis absorption coefficient we have:


Where Gamma is the blood oxygenation ratio between oxy and deoxy-haemoglobin; this varies very little and in [Donner and Jensen 2006] is fixed at 0.75.  Note that haemoglobin has different absorption behaviour if it is oxygenated or deoxygenated.  Ch is the fractional amount of haemoglobin in the dermis.

Reference and Acknowledgements

An obligatory reference not directly mentioned in the post has to go to the technical report The Appearance of Human Skin [Igarashi et al. 2005] . It is a bit long, but it has lots of interesting informations.

[Donner and Jensen 2006]¬†: ¬†DONNER, C., AND JENSEN, H. W. 2006. A spectral BSSRDF for shading¬†human skin. In Rendering Techniques 2006, 409‚Äď417.

[Donner et al. 2008]:¬†DONNER, C., WEYRICH, T., D‚ÄôEON, E., RAMAMOORTHI, R., AND RUSINKIEWICZ,S. 2008. A Layered, heterogeneous reflectance model for acquiring and rendering¬†human skin. In ACM Trans. Graph. 27, 5, 140:1‚Äď140:12.

practical appearance model for dynamic facial color. In ACM Trans. Graphic. 29,
141, 1‚Äď10.

[Weyrich et al. 2006]: WEYRICH, T., MATUSIK, W., PFISTER, H., BICKEL, B., DONNER, C., TU, C., MCANDLESS, J., LEE, J., NGAN, A., JENSEN, H. W., AND GROSS, M.
2006. Analysis of human faces using a measurement-based skin reflectance model.
ACM Trans. Graphic. 25, 1013‚Äď1024.

Also many thanks to my supervisor Tim Weyrich, who is certainly an authority of this field and who taught me a lot.


Chronicles of a tour in Skin Shading – (Some) Useful biological aspects used in CG

Chronicles of a tour in Skin Shading – Dipole approximation for dummies

Preamble: ¬†This series of blog posts is¬†motivated by the fact that I apparently lost the pdf of my Master¬†dissertation on real time skin shading. I still have most of the concepts in mind so I will braindump here some parts of my short journey. Don’t expect ultra high quality stuff, it was the work of a little guy approaching for the first time serious computer graphics.¬†

It was over 2 years ago, when I decided for my Master dissertation to venture myself in Skin Shading. At the time, having a sphere (slowly) rendered with a Cook-Torrance BRDF was my main achievement and little I knew about how challenging graphics programming can be.

All excited I started my literature review and one paper that always came up in references was¬† A Practical Model for Subsurface Light Transport¬†[Jensen et al. 2001] and its Dipole approximation. Little me was very excited, I printed the paper (yes, I like to print stuff I need to study) ¬†and …. I desperately stared at it for an hour. Whilst it being a great paper, it is not an easy read, at all.
Recently on Computer Graphics ¬†Stack Exchange¬†Nathan Reed posted a question about it and I had a chance to write down a short answer keeping in mind little me, that was eye-balling the paper back then. To kick-start this short series of posts I am going to partially refactor part of my¬†answer here. ¬†It is going to be a very high level description, for details go back and read the paper, it is worth it ūüôā

What it is

It is well known that to model skin a simple BRDF is not enough, but we need to approximate a BSSRDF because of the subsurface scattering.
Often the subsurface scattering (SSS) is approximated with a diffusion phenomenon; this is a fairly good thing to do since in highly scattering material such as the skin the distribution of light loses angular dependency and tends to isotropy.

The dipole approximation used by [Jensen et al. 2001] is nothing else but a formulation of the¬†diffusion problem¬†used to model the multiple-scattering part of the BSSRDF.¬Ļ This multiple-scattering term is¬†defined¬†in the paper as a function of incoming and outgoing points and directions:


Where the Ft are Fresnel terms and R is what is known as a diffusion profile. This profile is what actually describes the phenomenon as a function of the distance between the point where the light enters the media and the point where the light leaves it.
Here in R is where the dipole comes in play.


The dipole approximate the reflectance as coming from two light sources, one with positive contribution and one with negative contribution.

The positive one is located at a distance zr¬†beneath the surface, while the negative one is at a distance that depends on¬†zr¬†, Fresnel reflectivity of the considered slab and on absorption and scattering coefficients of the media. Note that¬†zr¬†is not an arbitrary distance, but is one “mean free path”. A mean free path is given by one over the sum of absorption and reduced scattering coefficients and it represent the distance travelled by a photon before it collides inside the medium.
The contribution of these lights to the diffusion profile R is dependent on their distance from the surface, xixo, scattering and absorption coefficients of the material.

Now that we have a definition of Sd, the last step is to integrate the BSSRDF over a certain area. In [Jensen et al. 2001] they render using a Monte Carlo ray tracer randomly sampling around a point with a density function that is density, where d is the distance from the point we are evaluating and sigmatr depends on absorption and reduced scattering coefficients.

I don’t want to add too many formulas to the core of the post, but at the bottom I’ve put an expanded formula for¬†R¬†with a short description of every parameter.

Why the dipole formulation?

That is all cool and well, but the little me of 2 years ago was very confused, what the heck? A dipole?  Why?

Well, it turns out that the dipole approximation is common in solutions for diffusion problems. Let’s just take a small step back.

The diffusion equation is a Partial Differential Equation and as such there is an infinite number of solutions! For this reason, when we solve diffusion of light problems we need to take into account a boundary condition to find the solution we want  (among the infinite possible ones).

Usually, the boundary condition that is used in transport theory is the so called “extrapolated boundary condition” and the dipole approximation is used to handle it.
In our case, the condition is that the net flux will be zero at the “extrapolated” distance of¬†2AD¬†from the actual surface, where:


being Fdr the Fresnel diffuse reflectance (of the considered slab) and sigmat the sum of absorption and reduced scattering coefficients of the material.

The dipole approximation satisfies this boundary condition by representing the scattering of the incoming light with the two light sources described in the previous section.

Its assumptions and subsequent works

[Jensen et al. 2001] was a seminal paper for the field, but it relies on few assumptions that are addressed in subsequent papers. I will not undergo the massive task of listing papers that improved/relied upon the dipole approximation, but here a couple of assumptions discussed after it:

  • The dipole is a good model as long as the object represented is considered (semi-infinitely) thick. This is not true for the skin which is a multi-layered material with thin slabs that need to be taken into account. This issue is tackled by the famous multipole model [Donner and Jensen 2005]. Their model assumes a multitude¬†of dipoles instead of just one (hence the name multipole) and simulates reflectance and transmittance between thin slabs. The paper also proposes interesting acceleration techniques for the rendering.
  • The dipole method assumes that the incident light is directionally uniform; to account for scenarios¬†where it is not, [Frisvad et al. 2015]¬†recently introduced a directional dipole model.
  • [D’Eon and Irving 2011]¬†proposed various improvements on the dipole and multipole models, introducing a modified formulation of the diffusion equation, a more accurate exitance calculation, a more accurate term¬†A¬†¬†and different boundary conditions.


That is all for now, I was multiple times tempted to write a bit more, but I really wanted to keep this post as simple as possible. This is¬†not sufficient to have a deep¬†grasp of this model, but hopefully it should give an overview and a good set of info to approach the actual paper ūüôā

– Francesco (@FCifaCiar)



The actual R(r) formula 


¬ĻNote that ignoring the single-scattering term has been proved an ok thing to do for skin. For some translucent materials¬†(e.g. smoke) it is¬†fundamental.


Chronicles of a tour in Skin Shading – Dipole approximation for dummies

Quick dump of GCN links

During my holidays I have met an old friend of mine that is currently trying to learn a bit more about console’s GPUs and asked me few links. I knew a couple off the top of my head, but I didn’t have a list saved anywhere, but instead vague memories of docs¬†here and there.

This is not a blog post, this is not interesting, this is only a GCN links dump to send to friends asking for info (ciao Piero!) and maybe useful to the random internet surfer.

Official AMD documents

Blog Posts




Also, let me add that AMD’s great initiative, GPU Open, is a great resource for GCN related stuff!

Let me know if you know any other interesting link!

Till the next one,
-Francesco  (@FCifaCiar)

Quick dump of GCN links