Thursday, June 30, 2005

Fast ambient occlusion

Ambient occlusion, of course, is essential to modern 3D rendering. It adds "life" or "atmosphere" to 3D models in environments, and once you've seen anything with it, you'll never settle for anything less.

On The Chronicles of Riddick we did our first ambient occlusion rendering, and while it worked fine, it was very slow to compute. So slow, in fact, that we did ambient occlusion as a pre-process on the model, and saved the model with the occlusion baked in. Fortunately, we were dealing with almost-rigid spaceships and props most of the time, and this worked fine.

I noticed that all commercial renderers now support ambient occlusion to some extent. I've done some testing with the Gelato beta, and RenderMan 11.5 (both of course obsolete now.) We also did had an overly aggressive animator take home a shot and do the ambient occlusion on his home network using the Brazil renderer. My experience is that these obsolete versions of Gelato and RenderMan worked OK, but had significant artifacts and were pretty slow -- whereas the Brazil AO was amazingly fast and seemed artifact free, but it doesn't run on Linux (yet!) Brazil's AO seemed so fast that I could hardly believe it, they must be doing some amazing acceleration. I couldn't figure it out -- if you think about ambient occlusion, you realize that the close polygons are much more important than the far away ones -- the big ones are more important than the little ones -- and there has to be some way of coming up with an approximate solution quickly.

We also thought about using hardware rendering to accelerate ambient occlusion. We hired this spectacular programmer Hal Bertram to experiment with this, he did it in an afternoon by rendering the scene with the light in 256 different directions, and averaging these images together. Unfortunately, hardware (at the time, this was almost two years ago) was still pretty limited, and there were serious artifacts (aliasing, limited Z resolution, banding...) but it was super fast (a few seconds a frame.)

Finally I read of a nice acceleration for AO that is part of the GPU Gems book. As with many things in that book, it's a technique that applies perfectly well to traditional computers too -- although it would be slower on traditional computers (and perhaps that delta will increase over time as GPUs have been increasing in speed somewhat faster than CPUs) At this point, the article is available online at

although you can always just go buy the book. The trick in this article is to represent each vertex as a disk, and calculate how much each disk shadows each other disk -- independently! -- and add them up. This is quick and easy, but it's still an O(n2) algorithm.

The acceleration comes from building a spatial tree. At each level of the tree, you have a list of disks in that part of the tree, plus some kind of approximation of all the disks from there on down. When calculating the occlusion for a particular vertex, you note how much shadow a particular subtree might contribute, and if it's less than a tolerance parameter, you use the approximate model.

The article in the book is just about that vague, leaving a lot of room for interpretation.

For my tree, I used a kd tree. The root of the tree contains the whole model, and the bounding box is divided into two subtrees by a median -- that is, half the discs are on one side of the plane, half on the other. I choose the axis of the plane (x, y, or z) to give the largest smallest extent. That is, if I have a box that has dimensions (10, 5, 3) and the median in each dimension is at (3, 2, 1) (say), then the x dimension would be partitioned into two parts 3 and 7 units wide, the y dimension into two parts 2 and 3 wide, and so on. The x dimension has the largest smaller half (3), so that cell is split in the x dimension. I think that probably any reasonable partition scheme would give almost identical results, but I thought that this scheme would yield somewhat-cubical boxes with equal numbers of discs reasonable quickly as you go down the tree.

At each level, I note the "large" discs, ones that cover a large part of the cell. As you traverse the tree, you have to evaluate the occlusion immediately for the large discs. These large discs are not included in the subtrees.

Finally, my approximation is that each node has an array of 24 structures. Imagine a cube, blow it up into a sphere, and imagine four equally-spaced normals emitted from each face. In each node in my kdtree, there are 24 entries containing the sum of the areas of the discs that correspond most closely to one of these normals, and the average position of the center of these discs.

At some point, when traversing the tree, you decide that you can use the approximate value, you build proxy discs from the array, and then you don't have to traverse the subtrees below that node. Here's my tree cell structure, just to help explain what's going on.

typedef struct KDtree {
BOX3F box; // bounding box of this cell
struct KDtree *child0; // child trees -- dividing plane immaterial
struct KDtree *child1;
int n_vert; // number of vertices at this level
OBJ_VERT **vert; // OBJ_VERT contains position, normal, area
float area[24]; // sum of areas of disks
V3F avg_pos[24]; // average position of disks with this normal
float total_area; // sum of area of all discs in subtree
} KDtree;

This all works great. On the 50,000 polygon model shown, I can calculate the ambient occlusion in less than 10 seconds with a simple 500 line program, with very good looking results. This compares to something like 10 minutes for my previous, heavily optimized, ray-tracing technique. Big win. If I spend some more time optimizing, I'll bet I can cut it down to 5 seconds.


One further note -- in the original GPU Gems 2 article, they recommend a very odd form factor calculation for the shadows

1 - (r * cos(thetae) * max(1, 4 * cos(thetar)) / sqrt(areae/pi + r2)

but later, when discussion radiance transfer, they use the more standard form. I use an equation similar to their radiance transfer equation

1 - (cos(thetae) * cos(thetar) * areae/(pi * r2))

I have to believe that the funky equation is designed to take into account some artifact of their approximation paradigm.

Tuesday, June 28, 2005

Hardware color correction update, continued

You have heard of the case of the dog that plays checkers? It's not that he doesn't play to well, it's amazing that he can do it at all? It's sort of like that sometimes with graphics programming.

I can't believe that my previous code worked at all, as I was not doing some very basic things. In particular, I was not using the texture handles correctly. Before any call to glTexParameteri, glTexImage2D, glTexImage3D, glTexSubImage2D and so on, you need to make sure that you're working with the correct texture unit, so you have to call glBindTexture(flag, handle) Doing this correctly gave the program a chance of working, and now I can reload the 3D color lookup table at will. I also had not called cgGLEnableTextureParameter() after calling cgGLSetTextureParameter(). Again, I can't imagine how the program worked at all before.

But anyway, now it all works. I had trouble when I enabled GL_DITHER, but that's a bit of an anachronism, anyway (back to the old SGI O2's, where double-buffered images were only 4-bits (dithered) per channel).

I had hoped to get some performance benefit by going to RGB instead of RGBA textures (less data to transfer, less data to lookup) but whatever gain there is miniscule at best. RGB textures require you to pad out each scanline to a 4-pixel boundary, which is kind of annoying (and not documented in the glTexture2D man page, but it is documented other places.) Similarly, I saw no speedup by using glTexSubImage2D instead of glTexImage2D to reload textures for each frame of a moving sequence -- at least on my development machine. I still use the glTexSubImage2D call whenever the image doesn't change size, because it seems like the right thing to do.

One minor annoyance is that it appears that there are limits to the size of texture images -- looking at 8k x 4k images, for example, doesn't seem to work -- where using glDrawPixels worked fine at that (and much larger) sizes. This is, so far, not much of a limitation -- but it could become one as the film industry moves to 4K images.

In the end, I'm getting the required 24 fps at 2048x1200 images with a 3D lookup table. Barely.

Friday, June 24, 2005

Hardware color correction update

I was somewhat devastated by the inability to re-write the 3D texture on the fly -- calling glTexImage3D again to replace the texture, in fact, caused a core dump. I can't believe that this is the spec, but I could be mistaken. In any case -- the solution is to call glTexSubImage3D to reload the texture data every frame. This seems to work perflectly.

Also, I should note for other people attempting to do that that in OpenGL if you want to use the vertex shaders, you can't create polygons the old IrisGL way -- you cannot use the glBegin(GL_POLYGON); glVertex3f(); glTexCoord2f(); ... glEnd(); sequence that all us old guys are familiar with. You have to use the glDrawArrays(GL_POLYGON, first, count); call. This means, as well, that you have to load up the arrays in a way that your vertex shader can get to them. These are done with the terribly intuitive1 calls (for a vertex position array called 'p' and texture coordinate array called 'uv')

cgGLSetParameterPointer(cgGetNamedParameter(vertexProgram, "Pobject"), 3, GL_FLOAT, 0, p);
cgGLSetParameterPointer(cgGetNamedParameter(vertexProgram, "TexUV"), 2, GL_FLOAT, 0, uv);

cgGLEnableClientState(cgGetNamedParameter(vertexProgram, "Pobject"));
cgGLEnableClientState(cgGetNamedParameter(vertexProgram, "TexUV"));

In any case, this ability to reload color correction tables is the second-to-last roadblock to releasing this image viewer. The last one is how to turn off the shaders to draw my annotations. This can't be too hard.

1. HTML really needs a tag.

Thursday, June 23, 2005

Hardware color correction

I've heard about people using the fragment shaders in current NVIDIA graphics cards to do color correction, and I thought I'd give it a whirl. By "color correction" here I mean making things on our monitors look like the do on film. (For future historians, 'film' was an ancient analogue medium used to project images onto a screen.)

The first thing I tried was just to write a fragment shader that would intercept pixels from our normal image viewing application "ras_view". My first attempted fragment shader was as follows:

// convert from color to l gb r space and back

half4 main(float3 Color : COLOR) : COLOR
half l, gb, r, rad2;
half red, grn, blu;

l = dot(half3(0.33333, 0.33333, 0.33333), Color);
gb = dot(half3(0.00000, 0.70711, -0.70711), Color);
r = dot(half3(0.81650, -0.40825, -0.40825), Color);

gb = gb * gb;
r = r * r;

rad2 = 1 - (gb + r) * (1 - 2 * l) * 16;

return half4(Color * rad2, 1);

My observation about film is that the gray scale value track very nicely with the logarithmic curve specfied in the Kodak manual on the Cineon format. But, it turns out that as colors get more saturated, they get dramatically darker. The most striking example is the reproduction on film of pure blue. Loading (0,0,1) into a framebuffer makes a piercing blue, but on print film it ends up being almost completely black. My belief is that this is intentional, or at least the result of some intentional mucking with chemistry that Kodak has done to make filmed images look "better". Whatever. In any case, we have to match that.

So, in this first fragment shader, I transform the image by a matrix that converts RGB into a "gray", "red" and "green-blue" color. The RGB value of the image is scaled down by the the distance from the gray axis through the color cube.

This actually worked, after some typical cut-and-try hacking to get fragment shaders working under Linux (If you're easily frustrated, don't even go here. Probably doing this under Windows would be infinitely easier -- there are certainly many more examples to choose from. Not that I'd recommend using Windows for anything, of course.) Unfortunately, it was too slow. I was only getting some 12 frames/sec on 1000x500 pixel images. How terribly disappointing! The graphics card I was using, a (I know, I know, way too expensive) Quadro 3000, had some 16 pixel pipelines -- it should have been able to do this in a reasonable amount of time.

Insight was gained by using a null shader that just copied colors to the screen with no math at all. Just as slow. What can this mean?

Well, of course (you can all stop snickering now) the problem was that I was just using glDrawPixels to display my images. It turns out that if you avoid using fragment shaders you can get truly eye-watering glDrawPixels pixel transfers (on my box, now, I am getting 150M pixels/sec) But, sending them through any kind of fragment shader drops the speed to some 6M pixels/sec. A rather substantial difference.

Of course, the moral of the story is that these graphics cards are designed to draw texture mapped polygons (especially ones of splattering gore...) The right way to draw the image is to read the pictures, build a texture, load it, then render a polygon with that texture applied. Current graphics cards support TEXTURE_RECTANGLE, apparently exactly for this purpose. Rebuilding the program to work this way, and going through the shader pipeline, yielded a respectable 70M pixels/sec. One nice thing about using textures is that you can arbitrarily scale the image to do things like anamorphic projection and correction for non-square pixels (as on our plasma display, for example.)

Once I got to thinking about texture mapping, the obvious thing to do was to use 3D textures to specify the color transformation. This way you can get fairly arbitrary color transformations at respectable resolutions -- and the hardware does texture lookups incredibly quickly. Here are my current vertex and fragment shaders.

Vertex shader

void main(float4 Pobject : POSITION,
float2 TexUV : TEXCOORD0,
uniform float4x4 ModelViewProj,
out float4 HPosition : POSITION,
out float2 uv : TEXCOORD0)
HPosition = mul(ModelViewProj, Pobject);
uv = TexUV;

Fragment shader

half4 main(half2 uv : TEXCOORD0,
uniform samplerRECT image2d,
uniform sampler3D color3d) : COLOR
return half4(tex3D(color3d, texRECT(image2d, uv).xyz).xyz, 1); // With film-look 3d lut

If you're thinking of trying this, it seems that the expense of the shader is independent of its size (up to 64x64x64 textures), so you can really get very detailed color correction volumes in real-time. Pushing all the information through the OpenGL calls into the shaders is a mess, perhaps especially under Linux, but it does work in the end. I had serious problems reloading the 3D texture -- if I tried to call glTexImage3D more than once, the program core dumped. I'm sure that there's a good reason for this, but I don't know what it is. This means that, at the moment, I cannot change the lookup table once the application has started. This is more than a little annoying. I'm certain that I'm doing something wrong, but it's classic edit-compile-coredump debugging, without the core file saying anything useful. I also was unable to get the borders working in the 3D texture, but the CLAMP_TO_EDGE filtering did the job without the need for border pixels. It's conceivable that 3D textures just aren't all that robust yet.

In any case, I now have a proof-of-concept for this technique. It works, it provides general 3D color correction, it's fast. It's not released (even within Hammerhead) yet, but soon it will be.