![]() |
| [[ Home | Forums | 3D Engines Database | Wiki | Articles/Tutorials | Game Dev Jobs | IRC Chat Network | Contact Us ]] |
|
|
#1 |
|
Senior Member
Join Date: Aug 2004
Location: Ghent, Belgium
Posts: 1,086
|
It's quite surprising that the most elementary rendering operations, like rasterization, are so little documented. Graphics cards take care of most of this, so people have forgotten how it actually works. But there are many reasons left why understanding the inner workings of the graphics pipeline is still useful. Not only for software rendering, but for any situation where this type of operations is required. Knowledge of these algorithm simply make you a better graphics programmer.
Today I present you the theory and implementation of an advanced rasterizer. It is advanced in the sense that it has many nice properties the classical scanline conversion algorithm does not have. The main problem with the old algorithm is that it's hard to process pixels in parallel. It identifies filled scanlines, but this is only suited for processing one pixel at a time. A much more efficient approach is to process 2x2 pixels together. These are called quads. By sharing some setup cost per quad, and using advanced parallel instructions, this results in a significant speedup. Some of the current graphics hardware also uses quad pixel pipelines. Our starting point is the half-space function approach. Before I go into details, all you have to know is that a half-space function is positive on one side of a line, and negative on the other. So it splits the screen in half, hence the name. Here's a small illustration of it, where the edges of a triangle each split the screen in a positive and negative part: half-space.png. Locations where all three half-space functions are positive define the inside of the triangle. When a half-edge function is zero, we're on an edge. When any of them is negative, we are outside the triangle. So this can be used for rasterization. If we can find formulas for the half-space functions, and evaluate them for each pixel location, we know which pixels need to be filled. So what formula is positive on one side of a line and negative on the other? Surprisingly, the equation of a line is exactly what we're looking for. For a segment with starting coordinates (x1, y1) and end coordinates (x2, y2), the equation is: (x2 - x1) * (y - y1) - (y2 - y1) * (x - x1) = 0 You can easily verify that this equation is true for any point (x, y) on this line (for example the start and end points themselves). But the left hand side also behaves perfectly as a half-space fuction. It is positive for (x, y) coordinates on one side, and negative on the other. It's zero on the line itself, which effectively defines the line in the above equation. Verifying this behaviour is left as an excercise for the reader. Let's write some code! Code:
Unfortunately, this implementation isn't fast at all. For every pixel, six multiplications and no less than fifteen subtractions are required. But don't worry, this can be optimized greatly. Per horizontal line that we scan, only the x component in the formula changes. For the first edge, this means only (y1 - y2) gets added to the half-edge function value of the previous pixel. That's just an addition and subtraction per pixel! Furthermore, the vertex coordinates are 'constant' per triangle, so (y1 - y2) and all other pairs like this only have to be computed once per pixel. Also for stepping in the y direction it's just an addition of a constant. Let's implement this: Code:
Ok, so this algorithm is starting to show its usefulness. But there still are some serious issues to solve. Here's an actual image created using this code: first_try.png. If it isn't clear that this is actually a car, here's the shaded reference image: reference.png. As you can see, there are many gaps between the polygons. The cause is twofold: precision, and the fill convention. There are precision issues because floating-point numbers are quite limited. They have 24-bit of precision. If you look at the half-space function, you see that we do several subtractions and multiplications. That's the perfect recipe for precision problems. What most people would do in such situation is use double-precision floating-point numbers. While this will make the issue unnoticable, it isn't exactly perfect and also slower. The real solution might sound crazy: use integers! Integers have 32-bit of precision, and do not suffer from precision loss when subtracting. Instead of using a 'floating-point', we'll use fixed-point arithmetic to assure we get sub-pixel accuracy. The second cause of the gaps is the fill convention. The half-space functions can be positive or negative, but also zero. In this case, the pixel is exactly on the edge. Until now, we've ignored this case and treated it as a pixel outside. We could treat it as inside, using >= comparators, but this isn't correct either. What will happen is that pixels on the edge between two triangles will be drawn twice. While this may sound more acceptable than gaps, it causes some serious artifacts for things like transparency and stenciling. What we'll use here is a top-left fill convention, just like DirectX and OpenGL. This means that all edges which are on the left side of the triangle, or on a horizontal top, are treated as inside the triangle. This convention assures that no gaps will occur, nor drawing the same pixel twice. Let's first see how we can detect whether an edge is 'top-left'. For humans, it's really easy to recognise them. Here are a few examples: top-left.png. I'm sure nobody really had a problem to identify the top-left edges himself, it's really simple. But give yourself a minute to think about how you would detect this using code... Don't look before you tried it, but here's the answer: detect.png. The arrows indicate the counter-clockwise order of the vertices. Note how for left edges they all point downward! To detect this in code we merely have to check the Dy1-Dy3 values! The only exception is the top edge. There, Dy# is zero, but Dx# is positive. Now that we know how to detect top-left edges, we still have to deal with them correctly so pixels on these edges are treated as inside. What we actually want to do is change the Cx# > 0 statements into Cx# >= 0 statements. Testing for top-left edges per pixel is way to slow, and handling all these cases in separate loops is way too complex. But look at what Cx# > 0 fundamentally is. It is 'true' when Cx# is 1, 2, 3, etc, and 'false' when Cx# is 0, -1, -2, etc. All we want to do is make it true also when Cx# is zero. The solution is too simple for words: all we have to do is add 1 to Cx# beforehand. This can even be done sooner, with the C1-C3 variables! Ok, let's sum this all up: Code:
This implementation is perfect for small triangles. The setup cost per triangle is really low compared to the scanline conversion algorithm. Unfortunately, it's not optimal for big triangles. About half of all pixels will be outside the triangle, but we still pay the price of evaluating the half-space functions. Furthermore, until now I've only talked about drawing one pixel at a time, while the real benefits of this approach is the possibility for parallelism. I'll show you how to take advantage of this, and what other benefits we get from it. What we really want to do is quickly skip pixels outside the triangle. We don't want to waste any time evaluating half-space functions there. We also don't want to spend too much time inside the triangle. The really interesting part is around the edges. So what we'll do is quickly detect whether 8x8 blocks are not covered, partially covered, or fully covered. Not covered or fully covered blocks can quicly be respectively rejected or accepted. Partially covered blocks will be completely scanned. There's another reason to do this block-based approach: it is very useful for visibility determination algorithms. And an extra benefit is that memory accesses are more localized. So how do we detect coverage as fast as possible? With the half-space functions, it's really easy. All we have to do is evaluate the half-space functions in the corners of the blocks. A non-covered block has negative half-space values for the corners for at least one edge. A completely covered block has positive half-space values for all edges. Everything else is a partially covered block. So the final implementation is: Code:
Note that I scan partially covered blocks completely, all 8x8 pixels. This is for consistency with the other blocks so they can be processed by the same visibility algorithm. Also, this is extremely fast when done in an unrolled loop, using assembly instructions. All further optimizations to this algorithm are best done in assembly anyway. So now the rasterizer can output coverage masks for 8x8 pixels. This can then easily be processed by the pixel pipeline(s). It's easy to process them as 4x4 quads, and many calculations can even be done per block. Everything taken together, there is no reason left to use the old scanline conversion algorithm. Enjoy! Nicolas "Nick" Capens |
|
|
|
|
|
#2 |
|
Senior Member
Join Date: Jan 2003
Posts: 903
|
agreed with the most but:
As i know the Pentium and better CPUs work with the 80bit float numbers actually faster than with the 32 and 64bits. Of course i know few languages which make use of the extended format. There are other few stuff, which i'm not yet sure about.. but the article is really good! Like everything in the _nick_ style ![]() |
|
|
|
|
|
#3 | |
|
Senior Member
Join Date: Aug 2004
Location: Ghent, Belgium
Posts: 1,086
|
Quote:
![]() |
|
|
|
|
|
|
#4 |
|
Senior Member
Join Date: Jan 2003
Location: ON, Canada
Posts: 524
|
this can be turned in an article
![]()
___________________________________________
"What ever happened to happily ever after?" |
|
|
|
|
|
#5 |
|
Senior Member
Join Date: Jan 2003
Posts: 903
|
Bah... too bad there ain't MMX/3DNow! on those damn nokias....
|
|
|
|
|
|
#6 |
|
Valued Member
Join Date: Sep 2004
Location: Copenhagen, Denmark
Posts: 101
|
Good work, Nick! Very interesting!
![]() |
|
|
|
|
|
#7 |
|
Member
Join Date: Jan 2003
Posts: 85
|
Impressive work Nick!
Could you provide some figures that reveals the performance (and compare them with traditional rasterizers)? How do graphic chips currently do this? I wonder if graphic chip vendors would benifit if your algorithm was implemented in hardware. Have you read the book by Andre LaMoth on: Tricks of the 3D Game Programming Gurus-Advanced 3D Graphics and Rasterization. What is your recommendation for using it as a learning reference for software rendering? |
|
|
|
|
|
#8 | ||||
|
Senior Member
Join Date: Aug 2004
Location: Ghent, Belgium
Posts: 1,086
|
Quote:
![]() Quote:
If you want numbers, I suggest using swShader, and plugging this code into it. Quote:
Quote:
Anyway, it isn't all bad. I'm sure that for someone who has no experience with software rendering it's quite good. He writes in a 'popular' way so it's easy to read and understand. It nicely bundles all the primary problems someone has to solve to write a software renderer. Since there are no other books like this, it is automatically the best you can get. I would have scratched the 'Advanced' out of the subtitle though... |
||||
|
|
|
|
|
#9 |
|
Senior Member
Join Date: Jan 2003
Posts: 903
|
Agreed with Nick on LaMothe's book 75% of it is code, wrong story about lightmaps, no correct organization no nothing!
|
|
|
|
|
|
#10 | |
|
DevMaster Staff
Join Date: Apr 2003
Location: Germany
Posts: 2,328
|
Quote:
i was suprised how many errors (i mean real mathematical errors not typos) it contains. this has to be confusing for somebody whithout any experience.
___________________________________________
If Prolog is the answer, what is the question ? |
|
|
|
|
|
|
#11 |
|
Senior Member
Join Date: Jan 2003
Location: Switzerland
Posts: 1,333
|
very nice done, nick. i think this shows again, premature optimisation isn't everything, optimizing the algorithm is what counts.
now rastericing works on a completely different level somehow, it's not about the individual pixel anymore, but really just the triangle. it reminds me much of raytracing, this algorithm. the logic is the same, just with tons of restrictions on what rays you trace but it's the same, with the same optimisation possiblities. process full blocks of rays in parallel, skip blocks that are useless, etc..it all sounds familiar.. can't wait to see another performance boost in sw-shader ![]()
___________________________________________
davepermen.net -Loving a Person is having the wish to see this Person happy, no matter what that means to yourself. -No matter what it means to myself.... |
|
|
|
|
|
#12 |
|
Senior Member
Join Date: Jan 2003
Location: Switzerland
Posts: 1,333
|
hehe.. you know what? your rastericer would rock quite a bit on this system:
![]() i mean.. yeah! 16 parallel processors (actually 4 processors with 2 cores and 2 hyperthreads each). there, your parallelisation will get onto a next, higher level..yehah! ![]()
___________________________________________
davepermen.net -Loving a Person is having the wish to see this Person happy, no matter what that means to yourself. -No matter what it means to myself.... |
|
|
|
|
|
#13 |
|
Valued Member
Join Date: Aug 2004
Posts: 120
|
Very interesting article. Confirmed how I thought this worked, and then extended on it with optimizations that are very nice toknow. Thanks.
@davepermen: Where did you get that screen? I would be interested to know who has the cash for that many processors, but not for a monitor large enough to show all the CPU usage graphs ![]() |
|
|
|
|
|
#14 |
|
Senior Member
Join Date: Jan 2003
Location: Switzerland
Posts: 1,333
|
Intel Developer Forum report at anandtech (just right-click and check the file properties to verify).
they aren't that expensive actually. both amd and intel start to show off their new multicore cpu's, for home and servers. this one is actually an ITANIUM dual core at 1.7ghz each, i think. each core has 2 hyperthreads, just as the newer p4 have, and the board is a 4x itanium board.. hence 4x2x2 hyperthreads, 4x2 cores, 4 cpus. right now, it's just a "we have it" proof demo. but it's planned for next year to start shipping. ITANIUMs, Xeons, and Pentiums, all with dual cores. AMD on it's side will show off Opterons with multicore (2x, 4x), and athlon 64, too.. should be available next year, too.. (they showed off a first sample, too). so yes, it's some years till every home end user has one. but it's right today important to start thinking about designing with multi-threading, to support hyperthreading by today, multicore by tomorrow, and multi-cpu for the highest-end-systems in the future. hehe ![]()
___________________________________________
davepermen.net -Loving a Person is having the wish to see this Person happy, no matter what that means to yourself. -No matter what it means to myself.... |
|
|
|
|
|
#15 |
|
Senior Member
Join Date: Aug 2004
Location: Ghent, Belgium
Posts: 1,086
|
I can't wait for a 4 GHz dual-core 64-bit (double amount of registers), Hyper-Threading CPU with 1200 MHz dual-channel FSB and build-in memory controller.
![]() |
|
|
|
|
|
#16 |
|
Senior Member
Join Date: Jan 2003
Posts: 903
|
And when i think that only 12 years has passed since the graphical 'wonders' of Wolf3D...
|
|
|
|
|
|
#17 | |
|
Senior Member
Join Date: Jan 2003
Location: Switzerland
Posts: 1,333
|
Quote:
hehe.. will you one day consider writing some raytracing-stuff? i mean, your renderer gets more and more into the right direction anyways ![]() you'll have to at least write a protocoll how each and every 'tile' gets submitted to the right cpu. btw, if you're quick, you'll be the first one with an api that wrapps well onto ps3! ![]()
___________________________________________
davepermen.net -Loving a Person is having the wish to see this Person happy, no matter what that means to yourself. -No matter what it means to myself.... |
|
|
|
|
|
|
#18 |
|
Member
Join Date: Jan 2003
Posts: 85
|
not only does Nick's swShader beat quake's software rendering and other , but it beats the crap out of DirectX's REF rastersizer. There's no point in comparing against quake's s/w anyways. Good job Nick! You're going to be a millionare someday. You're pretty much garenteed a job at any company you apply for. What's the point of life after that? hehe (j/k) As far as I know, there isn't a single commercial or free product which emulates shaders except for swShader (isn't that right?).
I guess there wasn't any incentive for Microsoft to optimize their software renderer since they expect hardware vendors to implement everything in hardware, and the software part is only used as a reference model to compare against. |
|
|
|
|
|
#19 | |
|
DevMaster Staff
Join Date: Apr 2003
Location: Germany
Posts: 2,328
|
Quote:
american ?
___________________________________________
If Prolog is the answer, what is the question ? |
|
|
|
|
|
|
#20 |
|
DevMaster Staff
Join Date: Jan 2003
Posts: 1,203
|
For those who haven't noticed, the code spotlight submissions are categorized now
![]() |
|
|
|
|
|
#21 |
|
DevMaster Staff
Join Date: Apr 2003
Location: Germany
Posts: 2,328
|
nice... maybe the categories should be seperated more clearly ?
___________________________________________
If Prolog is the answer, what is the question ? |
|
|
|
|
|
#22 | ||||
|
Senior Member
Join Date: Aug 2004
Location: Ghent, Belgium
Posts: 1,086
|
Quote:
Well, in its own context that ain't bad at all... But anyway, it's Quake I's renderer that really inspired me, and at some points it still does.Quote:
Quote:
Quote:
![]() |
||||
|
|
|
|
|
#23 |
|
Senior Member
Join Date: Jan 2003
Posts: 903
|
Hey John, check out Muli3D, it also supports shaders in easy and understanable way!
|
|
|
|
|
|
#24 | |
|
Senior Member
Join Date: Aug 2004
Location: Ghent, Belgium
Posts: 1,086
|
Quote:
I wouldn't really say it has shader support though. You can derive a class from a PixelShader class, and implement the execute() function. This allows to simply write the 'shader' in C++. In that sense every software renderer would have shader support even before one line of it is implemented.Anyway, it seems like it could become an open-source 'reference rasterizer'... |
|
|
|
|
|
|
#25 |
|
Senior Member
Join Date: Aug 2004
Location: Ghent, Belgium
Posts: 1,086
|
The article is now also available on the swShader site:
swShader Documentation If there's any part that can be improved, like things that are unclear and need more explanation, please let me know! |
|
|
|
![]() |
| Thread Tools | Search this Thread |
| Display Modes | |
|