| DevMaster.net |
| Home | Forums | 3D Engines Database | Wiki | Articles/Tutorials | Game Dev Jobs | IRC Chat Network | Contact Us | |
| Modern Fixed-point Tricks and Optimizations |
|
|
| Developing more robust and faster fixed-point routines with today's compilers | ||
|
06/04/2007
|
||
IntroductionFixed-point arithmetic has become a lost art among PC programmers. Since the introduction of the 486 CPU, a fast floating-point processor is an integrated part of the CPU, and hence there is rarely a need to work with fixed point numbers anymore. However, in the embedded world and on mobile devices, you rarely have a Floating-Point Unit (FPU) that does the dirty work for you. Also, fixed point arithmetic has some striking advantages over floating point when it comes to portability and precision. In this article I want to revise some fixed-point issues that we face with today's compilers, and presents an approach to create more robust and faster routines. There are still various reasons to use fixed-point numbers. Not only on mobile devices or embedded systems, where a FPU is simply not available, but also for performance reasons. Fixed point arithmetic has its uses on the desktop PC as well. If you need arithmetic with a guaranteed error bound, or you need to do calculations that give exactly the same results on any machine and any operation system, then floats or doubles are a bad choice. There are even differences on the Athlon and Pentium processors when it comes to floating point calculations. However, little has been written in the previous years about fixed point arithmetic. Programmers these days just write down their code in C or C++ and hope that the compiler will generate fast code. We use the patterns and algorithms that have been developed more than a decade ago, on machines with different constraints. There is a lot of potential for speedup on the new CPU, and some of the methods and algorithms need to be adjusted. I won't explain any basics here (there's enough introductory material on the web on how fixed-point arithmetic works). Also I'll keep the use of assembly to a minimum. How Compilers Handle Fixed-point OptimizationsI'd like to start with a little fact that might surprise even old-timers who have grown up with fixed point. Compilers did a bad job optimizing fixed point code 10 years ago. If you thought the good old Watcom C++ was bad you haven't seen what you get these days. The root of all evil is the 64-bit integer datatype and how 32-bit compilers handle it. Let's start with the good old fixed point multiply with 16 bit precision. We cast one multiplicant to 64-bit, so the compiler does a 64-bit multiply, then we shift back the result by 16 bit and return it. i32 fixmul (i32 a, i32 b)
{
return ((i64)a * b)>>16;
}
On the x86, this line of code can be translated into two instructions (most other CPU architectures have similar instructions): IMUL ebx // eax = a, ebx = b SHRD eax, edx, 16 // shift 64 bit result to eax Compilers will generate something like this (I spare us the assembler dump):
To be fair, the newest Visual C++ compiler is - after decades - smart enough to collapse the first three steps, but it still calls a subroutine for the shift. Other compilers I work with generate code that looks like it comes from the stone ages of compiler history. This is really bad: the subroutine call causes all kinds of penalties. The CPU has to do the call, execute the function itself, and parameter passing overhead. Above all, the optimizer will stop optimizing, as it can't optimize across subroutine calls. Also, the instruction scheduler gets no opportunity to hide the latencies of the IMUL and SHRD instructions. Even worse is the problem of aliasing. Consider this code: void MatrixMultiply (i32 * out, i32 * l, i32 * r) { out[0] = fixmul (l[0], r[0]) + fixmul (l[3], r[1]) + fixmul (l[6], right[2]); out[1] = fixmul (l[0], r[3]) + fixmul (l[3], r[4]) + fixmul (l[6], right[5]); out[2] = fixmul (l[0], r[6]) + fixmul (l[3], r[7]) + fixmul (l[6], right[8]); // ... } We know that the library routine just multiplies numbers. For the compiler however, it is a function like any other function. It could do anything, including writes to memory. This causes the integers l[0], l[3], l[6] to be reloaded from memory for each access. After all, the runtime library could have changed the content. However, on the x86 architecture, the compiler would load them from memory anyways, since there aren't that many registers to work with. On the ARM or MIPS architecture, where more registers are available the problem is more serious. On a side-note: I know that "out", "l" and "r" could alias anyways, but I left out the C99 restrict keyword to not confuse our visual c++ friends :-) What can we do to solve this: well, not much. In-line assembly can sometimes do the trick, but it will scare away the optimizer. For a non-inlined function, it makes no difference; but if you inline it into a larger function, it can do more harm than good. If your compiler generates two calls for fixmul, it could as well call your optimized fixmul routine, which is a saving for sure. If you work with GCC, the inline assembler pragmas are somewhat better. The compiler will optimize around them but will not perform instruction scheduling of your code -- still better than nothing. There is a trick to get rid of the 64-bit shifts: do them yourself! It's not obvious that doing the shift "on your own" will generate better code, but the output looks pretty good across a wide range of compilers. Not as optimal as it could be, but a lot better than the call for sure. // shift a to the right by 16, return the // return the lower 32 bits: i32 __inline shift_right_16 (i64 a) { union { ui32 dword[2]; i64 qword; } c; c.qword = a; return (c.dword[1]<<16)|(c.dword[0]>>16); } How much does this improve the code? I've did a little test using a 64-bit square root (taken and extended from Paul Hsieh's site). All I did was replace the compiler-generated 64-bit shift. The improved version is roughly 30% faster than what the compiler comes up with (on my Athlon 1.3Ghz). Not bad for such a little low-level tweak, ain't it? And it didn't even need a single line of assembly or a change in the algorithm. Here are the functions for reference: // shift a to the left by n bits (between 0 and 31), // return the result as 64 bit. ui64 __inline shift_left (ui32 a, ui32 n) { union { ui32 ui[2]; i64 ll; } c; if (n) { c.ui[0] = a <<n; c.ui[1] = a >>(32-n); return c.ll; } else return a; } ui32 isqrt64 (ui64 value) { ui32 g = 0; ui32 bshft = 31; ui32 b = 1<<31; do { // Old version: ui64 temp = ((ui64)(g+g+b)<<bshft); // New version: 30% faster. ui64 temp = shift_left (g+g+b, bshft); if (value >= temp) { g += b; value -= temp; } b>>=1; } while (bshft--); return g; } Enough compiler hacking for now. But as a last word of warning: if you do 64-bit integer arithmetic on a 32-bit machine, be skeptical and don't trust your compiler. I've seen compilers go totally braindead when it comes to 64-bit integers. The old Visual-Studio.Net 2003 compiler for example was bold enough not to turn this multiplication into a shift or an add: i64 somefunction (i64 a)
{
// watch out for dumb compilers: This innocent code
// might be slower than you maybe expect:
return a * 2 + somethingelse;
}
Fixed-point Algorithmic OptimizationsCPU's are very capable these days, but we programmers tend to just cut'n'paste our tried-and-trusted routines without adjusting them to the architecture. Let me introduce you to a new friend of mine: the x86 BSR instruction. Also known as CLZ (on some ARM and MIPS) or LMBD on the TMS320C64+ DSP (my current playground). It counts the number of leading zero bits in a integer, and it's a powerful instruction to improve fixed point code for performance and precision. It's my weapon of choice when I write fixed point code (which I do a lot recently). Unfortunately there has never been something like a CountLeadingZero function in the C-libraries, so very few textbook algorithms use it. On the x86, back in the days when programmers wrote assemblers for critical parts without being ashamed to do so, BSR was strongly avoided; it was very slow. At least this has changed as BSR is nice and dandy now. Those who work with GCC can consider themselves lucky: GCC has built-in cross-platform support for it. The relevant function is called __builtin_clz and will expand into few assembler instructions or emulated if the platform does not support it. Under Windows CE, the Microsoft compilers offer _CountLeadingZeros, which is hidden in the Cmnintrin.h header file. However, Visual C++ users are out of luck as there is no support for such a function, as far as I know. They can only emulate it using a C-function or using inline assembly. Many good and portable CLZ-algorithms are easy to find on the net. Here's my clz implementation for Visual C, provided as a quick starter for those who want to play around with it a bit. Did I mention that I want to stay away from assembly instructions in this article? Yes, I feel guilty now. ui32 clz (ui32 a)
{
_asm {
mov eax, [a]
bsr eax, eax
xor eax, 31
mov [a], eax
}
return a;
}
A word of warning: don't call clz with an argument of zero. The handling of this special case is undefined and varies from platform to platform. It won't crash your program if you do, but don't rely on the value you get from it. Now let us put what we've learnt into action: here is the square root function again, this time in 32-bit to make things a bit simpler: ui32 isqrt32 (ui32 value)
{
ui32 g = 0;
ui32 bshft = 15;
ui32 b = 1<<bshift;
do {
ui32 temp = (g+g+b)<<bshft;
if (value >= temp) {
g += b;
value -= temp;
}
b>>=1;
} while (bshft--);
return g;
}
In short, this function examines two bits, starting at the most significant bit, and generates one bit of result per iteration. Only 1 bit is relevant. Zero bits will just update the bshft and b value, which are easy to calculate. If you calculate the square root of 1 using this function, the first 15 iterations will not do much. We can make use of CLZ here to skip all these useless iterations: ui32 isqrt32 (ui32 value)
{
if (value) {
ui32 g = 0;
ui32 bshft = (31-clz(value))>>1; // spot the difference!
ui32 b = 1<<bshft;
do {
ui32 temp = (g+g+b)<<bshft;
if (value >= temp) {
g += b;
value -= temp;
}
b>>=1;
} while (bshft--);
return g;
}
else
return 0;
}
Little change -- some improvement. The function is now slightly slower if you pass UINT_MAX and faster for all others cases. And now comes a real treat: CLZ is not only very useful to speed up such iterative algorithms, but it can also improve the precision and performance of calculations as well. Consider the problem of calculating the winding order of a triangle in 2D. This involves a cross product, and fixed point tend to have problems with precision due to precision loss after multiplications. Going to full 64-bit is one way to tackle this problem, but as we have seen, this approach has its own problems. Here is a different way to handle the situation: i32 winding (vec *a, vec *b, vec *c)
{
// get two side vectors of the triangle:
i32 ux = a->x - c->x;
i32 uy = a->y - c->y;
i32 vx = b->x - c->x;
i32 vy = b->y - c->y;
// We're interested in the highest significiant bit of our input values:
ui32 mask = abs(ux)|abs(uy)|abs(vx)|abs(vx);
// Handle the special-case of a zero area triangle:
if (!mask) return 0;
// get the highest bit position:
ui32 bits = 31-clz(mask);
if (bits > 14) {
// scale all variables, so we can calculate with
// the 14 most significant bits only:
ux >>= (bits-14);
uy >>= (bits-14);
vx >>= (bits-14);
vy >>= (bits-14);
}
// cross product is cheap now! Can't overflow anymore,
// and does not even need 64 bit arithmetic. Horray!
return ux * vy - uy * vx;
}
We look for the highest bit position of all 4 values at once. For small triangles, the prime candidates for fixed point trouble this will give exact results. Larger triangles have enough significant bits to pass as well. But there is a drawback: Extreme sliver triangles might now give wrong results. As an extra bonus, we don't have to use fixed point multiplies at all since the result can be maximal 30 bits in size (limit for a signed integer). This works around the compiler runtime library calls once again. This method to bias multiple input values is universal. You can use it to normalize vectors, calculate barycentric coordinates, do texture setup for polyfillers, and lots more. You know how many bits you've shifted your values prior to do any calculations, and you can factor this shift out at the end. This way of dealing with precision is somewhat halfway between fixed point and floating point. Instead of calculating an exponent for each value during each calculation we do this once for a bunch of values. Fixed point arithmetic has still its places, and there is no excuse to use inefficient or imprecise methods. Compiler oddities can be worked around and algorithms can be modified to be robust, precise and fast at the same time. Moving away from the fixed radix representation of fixed point numbers is one step forward to close the gap between the precision and speed of fixed point and the ease of use of floating point arithmetic. About the AuthorNils Pipenbrinck has been programming computers since the late 80s. After working in various fields in computer science and in the game biz, he's now employed as an embedded software engineer with an international company. His current main focus is developing high quality graphic algorithms and APIs. When not polishing pixels, he spends his time behind his e-guitar or enjoys to design analog circuits just for fun. |
|
|
| © 2003-2004 DevMaster.net. All Rights Reserved. Terms of Use & Privacy Policy | Want to write for us? Click here |