![]() |
| [[ Home | Forums | 3D Engines Database | Wiki | Articles/Tutorials | Game Dev Jobs | IRC Chat Network | Contact Us ]] |
|
|
#1 |
|
DevMaster Editor
Join Date: Jan 2005
Posts: 54
|
Modern Fixed-point Tricks and Optimizations
Author: Nils Pipenbrinck Description: The article reviews some fixed-point issues that developers face with today's compilers, and presents an approach on creating more robust and faster routines. |
|
|
|
|
|
#2 |
|
New Member
Join Date: Mar 2006
Location: France
Posts: 8
|
interresting article, i learnt things today
thank you |
|
|
|
|
|
#3 |
|
Senior Member
Join Date: Aug 2004
Location: Thornhill, TO
Posts: 850
|
I never really thought about fixed-point calculations until I read this. It would have been nice to see some benchmarks with this though. Just to get an idea of what to expect when dealing between FPU and CPU, and when using vanilla compiler approaches versus optimized code.
___________________________________________
http://www.nutty.ca - Being a nut has its advantages. |
|
|
|
|
|
#4 |
|
Valued Member
Join Date: Aug 2004
Location: Norway
Posts: 200
|
The only thing I'm missing from the article is the fact that fixed point number doesn't have adaptive presicion like floating points. This is why you want to use fixed point for anything that should have uniform precision (so if you're engine doesn't use fixed point numbers to store position it's crap and can't be used for your next MMORPG).
Great article though (I didn't really read it carefully, but it seems very nice). A friend of mine talked to a guy making software that computed weather forcasts. My friend asked what they did for compensating for floating point round-off and limited precisions for large numbers, and the guy answered that they didn't consider that after they changed to doubles *shrug*. No wonder the weather forcasts is so bad. I don't think double's are the standard in Canada, though. Since the weather forecasts here are even less reliable than in Norway. |
|
|
|
|
|
#5 | |
|
Senior Member
Join Date: Sep 2005
Location: .nl
Posts: 504
|
A nice article indeed. Glad that it is finally uploaded
![]() Quote:
That sounds like a nice scientific approach ![]() |
|
|
|
|
|
|
#6 | |
|
Senior Member
Join Date: Sep 2005
Location: Hamburg / Germany
Posts: 597
|
Hi all, thanks for the comments.
Quote:
Oh well - I left a benchmark out because performance always depends on the CPU and context. I know there are not as much embedded guys here, but the console developers are facing the same problems. (there is not that much difference between embedded development and low level jobs in the console world anyways). Take the square root for example. I showed how to make the number of iterations adaptive to the input range. I could aswell unroll the loop and always do 16 iterations. The later version would be much faster if benchmarked by calculating half a million square roots in a tight loop. In reality that does not happen that often. The unrolled but longer (more code) version would be slower albeit it benchmarks faster in isolation. Why this? The working set of code will have a negative impact on the performance. The code might not fit into the code cache anymore, and your performance will go down. Ever had the "wtf, my program executes faster when optimized for size"-experience? It's the same effect. I had my wtf moment some years ago, when i optimized a engine on the playstation 1 just by moving functions around. That increased the performance by almost a factor of two. Also it all depends on the architecture: The modern x86 CPU's for example do an incredible job to execute bad code fast. This is due to the fact that we have historically only a very limited register-file. The modern x86 works with a lot more registers inside, and some serious magic is going on to transform the x86 code to something risc-like. Also we have tons of out of order and speculative execution things going on. This can totally mask the effect of an optimization on one x86 variant while the same optimization might give a nice performance boost on some other x86. On a different architecture, like the StrongARM for example things are different again. The StrongARM is pipelined and calculations can be performed in parallel, but it does not has speculative execution which might mask a call to a simple subroutine. All architectures are different, and they all have different spots where an optimization might make a difference or not. (side-rant: I like it that the CPU's are that fast and efficient, but sometimes I which bad code would crawl and be as painfull slow as possible. How can new programmers develop a feeling for efficiency on the x86 these days? They will never have the same "Ahh"-experience we old guys had in the pre-pentium days). Comparing Float against Integer: That's like comparing apples and oranges. Float is nice and dandy on some machines (x86) and slow as hell on others. (Ever tried float on a PocketPC of Mobile Phone?) Float is always adaptive to the input range. That's cool. You'll get some kind of result back as long as you don't do stupid things like dividing by zero, or mixing extreme large and extreme small values in calculation. But you have a hard time to predict the loss of precision. All floating point results are mere rough estimates. Not so with integers, you know exactly where and how many bits you throw away, and you can calculate anything up the the precision you want. Here's a practical example where float would fail: At work, I'm working on a antialiased graphic library (once again), and I need to calculate the distance of any screen-pixel to an edge (side of a triangle for example). The maximum render surface size is 4096x4096 so I need 12 bits for the integral position. To express subpixels, I use 16 bits fraction. The upper 8 bits of these are used or antialiasing, and the lower 8 bits are needed since I have to interpolate the edge-distance across the primitive. That sums up to 12+16 bits = 28 bits to represent distances. Add one additional bit for the sign (I have negative distances as well). A vanilla floating point square root gives me 23 bits of precision. That's not enough in the worst case and would yield visible rendering errors. Even if floating point would be available (it's not on that hardware) I had to use a fixed point square root anyways. In the end I use a 64 bit square root and It works like a charm. It will not break even if the maximum render surface size changes on day (that makes me feel good). In reality a 56 bit square root would be sufficient, but thanks to the CountLeadingZero operation using a 64 bit square root is just as fast (or slow). Nils Btw - when I find the time (doing a major garbage collection in my flat since two days, and there are still some heaps left, waiting to be deallocated) I'll write up some little benchmarks anyways and run them on my 1.3Ghz Athlon and on a StrongARM. The numbers don't tell much, but they'll show that the performance gain of optimizations does not translate well from one machine to another ![]()
___________________________________________
My music: http://myspace.com/planetarchh <-- my music My stuff: torus.untergrund.net <-- some diy electronic stuff and more. Last edited by Nils Pipenbrinck : 04-09-2007 at 08:10 AM. |
|
|
|
|
|
|
#7 |
|
Senior Member
Join Date: Oct 2005
Location: Pensacola, FL
Posts: 1,028
|
Thanks for the contribution! You've written a nice appendix to the article in that last post too.
I'm going to be doing some more fixed-point math on an ARM9 soon. So, the delay in getting this posted hadn't bother me yet. ![]() |
|
|
|
|
|
#8 |
|
Senior Member
Join Date: Sep 2005
Location: Hamburg / Germany
Posts: 597
|
ARM9? which one exactly? Some of them have DSP extensions (which are a bit laughable if you've worked on a real DSP but nevertheless they are helpfull).
___________________________________________
My music: http://myspace.com/planetarchh <-- my music My stuff: torus.untergrund.net <-- some diy electronic stuff and more. |
|
|
|
|
|
#9 |
|
Valued Member
Join Date: Jan 2004
Location: England
Posts: 265
|
Just want to post a quick message saying it's good to see some new articles here on DevMaster, and hopefully other people will see this as an opportunity to submit their own
![]() Spree |
|
|
|
|
|
#10 |
|
New Member
Join Date: Mar 2007
Posts: 8
|
Do you know if such helpful extensions exist on the TI OMAP 1710 or the Intel XScale ARM920T PXA27x.
|
|
|
|
|
|
#11 |
|
Senior Member
Join Date: Sep 2005
Location: Hamburg / Germany
Posts: 597
|
The OMAP has a DSP as and the ARM9 on a single chip, so you might go multiprocessor in your software and do part of the work on the dsp and on the arm.
XScales are already pretty efficient ARM CPU's and some versions of them have a powerfull mmx like instruction-set. Better than the PC mmx btw. If it's available depends on the exact XScale used (it's a family, not a single silicon). All XScales have the count leading zero instruction and some other extra instructions, saturating adds and such things. They are pipelined and can execute code quite fast. Nils The ARM920T is just a plain risc-core without any funky specials btw. Nothing to get excited about, but not to shabby either.
___________________________________________
My music: http://myspace.com/planetarchh <-- my music My stuff: torus.untergrund.net <-- some diy electronic stuff and more. |
|
|
|
|
|
#12 |
|
Senior Member
Join Date: Oct 2005
Location: Pensacola, FL
Posts: 1,028
|
It's a ARM926EJ-S and has 16-bit DSP extensions amongst other features that I haven't really looked at yet. It does have a CLZ instruction though.
![]() |
|
|
|
|
|
#13 |
|
New Member
Join Date: Apr 2007
Posts: 3
|
Hello!
There is an ugly workround I have to use for intel compiler to multiply fixed point numbers. Code:
Code:
surprisingly, ICL compiles this function w/o __allmul calls and it works faster than inline assembly alternative. This workaround is ridiculous, there must be some compiler setting to use instead. Do you guys have any ideas about that? I use this code for tests Code:
Last edited by Reedbeta : 04-12-2007 at 08:53 AM. Reason: added code tags |
|
|
|
|
|
#14 |
|
DevMaster Staff
Join Date: Oct 2004
Location: Seattle, WA
Posts: 3,707
|
(By the way, .alexander., please use the code tags when posting code! Thanks.)
___________________________________________
Currently working at Sucker Punch reedbeta.com - OpenGL demos and other projects Luabridge - a lightweight, dependency-free C++/Lua binding library. CD Lite - an unobtrusive, minimal CD player application for Windows. |
|
|
|
|
|
#15 | |
|
Senior Member
Join Date: Aug 2004
Location: Ghent, Belgium
Posts: 1,056
|
Quote:
Just out of curiosity, what does this function do and how did you came up with this name? |
|
|
|
|
|
|
#16 | |
|
Senior Member
Join Date: Jan 2003
Posts: 868
|
Quote:
Jesus Lord Christ... That thing, sir, is a 100% winner of this year's International Obfuscated C Coding Contest! |
|
|
|
|
|
|
#17 |
|
New Member
Join Date: Apr 2007
Posts: 3
|
Oh, i felt I should post completly functional chuck of code. Here it is
Code:
|
|
|
|
|
|
#18 |
|
New Member
Join Date: Apr 2007
Posts: 3
|
I hope now it's clear
![]() But if you still feel obfuscated: the function i posted first is part of fast implementation of DCT4 transform of size 64. I got it from code generator (http://www.spiral.net/codegenerator.html). What about intel C complier settings? Does anyone know how to tune it in this case? |
|
|
|
|
|
#19 | |
|
Senior Member
Join Date: Sep 2005
Location: Hamburg / Germany
Posts: 597
|
Quote:
Hi Alexander, That's a nice one! Thanks for posting it. I'm still a bit puzzled how it works, but it looks like you're breaking up the 32*32=64 bit multiplication into two c-compatible 32*32 bit multiplications. I'd expect that you need three multiplications to get an exact result. I can only guess, but I think your last two binary logic lines do that. It's a shame that this code works as fast as it does. After all, it's nearly three times slower than doing a single imul instructions. Add the overhead of my self-made 64 bit shift macro to that and we end up beeing roughly 5 times slower than an IMUL / SHRD instruction pair (estimated). I'm also pessimistic that we'll see any improvements of 32 bit compilers. Don't they spend all their resources on the x86 64 bit cpu's these days? The only way to get this part globally optimized would be to somehow get some fixed point arithmetic into important benchmarks, and this is near to impossible. Nils
___________________________________________
My music: http://myspace.com/planetarchh <-- my music My stuff: torus.untergrund.net <-- some diy electronic stuff and more. |
|
|
|
|
|
|
#20 |
|
Senior Member
Join Date: Sep 2005
Location: Hamburg / Germany
Posts: 597
|
Btw - thanks for posting the link to the code-generator site. looks interesting. I like their constant multiplication generator a lot!
___________________________________________
My music: http://myspace.com/planetarchh <-- my music My stuff: torus.untergrund.net <-- some diy electronic stuff and more. |
|
|
|
|
|
#21 |
|
New Member
Join Date: Mar 2007
Posts: 8
|
Hey there I finally got around to checking the assembly generated by some of my fixed-point routines.
I'm generating ARMv5 code using RVCT 2.2. I'm not very strong with ARM assembly so I'd like to know if there are things that can be done to help. (note code assembly for C++ code instantiated with M = 16, F = 16, base_type = int, next_type = int64, and fractional_bits = 16) Fixed point multiply routine: Code:
Fixed point divide routine: Code:
How does this compare to other's code. Is there room for improvement? Thanks, Michael Marcin |
|
|
|
|
|
#22 |
|
Senior Member
Join Date: Sep 2005
Location: Hamburg / Germany
Posts: 597
|
Code:
Looks good. Code is as tight as it can be. The compiler will most probably inline these into the surrounding code. For the fixed point division: The ARM does not has dividers, so they have to do a function call. What you have here is the worst case: It casts booth numbers to 64 bits and calls a 64bit /64bit divide routine. You might get something between 20% and 50% speedup if you roll your own division that is specialized to the fixed point case, but that would require assembly programming. Nils
___________________________________________
My music: http://myspace.com/planetarchh <-- my music My stuff: torus.untergrund.net <-- some diy electronic stuff and more. |
|
|
|
|
|
#23 |
|
New Member
Join Date: Mar 2007
Posts: 8
|
FWIW VC8 has _BitScanReverse in <intrin.h>
So you can write something like inline unsigned long count_leading_zeros( unsigned long a ) { unsigned long b; _BitScanReverse( &b, a ); return b^31; } It doesn't appear to exist for VC7.1 |
|
|
|
|
|
#24 |
|
New Member
Join Date: Jul 2006
Posts: 4
|
Yo Nils, I'm Matteo of AmanithVG!
Nice article, some tricks helped us to speedup our incoming software rasterizer! Thx again (i wish to hear to you via email, how are you?). |
|
|
|
|
|
#25 |
|
New Member
Join Date: Mar 2007
Posts: 8
|
On VC8 how about
Code:
Last edited by MikeJM : 07-08-2007 at 05:45 PM. |
|
|
|
![]() |
| Thread Tools | Search this Thread |
| Display Modes | |
|