DevMaster.net Forums
[[ Home | Forums | 3D Engines Database | Wiki | Articles/Tutorials | Game Dev Jobs | IRC Chat Network | Contact Us ]]

Go Back   DevMaster.net Forums > Site Discussions > Articles Discussion
User Name
Password
Register FAQ Members List Search Today's Posts Mark Forums Read

Reply
 
Thread Tools Search this Thread Display Modes
Old 04-06-2007, 03:32 PM   #1
DmEditor
DevMaster Editor
 
Join Date: Jan 2005
Posts: 54
Default Modern Fixed-point Tricks and Optimizations

Modern Fixed-point Tricks and Optimizations
Author: Nils Pipenbrinck
Description: The article reviews some fixed-point issues that developers face with today's compilers, and presents an approach on creating more robust and faster routines.
DmEditor is offline   Reply With Quote
Old 04-07-2007, 10:06 AM   #2
Guard][an
New Member
 
Join Date: Mar 2006
Location: France
Posts: 8
Default Re: Modern Fixed-point Tricks and Optimizations

interresting article, i learnt things today thank you
Guard][an is offline   Reply With Quote
Old 04-07-2007, 07:46 PM   #3
TheNut
Senior Member
 
TheNut's Avatar
 
Join Date: Aug 2004
Location: Thornhill, TO
Posts: 850
Default Re: Modern Fixed-point Tricks and Optimizations

I never really thought about fixed-point calculations until I read this. It would have been nice to see some benchmarks with this though. Just to get an idea of what to expect when dealing between FPU and CPU, and when using vanilla compiler approaches versus optimized code.
___________________________________________
http://www.nutty.ca - Being a nut has its advantages.
TheNut is offline   Reply With Quote
Old 04-07-2007, 09:42 PM   #4
SigKILL
Valued Member
 
Join Date: Aug 2004
Location: Norway
Posts: 200
Default Re: Modern Fixed-point Tricks and Optimizations

The only thing I'm missing from the article is the fact that fixed point number doesn't have adaptive presicion like floating points. This is why you want to use fixed point for anything that should have uniform precision (so if you're engine doesn't use fixed point numbers to store position it's crap and can't be used for your next MMORPG).

Great article though (I didn't really read it carefully, but it seems very nice).

A friend of mine talked to a guy making software that computed weather forcasts. My friend asked what they did for compensating for floating point round-off and limited precisions for large numbers, and the guy answered that they didn't consider that after they changed to doubles *shrug*. No wonder the weather forcasts is so bad. I don't think double's are the standard in Canada, though. Since the weather forecasts here are even less reliable than in Norway.
SigKILL is offline   Reply With Quote
Old 04-08-2007, 04:55 AM   #5
roel
Senior Member
 
roel's Avatar
 
Join Date: Sep 2005
Location: .nl
Posts: 504
Default Re: Modern Fixed-point Tricks and Optimizations

A nice article indeed. Glad that it is finally uploaded

Quote:
Originally Posted by SigKILL
A friend of mine talked to a guy making software that computed weather forcasts. My friend asked what they did for compensating for floating point round-off and limited precisions for large numbers, and the guy answered that they didn't consider that after they changed to doubles *shrug*. No wonder the weather forcasts is so bad. I don't think double's are the standard in Canada, though. Since the weather forecasts here are even less reliable than in Norway.

That sounds like a nice scientific approach
roel is offline   Reply With Quote
Old 04-09-2007, 08:04 AM   #6
Nils Pipenbrinck
Senior Member
 
Nils Pipenbrinck's Avatar
 
Join Date: Sep 2005
Location: Hamburg / Germany
Posts: 597
Default Re: Modern Fixed-point Tricks and Optimizations

Hi all, thanks for the comments.

Quote:
Originally Posted by TheNut
It would have been nice to see some benchmarks with this though. Just to get an idea of what to expect when dealing between FPU and CPU, and when using vanilla compiler approaches versus optimized code.

Oh well - I left a benchmark out because performance always depends on the CPU and context. I know there are not as much embedded guys here, but the console developers are facing the same problems. (there is not that much difference between embedded development and low level jobs in the console world anyways).


Take the square root for example. I showed how to make the number of iterations adaptive to the input range. I could aswell unroll the loop and always do 16 iterations.

The later version would be much faster if benchmarked by calculating half a million square roots in a tight loop. In reality that does not happen that often. The unrolled but longer (more code) version would be slower albeit it benchmarks faster in isolation. Why this? The working set of code will have a negative impact on the performance. The code might not fit into the code cache anymore, and your performance will go down.

Ever had the "wtf, my program executes faster when optimized for size"-experience? It's the same effect. I had my wtf moment some years ago, when i optimized a engine on the playstation 1 just by moving functions around. That increased the performance by almost a factor of two.



Also it all depends on the architecture:

The modern x86 CPU's for example do an incredible job to execute bad code fast. This is due to the fact that we have historically only a very limited register-file. The modern x86 works with a lot more registers inside, and some serious magic is going on to transform the x86 code to something risc-like. Also we have tons of out of order and speculative execution things going on.

This can totally mask the effect of an optimization on one x86 variant while the same optimization might give a nice performance boost on some other x86.

On a different architecture, like the StrongARM for example things are different again. The StrongARM is pipelined and calculations can be performed in parallel, but it does not has speculative execution which might mask a call to a simple subroutine.
All architectures are different, and they all have different spots where an optimization might make a difference or not.

(side-rant: I like it that the CPU's are that fast and efficient, but sometimes I which bad code would crawl and be as painfull slow as possible. How can new programmers develop a feeling for efficiency on the x86 these days? They will never have the same "Ahh"-experience we old guys had in the pre-pentium days).



Comparing Float against Integer:

That's like comparing apples and oranges. Float is nice and dandy on some machines (x86) and slow as hell on others. (Ever tried float on a PocketPC of Mobile Phone?)

Float is always adaptive to the input range. That's cool. You'll get some kind of result back as long as you don't do stupid things like dividing by zero, or mixing extreme large and extreme small values in calculation. But you have a hard time to predict the loss of precision. All floating point results are mere rough estimates.

Not so with integers, you know exactly where and how many bits you throw away, and you can calculate anything up the the precision you want.


Here's a practical example where float would fail: At work, I'm working on a antialiased graphic library (once again), and I need to calculate the distance of any screen-pixel to an edge (side of a triangle for example).

The maximum render surface size is 4096x4096 so I need 12 bits for the integral position. To express subpixels, I use 16 bits fraction. The upper 8 bits of these are used or antialiasing, and the lower 8 bits are needed since I have to interpolate the edge-distance across the primitive.

That sums up to 12+16 bits = 28 bits to represent distances. Add one additional bit for the sign (I have negative distances as well). A vanilla floating point square root gives me 23 bits of precision. That's not enough in the worst case and would yield visible rendering errors.

Even if floating point would be available (it's not on that hardware) I had to use a fixed point square root anyways. In the end I use a 64 bit square root and It works like a charm. It will not break even if the maximum render surface size changes on day (that makes me feel good). In reality a 56 bit square root would be sufficient, but thanks to the CountLeadingZero operation using a 64 bit square root is just as fast (or slow).

Nils

Btw - when I find the time (doing a major garbage collection in my flat since two days, and there are still some heaps left, waiting to be deallocated) I'll write up some little benchmarks anyways and run them on my 1.3Ghz Athlon and on a StrongARM. The numbers don't tell much, but they'll show that the performance gain of optimizations does not translate well from one machine to another
___________________________________________
My music: http://myspace.com/planetarchh <-- my music

My stuff: torus.untergrund.net <-- some diy electronic stuff and more.

Last edited by Nils Pipenbrinck : 04-09-2007 at 08:10 AM.
Nils Pipenbrinck is offline   Reply With Quote
Old 04-09-2007, 07:11 PM   #7
monjardin
Senior Member
 
Join Date: Oct 2005
Location: Pensacola, FL
Posts: 1,028
Default Re: Modern Fixed-point Tricks and Optimizations

Thanks for the contribution! You've written a nice appendix to the article in that last post too.

I'm going to be doing some more fixed-point math on an ARM9 soon. So, the delay in getting this posted hadn't bother me yet.
___________________________________________
monjardin's JwN Meter (1,2,3,4,5,6):
|----|----|----|----|----|----|----|----|----|----|
*
monjardin is offline   Reply With Quote
Old 04-10-2007, 06:42 AM   #8
Nils Pipenbrinck
Senior Member
 
Nils Pipenbrinck's Avatar
 
Join Date: Sep 2005
Location: Hamburg / Germany
Posts: 597
Default Re: Modern Fixed-point Tricks and Optimizations

ARM9? which one exactly? Some of them have DSP extensions (which are a bit laughable if you've worked on a real DSP but nevertheless they are helpfull).
___________________________________________
My music: http://myspace.com/planetarchh <-- my music

My stuff: torus.untergrund.net <-- some diy electronic stuff and more.
Nils Pipenbrinck is offline   Reply With Quote
Old 04-10-2007, 09:14 AM   #9
SpreeTree
Valued Member
 
SpreeTree's Avatar
 
Join Date: Jan 2004
Location: England
Posts: 265
Default Re: Modern Fixed-point Tricks and Optimizations

Just want to post a quick message saying it's good to see some new articles here on DevMaster, and hopefully other people will see this as an opportunity to submit their own

Spree
SpreeTree is offline   Reply With Quote
Old 04-10-2007, 11:40 AM   #10
MikeJM
New Member
 
Join Date: Mar 2007
Posts: 8
Default Re: Modern Fixed-point Tricks and Optimizations

Do you know if such helpful extensions exist on the TI OMAP 1710 or the Intel XScale ARM920T PXA27x.
MikeJM is offline   Reply With Quote
Old 04-10-2007, 01:15 PM   #11
Nils Pipenbrinck
Senior Member
 
Nils Pipenbrinck's Avatar
 
Join Date: Sep 2005
Location: Hamburg / Germany
Posts: 597
Default Re: Modern Fixed-point Tricks and Optimizations

The OMAP has a DSP as and the ARM9 on a single chip, so you might go multiprocessor in your software and do part of the work on the dsp and on the arm.

XScales are already pretty efficient ARM CPU's and some versions of them have a powerfull mmx like instruction-set. Better than the PC mmx btw. If it's available depends on the exact XScale used (it's a family, not a single silicon). All XScales have the count leading zero instruction and some other extra instructions, saturating adds and such things. They are pipelined and can execute code quite fast.

Nils

The ARM920T is just a plain risc-core without any funky specials btw. Nothing to get excited about, but not to shabby either.
___________________________________________
My music: http://myspace.com/planetarchh <-- my music

My stuff: torus.untergrund.net <-- some diy electronic stuff and more.
Nils Pipenbrinck is offline   Reply With Quote
Old 04-11-2007, 04:25 PM   #12
monjardin
Senior Member
 
Join Date: Oct 2005
Location: Pensacola, FL
Posts: 1,028
Default Re: Modern Fixed-point Tricks and Optimizations

It's a ARM926EJ-S and has 16-bit DSP extensions amongst other features that I haven't really looked at yet. It does have a CLZ instruction though.
___________________________________________
monjardin's JwN Meter (1,2,3,4,5,6):
|----|----|----|----|----|----|----|----|----|----|
*
monjardin is offline   Reply With Quote
Old 04-12-2007, 05:05 AM   #13
.alexander.
New Member
 
Join Date: Apr 2007
Posts: 3
Default Re: Modern Fixed-point Tricks and Optimizations

Hello!

There is an ugly workround I have to use for intel compiler to multiply fixed point numbers.

Code:
__inline int MULSHIFT32 (int x, int y) { return (__int64) x * (__int64) y >> 32; }
leads to __allmul calls if MULSHIFT32 is used many times in a function. One can use inline assembly, but there is another solution
Code:
const __int64 bm64 = 0x00000000FFFFFFFF; __inline int MULSHIFT32 (int x, int y) { int z; z = (bm64 & x) * (bm64 & y) >> 32; z -= x & (y >> 31); z -= y & (x >> 31); return z; }

surprisingly, ICL compiles this function w/o __allmul calls and it works faster than inline assembly alternative. This workaround is ridiculous, there must be some compiler setting to use instead. Do you guys have any ideas about that?


I use this code for tests
Code:
#define FLTTOQ31(c) ((int)((double)(c) * (double)(1<<31))) void S_BH32_D_BHN4_PRDFT3_4_7T_RD_FDataOfs_I2_Gx_H16_I2(int *Y, int *X, int R25, int b25, const int *D259) { int a5975, a5976, s9045, s9046, s9047, s9048; s9045 = ((MULSHIFT32(D259[0],X[0])) - (MULSHIFT30(D259[1],X[1]))); s9046 = ((MULSHIFT32(D259[1],X[0])) + (MULSHIFT30(D259[0],X[1]))); s9047 = ((MULSHIFT32(D259[2],X[32])) - (MULSHIFT30(D259[3],X[33]))); s9048 = ((MULSHIFT32(D259[3],X[32])) + (MULSHIFT30(D259[2],X[33]))); Y[b25] = (s9045 + s9047); Y[((R25 - 64) - b25)] = (s9046 - s9048); a5975 = (MULSHIFT32(FLTTOQ31(0.70710678118654757),(s9045 - s9047))); a5976 = (MULSHIFT32(FLTTOQ31(0.70710678118654757),(s9046 + s9048))); Y[(32 + b25)] = (a5975 - a5976); Y[((R25 - 64) - (b25 + 32))] = (a5976 + a5975); }

Last edited by Reedbeta : 04-12-2007 at 08:53 AM. Reason: added code tags
.alexander. is offline   Reply With Quote
Old 04-12-2007, 08:54 AM   #14
Reedbeta
DevMaster Staff
 
Join Date: Oct 2004
Location: Seattle, WA
Posts: 3,707
Default Re: Modern Fixed-point Tricks and Optimizations

(By the way, .alexander., please use the code tags when posting code! Thanks.)
___________________________________________
Currently working at Sucker Punch
reedbeta.com - OpenGL demos and other projects
Luabridge - a lightweight, dependency-free C++/Lua binding library.
CD Lite - an unobtrusive, minimal CD player application for Windows.
Reedbeta is offline   Reply With Quote
Old 04-12-2007, 10:38 AM   #15
Nick
Senior Member
 
Join Date: Aug 2004
Location: Ghent, Belgium
Posts: 1,056
Default Re: Modern Fixed-point Tricks and Optimizations

Quote:
Originally Posted by .alexander.
S_BH32_D_BHN4_PRDFT3_4_7T_RD_FDataOfs_I2_Gx_H16_I2
Good lord! Just out of curiosity, what does this function do and how did you came up with this name?
Nick is offline   Reply With Quote
Old 04-12-2007, 12:33 PM   #16
Mihail121
Senior Member
 
Mihail121's Avatar
 
Join Date: Jan 2003
Posts: 868
Default Re: Modern Fixed-point Tricks and Optimizations

Quote:
Originally Posted by .alexander.
Code:
#define FLTTOQ31(c) ((int)((double)(c) * (double)(1<<31))) void S_BH32_D_BHN4_PRDFT3_4_7T_RD_FDataOfs_I2_Gx_H16_I2(int *Y, int *X, int R25, int b25, const int *D259) { int a5975, a5976, s9045, s9046, s9047, s9048; s9045 = ((MULSHIFT32(D259[0],X[0])) - (MULSHIFT30(D259[1],X[1]))); s9046 = ((MULSHIFT32(D259[1],X[0])) + (MULSHIFT30(D259[0],X[1]))); s9047 = ((MULSHIFT32(D259[2],X[32])) - (MULSHIFT30(D259[3],X[33]))); s9048 = ((MULSHIFT32(D259[3],X[32])) + (MULSHIFT30(D259[2],X[33]))); Y[b25] = (s9045 + s9047); Y[((R25 - 64) - b25)] = (s9046 - s9048); a5975 = (MULSHIFT32(FLTTOQ31(0.70710678118654757),(s9045 - s9047))); a5976 = (MULSHIFT32(FLTTOQ31(0.70710678118654757),(s9046 + s9048))); Y[(32 + b25)] = (a5975 - a5976); Y[((R25 - 64) - (b25 + 32))] = (a5976 + a5975); }

Jesus Lord Christ...

That thing, sir, is a 100% winner of this year's International Obfuscated C Coding Contest!
Mihail121 is offline   Reply With Quote
Old 04-13-2007, 05:39 AM   #17
.alexander.
New Member
 
Join Date: Apr 2007
Posts: 3
Default Re: Modern Fixed-point Tricks and Optimizations

Oh, i felt I should post completly functional chuck of code. Here it is

Code:
typedef long long Word64; const Word64 bm64 = 0x00000000FFFFFFFF; __inline int MULSHIFT30(int x, int y) { int z; z = (bm64 & x) * (bm64 & y) >> 32; z -= x & (y >> 31); z -= y & (x >> 31); return z << 2; } #define FLTTOQ31(c) ((int)((double)(c) * (double)(1<<31))) void S_BH32_D_BHN4_PRDFT3_4_7T_RD_FDataOfs_I2_Gx_H16_I2(int *Y, int *X, int R25, int b25, const int *D259) { int a5975, a5976, s9045, s9046, s9047, s9048; s9045 = ((MULSHIFT30(D259[0],X[0])) - (MULSHIFT30(D259[1],X[1]))); s9046 = ((MULSHIFT30(D259[1],X[0])) + (MULSHIFT30(D259[0],X[1]))); s9047 = ((MULSHIFT30(D259[2],X[32])) - (MULSHIFT30(D259[3],X[33]))); s9048 = ((MULSHIFT30(D259[3],X[32])) + (MULSHIFT30(D259[2],X[33]))); Y[b25] = (s9045 + s9047); Y[((R25 - 64) - b25)] = (s9046 - s9048); a5975 = (MULSHIFT30(FLTTOQ31(0.70710678118654757),(s9045 - s9047))); a5976 = (MULSHIFT30(FLTTOQ31(0.70710678118654757),(s9046 + s9048))); Y[(32 + b25)] = (a5975 - a5976); Y[((R25 - 64) - (b25 + 32))] = (a5976 + a5975); } void Sx_H1_I2_PRDFT3_32_1_D_BHN32_G_BH4(int *Y, int *X, int R26, int b26) { int a6073, a6074, a6075, a6076, a6077, a6078, a6079, a6080, s9129, s9130, s9131, s9132, s9133, s9134, s9135, s9136, s9137, s9138, s9139, s9140, s9141, s9142, s9143, s9144, s9145, s9146, s9147, s9148, s9149, s9150, s9151, s9152, s9153, s9154, s9155, s9156, s9157, s9158, s9159, s9160, s9161, s9162, s9163, s9164, s9165, s9166, s9167, s9168, s9169, s9170, s9171, s9172, s9173, s9174, s9175, s9176, s9177, s9178, s9179, s9180, s9181, s9182, s9183, s9184, s9185, s9186, s9187, s9188, s9189, s9190, s9191, s9192, s9193, s9194, s9195, s9196, s9197, s9198, s9199, s9200, t17957, t17958, t17959, t17960, t17961, t17962, t17963, t17964, t17965, t17966, t17967, t17968, t17969, t17970, t17971, t17972, t17973, t17974, t17975, t17976, t17977, t17978, t17979, t17980, t17981, t17982, t17983, t17984, t17985, t17986, t17987, t17988, t17989, t17990, t17991, t17992, t17993, t17994, t17995, t17996, t17997, t17998, t17999, t18000, t18001, t18002, t18003, t18004, t18005, t18006, t18007, t18008, t18009, t18010, t18011, t18012, t18013, t18014, t18015, t18016, t18017, t18018, t18019, t18020, t18021, t18022, t18023, t18024, t18025, t18026, t18027, t18028, t18029, t18030, t18031, t18032, t18033, t18034, t18035, t18036, t18037, t18038, t18039, t18040, t18041, t18042, t18043, t18044, t18045, t18046, t18047, t18048, t18049, t18050, t18051, t18052; s9129 = X[b26]; a6073 = (MULSHIFT30(FLTTOQ31(0.70710678118654757),X[((R26 - 64) - (b26 + 32))])); a6074 = (MULSHIFT30(FLTTOQ31(0.70710678118654757),X[(32 + b26)])); s9130 = (a6074 + a6073); t17957 = (s9129 + s9130); t17958 = (s9129 - s9130); s9131 = X[((R26 - 64) - b26)]; s9132 = (a6074 - a6073); t17959 = (s9132 - s9131); t17960 = (s9131 + s9132); s9133 = X[(16 + b26)]; s9134 = X[((R26 - 64) - (b26 + 16))]; s9135 = X[(48 + b26)]; s9136 = X[((R26 - 64) - (b26 + 48))]; s9137 = ((MULSHIFT30(FLTTOQ31(0.92387953251128674),s9133)) + (MULSHIFT30(FLTTOQ31(0.38268343236508978),s9134))); s9138 = ((MULSHIFT30(FLTTOQ31(0.38268343236508978),s9135)) + (MULSHIFT30(FLTTOQ31(0.92387953251128674),s9136))); t17961 = (s9137 - s9138); t17962 = (s9137 + s9138); s9139 = ((MULSHIFT30(FLTTOQ31(0.38268343236508978),s9133)) - (MULSHIFT30(FLTTOQ31(0.92387953251128674),s9134))); s9140 = ((MULSHIFT30(FLTTOQ31(0.92387953251128674),s9135)) - (MULSHIFT30(FLTTOQ31(0.38268343236508978),s9136))); t17963 = (s9139 - s9140); t17964 = (s9139 + s9140); t17965 = (t17957 + t17962); s9141 = X[((R26 - 64) - (b26 + 24))]; s9142 = X[(24 + b26)]; s9143 = ((MULSHIFT30(FLTTOQ31(0.38268343236508978),s9142)) - (MULSHIFT30(FLTTOQ31(0.92387953251128674),s9141))); s9144 = X[((R26 - 64) - (b26 + 56))]; s9145 = X[(56 + b26)]; s9146 = ((MULSHIFT30(FLTTOQ31(0.92387953251128674),s9145)) - (MULSHIFT30(FLTTOQ31(0.38268343236508978),s9144))); t17966 = (s9143 + s9146); s9147 = X[((R26 - 64) - (b26 + 8))]; a6075 = (MULSHIFT30(FLTTOQ31(0.70710678118654757),X[((R26 - 64) - (b26 + 40))])); a6076 = (MULSHIFT30(FLTTOQ31(0.70710678118654757),X[(40 + b26)])); s9148 = (a6076 - a6075); t17967 = (s9148 - s9147); t17968 = (t17967 + t17966); s9149 = ((MULSHIFT30(FLTTOQ31(0.92387953251128674),s9142)) + (MULSHIFT30(FLTTOQ31(0.38268343236508978),s9141))); s9150 = ((MULSHIFT30(FLTTOQ31(0.38268343236508978),s9145)) + (MULSHIFT30(FLTTOQ31(0.92387953251128674),s9144))); t17969 = (s9149 + s9150); s9151 = X[(8 + b26)]; s9152 = (a6076 + a6075); t17970 = (s9151 + s9152); t17971 = (t17970 + t17969); s9153 = ((MULSHIFT30(FLTTOQ31(0.98078528040323043),t17971)) - (MULSHIFT30(FLTTOQ31(0.19509032201612825),t17968))); t17972 = (t17965 + s9153); t17973 = (t17965 - s9153); t17974 = (t17959 + t17964); s9154 = ((MULSHIFT30(FLTTOQ31(0.19509032201612825),t17971)) + (MULSHIFT30(FLTTOQ31(0.98078528040323043),t17968))); t17975 = (t17974 + s9154); t17976 = (t17974 - s9154); t17977 = (t17957 - t17962); t17978 = (t17967 - t17966); t17979 = (t17970 - t17969); s9155 = ((MULSHIFT30(FLTTOQ31(0.19509032201612825),t17979)) + (MULSHIFT30(FLTTOQ31(0.98078528040323043),t17978))); t17980 = (t17977 + s9155); t17981 = (t17977 - s9155); t17982 = (t17959 - t17964); s9156 = ((MULSHIFT30(FLTTOQ31(0.98078528040323043),t17979)) - (MULSHIFT30(FLTTOQ31(0.19509032201612825),t17978))); t17983 = (s9156 - t17982); t17984 = (t17982 + s9156); t17985 = (t17958 - t17963); t17986 = (s9143 - s9146); t17987 = (s9151 - s9152); t17988 = (t17987 - t17986); t17989 = (s9149 - s9150); t17990 = (s9147 + s9148); t17991 = (t17989 - t17990); s9157 = ((MULSHIFT30(FLTTOQ31(0.55557023301960218),t17988)) - (MULSHIFT30(FLTTOQ31(0.83146961230254524),t17991))); t17992 = (t17985 - s9157); t17993 = (t17985 + s9157); t17994 = (t17961 - t17960); s9158 = ((MULSHIFT30(FLTTOQ31(0.83146961230254524),t17988)) + (MULSHIFT30(FLTTOQ31(0.55557023301960218),t17991))); t17995 = (t17994 - s9158); t17996 = (t17994 + s9158); t17997 = (t17958 + t17963); t17998 = (t17990 + t17989); t17999 = (t17987 + t17986); s9159 = ((MULSHIFT30(FLTTOQ31(0.83146961230254524),t17999)) - (MULSHIFT30(FLTTOQ31(0.55557023301960218),t17998))); t18000 = (t17997 - s9159); t18001 = (t17997 + s9159); t18002 = (t17960 + t17961); s9160 = ((MULSHIFT30(FLTTOQ31(0.55557023301960218),t17999)) + (MULSHIFT30(FLTTOQ31(0.83146961230254524),t17998))); t18003 = (t18002 - s9160); t18004 = (t18002 + s9160); s9161 = X[(4 + b26)]; a6077 = (MULSHIFT30(FLTTOQ31(0.70710678118654757),X[((R26 - 64) - (b26 + 36))])); a6078 = (MULSHIFT30(FLTTOQ31(0.70710678118654757),X[(36 + b26)])); s9162 = (a6078 + a6077); t18005 = (s9161 - s9162); t18006 = (s9161 + s9162); s9163 = X[((R26 - 64) - (b26 + 4))]; s9164 = (a6078 - a6077); t18007 = (s9163 + s9164); t18008 = (s9164 - s9163); s9165 = X[(20 + b26)]; s9166 = X[((R26 - 64) - (b26 + 20))]; s9167 = X[(52 + b26)]; s9168 = X[((R26 - 64) - (b26 + 52))]; s9169 = ((MULSHIFT30(FLTTOQ31(0.92387953251128674),s9165)) + (MULSHIFT30(FLTTOQ31(0.38268343236508978),s9166))); s9170 = ((MULSHIFT30(FLTTOQ31(0.38268343236508978),s9167)) + (MULSHIFT30(FLTTOQ31(0.92387953251128674),s9168))); t18009 = (s9169 + s9170); t18010 = (s9169 - s9170); s9171 = ((MULSHIFT30(FLTTOQ31(0.38268343236508978),s9165)) - (MULSHIFT30(FLTTOQ31(0.92387953251128674),s9166))); s9172 = ((MULSHIFT30(FLTTOQ31(0.92387953251128674),s9167)) - (MULSHIFT30(FLTTOQ31(0.38268343236508978),s9168))); t18011 = (s9171 + s9172); t18012 = (s9171 - s9172); t18013 = (t18006 + t18009); t18014 = (t18008 + t18011); t18015 = (t18006 - t18009); t18016 = (t18008 - t18011); t18017 = (t18005 - t18012); t18018 = (t18010 - t18007); t18019 = (t18005 + t18012); t18020 = (t18007 + t18010); s9173 = X[(12 + b26)]; a6079 = (MULSHIFT30(FLTTOQ31(0.70710678118654757),X[((R26 - 64) - (b26 + 44))])); a6080 = (MULSHIFT30(FLTTOQ31(0.70710678118654757),X[(44 + b26)])); s9174 = (a6080 + a6079); t18021 = (s9173 - s9174); t18022 = (s9173 + s9174); s9175 = X[((R26 - 64) - (b26 + 12))]; s9176 = (a6080 - a6079); t18023 = (s9175 + s9176); t18024 = (s9176 - s9175); s9177 = X[(28 + b26)]; s9178 = X[((R26 - 64) - (b26 + 28))]; s9179 = X[(60 + b26)]; s9180 = X[((R26 - 64) - (b26 + 60))]; s9181 = ((MULSHIFT30(FLTTOQ31(0.92387953251128674),s9177)) + (MULSHIFT30(FLTTOQ31(0.38268343236508978),s9178))); s9182 = ((MULSHIFT30(FLTTOQ31(0.38268343236508978),s9179)) + (MULSHIFT30(FLTTOQ31(0.92387953251128674),s9180))); t18025 = (s9181 - s9182); t18026 = (s9181 + s9182); s9183 = ((MULSHIFT30(FLTTOQ31(0.38268343236508978),s9177)) - (MULSHIFT30(FLTTOQ31(0.92387953251128674),s9178))); s9184 = ((MULSHIFT30(FLTTOQ31(0.92387953251128674),s9179)) - (MULSHIFT30(FLTTOQ31(0.38268343236508978),s9180))); t18027 = (s9183 + s9184); t18028 = (s9183 - s9184); t18029 = (t18022 + t18026); t18030 = (t18024 + t18027); t18031 = (t18022 - t18026); t18032 = (t18024 - t18027); t18033 = (t18021 - t18028); t18034 = (t18025 - t18023); t18035 = (t18021 + t18028); t18036 = (t18023 + t18025); s9185 = ((MULSHIFT30(FLTTOQ31(0.99518472667219693),t18013)) - (MULSHIFT30(FLTTOQ31(0.098017140329560604),t18014))); s9186 = ((MULSHIFT30(FLTTOQ31(0.95694033573220882),t18029)) - (MULSHIFT30(FLTTOQ31(0.29028467725446233),t18030))); t18037 = (s9185 + s9186); t18038 = (s9185 - s9186); s9187 = ((MULSHIFT30(FLTTOQ31(0.098017140329560604),t18013)) + (MULSHIFT30(FLTTOQ31(0.99518472667219693),t18014))); s9188 = ((MULSHIFT30(FLTTOQ31(0.29028467725446233),t18029)) + (MULSHIFT30(FLTTOQ31(0.95694033573220882),t18030))); t18039 = (s9187 + s9188); t18040 = (s9187 - s9188); Y[0] = (t17972 + t18037); Y[1] = (t17975 + t18039); Y[30] = (t17972 - t18037); Y[31] = (t18039 - t17975); Y[16] = (t17973 - t18040); Y[17] = (t17976 + t18038); Y[14] = (t17973 + t18040); Y[15] = (t18038 - t17976); s9189 = ((MULSHIFT30(FLTTOQ31(0.95694033573220882),t18019)) - (MULSHIFT30(FLTTOQ31(0.29028467725446233),t18020))); s9190 = ((MULSHIFT30(FLTTOQ31(0.63439328416364549),t18035)) - (MULSHIFT30(FLTTOQ31(0.77301045336273699),t18036))); t18041 = (s9189 - s9190); t18042 = (s9189 + s9190); s9191 = ((MULSHIFT30(FLTTOQ31(0.29028467725446233),t18019)) + (MULSHIFT30(FLTTOQ31(0.95694033573220882),t18020))); s9192 = ((MULSHIFT30(FLTTOQ31(0.77301045336273699),t18035)) + (MULSHIFT30(FLTTOQ31(0.63439328416364549),t18036))); t18043 = (s9191 - s9192); t18044 = (s9191 + s9192); Y[2] = (t18001 + t18042); Y[3] = (t18004 + t18044); Y[28] = (t18001 - t18042); Y[29] = (t18044 - t18004); Y[18] = (t18000 - t18043); Y[19] = (t18003 + t18041); Y[12] = (t18000 + t18043); Y[13] = (t18041 - t18003); s9193 = ((MULSHIFT30(FLTTOQ31(0.88192126434835505),t18017)) - (MULSHIFT30(FLTTOQ31(0.47139673682599764),t18018))); s9194 = ((MULSHIFT30(FLTTOQ31(0.098017140329560604),t18033)) - (MULSHIFT30(FLTTOQ31(0.99518472667219693),t18034))); t18045 = (s9193 + s9194); t18046 = (s9193 - s9194); s9195 = ((MULSHIFT30(FLTTOQ31(0.47139673682599764),t18017)) + (MULSHIFT30(FLTTOQ31(0.88192126434835505),t18018))); s9196 = ((MULSHIFT30(FLTTOQ31(0.99518472667219693),t18033)) + (MULSHIFT30(FLTTOQ31(0.098017140329560604),t18034))); t18047 = (s9195 + s9196); t18048 = (s9195 - s9196); Y[4] = (t17993 + t18045); Y[5] = (t17996 + t18047); Y[26] = (t17993 - t18045); Y[27] = (t18047 - t17996); Y[20] = (t17992 - t18048); Y[21] = (t17995 + t18046); Y[10] = (t17992 + t18048); Y[11] = (t18046 - t17995); s9197 = ((MULSHIFT30(FLTTOQ31(0.77301045336273699),t18015)) + (MULSHIFT30(FLTTOQ31(0.63439328416364549),t18016))); s9198 = ((MULSHIFT30(FLTTOQ31(0.88192126434835505),t18032)) - (MULSHIFT30(FLTTOQ31(0.47139673682599764),t18031))); t18049 = (s9197 + s9198); t18050 = (s9197 - s9198); s9199 = ((MULSHIFT30(FLTTOQ31(0.63439328416364549),t18015)) - (MULSHIFT30(FLTTOQ31(0.77301045336273699),t18016))); s9200 = ((MULSHIFT30(FLTTOQ31(0.88192126434835505),t18031)) + (MULSHIFT30(FLTTOQ31(0.47139673682599764),t18032))); t18051 = (s9199 + s9200); t18052 = (s9199 - s9200); Y[6] = (t17980 + t18049); Y[7] = (t17983 + t18051); Y[24] = (t17980 - t18049); Y[25] = (t18051 - t17983); Y[22] = (t17981 - t18052); Y[23] = (t18050 - t17984); Y[8] = (t17981 + t18052); Y[9] = (t17984 + t18050); } const int dat18[64] = { 0x7ffd885a, 0x01921d20, 0x7ff62182, 0x03242abf, 0x7fe9cbc0, 0x04b6195d, 0x7fa736b4, 0x096a9049, 0x7fc25596, 0x07d95b9e, 0x7f0991c4, 0x0fab272b, 0x7f872bf3, 0x0afb6805, 0x7e1d93ea, 0x15e21445, 0x7f3857f6, 0x0e1bc2e4, 0x7ce3ceb2, 0x1c0b826a, 0x7ed5e5c6, 0x1139f0cf, 0x7b5d039e, 0x2223a4c5, 0x7e5fe493, 0x145576b1, 0x798a23b1, 0x2826b928, 0x7dd6668f, 0x176dd9de, 0x776c4edb, 0x2e110a62, 0x7d3980ec, 0x1a82a026, 0x7504d345, 0x33def287, 0x7c894bde, 0x1d934fe5, 0x72552c85, 0x398cdd32, 0x7bc5e290, 0x209f701c, 0x6f5f02b2, 0x3f1749b8, 0x7aef6323, 0x23a6887f, 0x6c242960, 0x447acd50, 0x7a05eead, 0x26a82186, 0x68a69e81, 0x49b41533, 0x7909a92d, 0x29a3c485, 0x64e88926, 0x4ebfe8a5, 0x77fab989, 0x2c98fbba, 0x60ec3830, 0x539b2af0, 0x76d94989, 0x2f875262, 0x5cb420e0, 0x5842dd54 }; void DCT4_64(int *Y, int *X) { int buf20[64]; Sx_H1_I2_PRDFT3_32_1_D_BHN32_G_BH4(buf20, X, 127, 0); Sx_H1_I2_PRDFT3_32_1_D_BHN32_G_BH4(buf20 + 32, X, 127, 1); S_BH32_D_BHN4_PRDFT3_4_7T_RD_FDataOfs_I2_Gx_H16_I2(Y, buf20, 127, 0, dat18); S_BH32_D_BHN4_PRDFT3_4_7T_RD_FDataOfs_I2_Gx_H16_I2(Y, buf20 + 2, 127, 1, (dat18 + 4)); S_BH32_D_BHN4_PRDFT3_4_7T_RD_FDataOfs_I2_Gx_H16_I2(Y, buf20 + 4, 127, 2, (dat18 + 8)); S_BH32_D_BHN4_PRDFT3_4_7T_RD_FDataOfs_I2_Gx_H16_I2(Y, buf20 + 6, 127, 3, (dat18 + 12)); S_BH32_D_BHN4_PRDFT3_4_7T_RD_FDataOfs_I2_Gx_H16_I2(Y, buf20 + 8, 127, 4, (dat18 + 16)); S_BH32_D_BHN4_PRDFT3_4_7T_RD_FDataOfs_I2_Gx_H16_I2(Y, buf20 + 10, 127, 5, (dat18 + 20)); S_BH32_D_BHN4_PRDFT3_4_7T_RD_FDataOfs_I2_Gx_H16_I2(Y, buf20 + 12, 127, 6, (dat18 + 24)); S_BH32_D_BHN4_PRDFT3_4_7T_RD_FDataOfs_I2_Gx_H16_I2(Y, buf20 + 14, 127, 7, (dat18 + 28)); S_BH32_D_BHN4_PRDFT3_4_7T_RD_FDataOfs_I2_Gx_H16_I2(Y, buf20 + 16, 127, 8, (dat18 + 32)); S_BH32_D_BHN4_PRDFT3_4_7T_RD_FDataOfs_I2_Gx_H16_I2(Y, buf20 + 18, 127, 9, (dat18 + 36)); S_BH32_D_BHN4_PRDFT3_4_7T_RD_FDataOfs_I2_Gx_H16_I2(Y, buf20 + 20, 127, 10, (dat18 + 40)); S_BH32_D_BHN4_PRDFT3_4_7T_RD_FDataOfs_I2_Gx_H16_I2(Y, buf20 + 22, 127, 11, (dat18 + 44)); S_BH32_D_BHN4_PRDFT3_4_7T_RD_FDataOfs_I2_Gx_H16_I2(Y, buf20 + 24, 127, 12, (dat18 + 48)); S_BH32_D_BHN4_PRDFT3_4_7T_RD_FDataOfs_I2_Gx_H16_I2(Y, buf20 + 26, 127, 13, (dat18 + 52)); S_BH32_D_BHN4_PRDFT3_4_7T_RD_FDataOfs_I2_Gx_H16_I2(Y, buf20 + 28, 127, 14, (dat18 + 56)); S_BH32_D_BHN4_PRDFT3_4_7T_RD_FDataOfs_I2_Gx_H16_I2(Y, buf20 + 30, 127, 15, (dat18 + 60)); }
.alexander. is offline   Reply With Quote
Old 04-13-2007, 05:53 AM   #18
.alexander.
New Member
 
Join Date: Apr 2007
Posts: 3
Default Re: Modern Fixed-point Tricks and Optimizations

I hope now it's clear

But if you still feel obfuscated: the function i posted first is part of fast implementation of DCT4 transform of size 64.
I got it from code generator (http://www.spiral.net/codegenerator.html).

What about intel C complier settings? Does anyone know how to tune it in this case?
.alexander. is offline   Reply With Quote
Old 04-15-2007, 04:25 PM   #19
Nils Pipenbrinck
Senior Member
 
Nils Pipenbrinck's Avatar
 
Join Date: Sep 2005
Location: Hamburg / Germany
Posts: 597
Default Re: Modern Fixed-point Tricks and Optimizations

Quote:
Originally Posted by .alexander.
Code:
const __int64 bm64 = 0x00000000FFFFFFFF; __inline int MULSHIFT32 (int x, int y) { int z; z = (bm64 & x) * (bm64 & y) >> 32; z -= x & (y >> 31); z -= y & (x >> 31); return z; }

surprisingly, ICL compiles this function w/o __allmul calls and it works faster than inline assembly alternative. This workaround is ridiculous, there must be some compiler setting to use instead. Do you guys have any ideas about that?

Hi Alexander,

That's a nice one! Thanks for posting it. I'm still a bit puzzled how it works, but it looks like you're breaking up the 32*32=64 bit multiplication into two c-compatible 32*32 bit multiplications. I'd expect that you need three multiplications to get an exact result. I can only guess, but I think your last two binary logic lines do that.

It's a shame that this code works as fast as it does. After all, it's nearly three times slower than doing a single imul instructions. Add the overhead of my self-made 64 bit shift macro to that and we end up beeing roughly 5 times slower than an IMUL / SHRD instruction pair (estimated).


I'm also pessimistic that we'll see any improvements of 32 bit compilers. Don't they spend all their resources on the x86 64 bit cpu's these days?

The only way to get this part globally optimized would be to somehow get some fixed point arithmetic into important benchmarks, and this is near to impossible.


Nils
___________________________________________
My music: http://myspace.com/planetarchh <-- my music

My stuff: torus.untergrund.net <-- some diy electronic stuff and more.
Nils Pipenbrinck is offline   Reply With Quote
Old 04-15-2007, 04:26 PM   #20
Nils Pipenbrinck
Senior Member
 
Nils Pipenbrinck's Avatar
 
Join Date: Sep 2005
Location: Hamburg / Germany
Posts: 597
Default Re: Modern Fixed-point Tricks and Optimizations

Btw - thanks for posting the link to the code-generator site. looks interesting. I like their constant multiplication generator a lot!
___________________________________________
My music: http://myspace.com/planetarchh <-- my music

My stuff: torus.untergrund.net <-- some diy electronic stuff and more.
Nils Pipenbrinck is offline   Reply With Quote
Old 04-26-2007, 06:59 PM   #21
MikeJM
New Member
 
Join Date: Mar 2007
Posts: 8
Default Re: Modern Fixed-point Tricks and Optimizations

Hey there I finally got around to checking the assembly generated by some of my fixed-point routines.

I'm generating ARMv5 code using RVCT 2.2. I'm not very strong with ARM assembly so I'd like to know if there are things that can be done to help.

(note code assembly for C++ code instantiated with M = 16, F = 16, base_type = int, next_type = int64, and fractional_bits = 16)

Fixed point multiply routine:
Code:
template<std::size_t M, std::size_t F> inline fixed<M,F> fixed<M,F>::operator * ( const fixed& rhs ) const { fixed nrv; nrv.data = static_cast<base_type>( (next_type(data) * rhs.data) >> fractional_bits); return nrv; } _Z8fix_testN6method5fixedILj16ELj16EEES1_ PROC SMULL r12,r1,r0,r1 LSR r0,r12,#16 ORR r0,r0,r1,LSL #16 BX lr ENDP

Fixed point divide routine:
Code:
template<std::size_t M, std::size_t F> inline fixed<M,F> fixed<M,F>::operator / ( const fixed& rhs ) const { fixed nrv; nrv.data = static_cast<base_type>( (next_type(data) << fractional_bits) / rhs.data ); return nrv; } _Z8fix_testN6method5fixedILj16ELj16EEES1_ PROC MOV r2,r1 ASR r3,r1,#31 ASR r1,r0,#31 LSL r1,r1,#16 ORR r1,r1,r0,LSR #16 LSL r0,r0,#16 B __aeabi_ldivmod ENDP

How does this compare to other's code. Is there room for improvement?

Thanks,

Michael Marcin
MikeJM is offline   Reply With Quote
Old 04-27-2007, 03:41 AM   #22
Nils Pipenbrinck
Senior Member
 
Nils Pipenbrinck's Avatar
 
Join Date: Sep 2005
Location: Hamburg / Germany
Posts: 597
Default Re: Modern Fixed-point Tricks and Optimizations

Code:
_Z8fix_testN6method5fixedILj16ELj16EEES1_ PROC SMULL r12,r1,r0,r1 LSR r0,r12,#16 ORR r0,r0,r1,LSL #16 BX lr ENDP

Looks good. Code is as tight as it can be. The compiler will most probably inline these into the surrounding code.


For the fixed point division: The ARM does not has dividers, so they have to do a function call. What you have here is the worst case: It casts booth numbers to 64 bits and calls a 64bit /64bit divide routine.

You might get something between 20% and 50% speedup if you roll your own division that is specialized to the fixed point case, but that would require assembly programming.

Nils
___________________________________________
My music: http://myspace.com/planetarchh <-- my music

My stuff: torus.untergrund.net <-- some diy electronic stuff and more.
Nils Pipenbrinck is offline   Reply With Quote
Old 05-21-2007, 09:16 PM   #23
MikeJM
New Member
 
Join Date: Mar 2007
Posts: 8
Default Re: Modern Fixed-point Tricks and Optimizations

FWIW VC8 has _BitScanReverse in <intrin.h>

So you can write something like

inline unsigned long count_leading_zeros( unsigned long a )
{
unsigned long b;
_BitScanReverse( &b, a );
return b^31;
}

It doesn't appear to exist for VC7.1
MikeJM is offline   Reply With Quote
Old 07-02-2007, 04:50 AM   #24
muratmat
New Member
 
Join Date: Jul 2006
Posts: 4
Default Re: Modern Fixed-point Tricks and Optimizations

Yo Nils, I'm Matteo of AmanithVG!
Nice article, some tricks helped us to speedup our incoming software rasterizer!

Thx again (i wish to hear to you via email, how are you?).
muratmat is offline   Reply With Quote
Old 07-08-2007, 04:43 PM   #25
MikeJM
New Member
 
Join Date: Mar 2007
Posts: 8
Default Re: Modern Fixed-point Tricks and Optimizations

On VC8 how about

Code:
#include <intrin.h> #pragma intrinsic(__ll_rshift) int fixmul( int a, int b ) { return (int)__ll_rshift( (__int64)a*b, 16 ); }

Last edited by MikeJM : 07-08-2007 at 05:45 PM.
MikeJM is offline   Reply With Quote
Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Forum Jump


All times are GMT -7. The time now is 10:18 PM.


Powered by vBulletin
Copyright ©2000 - 2009, Jelsoft Enterprises Ltd.