![]() |
| [[ Home | Forums | 3D Engines Database | Wiki | Articles/Tutorials | Game Dev Jobs | IRC Chat Network | Contact Us ]] |
|
|
#1 |
|
Senior Member
Join Date: Aug 2004
Location: Ghent, Belgium
Posts: 1,145
|
Hi all,
In some situations you need a good approximation of sine and cosine that runs at very high performance. One example is to implement dynamic subdivision of round surfaces, comparable to those in Quake 3. Or to implement wave motion, in case there are no vertex shaders 2.0 available. The standard C sinf() and cosf() functions are terribly slow and offer much more precision than what we actually need. What we really want is an approximation that offers the best compromise between precision and performance. The most well known approximation method is to use a Taylor series about 0 (also known as a Maclaurin series), which for sine becomes: Code:
We could add another term, which in fact reduces the error significantly, but this makes the formula pretty lengthy. We already need 7 multiplications and 3 additions for the 4-term version. The taylor series just can't give us the precision and performance we're looking for. We did however note that we need sine(pi) = 0. And there's another thing we can see from taylor_zoom.gif: this looks very much like a parabola! So let's try to find the formula of a parabola that matches it as closely as possible. The generic formula for a parabola is A + B x + C x^2. So this gives us three degrees of freedom. Obvious choises are that we want sine(0) = 0, sine(pi/2) = 1 and sine(pi) = 0. This gives us the following three equations: Code:
Unfortunately it's not very practical yet. This is what we get in the [-pi, pi] range: negative.gif. It's quite obvious we want at least a full period. But it's also clear that it's just another parabola, mirrored around the origin. The formula for it is 4/pi x + 4/pi^2 x^2. So the straightforward (pseudo-C) solution is: Code:
Now let's look at cosine. Basic trigonometry tells us that cosine(x) = sine(pi/2 + x). Is that all, add pi/2 to x? No, we actually get the unwanted part of the parabola again: shift_sine.gif. What we need to do is 'wrap around' when x > pi/2. This can be done by subtracting 2 pi. So the code becomes: Code:
Code:
Now, a maximum error of 0.056 is nice but clearly the 4-term Taylor series still has a smaller error on average. Recall what our sine looked like: abs.gif. So isn't there anything we can do to further increase precision at minimal cost? Of course the current version is already applicable for many situations where something that looks like a sine is just as good as the real sine. But for other situations that's just not good enough. Looking at the graphs you'll notice that our approximation always overestimates the real sine, except at 0, pi/2 and pi. So what we need is to 'scale it down' without touching these important points. The solution is to use the squared parabola, which looks like this: squared.gif. Note how it preserves those important points but it's always lower than the real sine. So we can use a weighted average of the two to get a better approximation: Code:
It shouldn't come as a big surprise that to make it work for negative x as well we need a second abs() operation. The final C code for the sine becomes: Code:
Last but not least, I wouldn't be Nick if I didn't include the SIMD optimized assembly version. It allows to perform the wrap operation very efficiently so I'll give you the cosine: Code:
So, the conclusion is don't ever again use a Taylor series to approximate a sine or cosine! To add a useful discussion to this article, I'd love to hear if anyone knows good approximations for other transcendental functions like the exponential, logarithm and power functions. Cheers, Nick Last edited by Nick : 04-23-2006 at 09:56 AM. |
|
|
|
|
|
#2 |
|
Valued Member
Join Date: Aug 2005
Posts: 189
|
one word: interesting =)
___________________________________________
hi, i'm a signature viruz, plz set me as your signature and help me spread :) Bramz' warehouse | LiAR isn't a raytracer |
|
|
|
|
|
#3 |
|
Senior Member
Join Date: Jan 2003
Location: Switzerland
Posts: 1,331
|
hehe, i've went trough this quite a while ago, while sitting in school, bored by the math lessons i knew already. i tried different ways to approximate sine, cosine, and all the others. but i don't remember. i haven't had your idea to use a lerp between the parabola and the squared parabola, though. i think i used x^3 because it looks more like sine by default.
exp was great, as you could split the value into integer and fractional part, do the integer with bitshifting, and the fraction with some nice approx. something like that i remember.. man, thats years back ![]()
___________________________________________
davepermen.net -Loving a Person is having the wish to see this Person happy, no matter what that means to yourself. -No matter what it means to myself.... |
|
|
|
|
|
#4 |
|
New Member
Join Date: Aug 2005
Location: San Francisco, MA
Posts: 24
|
Nice job Nick, very clever.
Max |
|
|
|
|
|
#5 |
|
Senior Member
Join Date: Jan 2003
Location: Switzerland
Posts: 1,331
|
hm.. you should present a fast way to fit a number into the -pi/2 - +pi/2 range.
is there a fractional() instruction in SSE? or some modulus directly?
___________________________________________
davepermen.net -Loving a Person is having the wish to see this Person happy, no matter what that means to yourself. -No matter what it means to myself.... |
|
|
|
|
|
#6 |
|
New Member
Join Date: Apr 2006
Posts: 1
|
Hello,
This is my first post here, but I've visited here for a while now. I did a quick test on accuracies and found the fast sin function to have about 12.9% error without added precision and -0.2% error with added precision. Here's the code, if you'd like to have a look to make sure I didn't do something that might be giving false figures. Code:
Slaru |
|
|
|
|
|
#7 | ||
|
Valued Member
Join Date: Mar 2005
Posts: 135
|
Note: Slaru, i'm sure you could gain a couple ulp by adding more digits to your constants.
Quote:
http://gcc.gnu.org/ml/gcc-patches/2006-03/msg01611.html Further down the thread there's also some rather amusing remarks from Intel drones. Quote:
Last edited by tbp : 04-24-2006 at 12:58 AM. |
||
|
|
|
|
|
#8 | |
|
Senior Member
Join Date: Aug 2004
Location: Ghent, Belgium
Posts: 1,145
|
Quote:
By the way, the relative error of the Taylor series is infinity at x = pi. :o Absolute error makes more sense here. |
|
|
|
|
|
|
#9 |
|
Valued Member
Join Date: Aug 2005
Posts: 189
|
how can you have a -0.2% error?
___________________________________________
hi, i'm a signature viruz, plz set me as your signature and help me spread :) Bramz' warehouse | LiAR isn't a raytracer |
|
|
|
|
|
#10 | |
|
Valued Member
Join Date: Aug 2005
Posts: 189
|
Quote:
Relative errors do make sense when talking about signals, because the amplitude isn't always equal to one. And it also makes sense to compare the error for different angles ...
___________________________________________
hi, i'm a signature viruz, plz set me as your signature and help me spread :) Bramz' warehouse | LiAR isn't a raytracer |
|
|
|
|
|
|
#11 | |
|
Senior Member
Join Date: Aug 2004
Location: Ghent, Belgium
Posts: 1,145
|
Quote:
|
|
|
|
|
|
|
#12 | |
|
Valued Member
Join Date: Aug 2005
Posts: 189
|
Quote:
In particular, it also makes sense to talk about the total harmonic distortion
___________________________________________
hi, i'm a signature viruz, plz set me as your signature and help me spread :) Bramz' warehouse | LiAR isn't a raytracer |
|
|
|
|
|
|
#13 | |
|
Senior Member
Join Date: Aug 2004
Location: Ghent, Belgium
Posts: 1,145
|
Quote:
So what we really need is a function to weigh the error absolutely near zero, and somewhat relative towards infinity. exp(-x) seems very suited. This way the maximum 'exponential' error of the parabola is 0.037, compared to 0.075 for the Taylor series. The squared parabola gives me 0.0006 with adjusted Q = 0.7775, P = 0.2225. No matter how you look at it, this is a fast and accurate approximation. ![]() |
|
|
|
|
|
|
#14 | |
|
Valued Member
Join Date: Aug 2005
Posts: 189
|
Quote:
Let us not mix two kinds of error measurments, ok? =) The reason why it goes towards infinity there, is because ... well ... because you're making an infinitely large error, no matter how you look at it. However, we don't need to panic either. For sines, we can get a nice figure of overall relative error by using the total harmonic distortion as I mentioned in the other post. This is in particular true if you're using this function to generate signals. And btw, I _do_ like your approximation ![]()
___________________________________________
hi, i'm a signature viruz, plz set me as your signature and help me spread :) Bramz' warehouse | LiAR isn't a raytracer Last edited by bramz : 04-24-2006 at 06:34 AM. |
|
|
|
|
|
|
#15 | ||
|
Valued Member
Join Date: Mar 2005
Posts: 135
|
Quote:
Even if you only care about ]0, inf[ latency has already gone through the roof. Anyway, you're right i forgot to mention them ![]() Quote:
![]() I stand by my initial point, that is paying for the latency, decoding bandwitdth etc for 3 ops (andn, and, or) for something as mundane as a cond move is silly. Even more so when you strive to avoid branches when dealing with SSE to begin with, branches that you remove with... you know what. Last edited by tbp : 04-24-2006 at 07:35 AM. |
||
|
|
|
|
|
#16 |
|
Senior Member
Join Date: Aug 2004
Location: Ghent, Belgium
Posts: 1,145
|
By the way, for those who think the squared parabola factor fell out of thin air, expand the formula and compare it with the Taylor series about x = pi/2:
Q (4/pi x - 4/pi^2 x^2) + P (4/pi x - 4/pi^2 x^2)^2 = Q 4/pi - (Q 4/pi^2 + P 16/pi^2) x^2 - P 32/pi^3 x^3 + P 16/pi^4 x^4 ~= 0.986 x + 0.0507 x^2 - 0.232 x^3 + 0.0370 x^4 sin(x ~ pi/2) ~= 1 - 1/2 (x - 1/2 Pi)^2 + 1/24 (x - 1/2 Pi)^4 ~= 0.0200 + 0.925 x + 0.117 x^2 - 0.262 x^3 + 0.0417 x^4 Note they are both polynomials with roughly comparable coefficients. The parabola method just has 'corrected' coefficients to make sure sin(0) = 0, sin(pi/2) = 1, sin(pi) = 0 and the erros are minimal. In fact you can also approach the problem starting from a generic order 4 polynomial: A + B x + C x^2 + D x^3 + E x^4 Using the restrictions sin(0) = 0, sin(pi/2) = 1, sin(pi) = 0 yields: A = 0, B = 1/(4 pi) (16 + 2 D pi^3 + 3 E pi^4), C = -1/4 (16 + 6 D pi^3 + 7 E pi^4) / pi^2. D and E are still free parameters. So we can choose two more restrictions, for example, the first and second derivative in pi/2 should be 0 and -1 respectively. This yields A = 0, B = -1/2 (-16 + pi^2) / pi, C = 1/2 (-48 + 5 pi^2) / pi^2, D = -4 (-8 + pi^2) / pi^3, E = 2 (-8 + pi^2) / pi^4. Which gives us: 0.976 x + 0.0683 x^2 - 0.241 x^3 + 0.0384 x^4 This is pretty close to the squared parabola method and has some nice properties, but the maximum error is a little bigger. By the way, for order 3 and 4 polynomials we both get the simple parabola. Anyway, the ad-hoc explanation is a bit easier to understand I think, but this gives it a bit of theoretical background. It's nothing more than fitting a function with a polynomial using some restrictions (exact points and derivatives). Minimizing the error is another very useful restriction... Last edited by Nick : 04-24-2006 at 11:50 AM. |
|
|
|
|
|
#17 | |||
|
Senior Member
Join Date: Aug 2004
Location: Ghent, Belgium
Posts: 1,145
|
Quote:
![]() Quote:
Quote:
![]() |
|||
|
|
|
|
|
#18 | ||
|
Valued Member
Join Date: Aug 2005
Posts: 189
|
Quote:
ok, fair enough ![]() Quote:
Well, basically, you'll need to decompose your 'signal' (-pi..pi) in its harmonics ... you know, thinks like y(theta) = \sum_{k=1}^inf A_k sin(k * theta). (Normally, you won't have cosine terms, since you have an odd function. Except maybe for a DC term, though I don't think you have one. If you would have cosine terms, you better write it down using the exponential form with complex amplitudes) Then, the THD is defined as the ratio of the RMS of the harmonics to the fundamental: \frac {\sqrt \sum_{k=2}^inf A_k^2} {\abs A_1}. One thingy though: it silently ignores the DC term ... so if you simply would add a number like 0.123456 to y, the THD would not change. edit: need to use the total RMS, not total power
___________________________________________
hi, i'm a signature viruz, plz set me as your signature and help me spread :) Bramz' warehouse | LiAR isn't a raytracer Last edited by bramz : 04-25-2006 at 04:37 AM. |
||
|
|
|
|
|
#19 | ||
|
Senior Member
Join Date: Aug 2004
Location: Ghent, Belgium
Posts: 1,145
|
Quote:
Quote:
So, I understand your frustration about modern CPU performance characteristics, but Intel and AMD really know what they're doing. x86 ain't perfect but it's legacy (tons of support for it) and it's flexible (new instructions get added every generation). It's certainly possible to create faster chips with a new instruction set tuned for today's applications but if it's not flexible enough it's not going to survive. Intel beated Alpha and PowerPC with affordable and fast processors. |
||
|
|
|
|
|
#20 | |||
|
Valued Member
Join Date: Mar 2005
Posts: 135
|
Quote:
Quote:
So we have to remove branches, and for that more often than not we'll get into the the cond move pattern. Now let's see what we had for ages on the integer side of those same CPU. Surprise! There's both those "binary logic" ops and... tada ... cmov. I'm sure dudes at Intel thought they had encoding bits and silicon to waste and they were like let's add another instruction just for the fun. Really i'm asking for consistency. And common sense. Quote:
And allow me to put the "they know what they're doing" back into proper historical context: that's the same guys that have gone with a segmented memory model at the same time ie Motorola was going the flat way. Or those that later on saw the light again when they thought stack based fpu was the best thing since sliced bread when everyone and his brother was using registers. ![]() Please save me that kind of PR, because even if my memory may not be what it used to be, i was there. |
|||
|
|
|
|
|
#21 | |||||
|
Senior Member
Join Date: Aug 2004
Location: Ghent, Belgium
Posts: 1,145
|
Quote:
Quote:
And I read almost every page of the Intel manuals (plus some chapters in The Unabridged Pentium 4) so it's hard to get my attention with that, sorry. I also got several courses about processor architecture and digital design if that matters...Quote:
Implementing cmov in the integer pipeline is easy since flags have always been there. They did the right thing for SSE by giving it a cleaner modern architecture. That means less consistency but I don't mind. A single scalar SSE version of cmov would probably have been doable but then there is no vector equivalent (no consistency) and it's pretty useless anyway. For predictable branches you can still use comiss. So I'm not sure what you're asking for exactly... ![]() Quote:
Quote:
Anyway, I have good hopes for the future of x86. Intel finally realized clock frequency isn't the only thing that determines performance. By increasing issue width, making SIMD units 4-wide instead of only 2-wide, and using reverse hyper-threading, floating-point performance can still increase a lot. |
|||||
|
|
|
|
|
#22 | ||||
|
Valued Member
Join Date: Mar 2005
Posts: 135
|
Quote:
Fact is i need both of them, fast and robust and if possible with a bit pattern that doesn't look like i got it from a RNG. Plus i wasn't addressing relative latency of 3DNow & SSE but functionnalities. It seems you've missed the wildcard in my previous post, so please check "3DNow! Technology Manual" for PFRCPIT1, PFRCPIT2, PFRSQIT1 & co. Compare that to: Code:
Quote:
From a quick searh through IA-32 Intel Architecture Optimization Reference Manual, Order Number:248966-012 2-85, Vectorization "User/Source Coding Rule 19. (M impact, ML generality) Avoid the use of conditional branches inside loops and consider using SSE instructions to eliminate branches" 2-89, User/Source Coding Rules "User/Source Coding Rule 19. (M impact, ML generality) Avoid the use of conditional branches inside loops and consider using SSE instructions to eliminate branches. 2-86" 3-37, Instruction Selection "The following section gives some guidelines for choosing instructions to complete a task. One barrier to SIMD computation can be the existence of data-dependent branches. Conditional moves can be used to eliminate data-dependent branches. Conditional moves can be emulated in SIMD computation by using masked compares and logicals, as shown in Example3-21." 4-2, General Rules on SIMD Integer Code "Emulate conditional moves by using masked compares and logicals instead of using conditional branches." Should i quote some more or did you get the point by now? Quote:
I was kind enough to restrict my example to x86 and i've never ever alluded to flags. I've merely pointed out the inconsistency where on one hand you have cmov* on the ALU side, fcmov* on the FPU side and nothing like that for SSE. You'll notice those cmov* and fcmov* could also be emulated. I would have expected a single op doing exactly what a andn/and/or sequence would do; admitedly ternary ops aren't the norm on x86, but anything would have done the trick: use a blessed register as the implicit mask (aka accumulator) or encode it as a prefix or whatever. It makes no sense when decoding bandwitdh is scarce and you're struggling with troughoutput to say the norm is replace to conditional branches with a 3 freaking op sequence. Quote:
![]() You know there was a (computing) world before Intel and it will still be there when everybody will have forgotten about them. Repeat after me: stack based fpu stink, they never ever made sense. Ever. Again it might come as a surprise but others got it mostly right: http://en.wikipedia.org/wiki/Motorola_68882 Now better than rephrasing what the marketing department wrote, get the story about IEEE 754 from the Man himself. Please. And remove those rosy teinted glasses. http://www.cs.berkeley.edu/~wkahan/i.../754story.html Last edited by tbp : 04-24-2006 at 08:17 PM. |
||||
|
|
|
|
|
#23 | ||||||||||||
|
Senior Member
Join Date: Aug 2004
Location: Ghent, Belgium
Posts: 1,145
|
Quote:
Quote:
Quote:
Quote:
Quote:
Besides, it's not because you see a use for certain instructions for your specific application, that they need to be added to the instruction set. They constantly have to evaluate transistor cost and overhead versus speedup over all existing software. It simply pays off more to improve the architecture and not touch the instruction set. Even though it's CISC you can't go adding instructions like mad. Quote:
Ternary operations would require register files with extra read ports. The area of a register file is proportional to the square of the number of ports. It also has a direct influence on latencies. The only time that can be acceptable is if all operations are ternary (which saves move operations). Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
As a matter of fact I got cool new glasses last Saturday, black with copper orange. Not rosy teinted though, crystal clear. |
||||||||||||
|
|
|
|
|
#24 |
|
DevMaster Staff
Join Date: Oct 2004
Location: Seattle, WA
Posts: 4,089
|
Guys,
This is rapidly becoming pointless. Let's stay on the topic please.
___________________________________________
Currently working at Sucker Punch reedbeta.com - OpenGL demos and other projects Luabridge - a lightweight, dependency-free C++/Lua binding library. CD Lite - an unobtrusive, minimal CD player application for Windows. |
|
|
|
|
|
#25 |
|
Valued Member
Join Date: Aug 2005
Posts: 189
|
So, I did a quick test on the THD, and this is what I've found:
taylor: 2.4% parabolic: 3.8% extra precision: 0.078% So, you can see that your version with extra precision performs much better than the others. Your parabolic version is actually worse than the taylor one. That's of course because you're making rather large errors over almost the complete domain. Taylor only gets (very) bad towards pi. There is however one major fact in favour of the parabolic one: it's continuous. The parabolic also has only odd harmonics (k = 2n+1), while taylor has both even and odd harmonics. The latter however, might be in favour of the Taylor series when it comes to psychoacoustics (how we experience the signal when we here it) because the third harmonic gives a 'covered' sound, while having the 2nd, 3rd, 4th, ... all together gives a 'brassy' sound. It's a bit like transistors vs tubes ... It might be interesting to take the general formula with four coefficients, and to optimize it for minimal THD. Bramz BTW, that other one you proposed: 0.976 x + 0.0683 x^2 - 0.241 x^3 + 0.0384 x^4, it's THD is 0.32%.
___________________________________________
hi, i'm a signature viruz, plz set me as your signature and help me spread :) Bramz' warehouse | LiAR isn't a raytracer Last edited by bramz : 04-25-2006 at 05:49 AM. |
|
|
|
![]() |
| Thread Tools | Search this Thread |
| Display Modes | |
|