DevMaster.net Forums
[[ Home | Forums | 3D Engines Database | Wiki | Articles/Tutorials | Game Dev Jobs | IRC Chat Network | Contact Us ]]

Go Back   DevMaster.net Forums > Site Discussions > Code & Snapshot Discussion
User Name
Password
Register FAQ Members List Search Today's Posts Mark Forums Read

Reply
 
Thread Tools Search this Thread Display Modes
Old 04-23-2006, 07:00 AM   #1
Nick
Senior Member
 
Join Date: Aug 2004
Location: Ghent, Belgium
Posts: 1,145
Default

Hi all,

In some situations you need a good approximation of sine and cosine that runs at very high performance. One example is to implement dynamic subdivision of round surfaces, comparable to those in Quake 3. Or to implement wave motion, in case there are no vertex shaders 2.0 available.

The standard C sinf() and cosf() functions are terribly slow and offer much more precision than what we actually need. What we really want is an approximation that offers the best compromise between precision and performance. The most well known approximation method is to use a Taylor series about 0 (also known as a Maclaurin series), which for sine becomes:
Code:
x - 1/6 x^3 + 1/120 x^5 - 1/5040 x^7 + ...
When we plot this we get: taylor.gif. The green line is the real sine, the red line is the first four terms of the Taylor series. This seems like an acceptable approximation, but lets have a closer look: taylor_zoom.gif. It behaves very nicely up till pi/2, but after that it quickly deviates. At pi, it evaluates to -0.075 instead of 0. Using this for wave simulation will result in jerky motion which is unacceptable.

We could add another term, which in fact reduces the error significantly, but this makes the formula pretty lengthy. We already need 7 multiplications and 3 additions for the 4-term version. The taylor series just can't give us the precision and performance we're looking for.

We did however note that we need sine(pi) = 0. And there's another thing we can see from taylor_zoom.gif: this looks very much like a parabola! So let's try to find the formula of a parabola that matches it as closely as possible. The generic formula for a parabola is A + B x + C x^2. So this gives us three degrees of freedom. Obvious choises are that we want sine(0) = 0, sine(pi/2) = 1 and sine(pi) = 0. This gives us the following three equations:
Code:
A + B 0 + C 0^2 = 0 A + B pi/2 + C (pi/2)^2 = 1 A + B pi + C pi^2 = 0
Which has as solution A = 0, B = 4/pi, C = -4/pi^2. So our parabola approximation becomes 4/pi x - 4/pi^2 x^2. Plotting this we get: parabola.gif. This looks worse than the 4-term Taylor series, right? Wrong! The maximum absolute error is 0.056. Furthermore, this approximation will give us smooth wave motion, and can be calculated in only 3 multiplications and 1 addition!

Unfortunately it's not very practical yet. This is what we get in the [-pi, pi] range: negative.gif. It's quite obvious we want at least a full period. But it's also clear that it's just another parabola, mirrored around the origin. The formula for it is 4/pi x + 4/pi^2 x^2. So the straightforward (pseudo-C) solution is:
Code:
if(x > 0) { y = 4/pi x - 4/pi^2 x^2; } else { y = 4/pi x + 4/pi^2 x^2; }
Adding a branch is not a good idea though. It makes the code significantly slower. But look at how similar the two parts really are. A subtraction becomes an addition depending on the sign of x. In a naive first attempt to eliminate the branch we can 'extract' the sign of x using x / abs(x). The division is very expensive but look at the resulting formula: 4/pi x - x / abs(x) 4/pi^2 x^2. By inverting the division we can simplify this to a very nice and clean 4/pi x - 4/pi^2 x abs(x). So for just one extra operation we get both halves of our sine approximation! Here's the graph of this formula confirming the result: abs.gif.

Now let's look at cosine. Basic trigonometry tells us that cosine(x) = sine(pi/2 + x). Is that all, add pi/2 to x? No, we actually get the unwanted part of the parabola again: shift_sine.gif. What we need to do is 'wrap around' when x > pi/2. This can be done by subtracting 2 pi. So the code becomes:
Code:
x += pi/2; if(x > pi) // Original x > pi/2 { x -= 2 * pi; // Wrap: cos(x) = cos(x - 2 pi) } y = sine(x);
Yet another branch. To eliminate it, we can use binary logic, like this:
Code:
x -= (x > pi) & (2 * pi);
Note that this isn't valid C code at all. But it should clarify how this can work. When x > pi is false the & operation zeroes the right hand part so the subtraction does nothing, which is perfectly equivalent. I'll leave it as an excercise to the reader to create working C code for this (or just read on). Clearly the cosine requires a few more operations than the sine but there doesn't seem to be any other way and it's still extremely fast.

Now, a maximum error of 0.056 is nice but clearly the 4-term Taylor series still has a smaller error on average. Recall what our sine looked like: abs.gif. So isn't there anything we can do to further increase precision at minimal cost? Of course the current version is already applicable for many situations where something that looks like a sine is just as good as the real sine. But for other situations that's just not good enough.

Looking at the graphs you'll notice that our approximation always overestimates the real sine, except at 0, pi/2 and pi. So what we need is to 'scale it down' without touching these important points. The solution is to use the squared parabola, which looks like this: squared.gif. Note how it preserves those important points but it's always lower than the real sine. So we can use a weighted average of the two to get a better approximation:
Code:
Q (4/pi x - 4/pi^2 x^2) + P (4/pi x - 4/pi^2 x^2)^2
With Q + P = 1. You can use exact minimization of either the absolute or relative error but I'll save you the math. The optimal weights are Q = 0.775, P = 0.225 for the absolute error and Q = 0.782, P = 0.218 for the relative error. I will use the former. The resulting graph is: average.gif. Where did the red line go? It's almost entirely covered by the green line, which instantly shows how good this approximation really is. The maximum error is about 0.001, a 50x improvement! The formula looks lengthy but the part between parenthesis is the same value from the parabola, which needs to be computed only once. In fact only 2 extra multiplications and 2 additions are required to achieve this precision boost.

It shouldn't come as a big surprise that to make it work for negative x as well we need a second abs() operation. The final C code for the sine becomes:
Code:
float sine(float x) { const float B = 4/pi; const float C = -4/(pi*pi); float y = B * x + C * x * abs(x); #ifdef EXTRA_PRECISION // const float Q = 0.775; const float P = 0.225; y = P * (y * abs(y) - y) + y; // Q * y + P * y * abs(y) #endif }
So we need merely 5 multiplications and 3 additions; still faster than the 4-term Taylor if we neglect the abs(), and much more precise! The cosine version just needs the extra shift and wrap operations on x.

Last but not least, I wouldn't be Nick if I didn't include the SIMD optimized assembly version. It allows to perform the wrap operation very efficiently so I'll give you the cosine:
Code:
// cos(x) = sin(x + pi/2) addps xmm0, PI_2 movaps xmm1, xmm0 cmpnltps xmm1, PI andps xmm1, PIx2 subps xmm0, xmm1 // Parabola movaps xmm1, xmm0 andps xmm1, abs mulps xmm1, xmm0 mulps xmm0, B mulps xmm1, C addps xmm0, xmm1 // Extra precision movaps xmm1, xmm0 andps xmm1, abs mulps xmm1, xmm0 subps xmm1, xmm0 mulps xmm1, P addps xmm0, xmm1
This code computes four cosines in parallel, resulting in a peak performance of about 9 clock cycles per cosine for most CPU architectures. Sines take only 6 clock cycles ideally. The lower precision version can even run at 3 clock cycles per sine... And lets not forget that all input between -pi and pi is valid and the formula is exact at -pi, -pi/2, 0, pi/2 and pi.

So, the conclusion is don't ever again use a Taylor series to approximate a sine or cosine! To add a useful discussion to this article, I'd love to hear if anyone knows good approximations for other transcendental functions like the exponential, logarithm and power functions.

Cheers,

Nick

Last edited by Nick : 04-23-2006 at 09:56 AM.
Nick is offline   Reply With Quote
Old 04-23-2006, 08:30 AM   #2
bramz
Valued Member
 
Join Date: Aug 2005
Posts: 189
Default Re: Fast and accurate sine/cosine

one word: interesting =)
___________________________________________
hi, i'm a signature viruz, plz set me as your signature and help me spread :)
Bramz' warehouse | LiAR isn't a raytracer
bramz is offline   Reply With Quote
Old 04-23-2006, 08:54 AM   #3
davepermen
Senior Member
 
davepermen's Avatar
 
Join Date: Jan 2003
Location: Switzerland
Posts: 1,331
Default Re: Fast and accurate sine/cosine

hehe, i've went trough this quite a while ago, while sitting in school, bored by the math lessons i knew already. i tried different ways to approximate sine, cosine, and all the others. but i don't remember. i haven't had your idea to use a lerp between the parabola and the squared parabola, though. i think i used x^3 because it looks more like sine by default.

exp was great, as you could split the value into integer and fractional part, do the integer with bitshifting, and the fraction with some nice approx. something like that i remember..

man, thats years back
___________________________________________
davepermen.net
-Loving a Person is having the wish to see this Person happy, no matter what that means to yourself.
-No matter what it means to myself....
davepermen is offline   Reply With Quote
Old 04-23-2006, 12:31 PM   #4
m4x0r
New Member
 
Join Date: Aug 2005
Location: San Francisco, MA
Posts: 24
Default Re: Fast and accurate sine/cosine

Nice job Nick, very clever.

Max
m4x0r is offline   Reply With Quote
Old 04-23-2006, 02:02 PM   #5
davepermen
Senior Member
 
davepermen's Avatar
 
Join Date: Jan 2003
Location: Switzerland
Posts: 1,331
Default Re: Fast and accurate sine/cosine

hm.. you should present a fast way to fit a number into the -pi/2 - +pi/2 range.

is there a fractional() instruction in SSE? or some modulus directly?
___________________________________________
davepermen.net
-Loving a Person is having the wish to see this Person happy, no matter what that means to yourself.
-No matter what it means to myself....
davepermen is offline   Reply With Quote
Old 04-23-2006, 07:58 PM   #6
Slaru
New Member
 
Join Date: Apr 2006
Posts: 1
Default Re: Fast and accurate sine/cosine

Hello,

This is my first post here, but I've visited here for a while now.

I did a quick test on accuracies and found the fast sin function to have about 12.9% error without added precision and -0.2% error with added precision.

Here's the code, if you'd like to have a look to make sure I didn't do something that might be giving false figures.
Code:
#include <iostream> #include <cstdlib> #include <ctime> #include <cmath> const float PI = 3.14159265358979323846264338327950288f; const float PI2 = 6.28318530717958647692528676655900577f; const float PID2 = 1.57079632679489661923132169163975144f; const float PI_SQR = 9.86960440108935861883449099987615114f; template <typename T, bool accurate> T FastSin(const T& theta) { const T B = 4 / PI; const T C = -4 / PI_SQR; T y = B * theta + C * theta * abs(theta); if (accurate) { // const float Q = 0.775; const T P = 0.225; y = P * (y * abs(y) - y) + y; // Q * y + P * y * abs(y) } return y; } template <typename T> T randf() { int randi = rand(); return (T)(((double)(randi % RAND_MAX))/((double)RAND_MAX)); } void CheckAccuracy() { const size_t checks = 10000; typedef double real; std::cout << "Checking accuracy randomly." << std::endl; srand((unsigned)std::time(0)); real Sin[checks], fastSin[checks], fastAccurSin[checks]; for (int i = 0; i < checks; i++) { real randFloat = randf<real>(); Sin[i] = sin(randFloat); fastSin[i] = FastSin<real, false>(randFloat); fastAccurSin[i] = FastSin<real, true>(randFloat); } real SinToFastSinAccurs[checks], SinToFastAccurSinAccurs[checks]; for (int i = 0; i < checks; i++) { // (measured - accurate)/accurate SinToFastSinAccurs[i] = (fastSin[i] - Sin[i])/Sin[i]; SinToFastAccurSinAccurs[i] = (fastAccurSin[i] - Sin[i])/Sin[i]; } real FinalSinToFastSin = ((real)0), FinalSinToFastAccurSin = ((real)0); for (int i = 0; i < checks; i++) { FinalSinToFastSin += SinToFastSinAccurs[i]; FinalSinToFastAccurSin += SinToFastAccurSinAccurs[i]; } FinalSinToFastSin /= (((real)checks) / ((real)100.0)); FinalSinToFastAccurSin /= (((real)checks) / ((real)100.0)); std::cout << "FinalSinToFastSin %: " << FinalSinToFastSin << "\nFinalSinToFastAccurSin %: " << FinalSinToFastAccurSin << std::endl; } int main() { CheckAccuracy(); system("Pause"); return 0; }

Slaru
Slaru is offline   Reply With Quote
Old 04-24-2006, 12:45 AM   #7
tbp
Valued Member
 
Join Date: Mar 2005
Posts: 135
Default Re: Fast and accurate sine/cosine

Note: Slaru, i'm sure you could gain a couple ulp by adding more digits to your constants.


Quote:
Originally Posted by Nick
I'd love to hear if anyone knows good approximations for other transcendental functions like the exponential, logarithm and power functions.
See this patch for gcc contributed by AMD. All vectorized. May be too precise and slow for your purpose tho.
http://gcc.gnu.org/ml/gcc-patches/2006-03/msg01611.html
Further down the thread there's also some rather amusing remarks from Intel drones.


Quote:
Originally Posted by davepermen
is there a fractional() instruction in SSE? or some modulus directly?
No. There's some basic arithmetic (+,-,*,/) and bitwise ops (or, and, nand, xor) and that's it. Heck, even the usual cond move is missing.

Last edited by tbp : 04-24-2006 at 12:58 AM.
tbp is offline   Reply With Quote
Old 04-24-2006, 04:20 AM   #8
Nick
Senior Member
 
Join Date: Aug 2004
Location: Ghent, Belgium
Posts: 1,145
Default Re: Fast and accurate sine/cosine

Quote:
Originally Posted by Slaru
I did a quick test on accuracies and found the fast sin function to have about 12.9% error without added precision and -0.2% error with added precision.

Here's the code, if you'd like to have a look to make sure I didn't do something that might be giving false figures...
You're measuring relative errors, which are a bigger than absolute errors. You can however improve relative accuracy a tiny bit by using Q = 0.782, P = 0.218, as mentioned in the article.

By the way, the relative error of the Taylor series is infinity at x = pi. :o Absolute error makes more sense here.
Nick is offline   Reply With Quote
Old 04-24-2006, 04:29 AM   #9
bramz
Valued Member
 
Join Date: Aug 2005
Posts: 189
Default Re: Fast and accurate sine/cosine

how can you have a -0.2% error?
___________________________________________
hi, i'm a signature viruz, plz set me as your signature and help me spread :)
Bramz' warehouse | LiAR isn't a raytracer
bramz is offline   Reply With Quote
Old 04-24-2006, 04:58 AM   #10
bramz
Valued Member
 
Join Date: Aug 2005
Posts: 189
Default Re: Fast and accurate sine/cosine

Quote:
Originally Posted by Nick
By the way, the relative error of the Taylor series is infinity at x = pi. :o Absolute error makes more sense here.

Relative errors do make sense when talking about signals, because the amplitude isn't always equal to one. And it also makes sense to compare the error for different angles ...
___________________________________________
hi, i'm a signature viruz, plz set me as your signature and help me spread :)
Bramz' warehouse | LiAR isn't a raytracer
bramz is offline   Reply With Quote
Old 04-24-2006, 05:12 AM   #11
Nick
Senior Member
 
Join Date: Aug 2004
Location: Ghent, Belgium
Posts: 1,145
Default Re: Fast and accurate sine/cosine

Quote:
Originally Posted by tbp
No. There's some basic arithmetic (+,-,*,/) and bitwise ops (or, and, nand, xor) and that's it. Heck, even the usual cond move is missing.
It also has very fast approximations of reciproke and reciproke square root (12-bit mantissa precision). For fractions we can convert to integer and convert back fairly fast. Conditional move can be done with bitwise logic.
Nick is offline   Reply With Quote
Old 04-24-2006, 05:45 AM   #12
bramz
Valued Member
 
Join Date: Aug 2005
Posts: 189
Default Re: Fast and accurate sine/cosine

Quote:
Originally Posted by bramz
Relative errors do make sense when talking about signals, because the amplitude isn't always equal to one. And it also makes sense to compare the error for different angles ...

In particular, it also makes sense to talk about the total harmonic distortion
___________________________________________
hi, i'm a signature viruz, plz set me as your signature and help me spread :)
Bramz' warehouse | LiAR isn't a raytracer
bramz is offline   Reply With Quote
Old 04-24-2006, 05:46 AM   #13
Nick
Senior Member
 
Join Date: Aug 2004
Location: Ghent, Belgium
Posts: 1,145
Default Re: Fast and accurate sine/cosine

Quote:
Originally Posted by bramz
Relative errors do make sense when talking about signals, because the amplitude isn't always equal to one. And it also makes sense to compare the error for different angles ...
Sure, but when the signal is around zero the relative error can explode even when the approximation is almost coincident with the real curve.

So what we really need is a function to weigh the error absolutely near zero, and somewhat relative towards infinity. exp(-x) seems very suited. This way the maximum 'exponential' error of the parabola is 0.037, compared to 0.075 for the Taylor series. The squared parabola gives me 0.0006 with adjusted Q = 0.7775, P = 0.2225.

No matter how you look at it, this is a fast and accurate approximation.
Nick is offline   Reply With Quote
Old 04-24-2006, 06:30 AM   #14
bramz
Valued Member
 
Join Date: Aug 2005
Posts: 189
Default Re: Fast and accurate sine/cosine

Quote:
Originally Posted by Nick
So what we really need is a function to weigh the error absolutely near zero, and somewhat relative towards infinity. exp(-x) seems very suited.

Let us not mix two kinds of error measurments, ok? =) The reason why it goes towards infinity there, is because ... well ... because you're making an infinitely large error, no matter how you look at it. However, we don't need to panic either. For sines, we can get a nice figure of overall relative error by using the total harmonic distortion as I mentioned in the other post. This is in particular true if you're using this function to generate signals.

And btw, I _do_ like your approximation
___________________________________________
hi, i'm a signature viruz, plz set me as your signature and help me spread :)
Bramz' warehouse | LiAR isn't a raytracer

Last edited by bramz : 04-24-2006 at 06:34 AM.
bramz is offline   Reply With Quote
Old 04-24-2006, 07:31 AM   #15
tbp
Valued Member
 
Join Date: Mar 2005
Posts: 135
Default Re: Fast and accurate sine/cosine

Quote:
Originally Posted by Nick
It also has very fast approximations of reciproke and reciproke square root (12-bit mantissa precision).
... which are pretty much useless by themselves. You'll need a Newton-Raphson step to get within a couple ulp. And then you'll need to fix NaN blackholes at 0 and infinity.
Even if you only care about ]0, inf[ latency has already gone through the roof.

Anyway, you're right i forgot to mention them

Quote:
Originally Posted by Nick
For fractions we can convert to integer and convert back fairly fast. Conditional move can be done with bitwise logic.
And someone posted code to get something looking like a sin/cos with a bunch of adds & muls... so that doesn't say much
I stand by my initial point, that is paying for the latency, decoding bandwitdth etc for 3 ops (andn, and, or) for something as mundane as a cond move is silly. Even more so when you strive to avoid branches when dealing with SSE to begin with, branches that you remove with... you know what.

Last edited by tbp : 04-24-2006 at 07:35 AM.
tbp is offline   Reply With Quote
Old 04-24-2006, 07:51 AM   #16
Nick
Senior Member
 
Join Date: Aug 2004
Location: Ghent, Belgium
Posts: 1,145
Default Re: Fast and accurate sine/cosine

By the way, for those who think the squared parabola factor fell out of thin air, expand the formula and compare it with the Taylor series about x = pi/2:

Q (4/pi x - 4/pi^2 x^2) + P (4/pi x - 4/pi^2 x^2)^2
= Q 4/pi - (Q 4/pi^2 + P 16/pi^2) x^2 - P 32/pi^3 x^3 + P 16/pi^4 x^4
~= 0.986 x + 0.0507 x^2 - 0.232 x^3 + 0.0370 x^4

sin(x ~ pi/2)
~= 1 - 1/2 (x - 1/2 Pi)^2 + 1/24 (x - 1/2 Pi)^4
~= 0.0200 + 0.925 x + 0.117 x^2 - 0.262 x^3 + 0.0417 x^4

Note they are both polynomials with roughly comparable coefficients. The parabola method just has 'corrected' coefficients to make sure sin(0) = 0, sin(pi/2) = 1, sin(pi) = 0 and the erros are minimal.

In fact you can also approach the problem starting from a generic order 4 polynomial:

A + B x + C x^2 + D x^3 + E x^4

Using the restrictions sin(0) = 0, sin(pi/2) = 1, sin(pi) = 0 yields: A = 0, B = 1/(4 pi) (16 + 2 D pi^3 + 3 E pi^4), C = -1/4 (16 + 6 D pi^3 + 7 E pi^4) / pi^2. D and E are still free parameters. So we can choose two more restrictions, for example, the first and second derivative in pi/2 should be 0 and -1 respectively. This yields A = 0, B = -1/2 (-16 + pi^2) / pi, C = 1/2 (-48 + 5 pi^2) / pi^2, D = -4 (-8 + pi^2) / pi^3, E = 2 (-8 + pi^2) / pi^4. Which gives us:

0.976 x + 0.0683 x^2 - 0.241 x^3 + 0.0384 x^4

This is pretty close to the squared parabola method and has some nice properties, but the maximum error is a little bigger. By the way, for order 3 and 4 polynomials we both get the simple parabola.

Anyway, the ad-hoc explanation is a bit easier to understand I think, but this gives it a bit of theoretical background. It's nothing more than fitting a function with a polynomial using some restrictions (exact points and derivatives). Minimizing the error is another very useful restriction...

Last edited by Nick : 04-24-2006 at 11:50 AM.
Nick is offline   Reply With Quote
Old 04-24-2006, 08:01 AM   #17
Nick
Senior Member
 
Join Date: Aug 2004
Location: Ghent, Belgium
Posts: 1,145
Default Re: Fast and accurate sine/cosine

Quote:
Originally Posted by bramz
The reason why it goes towards infinity there, is because ... well ... because you're making an infinitely large error, no matter how you look at it.
Not me, Taylor.
Quote:
However, we don't need to panic either. For sines, we can get a nice figure of overall relative error by using the total harmonic distortion as I mentioned in the other post. This is in particular true if you're using this function to generate signals.
How does that work exactly? The Wikipedia explanation is a bit cryptic...
Quote:
And btw, I _do_ like your approximation
I wasn't doubting that.
Nick is offline   Reply With Quote
Old 04-24-2006, 08:13 AM   #18
bramz
Valued Member
 
Join Date: Aug 2005
Posts: 189
Default Re: Fast and accurate sine/cosine

Quote:
Originally Posted by Nick
Not me, Taylor.

ok, fair enough

Quote:
How does that work exactly? The Wikipedia explanation is a bit cryptic...

Well, basically, you'll need to decompose your 'signal' (-pi..pi) in its harmonics ... you know, thinks like
y(theta) = \sum_{k=1}^inf A_k sin(k * theta).

(Normally, you won't have cosine terms, since you have an odd function. Except maybe for a DC term, though I don't think you have one. If you would have cosine terms, you better write it down using the exponential form with complex amplitudes)

Then, the THD is defined as the ratio of the RMS of the harmonics to the fundamental:
\frac {\sqrt \sum_{k=2}^inf A_k^2} {\abs A_1}.

One thingy though: it silently ignores the DC term ... so if you simply would add a number like 0.123456 to y, the THD would not change.

edit: need to use the total RMS, not total power
___________________________________________
hi, i'm a signature viruz, plz set me as your signature and help me spread :)
Bramz' warehouse | LiAR isn't a raytracer

Last edited by bramz : 04-25-2006 at 04:37 AM.
bramz is offline   Reply With Quote
Old 04-24-2006, 11:23 AM   #19
Nick
Senior Member
 
Join Date: Aug 2004
Location: Ghent, Belgium
Posts: 1,145
Default Re: Fast and accurate sine/cosine

Quote:
Originally Posted by tbp
... which are pretty much useless by themselves. You'll need a Newton-Raphson step to get within a couple ulp. And then you'll need to fix NaN blackholes at 0 and infinity.
Even if you only care about ]0, inf[ latency has already gone through the roof.
That's really the best possible. Remember that a divps/sqrtps is 39 clock cycles non-pipelined. So if you only need 12-bit or 24-bit mantissa precision these instructions are great. What more could you expect? They can't break the laws of semiconductor physics.
Quote:
I stand by my initial point, that is paying for the latency, decoding bandwitdth etc for 3 ops (andn, and, or) for something as mundane as a cond move is silly. Even more so when you strive to avoid branches when dealing with SSE to begin with, branches that you remove with... you know what.
Why is that silly? For many situations, like for the cosine wrap, you can simplify the binary logic. Also, latency by itself is not that big a problem thanks to pipelining, out-of-order execution, and jump prediction. Ok, the Pentium 4 took that idea a little too far, but seriously, if you know a magical architecture that combines high clock frequency with low latency then let me know.

So, I understand your frustration about modern CPU performance characteristics, but Intel and AMD really know what they're doing. x86 ain't perfect but it's legacy (tons of support for it) and it's flexible (new instructions get added every generation). It's certainly possible to create faster chips with a new instruction set tuned for today's applications but if it's not flexible enough it's not going to survive. Intel beated Alpha and PowerPC with affordable and fast processors.
Nick is offline   Reply With Quote
Old 04-24-2006, 01:45 PM   #20
tbp
Valued Member
 
Join Date: Mar 2005
Posts: 135
Default Re: Fast and accurate sine/cosine

Quote:
Originally Posted by Nick
That's really the best possible. ... What more could you expect?
1998 has called and wants the 3DNow pfrcp*/pfrsq* back.


Quote:
Originally Posted by Nick
Why is that silly? For many situations, like for the cosine wrap, you can simplify the binary logic. Also, latency by itself is not that big a problem thanks to pipelining, out-of-order execution, and jump prediction. Ok, the Pentium 4 took that idea a little too far, but seriously, if you know a magical architecture that combines high clock frequency with low latency then let me know.
You can spin it till hell freezes over, Intel & AMD optimization manuals clearly say: thou shall remove thy ugly branches. If you don't lend me any credit maybe that should get your attention, i think.

So we have to remove branches, and for that more often than not we'll get into the the cond move pattern.
Now let's see what we had for ages on the integer side of those same CPU. Surprise! There's both those "binary logic" ops and... tada ... cmov.
I'm sure dudes at Intel thought they had encoding bits and silicon to waste and they were like let's add another instruction just for the fun.

Really i'm asking for consistency. And common sense.

Quote:
Originally Posted by Nick
So, I understand your frustration about modern CPU performance characteristics, but Intel and AMD really know what they're doing. x86 ain't perfect but it's legacy (tons of support for it) and it's flexible (new instructions get added every generation).
Err, correct me if i'm wrong but i'm pretty sure there was like 0 legacy issues when Intel first introduced SSE.

And allow me to put the "they know what they're doing" back into proper historical context: that's the same guys that have gone with a segmented memory model at the same time ie Motorola was going the flat way. Or those that later on saw the light again when they thought stack based fpu was the best thing since sliced bread when everyone and his brother was using registers.

Please save me that kind of PR, because even if my memory may not be what it used to be, i was there.
tbp is offline   Reply With Quote
Old 04-24-2006, 06:06 PM   #21
Nick
Senior Member
 
Join Date: Aug 2004
Location: Ghent, Belgium
Posts: 1,145
Default Re: Fast and accurate sine/cosine

Quote:
Originally Posted by tbp
1998 has called and wants the 3DNow pfrcp*/pfrsq* back.
pfrcp has 14-bit mantissa precision, pfsqrt 15-bit. That's not a whole lot better than the SSE versions. I mean, especially for the purposes these instructions are used for it shouldn't matter that much (e.g. vector normalization). And the 3DNow! instructions have quite high latencies on Athlon 64 as well.
Quote:
You can spin it till hell freezes over, Intel & AMD optimization manuals clearly say: thou shall remove thy ugly branches. If you don't lend me any credit maybe that should get your attention, i think.
Sorry, what credit are you asking for then? And I read almost every page of the Intel manuals (plus some chapters in The Unabridged Pentium 4) so it's hard to get my attention with that, sorry. I also got several courses about processor architecture and digital design if that matters...
Quote:
So we have to remove branches, and for that more often than not we'll get into the the cond move pattern.
Now let's see what we had for ages on the integer side of those same CPU. Surprise! There's both those "binary logic" ops and... tada ... cmov.
I'm sure dudes at Intel thought they had encoding bits and silicon to waste and they were like let's add another instruction just for the fun.

Really i'm asking for consistency. And common sense.
My point was that using a couple binary logic instructions is quite effective to remove some branches. Sure, there is no cmov equivalent, but that's not so terrible. Besides, what would that instruction look like anyway? There is no vector of compare flags. And that's a good thing because they would add a whole lot of hardware complexity (making the latencies even bigger). The current SSE architecture is pretty clean in my opinion.

Implementing cmov in the integer pipeline is easy since flags have always been there. They did the right thing for SSE by giving it a cleaner modern architecture. That means less consistency but I don't mind. A single scalar SSE version of cmov would probably have been doable but then there is no vector equivalent (no consistency) and it's pretty useless anyway. For predictable branches you can still use comiss.

So I'm not sure what you're asking for exactly...
Quote:
And allow me to put the "they know what they're doing" back into proper historical context: that's the same guys that have gone with a segmented memory model at the same time ie Motorola was going the flat way. Or those that later on saw the light again when they thought stack based fpu was the best thing since sliced bread when everyone and his brother was using registers.
Intel has always chosen the budget solution, and look where they are now. Sure, x86 ain't perfect, but it's what makes my PC tick and almost everyone else's. As for x87, yes the register stack can be annoying, but its design started in 1976 and uses the robust IEEE 754 standard (also created by Intel) we still use today.
Quote:
Please save me that kind of PR, because even if my memory may not be what it used to be, i was there.
I'm sorry but what PR? I have an AMD destop and Intel laptop. And I'm willing to change to something else if it's fast, affordable, easy to develop for and flexible enough for the future.

Anyway, I have good hopes for the future of x86. Intel finally realized clock frequency isn't the only thing that determines performance. By increasing issue width, making SIMD units 4-wide instead of only 2-wide, and using reverse hyper-threading, floating-point performance can still increase a lot.
Nick is offline   Reply With Quote
Old 04-24-2006, 08:03 PM   #22
tbp
Valued Member
 
Join Date: Mar 2005
Posts: 135
Default Re: Fast and accurate sine/cosine

Quote:
Originally Posted by Nick
pfrcp has 14-bit mantissa precision, pfsqrt 15-bit. That's not a whole lot better than the SSE versions. I mean, especially for the purposes these instructions are used for it shouldn't matter that much (e.g. vector normalization). And the 3DNow! instructions have quite high latencies on Athlon 64 as well.
Reciprocals and square roots do happen outside of vector normalization you know.
Fact is i need both of them, fast and robust and if possible with a bit pattern that doesn't look like i got it from a RNG.

Plus i wasn't addressing relative latency of 3DNow & SSE but functionnalities. It seems you've missed the wildcard in my previous post, so please check "3DNow! Technology Manual" for PFRCPIT1, PFRCPIT2, PFRSQIT1 & co.
Compare that to:
Code:
static __m128 rcp_sqrt_nr_robust_ps(const __m128 arg) { const __m128 mask = cmpeq(arg, all_zero()), nr = rsqrtps(arg), muls = mulps(mulps(nr, nr), arg), filter = andnps(mask, muls), beta = mulps(rt::cst::section.half, nr), gamma = subps(rt::cst::section.three, filter), final = mulps(beta, gamma); return final; }
I could have spammed the same kind of uglyness for related ops like 1/x etc... Not so attractive anymore eh? (or robust in fact)


Quote:
Originally Posted by Nick
Sorry, what credit are you asking for then? And I read almost every page of the Intel manuals (plus some chapters in The Unabridged Pentium 4) so it's hard to get my attention with that, sorry. I also got several courses about processor architecture and digital design if that matters...
That's fine & dandy, but one got to wonder if we're reading the same manuals.

From a quick searh through IA-32 Intel Architecture Optimization Reference Manual, Order Number:248966-012
2-85, Vectorization
"User/Source Coding Rule 19. (M impact, ML generality) Avoid the use of conditional branches inside loops and consider using SSE instructions to eliminate branches"

2-89, User/Source Coding Rules
"User/Source Coding Rule 19. (M impact, ML generality) Avoid the use of conditional branches inside loops and consider using SSE instructions to eliminate branches. 2-86"

3-37, Instruction Selection
"The following section gives some guidelines for choosing instructions to complete a task. One barrier to SIMD computation can be the existence of data-dependent branches. Conditional moves can be used to eliminate data-dependent branches. Conditional moves can be emulated in SIMD computation by using masked compares and logicals, as shown in Example3-21."

4-2, General Rules on SIMD Integer Code
"Emulate conditional moves by using masked compares and logicals instead of using conditional branches."

Should i quote some more or did you get the point by now?

Quote:
Originally Posted by Nick
My point was that using a couple binary logic instructions is quite effective to remove some branches. Sure, there is no cmov equivalent, but that's not so terrible. Besides, what would that instruction look like anyway? There is no vector of compare flags. And that's a good thing because they would add a whole lot of hardware complexity (making the latencies even bigger). The current SSE architecture is pretty clean in my opinion.
Excuse me but isn't one of the main point of a CISC arch to provide code compression via byzantine instruction encoding?
I was kind enough to restrict my example to x86 and i've never ever alluded to flags. I've merely pointed out the inconsistency where on one hand you have cmov* on the ALU side, fcmov* on the FPU side and nothing like that for SSE. You'll notice those cmov* and fcmov* could also be emulated.

I would have expected a single op doing exactly what a andn/and/or sequence would do; admitedly ternary ops aren't the norm on x86, but anything would have done the trick: use a blessed register as the implicit mask (aka accumulator) or encode it as a prefix or whatever.
It makes no sense when decoding bandwitdh is scarce and you're struggling with troughoutput to say the norm is replace to conditional branches with a 3 freaking op sequence.


Quote:
Originally Posted by Nick
Intel has always chosen the budget solution, and look where they are now. Sure, x86 ain't perfect, but it's what makes my PC tick and almost everyone else's. As for x87, yes the register stack can be annoying, but its design started in 1976 and uses the robust IEEE 754 standard (also created by Intel) we still use today.
Revisionist kids these days...
You know there was a (computing) world before Intel and it will still be there when everybody will have forgotten about them.

Repeat after me: stack based fpu stink, they never ever made sense. Ever.
Again it might come as a surprise but others got it mostly right: http://en.wikipedia.org/wiki/Motorola_68882

Now better than rephrasing what the marketing department wrote, get the story about IEEE 754 from the Man himself. Please. And remove those rosy teinted glasses.
http://www.cs.berkeley.edu/~wkahan/i.../754story.html

Last edited by tbp : 04-24-2006 at 08:17 PM.
tbp is offline   Reply With Quote
Old 04-24-2006, 11:26 PM   #23
Nick
Senior Member
 
Join Date: Aug 2004
Location: Ghent, Belgium
Posts: 1,145
Default Re: Fast and accurate sine/cosine

Quote:
Originally Posted by tbp
Reciprocals and square roots do happen outside of vector normalization you know.
Indeed I know.
Quote:
Fact is i need both of them, fast and robust and if possible with a bit pattern that doesn't look like i got it from a RNG.
Then please explain what you're using it for exactly and what's the source of your frustration. Because right now you're barking up the wrong tree.
Quote:
Plus i wasn't addressing relative latency of 3DNow & SSE but functionnalities. It seems you've missed the wildcard in my previous post, so please check "3DNow! Technology Manual" for PFRCPIT1, PFRCPIT2, PFRSQIT1 & co.
Compare that to:
Code:
static __m128 rcp_sqrt_nr_robust_ps(const __m128 arg) { const __m128 mask = cmpeq(arg, all_zero()), nr = rsqrtps(arg), muls = mulps(mulps(nr, nr), arg), filter = andnps(mask, muls), beta = mulps(rt::cst::section.half, nr), gamma = subps(rt::cst::section.three, filter), final = mulps(beta, gamma); return final; }
I could have spammed the same kind of uglyness for related ops like 1/x etc... Not so attractive anymore eh? (or robust in fact)
I already know how Newton-Rhapson refinement works and the related 3DNow! instructions, I use it myself, now what's your point? That SSE intrinsics can be unattractive? I got hundreds of lines of run-time intrinsics that are very tricky and it's not all SSE code.
Quote:
Should i quote some more or did you get the point by now?
No need to quote anything, I know the optimization manual by heart. I still don't get your point though. You want an SSE cmov? Show me how that would work.
Quote:
Excuse me but isn't one of the main point of a CISC arch to provide code compression via byzantine instruction encoding?
As I'm sure you know, every modern processor is RISC behind the decoder. So even if they added every exotic instruction you can think of, they wouldn't execute much faster than when you have to implement them yourself. That's also why SSE is very RISC-like and adding more instructions would just be a waste of opcodes and microcode ROM. Compact code ain't that important any more.

Besides, it's not because you see a use for certain instructions for your specific application, that they need to be added to the instruction set. They constantly have to evaluate transistor cost and overhead versus speedup over all existing software. It simply pays off more to improve the architecture and not touch the instruction set. Even though it's CISC you can't go adding instructions like mad.
Quote:
I would have expected a single op doing exactly what a andn/and/or sequence would do; admitedly ternary ops aren't the norm on x86, but anything would have done the trick: use a blessed register as the implicit mask (aka accumulator) or encode it as a prefix or whatever.
Sorry, that's not a good idea. You would be giving a nice modern architecture a special-purpose accumulator register and have instructions with side effects. That's actually worse than having an FPU with a register stack.

Ternary operations would require register files with extra read ports. The area of a register file is proportional to the square of the number of ports. It also has a direct influence on latencies. The only time that can be acceptable is if all operations are ternary (which saves move operations).
Quote:
It makes no sense when decoding bandwitdh is scarce and you're struggling with troughoutput to say the norm is replace to conditional branches with a 3 freaking op sequence.
There is nothing we can change about the masking sequence. But that's not the actual problem anyway. Future x86 CPUs will just need better decoders, higher issue width and wider execution units, like I said before. That's infinitely better than breaking the architecture for that one application where it could add a few percent performance.
Quote:
Revisionist kids these days...
Don't mock me. I'm only giving explanations, not excuses. I too wish that x86 was a cleaner architecture or something superiour took over. For the time being we're just going to have to live with it and understand its limitiations so we still get the best out of it.
Quote:
You know there was a (computing) world before Intel and it will still be there when everybody will have forgotten about them.
Indeed I know.
Quote:
Repeat after me: stack based fpu stink, they never ever made sense. Ever.
Early compiler technology was stack based for most part. So the FPU's register stack made it easier to write compilers. Nowadays it makes no sense to write FPU code manually, compilers generate perfect code, so why bother. Also, fxch has zero latency so it's pretty much like accessing a flat register file anyway.
Quote:
Again it might come as a surprise but others got it mostly right: http://en.wikipedia.org/wiki/Motorola_68882
That's not a surprise at all considering it was designed 10 years later.
Quote:
Now better than rephrasing what the marketing department wrote, get the story about IEEE 754 from the Man himself. Please. And remove those rosy teinted glasses.
http://www.cs.berkeley.edu/~wkahan/i.../754story.html
I have already read that article. Not so long ago actually. I was looking for a way to detect problems caused by NaNs, you remember. The answer is not in the article but it was an interesting read. But what's your point? It's Intel who initiated the standard. It would be foolish to ignore their influence on the digital world.

As a matter of fact I got cool new glasses last Saturday, black with copper orange. Not rosy teinted though, crystal clear.
Nick is offline   Reply With Quote
Old 04-24-2006, 11:40 PM   #24
Reedbeta
DevMaster Staff
 
Join Date: Oct 2004
Location: Seattle, WA
Posts: 4,089
Default Re: Fast and accurate sine/cosine

Guys,

This is rapidly becoming pointless. Let's stay on the topic please.
___________________________________________
Currently working at Sucker Punch
reedbeta.com - OpenGL demos and other projects
Luabridge - a lightweight, dependency-free C++/Lua binding library.
CD Lite - an unobtrusive, minimal CD player application for Windows.
Reedbeta is online now   Reply With Quote
Old 04-25-2006, 04:35 AM   #25
bramz
Valued Member
 
Join Date: Aug 2005
Posts: 189
Default Re: Fast and accurate sine/cosine

So, I did a quick test on the THD, and this is what I've found:

taylor: 2.4%
parabolic: 3.8%
extra precision: 0.078%

So, you can see that your version with extra precision performs much better than the others. Your parabolic version is actually worse than the taylor one. That's of course because you're making rather large errors over almost the complete domain. Taylor only gets (very) bad towards pi. There is however one major fact in favour of the parabolic one: it's continuous.

The parabolic also has only odd harmonics (k = 2n+1), while taylor has both even and odd harmonics. The latter however, might be in favour of the Taylor series when it comes to psychoacoustics (how we experience the signal when we here it) because the third harmonic gives a 'covered' sound, while having the 2nd, 3rd, 4th, ... all together gives a 'brassy' sound. It's a bit like transistors vs tubes ...

It might be interesting to take the general formula with four coefficients, and to optimize it for minimal THD.

Bramz

BTW, that other one you proposed: 0.976 x + 0.0683 x^2 - 0.241 x^3 + 0.0384 x^4, it's THD is 0.32%.
___________________________________________
hi, i'm a signature viruz, plz set me as your signature and help me spread :)
Bramz' warehouse | LiAR isn't a raytracer

Last edited by bramz : 04-25-2006 at 05:49 AM.
bramz is offline   Reply With Quote
Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Forum Jump


All times are GMT -7. The time now is 07:34 PM.


Powered by vBulletin
Copyright ©2000 - 2010, Jelsoft Enterprises Ltd.