DevMaster.net Forums
[[ Home | Forums | 3D Engines Database | Wiki | Articles/Tutorials | Game Dev Jobs | IRC Chat Network | Contact Us ]]

Go Back   DevMaster.net Forums > Programming & Development > General Development
User Name
Password
Register FAQ Members List Search Today's Posts Mark Forums Read

Reply
 
Thread Tools Search this Thread Display Modes
Old 01-16-2007, 08:18 AM   #1
Nick
Senior Member
 
Join Date: Aug 2004
Location: Ghent, Belgium
Posts: 1,056
Default Approximate Math Library

Hi all,

I'm looking for the Intel Approximate Math Library. It used to be at www.intel.com/design/Pentium4/devtools/ but that page moved and I couldn't locate the library.

So, does anyone know where to find it, or can send me a copy? That would be much appreciated. I'm also interested in other approximation libraries/functions. The more options between precision and performance the better...

Thanks a lot,

Nick
Nick is offline   Reply With Quote
Old 01-17-2007, 09:21 AM   #2
Kenneth Gorking
Senior Member
 
Kenneth Gorking's Avatar
 
Join Date: Aug 2004
Location: Århus, Denmark
Posts: 688
Default Re: Approximate Math Library

I have never heard of this library, but AMD has a library of their. I'm sure you know this already, but I'll post it anyway just in case: http://developer.amd.com/acml.jsp
___________________________________________
"Stupid bug! You go squish now!!" - Homer Simpson
Kenneth Gorking is offline   Reply With Quote
Old 01-17-2007, 10:22 AM   #3
Nick
Senior Member
 
Join Date: Aug 2004
Location: Ghent, Belgium
Posts: 1,056
Default Re: Approximate Math Library

AMD's library doesn't really have what I'm looking for. I seek fast implementations of transcendental functions (exp, log, sin, etc). Thanks anyway, it was worth checking out.

So the Intel Approximate Math Library contains approximations of transcendental functions, optimized for SSE. For some time I have used my own implementations (sin/cos) but now I'm looking for more precision with minimal performance impact...
Nick is offline   Reply With Quote
Old 01-17-2007, 11:17 AM   #4
Kenneth Gorking
Senior Member
 
Kenneth Gorking's Avatar
 
Join Date: Aug 2004
Location: Århus, Denmark
Posts: 688
Default Re: Approximate Math Library

The CRT has a few functions implemented using SSE2 (see here), but they are most likely built for precision. The CRT source does however come with VS2005, so you could dig up those files and enhance them for speed
___________________________________________
"Stupid bug! You go squish now!!" - Homer Simpson
Kenneth Gorking is offline   Reply With Quote
Old 01-18-2007, 03:29 AM   #5
Nick
Senior Member
 
Join Date: Aug 2004
Location: Ghent, Belgium
Posts: 1,056
Default Re: Approximate Math Library

Interesting! It seems pretty hard to derive a faster (single-precision) version from it, but it's always good to have a reference.
Nick is offline   Reply With Quote
Old 01-18-2007, 10:46 AM   #6
Nick
Senior Member
 
Join Date: Aug 2004
Location: Ghent, Belgium
Posts: 1,056
Default Re: Approximate Math Library

Since the Intel library is still nowhere to be found, I started writing my own approximations. Here's a 2^x implementation (first attempt):
Code:
float exp2(float x) { static const float one = 1.0f; static const float half = 0.5f; static const float lim_two = 1.9999999f; static const float C0[256] = {.547404266246f, .548061348142f, .548725333807f, .549398399167f, .550075922598f, .550760241024f, .551454706663f, .552154084359f, .552858883435f, .553574967496f, .554297163206f, .555026093302f, .555760495106f, .556505137841f, .557256046635f, .558017083881f, .558781729040f, .559558847539f, .560337005182f, .561128551344f, .561926919212f, .562731360892f, .563548838504f, .564371785305f, .565202403662f, .566041175842f, .566885148265f, .567741995681f, .568603271104f, .569474540778f, .570356611167f, .571248305001f, .572143490366f, .573051865152f, .573965673107f, .574892291723f, .575822287504f, .576764054387f, .577713794250f, .578675521802f, .579646563842f, .580622066161f, .581609491764f, .582606143121f, .583612970496f, .584629192183f, .585655505258f, .586690228227f, .587733473599f, .588788523879f, .589851795687f, .590925279595f, .592009141821f, .593102901511f, .594205792172f, .595321583429f, .596446634193f, .597577914584f, .598723732752f, .599879808291f, .601047527774f, .602222182051f, .603410536625f, .604607039548f, .605814947112f, .607033833104f, .608266709013f, .609506083140f, .610760886405f, .612025725384f, .613302240932f, .614585463030f, .615883610698f, .617191217751f, .618511752134f, .619845226500f, .621187546929f, .622542829136f, .623914626393f, .625290774906f, .626682429570f, .628088718711f, .629505305967f, .630928111994f, .632369622796f, .633820437496f, .635288816972f, .636766410268f, .638255823256f, .639759511520f, .641275658394f, .642806215397f, .644347352744f, .645904478891f, .647470860480f, .649053754624f, .650645737483f, .652253552238f, .653875847907f, .655506204107f, .657158529182f, .658824473497f, .660502429383f, .662188430285f, .663896943735f, .665614322456f, .667344320807f, .669091861220f, .670854384730f, .672633065213f, .674418265280f, .676229143542f, .678048816547f, .679881705149f, .681730321995f, .683593925186f, .685478115237f, .687372988183f, .689280701315f, .691208589068f, .693153490063f, .695110683979f, .697086170558f, .699075467598f, .701082480523f, .703098111035f, .705136926814f, .707188064752f, .709255034137f, .711344041865f, .713449959172f, .715567879355f, .717699596625f, .719857427125f, .722028307800f, .724216235913f, .726413096106f, .728639561732f, .730875455600f, .733132357639f, .735405868218f, .737702530438f, .740009520055f, .742332281356f, .744679668196f, .747037600146f, .749420315605f, .751820376930f, .754243487814f, .756683762441f, .759133959936f, .761605643019f, .764099729889f, .766614919527f, .769146093637f, .771693145114f, .774267141070f, .776858393243f, .779466882269f, .782096835996f, .784742415436f, .787410123430f, .790101570173f, .792815520645f, .795542431752f, .798294870351f, .801063242968f, .803849366342f, .806665638513f, .809494760355f, .812353037567f, .815226352727f, .818124218289f, .821037431411f, .823977286773f, .826945359586f, .829920082804f, .832926883189f, .835951538712f, .839006712543f, .842080344842f, .845177216775f, .848284765471f, .851423322415f, .854588979701f, .857776791512f, .860984910999f, .864215879261f, .867469536588f, .870748683758f, .874051542963f, .877378717421f, .880731452223f, .884101609574f, .887502615964f, .890925768871f, .894372845122f, .897843135408f, .901337846858f, .904862544340f, .908410015848f, .911984035138f, .915576759348f, .919196625760f, .922845709434f, .926529189028f, .930225124843f, .933945305924f, .937700927963f, .941479193749f, .945282181080f, .949117765062f, .952980179587f, .956873412255f, .960779020905f, .964718014705f, .968687508318f, .972685203315f, .976705892948f, .980760930721f, .984836290213f, .988951866922f, .993080243943f, .997245671924f, 1.00144630340f, 1.00567323430f, 1.00991602499f, 1.01419270811f, 1.01850230704f, 1.02285032212f, 1.02721871206f, 1.03161636643f, 1.03604653022f, 1.04050699395f, 1.04499850272f, 1.04951852054f, 1.05407459978f, 1.05865921545f, 1.06326856892f, 1.06791602645f, 1.07259384061f, 1.07729671165f, 1.08203878536f, 1.08680950686f, 1.09161248555f, 1.09645085758f, 1.10132609310f, 1.10623766041f, 1.11114750589f, 1.11612095277f, 1.12113714360f, 1.12616425808f, 1.13122641447f, 1.13631798693f, 1.14146805990f, 1.14663629983f}; static const float C1[256] = {.212044759469f, .210735702190f, .209418015443f, .208087475652f, .206753275519f, .205410853969f, .204053729978f, .202692211082f, .201325322173f, .199941790541f, .198551704706f, .197153917470f, .195750881257f, .194333564491f, .192909615777f, .191471793032f, .190032461265f, .188575010632f, .187120933209f, .185647204506f, .184166170344f, .182679245802f, .181173672353f, .179663467005f, .178144635485f, .176616349360f, .175084037800f, .173533838064f, .171981116491f, .170415875372f, .168836776605f, .167246022818f, .165654589240f, .164045289244f, .162431953647f, .160801620569f, .159170951952f, .157525268876f, .155871296686f, .154202124016f, .152522495456f, .150840826888f, .149144311266f, .147437665224f, .145719346014f, .143990758331f, .142250789144f, .140502354043f, .138745315445f, .136974218191f, .135195151140f, .133404847135f, .131603099825f, .129790785291f, .127969230244f, .126132298677f, .124286064322f, .122435531964f, .120567182576f, .118688092798f, .116796095743f, .114898862963f, .112985536212f, .111065131312f, .109132474142f, .107188321781f, .105227969151f, .103263383910f, .101280483950f, .99287883117e-1f, .97283061235e-1f, .95273867908e-1f, .93247492774e-1f, .91212558811e-1f, .89163726456e-1f, .87101072766e-1f, .85030989812e-1f, .82947185421e-1f, .80844315542e-1f, .78741073956e-1f, .76620455758e-1f, .74483906833e-1f, .72338082587e-1f, .70189189871e-1f, .68018435591e-1f, .65840082252e-1f, .63641802316e-1f, .61436186248e-1f, .59219384845e-1f, .56987821889e-1f, .54744275251e-1f, .52485937632e-1f, .50218520627e-1f, .47934149447e-1f, .45642766126e-1f, .43333822826e-1f, .41018216474e-1f, .38686201364e-1f, .36339829169e-1f, .33988431451e-1f, .31612040002e-1f, .29222787772e-1f, .26823040651e-1f, .24418495167e-1f, .21988609993e-1f, .19552890922e-1f, .17106039552e-1f, .14641181181e-1f, .12162020545e-1f, .9666998685e-2f, .7169663523e-2f, .4643309730e-2f, .2111596679e-2f, -.431598824e-3f, -.2989690662e-2f, -.5561577943e-2f, -.8154880822e-2f, -.10755886294e-1f, -.13367524684e-1f, -.15999743658e-1f, -.18648123903e-1f, -.21306166293e-1f, -.23981953789e-1f, -.26669330748e-1f, -.29373502333e-1f, -.32082162062e-1f, -.34814814297e-1f, -.37556805223e-1f, -.40312772251e-1f, -.43090887470e-1f, -.45884226576e-1f, -.48686220012e-1f, -.51499213346e-1f, -.54339341155e-1f, -.57189308948e-1f, -.6005430929e-1f, -.6292367512e-1f, -.6582431504e-1f, -.6872984049e-1f, -.7165524503e-1f, -.7459473297e-1f, -.7755666791e-1f, -.8052444140e-1f, -.8350502578e-1f, -.8650968268e-1f, -.8952031711e-1f, -.9255503332e-1f, -.9560425465e-1f, -.9867513068e-1f, -.10176011141f, -.10485000844f, -.10795935979f, -.11108920339f, -.11423780756f, -.11739868933f, -.12057166547f, -.12377042647f, -.12698282286f, -.13020878013f, -.13345344254f, -.13670954099f, -.13998500792f, -.14328171413f, -.14659803793f, -.14992226374f, -.15326963975f, -.15662841273f, -.16000074335f, -.16340152411f, -.16680978718f, -.17024508632f, -.17369037027f, -.17715697361f, -.18063381546f, -.18413430337f, -.18766017755f, -.19118577445f, -.19474116819f, -.19830943646f, -.20190541522f, -.20551481061f, -.20914316989f, -.21277573100f, -.21643619837f, -.22011987890f, -.22382091824f, -.22753710375f, -.23127131148f, -.23502327513f, -.23879613644f, -.24258775687f, -.24639874443f, -.25023043386f, -.25407345559f, -.25794304031f, -.26182918290f, -.26573382630f, -.26965609165f, -.27359726148f, -.27756351145f, -.28154663028f, -.28555077208f, -.28956708486f, -.29360493441f, -.29766652900f, -.30175749946f, -.30585341675f, -.30996731111f, -.31411145322f, -.31827160793f, -.32245000439f, -.32665519190f, -.33088073838f, -.33513090577f, -.33938551392f, -.34366739446f, -.34797328516f, -.35230059730f, -.35664361881f, -.36101452016f, -.36539810091f, -.36981565376f, -.37423767670f, -.37869008365f, -.38317075577f, -.38767009071f, -.39217694488f, -.39671041296f, -.40126933906f, -.40585941045f, -.41046149518f, -.41508490180f, -.41973294431f, -.42440320388f, -.42909637229f, -.43380970842f, -.43855098623f, -.44331228018f, -.44808957363f, -.45289663497f, -.45772533171f, -.46257013252f, -.46744551144f, -.47234051123f, -.47725875550f, -.48220335041f, -.48717569030f, -.49217510643f, -.49716286338f, -.50220524890f, -.50728090192f, -.51235755815f, -.51745955485f, -.52258113148f, -.52775139108f, -.53292971984f}; static const float C2[256] = {.240550974386f, .241202959893f, .241856701004f, .242514265929f, .243171103188f, .243829457574f, .244492480331f, .245155117376f, .245817854428f, .246486128053f, .247155039241f, .247825134276f, .248495241281f, .249169654431f, .249844713280f, .250523830778f, .251201163433f, .251884509323f, .252563787406f, .253249747676f, .253936606109f, .254623711603f, .255316926931f, .256009778763f, .256704095772f, .257400249962f, .258095764201f, .258796915826f, .259496734037f, .260199725451f, .260906458608f, .261615922477f, .262323223326f, .263035992751f, .263748082312f, .264465202209f, .265180012111f, .265898946252f, .266619044658f, .267343298130f, .268069619098f, .268794376437f, .269523080941f, .270253687830f, .270986837738f, .271721917527f, .272459385827f, .273197995608f, .273937798870f, .274681077710f, .275425261899f, .276171707452f, .276920486684f, .277671219387f, .278423348235f, .279179385328f, .279936813823f, .280693582599f, .281455206542f, .282218776699f, .282985153731f, .283751228711f, .284521373431f, .285291943163f, .286065008210f, .286840250780f, .287619522875f, .288398060167f, .289181428600f, .289966204157f, .290753368791f, .291539837984f, .292330618012f, .293122323170f, .293917023197f, .294714665277f, .295512769094f, .296313753856f, .297119642359f, .297923266738f, .298731122037f, .299542626941f, .300355242358f, .301166621541f, .301983850061f, .302801532724f, .303624282505f, .304447368111f, .305272224846f, .306100168085f, .306930150725f, .307763195441f, .308597186775f, .309435004672f, .310272993070f, .311114998221f, .311957035187f, .312802638726f, .313651045018f, .314498876947f, .315353314930f, .316209965286f, .317067971632f, .317925302128f, .318789261216f, .319652892561f, .320518078729f, .321387232845f, .322259028085f, .323133993420f, .324007380664f, .324888508986f, .325769102908f, .326651295684f, .327536259333f, .328423600216f, .329315922731f, .330208493193f, .331102319414f, .332000786889f, .332902364781f, .333804829061f, .334710914663f, .335618521194f, .336529395375f, .337439388370f, .338355041149f, .339271424585f, .340190082999f, .341113718319f, .342040006406f, .342966760909f, .343894760691f, .344829302014f, .345764673269f, .346702572919f, .347639507987f, .348584246701f, .349528173351f, .350476153452f, .351426290935f, .352381270258f, .353335726296f, .354291903376f, .355253394261f, .356214397492f, .357180679851f, .358149170924f, .359122122768f, .360097126490f, .361071279072f, .362049162020f, .363031076715f, .364016460615f, .365003272896f, .365991452648f, .366985245318f, .367980853922f, .368978251405f, .369979013823f, .370980890602f, .371986312192f, .372995831811f, .374008931067f, .375022025929f, .376039753227f, .377058524791f, .378078994230f, .379105644931f, .380132134704f, .381164337027f, .382197114688f, .383233855329f, .384271234863f, .385313243354f, .386360369799f, .387404990922f, .388456010655f, .389508406520f, .390566534906f, .391626171784f, .392688937114f, .393750505484f, .394817797166f, .395889415356f, .396963639187f, .398039817634f, .399118775070f, .400200421917f, .401285649507f, .402373826907f, .403465115321f, .404559881707f, .405655440162f, .406756120792f, .407859059092f, .408964796284f, .410073072953f, .411184239170f, .412300018644f, .413418085218f, .414539592392f, .415662053303f, .416788077635f, .417918262200f, .419054147477f, .420188943978f, .421326262837f, .422469476617f, .423614637508f, .424762353579f, .425914955357f, .427070660204f, .428230616859f, .429389315399f, .430552969348f, .431720668445f, .432891695257f, .434064493914f, .435242336232f, .436421114570f, .437606536221f, .438790675206f, .439980464677f, .441175310294f, .442372633912f, .443569472135f, .444770890044f, .445976559606f, .447187960487f, .448400030952f, .449615218011f, .450834377285f, .452056858160f, .453282828803f, .454511559547f, .455745060764f, .456981256106f, .458219094270f, .459462130949f, .460708242561f, .461955996149f, .463209104131f, .464464733125f, .465723802255f, .466987089818f, .468254934518f, .469527143677f, .470793869973f, .472071940011f, .473355895138f, .474637566791f, .475923104541f, .477211044044f, .478508675487f, .479805784911f}; #define EXP2_CHECK_FOR_LIMITS 1 __asm { movss xmm0, x movss xmm1, xmm0 subss xmm1, half cvtss2si ecx, xmm1 cvtsi2ss xmm2, ecx add ecx, 127 #if EXP2_CHECK_FOR_LIMITS test ecx, 0xFFFFFF00 jnz overflow #endif shl ecx, 23 movd xmm7, ecx subss xmm0, xmm2 addss xmm0, one minss xmm0, lim_two movd eax, xmm0 and eax, 0x007F8000 shr eax, 13 movss xmm1, C2[eax] mulss xmm1, xmm0 addss xmm1, C1[eax] mulss xmm1, xmm0 addss xmm1, C0[eax] mulss xmm1, xmm7 movss x, xmm1 #if EXP2_CHECK_FOR_LIMITS jmp end overflow: comiss xmm0, one jp not_a_number jb underflow mov x, 0x7F800000 jmp end underflow: mov x, 0x00000000 jmp end not_a_number: movss x, xmm0 end: #endif } return x; }
It's accurate to 2 ulps. Basically it uses the 8 upper mantissa bits to lookup the coefficients of a quadratic polynomial, determined with the minimax algorithm. It handles all inputs (denormals, infinity, NaN), but flushes denormal results to 0.0.

All ideas to make it more accurate and/or faster are welcome!
Nick is offline   Reply With Quote
Old 01-18-2007, 03:46 PM   #7
Reedbeta
DevMaster Staff
 
Join Date: Oct 2004
Location: Seattle, WA
Posts: 3,707
Default Re: Approximate Math Library

What are ulps? (Couldn't find the term with google...)
___________________________________________
Currently working at Sucker Punch
reedbeta.com - OpenGL demos and other projects
Luabridge - a lightweight, dependency-free C++/Lua binding library.
CD Lite - an unobtrusive, minimal CD player application for Windows.
Reedbeta is offline   Reply With Quote
Old 01-18-2007, 04:43 PM   #8
Dia Kharrat
DevMaster Staff
 
Join Date: Jan 2003
Posts: 1,201
Default

Nick: Is this what you're looking for?

btw, ulps stands for Units in the Last Place.
Dia Kharrat is offline   Reply With Quote
Old 01-18-2007, 05:11 PM   #9
Nick
Senior Member
 
Join Date: Aug 2004
Location: Ghent, Belgium
Posts: 1,056
Default Re: Approximate Math Library

Quote:
Originally Posted by Dia Kharrat
Is this what you're looking for?
Thanks!
Nick is offline   Reply With Quote
Old 01-20-2007, 02:48 AM   #10
Nick
Senior Member
 
Join Date: Aug 2004
Location: Ghent, Belgium
Posts: 1,056
Default Re: Approximate Math Library

The accuracy/performance balance of the Intel library failed to impress me. At least for the exp2 implementation...

Here's my own vector implementation:
Code:
#include <emmintrin.h> __m128 exp2(__m128 x) { static const __declspec(align(16)) int lim1[4] = {0x43010000, 0x43010000, 0x43010000, 0x43010000}; // 129.00000e+0f static const __declspec(align(16)) int lim2[4] = {0xC2FDFFFF, 0xC2FDFFFF, 0xC2FDFFFF, 0xC2FDFFFF}; // -126.99999e+0f static const __declspec(align(16)) int half[4] = {0x3F000000, 0x3F000000, 0x3F000000, 0x3F000000}; // 0.5e+0f static const __declspec(align(16)) int base[4] = {0x0000007F, 0x0000007F, 0x0000007F, 0x0000007F}; // 127 static const __declspec(align(16)) int C5[4] = {0x3AF61905, 0x3AF61905, 0x3AF61905, 0x3AF61905}; // 1.8775767e-3f static const __declspec(align(16)) int C4[4] = {0x3C134806, 0x3C134806, 0x3C134806, 0x3C134806}; // 8.9893397e-3f static const __declspec(align(16)) int C3[4] = {0x3D64AA23, 0x3D64AA23, 0x3D64AA23, 0x3D64AA23}; // 5.5826318e-2f static const __declspec(align(16)) int C2[4] = {0x3E75EAD4, 0x3E75EAD4, 0x3E75EAD4, 0x3E75EAD4}; // 2.4015361e-1f static const __declspec(align(16)) int C1[4] = {0x3F31727B, 0x3F31727B, 0x3F31727B, 0x3F31727B}; // 6.9315308e-1f static const __declspec(align(16)) int C0[4] = {0x3F7FFFFF, 0x3F7FFFFF, 0x3F7FFFFF, 0x3F7FFFFF}; // 9.9999994e-1f __asm { movaps xmm0, x minps xmm0, lim1 maxps xmm0, lim2 movaps xmm1, xmm0 subps xmm1, half cvtps2dq xmm2, xmm1 cvtdq2ps xmm1, xmm2 paddd xmm2, base pslld xmm2, 23 subps xmm0, xmm1 movaps xmm1, C5 mulps xmm1, xmm0 addps xmm1, C4 mulps xmm1, xmm0 addps xmm1, C3 mulps xmm1, xmm0 addps xmm1, C2 mulps xmm1, xmm0 addps xmm1, C1 mulps xmm1, xmm0 addps xmm1, C0 mulps xmm1, xmm2 movaps x, xmm1 } return x; }
Unlike the previous version it requires SSE2. It doesn't use any lookup tables, but instead uses a polynomial of degree five. This creates a fairly long dependency chain, wich means a high latency. But at 21 instructions this is compensated with a high throughput. In English this means it's very suited for computing lots of 2x values in a loop.

It has a relative accuracy of 2-22.33. It returns infinity on overflow, 0.0 on underflow, and keeps it accracy with denormals. For NaN inputs it returns infinity.

Last edited by Nick : 01-20-2007 at 01:08 PM.
Nick is offline   Reply With Quote
Old 01-20-2007, 07:29 AM   #11
Nick
Senior Member
 
Join Date: Aug 2004
Location: Ghent, Belgium
Posts: 1,056
Default Re: Approximate Math Library

For those interested:

Here's a version with a weighted minimax polynomial of order four, yielding 2-18.54 relative precision:
Code:
#include <emmintrin.h> __m128 exp2(__m128 x) { static const __declspec(align(16)) int lim1[4] = {0x43010000, 0x43010000, 0x43010000, 0x43010000}; // 129.00000e+0f static const __declspec(align(16)) int lim2[4] = {0xC2FDFFFF, 0xC2FDFFFF, 0xC2FDFFFF, 0xC2FDFFFF}; // -126.99999e+0f static const __declspec(align(16)) int half[4] = {0x3F000000, 0x3F000000, 0x3F000000, 0x3F000000}; // 0.5e+0f static const __declspec(align(16)) int base[4] = {0x0000007F, 0x0000007F, 0x0000007F, 0x0000007F}; // 127 static const __declspec(align(16)) int C4[4] = {0x3C5DBE69, 0x3C5DBE69, 0x3C5DBE69, 0x3C5DBE69}; // 1.3534167e-2f static const __declspec(align(16)) int C3[4] = {0x3D5509F9, 0x3D5509F9, 0x3D5509F9, 0x3D5509F9}; // 5.2011464e-2f static const __declspec(align(16)) int C2[4] = {0x3E773CC5, 0x3E773CC5, 0x3E773CC5, 0x3E773CC5}; // 2.4144275e-1f static const __declspec(align(16)) int C1[4] = {0x3F3168B3, 0x3F3168B3, 0x3F3168B3, 0x3F3168B3}; // 6.9300383e-1f static const __declspec(align(16)) int C0[4] = {0x3F800016, 0x3F800016, 0x3F800016, 0x3F800016}; // 1.0000026f __asm { movaps xmm0, x minps xmm0, lim1 maxps xmm0, lim2 movaps xmm1, xmm0 subps xmm1, half cvtps2dq xmm2, xmm1 cvtdq2ps xmm1, xmm2 paddd xmm2, base pslld xmm2, 23 subps xmm0, xmm1 movaps xmm1, C4 mulps xmm1, xmm0 addps xmm1, C3 mulps xmm1, xmm0 addps xmm1, C2 mulps xmm1, xmm0 addps xmm1, C1 mulps xmm1, xmm0 addps xmm1, C0 mulps xmm1, xmm2 movaps x, xmm1 } return x; }
Here's a version with a weighted minimax polynomial of order three, yielding 2-13.70 relative precision:
Code:
#include <emmintrin.h> __m128 exp2(__m128 x) { static const __declspec(align(16)) int lim1[4] = {0x43010000, 0x43010000, 0x43010000, 0x43010000}; // 129.00000e+0f static const __declspec(align(16)) int lim2[4] = {0xC2FDFFFF, 0xC2FDFFFF, 0xC2FDFFFF, 0xC2FDFFFF}; // -126.99999e+0f static const __declspec(align(16)) int half[4] = {0x3F000000, 0x3F000000, 0x3F000000, 0x3F000000}; // 0.5e+0f static const __declspec(align(16)) int base[4] = {0x0000007F, 0x0000007F, 0x0000007F, 0x0000007F}; // 127 static const __declspec(align(16)) int C3[4] = {0x3D9FCB52, 0x3D9FCB52, 0x3D9FCB52, 0x3D9FCB52}; // 7.8024521e-2f static const __declspec(align(16)) int C2[4] = {0x3E677E26, 0x3E677E26, 0x3E677E26, 0x3E677E26}; // 2.2606716e-1f static const __declspec(align(16)) int C1[4] = {0x3F322226, 0x3F322226, 0x3F322226, 0x3F322226}; // 6.9583356e-1f static const __declspec(align(16)) int C0[4] = {0x3F7FFB19, 0x3F7FFB19, 0x3F7FFB19, 0x3F7FFB19}; // 9.9992520e-1f __asm { movaps xmm0, x minps xmm0, lim1 maxps xmm0, lim2 movaps xmm1, xmm0 subps xmm1, half cvtps2dq xmm2, xmm1 cvtdq2ps xmm1, xmm2 paddd xmm2, base pslld xmm2, 23 subps xmm0, xmm1 movaps xmm1, C3 mulps xmm1, xmm0 addps xmm1, C2 mulps xmm1, xmm0 addps xmm1, C1 mulps xmm1, xmm0 addps xmm1, C0 mulps xmm1, xmm2 movaps x, xmm1 } return x; }
Here's a version with a weighted minimax polynomial of order two, yielding 2-9.17 relative precision:
Code:
#include <emmintrin.h> __m128 exp2(__m128 x) { static const __declspec(align(16)) int lim1[4] = {0x43010000, 0x43010000, 0x43010000, 0x43010000}; // 129.00000e+0f static const __declspec(align(16)) int lim2[4] = {0xC2FDFFFF, 0xC2FDFFFF, 0xC2FDFFFF, 0xC2FDFFFF}; // -126.99999e+0f static const __declspec(align(16)) int half[4] = {0x3F000000, 0x3F000000, 0x3F000000, 0x3F000000}; // 0.5e+0f static const __declspec(align(16)) int base[4] = {0x0000007F, 0x0000007F, 0x0000007F, 0x0000007F}; // 127 static const __declspec(align(16)) int C2[4] = {0x3EACA418, 0x3EACA418, 0x3EACA418, 0x3EACA418}; // 3.3718944e-1f static const __declspec(align(16)) int C1[4] = {0x3F285ADA, 0x3F285ADA, 0x3F285ADA, 0x3F285ADA}; // 6.5763628e-1f static const __declspec(align(16)) int C0[4] = {0x3F803884, 0x3F803884, 0x3F803884, 0x3F803884}; // 1.0017247f __asm { movaps xmm0, x minps xmm0, lim1 maxps xmm0, lim2 movaps xmm1, xmm0 subps xmm1, half cvtps2dq xmm2, xmm1 cvtdq2ps xmm1, xmm2 paddd xmm2, base pslld xmm2, 23 subps xmm0, xmm1 movaps xmm1, C2 mulps xmm1, xmm0 addps xmm1, C1 mulps xmm1, xmm0 addps xmm1, C0 mulps xmm1, xmm2 movaps x, xmm1 } return x; }
Every floating-point number has been tested, using an exhaustive loop. For the record, Intel's library claims an average relative error of 0.0018, but it peaks at a terrible 0.29 relative error.

Last edited by Nick : 01-20-2007 at 12:58 PM.
Nick is offline   Reply With Quote
Old 01-20-2007, 08:56 AM   #12
davepermen
Senior Member
 
davepermen's Avatar
 
Join Date: Jan 2003
Location: Switzerland
Posts: 1,333
Default Re: Approximate Math Library

Hehe, looks cool. If i would work currently in C++, I'd sure plug this into my code directly and look how it affects the image, depending on the quality of the function

But I don't think I'll touch C++ in the sooner time at all anymore... And I don't have my code lying around, else, it would be a simple copy-paste.

But I'll remember
___________________________________________
davepermen.net
-Loving a Person is having the wish to see this Person happy, no matter what that means to yourself.
-No matter what it means to myself....
davepermen is offline   Reply With Quote
Old 01-20-2007, 11:09 AM   #13
Nick
Senior Member
 
Join Date: Aug 2004
Location: Ghent, Belgium
Posts: 1,056
Default Re: Approximate Math Library

Quote:
Originally Posted by davepermen
But I don't think I'll touch C++ in the sooner time at all anymore...
Totally fallen in love with C#?
Nick is offline   Reply With Quote
Old 01-25-2007, 09:25 AM   #14
Nick
Senior Member
 
Join Date: Aug 2004
Location: Ghent, Belgium
Posts: 1,056
Default Re: Approximate Math Library

A log2 implementation (first attempt):
Code:
__m128 log2(__m128 x) { static const __declspec(align(16)) int exp[4] = {0x7F800000, 0x7F800000, 0x7F800000, 0x7F800000}; static const __declspec(align(16)) int one[4] = {0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000}; // 1.0f static const __declspec(align(16)) int off[4] = {0x3FBF8000, 0x3FBF8000, 0x3FBF8000, 0x3FBF8000}; // 1.4960938f static const __declspec(align(16)) int mant[4] = {0x007FFFFF, 0x007FFFFF, 0x007FFFFF, 0x007FFFFF}; static const __declspec(align(16)) int sign[4] = {0x80000000, 0x80000000, 0x80000000, 0x80000000}; static const __declspec(align(16)) int base[4] = {0x43800000, 0x43800000, 0x43800000, 0x43800000}; // 256.0f static const __declspec(align(16)) int C5[4] = {0xBD0D0CC5, 0xBD0D0CC5, 0xBD0D0CC5, 0xBD0D0CC5}; // -3.4436006e-2f static const __declspec(align(16)) int C4[4] = {0x3EA2ECDD, 0x3EA2ECDD, 0x3EA2ECDD, 0x3EA2ECDD}; // 3.1821337e-1f static const __declspec(align(16)) int C3[4] = {0xBF9dA2C9, 0xBF9dA2C9, 0xBF9dA2C9, 0xBF9dA2C9}; // -1.2315303f static const __declspec(align(16)) int C2[4] = {0x4026537B, 0x4026537B, 0x4026537B, 0x4026537B}; // 2.5988452f static const __declspec(align(16)) int C1[4] = {0xC054bFAD, 0xC054bFAD, 0xC054bFAD, 0xC054bFAD}; // -3.3241990f static const __declspec(align(16)) int C0[4] = {0x4047691A, 0x4047691A, 0x4047691A, 0x4047691A}; // 3.1157899f __asm { movaps xmm0, x movaps xmm1, xmm0 andps xmm1, exp psrld xmm1, 8 orps xmm1, one subps xmm1, off mulps xmm1, base andps xmm0, mant orps xmm0, one movaps xmm2, C5 mulps xmm2, xmm0 addps xmm2, C4 mulps xmm2, xmm0 addps xmm2, C3 mulps xmm2, xmm0 addps xmm2, C2 mulps xmm2, xmm0 addps xmm2, C1 mulps xmm2, xmm0 addps xmm2, C0 subps xmm0, one mulps xmm0, xmm2 addps xmm1, xmm0 movaps x, xmm1 } return x; }
Unfortunately logarithms don't converge very fast. It's also not that easy to get good behaviour for the limits.
Nick is offline   Reply With Quote
Old 01-25-2007, 09:41 AM   #15
Kenneth Gorking
Senior Member
 
Kenneth Gorking's Avatar
 
Join Date: Aug 2004
Location: Århus, Denmark
Posts: 688
Default Re: Approximate Math Library

Any benchmarks to go with that? Not that I'm doubting it will be faster, but it's always nice to know how much faster it is.
___________________________________________
"Stupid bug! You go squish now!!" - Homer Simpson
Kenneth Gorking is offline   Reply With Quote
Old 01-25-2007, 01:33 PM   #16
Nick
Senior Member
 
Join Date: Aug 2004
Location: Ghent, Belgium
Posts: 1,056
Default Re: Approximate Math Library

Quote:
Originally Posted by Kenneth Gorking
Any benchmarks to go with that? Not that I'm doubting it will be faster, but it's always nice to know how much faster it is.
I'm not so sure it's faster. The dependency chains are quite bad, while the Intel library appears to use properly scheduled instructions.

Anyway, I'm considering making it an article for the 'Daily Code Gem', including benchmark results. I can't promise when I have the time for it though...
Nick is offline   Reply With Quote
Old 01-25-2007, 05:23 PM   #17
davepermen
Senior Member
 
davepermen's Avatar
 
Join Date: Jan 2003
Location: Switzerland
Posts: 1,333
Default Re: Approximate Math Library

Quote:
Originally Posted by Nick
Totally fallen in love with C#?

Yes. You and you're work are the only reason to regret this... But sorry, thats not enough...
___________________________________________
davepermen.net
-Loving a Person is having the wish to see this Person happy, no matter what that means to yourself.
-No matter what it means to myself....
davepermen is offline   Reply With Quote
Old 01-25-2007, 08:40 PM   #18
Nautilus
Valued Member
 
Join Date: Nov 2004
Location: Milan -ITALY-
Posts: 297
Default Re: Approximate Math Library

There's one thing I don't understand.
I'll quote from "AMaths.pdf", page 5:
Quote:
Packed vs. Scalar AM Library Functions
Compared to the packed AM Library functions, the scalar ones are slower per operand but
faster per call. Therefore, when just one result is needed, it is faster to call the scalar version.
Otherwise (for two or more results), the packed version should be used.
What does it mean, exactly?

Regards,
Ciao ciao : )
___________________________________________
-Nautilus

1.551640271931635485e+1292913986 ?
Why, that's: 2 ^ ((2 ^ (2 ^ ((2 ^ 2) + (2 ^ (2 - 2))))) - (2 ^ (2 - 2))). Now verify, please...
Nautilus is offline   Reply With Quote
Old 01-25-2007, 11:27 PM   #19
Nick
Senior Member
 
Join Date: Aug 2004
Location: Ghent, Belgium
Posts: 1,056
Default Re: Approximate Math Library

Quote:
Originally Posted by davepermen
Yes. You and you're work are the only reason to regret this... But sorry, thats not enough...
Yeah, C# is a very nice language. I'm actually considering writing my high-level code in it, and leave the performance critical parts in C++/assembly. I'm only afraid it could become messier than just sticking with C++. For the things I work on, high-level code can become low-level code the next day.

A native C# compiler with inline assembly would convince me in a second though...
Nick is offline   Reply With Quote
Old 01-25-2007, 11:47 PM   #20
Nick
Senior Member
 
Join Date: Aug 2004
Location: Ghent, Belgium
Posts: 1,056
Default Re: Approximate Math Library

Quote:
Originally Posted by Nautilus
What does it mean, exactly?
Take for example am_sin_ss and am_sin_ps. The first computes one sine (a scalar), while the second computes four sines (a vector). The scalar version takes 48 clock cycles, and the vector version 66 clock cycles. So the scalar version is faster, but it only computes one sine! If you need only one sine (for example for constructing a rotation matrix), then the scalar version is optimal. But if you need to compute sines for a whole array of angles (for example for a Fourier series), then the vector version is more efficient. Computing four sines separately with the scalar version is slower than computing four in parallel with the vector version.
Nick is offline   Reply With Quote
Old 01-26-2007, 06:42 AM   #21
Nautilus
Valued Member
 
Join Date: Nov 2004
Location: Milan -ITALY-
Posts: 297
Default Re: Approximate Math Library

Thank you very much, Nick
___________________________________________
-Nautilus

1.551640271931635485e+1292913986 ?
Why, that's: 2 ^ ((2 ^ (2 ^ ((2 ^ 2) + (2 ^ (2 - 2))))) - (2 ^ (2 - 2))). Now verify, please...
Nautilus is offline   Reply With Quote
Old 12-15-2007, 06:13 PM   #22
eigma
New Member
 
Join Date: Dec 2007
Location: Ottawa, Canada
Posts: 1
Thumbs up Re: Approximate Math Library

Hi Nick,

I'm a novice at SSE programming -- I'm teaching myself SSE by using a mix of SSE and FP to implement a linear algebra algorithm. I'm using FP because the algorithm involves the logarithm function. I've determined that this is the main bottleneck (according to gprof, I spend 86% of the time in the logarithm function), so I'm looking for ways to implement the logarithm function in pure SSE.

I'm reluctant to reuse your log2 code verbatim; I think that by redesigning the algorithm with my particular application in mind (for example, my required interval of convergence, precision, etc.), I can better optimize the overall performance of my application.

Do you have any pointers on how to design a logarithm approximator?

From what I can see in your code, you're extracting the exponent field and approximating the logarithm of the mantissa using a 5th-order polynomial.

How did you arrive at the coefficients? I would either use 5 terms of the Taylor series expansion for log, or pick 5 points (perhaps evenly-spaced) on the interval of interest and obtain a 5th-order polynomial that passes through those points (and perhaps visibly verify that the error between those points is acceptable).

Anyway, I'm not very experienced in this domain, which is why I'm hoping you could provide some pointers.

Thanks,

Catalin Patulea
eigma is offline   Reply With Quote
Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Forum Jump


All times are GMT -7. The time now is 04:45 PM.


Powered by vBulletin
Copyright ©2000 - 2009, Jelsoft Enterprises Ltd.