![]() |
| [[ Home | Forums | 3D Engines Database | Wiki | Articles/Tutorials | Game Dev Jobs | IRC Chat Network | Contact Us ]] |
|
|
#1 |
|
New Member
Join Date: Mar 2006
Posts: 26
|
Hi everyone!
I wrote a function to inverse a 4 by 4 matrix using intrinsic instructions. I used Cramer's rule, prevent coefficient processing duplication and processed two 2 by 2 sub-factor at a time. It involved quite a lot of _mm_shuffle_ps but considering that this instruction cost 1 cycle on Core i7 I thought it was a fair trade. There is probably some possibilities to improve my code but the result seams quite nice to me: On a Core 2 Q6600 build with VC2008, i get 162 cycles , my original implementation with FPU cost 918 cycles. Using _mm_rcp_ps instruction instead of _mm_div_ps it goes down to 135 cycles but with some accuracy lost. I would love to see he number of cycles needed on a Core i7! I have added my mat4 product as well: 63 cycles instead of 378 cycles. I bet it could be improved more so I am waiting for you comment! Matrix inverse: Code:
Matrix product: Code:
-- EDIT -- Dot product: Code:
Last edited by Groove : 08-12-2009 at 03:15 AM. |
|
|
|
|
|
#2 |
|
Member
Join Date: Apr 2006
Location: Latvia
Posts: 72
|
Have you compared performance with matrix inversion from XNA Math Library from latest DirectX SDK?
http://msdn.microsoft.com/en-us/libr...2(loband).aspx |
|
|
|
|
|
#3 | |
|
Senior Member
Join Date: Aug 2004
Location: Ghent, Belgium
Posts: 1,056
|
Quote:
I'll do some actual timing when I get the chance... By the way, to speed up the division without losing a great deal of precision, try using a Newton-Rhapson iteration after using _mm_rcp_ps for the first approximation. |
|
|
|
|
|
|
#4 |
|
Senior Member
Join Date: Aug 2004
Location: Ghent, Belgium
Posts: 1,056
|
Your code appears to be missing the _mm_dot_ps function (or is this an intrinsic that isn't recognised by Visual C++ 2005)? Also the 'one' variable isn't defined.
I tried benchmarking it on my Core i7, but the results weren't consistent. The problem is that it has Turbo Boost, which can increase the clock frequency temporarily. I could disable it to get accurate readings though. Anyway, could you post your benchmarking code so we use the exact same thing? The results can differ depending on code and memory layout... |
|
|
|
|
|
#5 | |
|
New Member
Join Date: Mar 2006
Posts: 26
|
Quote:
That exactly why I would love to see the number of cycles on a Core i7 The code contains 43 _mm_shuffle_ps which require 2 cycles on Q6600 each if I remember well and only cost 1 cycle on i7. So it may save 43 cycles ... not bad at all. That's theory only. |
|
|
|
|
|
|
#6 |
|
Valued Member
Join Date: Mar 2008
Location: Finland
Posts: 225
|
I tested this on i7 920, and it took 13.051580s (using QPC for timing) to execute 800 million matrix inverses (100 million iterations doing 8 inverses each on different matrices). I changed the code to use _mm_rcp_ps() and also replaced __m128 Det0 = _mm_dot_ps(in[0], Row2); with __m128 Det0 = _mm_dp_ps(in[0], Row2, 0xff); which is SSE4 instruction though. I also changed inline to __forceinline because otherwise the inverse code had some function calls significantly degrading the results (down to ~30s).
After each inverse I added the result to static matrix to make sure there was a side effect from the inverse and that the compiler (MSVC 2008) didn't optimize the code away, so that added extra 4 simd adds per inverse. Results were pretty consistent over 10 runs I did (only 0.02% variation), and the 13.051580s result was from the best run.
___________________________________________
www.spinxengine.com - Spin-X Engine |
|
|
|
|
|
#7 |
|
New Member
Join Date: Mar 2006
Posts: 26
|
How much benefit did you get using _mm_dp_ps instead of my really basic _mm_dot_ps?
|
|
|
|
|
|
#8 |
|
Valued Member
Join Date: Mar 2008
Location: Finland
Posts: 225
|
Ah, didn't notice you edited your original post and added the _mm_dot_ps. With _mm_dot_ps the best run was 14.532731s. Without inlining the times for _mm_dp_ps and _mm_dot_ps are 29.669986s and 30.244751s respectively. I don't know how much code the compiler is able to omit because of inlining, but I noticed for example that with inlined version there were only 11 shuffles per inverse (vs 33 in non-inlined version), so the non-inlined versions probably represent more closely the real-life performance.
___________________________________________
www.spinxengine.com - Spin-X Engine |
|
|
|
![]() |
| Thread Tools | Search this Thread |
| Display Modes | |
|