![]() |
| [[ Home | Forums | 3D Engines Database | Wiki | Articles/Tutorials | Game Dev Jobs | IRC Chat Network | Contact Us ]] |
|
|
#1 |
|
New Member
Join Date: Dec 2007
Posts: 9
|
Hi there, I have been trying to learn more about writing SSE code. But for some reason I cannot get my code much faster as with normal CPU code.
Is there someone out there who can take a look at my code and hopefully tell what's wrong with it? Here is link: http://codepad.org/plz5aEW6 I'm using MS Visual Studio 2005 with all optimization set to the highest. Thanks in advance, Christian |
|
|
|
|
|
#2 |
|
Valued Member
Join Date: Mar 2008
Location: Finland
Posts: 225
|
It's because in your test code most of the overhead comes from cache misses and looping. Unroll your test loop few times and read from a single index and you'll see the expected improvement.
___________________________________________
www.spinxengine.com - Spin-X Engine |
|
|
|
|
|
#3 |
|
Senior Member
Join Date: Aug 2004
Location: Ghent, Belgium
Posts: 1,056
|
Looks fairly ok to me:
FPU result: 4571 ms SSE result: 2543 ms Although there is indeed some overhead from the loop you can clearly see significant benefit from using SSE. Anyway, what CPU do you have? Pentium 4 and older Athlons don't have 128-bit SSE execution units, so it's a lot harder to get any speedup out of them. |
|
|
|
|
|
#4 |
|
Senior Member
Join Date: Aug 2004
Location: Ghent, Belgium
Posts: 1,056
|
64-bit version:
FPU results: 4587 ms SSE results: 2277 ms |
|
|
|
|
|
#5 |
|
New Member
Join Date: Dec 2007
Posts: 9
|
Thanks for all of your replies. I have been significantly improving the SSE code. Basically rearranging the loops have helped a lot.
I have an Intel Core 2 Quad Q9400. But I'm using an old MS compiler from 2005 which probably isn't good in optimizing code anymore. Thanks for your remarks, Christian |
|
|
|
|
|
#6 |
|
Valued Member
Join Date: Mar 2008
Location: Finland
Posts: 225
|
For the SSE code there isn't much for compiler to optimize though. Anyway, the operation you perform on SOA is very simple (only dot product) thus cache misses contribute quite a bit on my PC at least to the figure, so there isn't close to 4x improvement what you would expect. If I change the code so that it performs the operation only at fixed index essentially making all the reads from L1 cache, then the improvement is close to 4x, so you should really try to make more work in a single pass over the data to get better SSE improvements.
___________________________________________
www.spinxengine.com - Spin-X Engine |
|
|
|
|
|
#7 |
|
New Member
Join Date: Dec 2007
Posts: 9
|
Jarkkol, I have been doing so benchmarking.
Here are the results. NV - Number of Vertices; NL - Number of Loops; NV = 1,000; NL = 100,000,000 CPU - 394sec SSE - 65sec Ratio: 6.1 NV = 10,000; NL = 10,000,000 CPU - 461sec SSE - 146sec Ratio: 3.2 NV = 100,000; NL = 1,000,000 CPU - 461sec SSE - 146sec Ratio: 1.04 Basically, I think I'm just testing the cache misses. But how does the math go? CPU-Z tells me that my L1 D-Cache is 32Kbyte where each line is 64byte. I can squeeze in 32 * 1024byte / 4 byte per float = 8192 floats. That would mean I can have 8192 floats / 3 floats per vertex = 2730 vertices in my L1 Cache. Now, I have 2 float arrays v_1 and v_2. Does it makes sense to compute the dot product in batches of 1300 floats from each array? Is that a recipe for maximum performance? Thanks, Christian |
|
|
|
|
|
#8 | |
|
Senior Member
Join Date: Aug 2004
Location: Ghent, Belgium
Posts: 1,056
|
Quote:
In my experience it's not really worth trying to achieve absolute maximum performance. Change one small thing and it no longer performs optimally. So just make sure you have tightly packed SSE-friendly data structures and you access them linearly whenever possible. |
|
|
|
|
|
|
#9 | ||
|
Member
Join Date: Nov 2008
Location: Germany
Posts: 32
|
What you have written here is essentially a micro benchmark, where in real world code calculating a compile time constant amount of dot products in a row is not quite common. And even when, you should organize your data so that you don't need so many pointer-casts and produce aliasing (which is massively in the way of compiler optimization).
Let the compiler decide whether to inline or not, inlined function can pollute the execution cache, even when not executed. Inlining or not is not relevant in your code, but might be very relevant in real applications. Further: Before diving into making everything SSE, find your bottlenecks first, and see what the compiler already does for you. Often, you'll find that the compiler generates code that is already unrolled, autovectorized and whatnot, and better than what you would have code: Code:
Quote:
This vectorized and loop-unrolled code is not necessarily better than your version (but I haven't benchmarked it), but often people underestimate what modern compilers are able to do, possibly enough to leverage bottlenecks to remote and unrelated points in your application, making optimizing wannabe-bottlenecks an unneeded loss of time. edit: Little update after tweaking above code from array-of-struct to struct-of-arrays: Quote:
I guess performance is now pretty much equivalent to your explicit SSE code (care to post some asm-dumps?).
___________________________________________
[~/www/ — picogen.org]
Last edited by phresnel : 11-02-2009 at 09:54 AM. |
||
|
|
|
|
|
#10 |
|
New Member
Join Date: Dec 2007
Posts: 9
|
Hi phresnel, qq. How can I get an assembler output with VC++ 8?
In any event, thanks for your reply. Thanks, Christian |
|
|
|
|
|
#11 |
|
DevMaster Staff
Join Date: Sep 2005
Location: The Netherlands
Posts: 1,442
|
It's somewhere in the compiler options for each file, but what I usually do is just put a breakpoint on the line of code I want to view the assembly for, then run, and when it breaks do ctrl-F11 to go to the disassembly window.
___________________________________________
C++ addict - Currently working on: the 3D engine for Tomb Raider: Underworld and Deus Ex 3. |
|
|
|
|
|
#12 |
|
Valued Member
Join Date: Mar 2008
Location: Finland
Posts: 225
|
File (or project) properties => C/C++ => Output Files => Assembler Output => Assembly With Source Code (/FAs)
___________________________________________
www.spinxengine.com - Spin-X Engine |
|
|
|
|
|
#13 |
|
Member
Join Date: Nov 2008
Location: Germany
Posts: 32
|
Little update:
Code:
For size=1000, more=1000000: Code:
Running your explicit sse code, ported to g++, yields: Code:
To ensure that the outer loop is not a nop, I looked at the assemblies. Yours: Code:
Mine: Code:
g++ 4.4 wins this match, and all we had to do was to align the data, all else is john-doe-readable & standards-conforming c++. And when I look at the latest result, I think it is hard to code or generate relevantly better codepaths ![]()
___________________________________________
[~/www/ — picogen.org]
|
|
|
|
|
|
#14 |
|
New Member
Join Date: Dec 2007
Posts: 9
|
Hi there, I have add my sse calculation to your code ( http://codepad.org/nBhb3C23 ). It it's slightly faster than your code. It could be made a lot faster without the cache misses. Restructuring the for loops would help a lot. I'll let you know how it goes.
Christian |
|
|
|
|
|
#15 | |
|
Member
Join Date: Nov 2008
Location: Germany
Posts: 32
|
Quote:
I don't see where there shall be cache misses on a modern cpu. Then there is a bug: Code:
Further: Have you looked at the assemby? What does MSVC produce for those simple dot-products? Then: On g++, your code crashes. I try to fix it.
___________________________________________
[~/www/ — picogen.org]
|
|
|
|
|
|
|
#16 |
|
Member
Join Date: Nov 2008
Location: Germany
Posts: 32
|
There seems to be an issue w.r.t. alignment in g++-4.4 ((edit: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41950)), but I got it compile again with small tweaks:
Code:
In one test, your code (+ tweaks by me) is faster (sse: 7.892, scalar: 7.922), in another, mine is faster (scalar: 7.922, sse: 7.953). The reason becomes clear when we look at the assemblies: Here is what g++-4.4 gives me for your (tweaked by me) code: Code:
For my john-doe-c++ I reap: Code:
Both are nearly identical, the only real difference is that your code runs in xmm0-xmm3, whereas mine runs in only xmm0 and xmm1. I don't know what MSVC produces, but it is clear that optimising such microbenchmark code with explicit sse is a pure waste of time with g++, because the latter does it all for you.
___________________________________________
[~/www/ — picogen.org]
Last edited by phresnel : 11-05-2009 at 07:56 AM. |
|
|
|
|
|
#17 |
|
Member
Join Date: Apr 2006
Location: Latvia
Posts: 72
|
MSVC produces bad code for such loop. It doesn't use packed instructions, it just generates bunch of movss, mulss, addss instructions.
|
|
|
|
|
|
#18 | |
|
DevMaster Staff
Join Date: Sep 2005
Location: The Netherlands
Posts: 1,442
|
Quote:
___________________________________________
C++ addict - Currently working on: the 3D engine for Tomb Raider: Underworld and Deus Ex 3. |
|
|
|
|
|
|
#19 | |
|
Senior Member
Join Date: Aug 2004
Location: Ghent, Belgium
Posts: 1,056
|
Quote:
So it's not that simple to get a cache miss out of a modern CPU. I suspect you might be running into bandwidth limitations, or your random test isn't fully random, or a significant part of the array still fits in L2 cache. |
|
|
|
|
|
|
#20 |
|
DevMaster Staff
Join Date: Sep 2005
Location: The Netherlands
Posts: 1,442
|
Well I'm purposely wasting cycles to not be bandwith limited, and my random access pattern basically involves unrolling the forward loop using a 1:16 divider and then taking a random permutation of statements. Also, the array I'm working on is 64MB. Does the prefetching involve larger prefetches than a single cache line?
Here's my code: Code:
___________________________________________
C++ addict - Currently working on: the 3D engine for Tomb Raider: Underworld and Deus Ex 3. Last edited by .oisyn : 11-06-2009 at 05:10 AM. |
|
|
|
|
|
#21 | |
|
DevMaster Staff
Join Date: Sep 2005
Location: The Netherlands
Posts: 1,442
|
Hmm, perhaps that random was not random enough. I enlarged it to 64 statements, and also unrolled the loops for forward and backward to make them more similar (since I figured that the whopping 2k code size of the loop itself might clobber the results as well)
I've posted it here Results: Quote:
.edit: Well duh, I don't know where I got the idea from that an SSE register was a single cache line, but obviously it is only a quarter of a cache line.
___________________________________________
C++ addict - Currently working on: the 3D engine for Tomb Raider: Underworld and Deus Ex 3. Last edited by .oisyn : 11-06-2009 at 07:38 AM. |
|
|
|
|
![]() |
| Thread Tools | Search this Thread |
| Display Modes | |
|