View Full Version : Faster sort.
Hawkwind
02-19-2008, 01:00 AM
I wrote some code a year or two back for sorting objects before render (particularly semi-transparent stuff).
I'm wondering if anybody has something faster than this ?
Cheers.
//------------------------------------------------------------------------------------
// Sort
// ----
// Binary radix sort - used for ordering partially transparent entities, etc.
//
// Sorts an array of CSortEntry.
//
// When adding entries to the array:
//
// iMaxKey |= key; // This is therefore an upper bound on
// // unsigned values.
// Initially: iMaxKey = 0;
//
//------------------------------------------------------------------------------------
#pragma warning( push )
#pragma warning( disable : 4731 ) // 4731 = frame pointer register 'ebp' modified by inline assembly code
struct CSortEntry //Used for sorting. Total of 32bits per entry
{
unsigned short key;
unsigned short idx;
};
void CGeneric::Sort( CSortEntry * pArray, int iCount, unsigned int iMaxKey)
{
if( iCount == 0 ) // Early out if required
return;
/*
unsigned int iMask = 1;
while( iMask < iMaxKey )
{
unsigned int Count1 = 0;
CSortEntry * inPtr = pArray;
CSortEntry * outPtr0 = inPtr; // Output 0 is always to the original array
CSortEntry * outPtr1 = CSortArrayTemp; // Output 1 is always to a temporary store
for( int i = 0; i < iCount; i++ )
{
if( (inPtr->key & iMask)==0)
{
*outPtr0++ = *inPtr++;
}
else
{
*outPtr1++ = *inPtr++;
Count1++;
}
};
// Weld the two parts together
memcpy( outPtr0, CSortArrayTemp, (Count1)*sizeof(CSortEntry) );
// Advance mask, to next bit
iMask += iMask;
};
*/
// Use local stack space vars for older compilers not recognising
// parameter vars in _asm{}
unsigned int maxMask = iMaxKey;
CSortEntry * inPtr = pArray;
CSortEntry * outPtr1 = CSortArrayTemp; // A temporary buffer at least as big as pArray.
unsigned int totalEntries = iCount;
_asm
{
pusha
push ebp
push maxMask
push totalEntries
push inPtr
push outPtr1
mov eax, 1 // eax holds the mask
jmp L0002
mov ebx, [esp + 0 ]
mov esi, [esp + 4 ]
L0003a:
mov ecx, [esp + 8 ]
mov edi, esi
lea esi, [esi + ecx*4]
neg ecx
//--------------------------------------------------------------------------
// Main loop -
// ---------
// branchless decision on destination buffer
//--------------------------------------------------------------------------
L0000:
mov ebp, [esi + ecx*4]
test ebp, eax
cmovne edx, ebx
cmove edx, edi
mov [edx], ebp
lea edx, [edx + 4]
cmovne ebx, edx
cmove edi, edx
inc ecx
jl L0000
//--------------------------------------------------------------------------
// End of Main Loop
//--------------------------------------------------------------------------
mov esi, [esp + 0] // Top of stack (push ebx)
sub ebx, esi // Total * 8
jz L0004 // if none then early exit - copy loop won't work for zero!
add esi, ebx
add edi, ebx
neg ebx
L0001:
mov ecx, [esi + ebx]
mov [edi + ebx], ecx
add ebx, 4 // Seperate from jl L001 so that there are no dependencies
jl L0001
L0004:
add eax, eax
L0002:
// Interlace the first two instructions at the destination loop label - this breaks up dependencies
// and is effective since the destination label is always taken except the final time
mov esi, [esp + 4 ]
cmp eax, [esp + 12] //maxMask
mov ebx, [esp + 0 ]
jle L0003a
add esp, 16 // Account for top 4 stack dwords
pop ebp
popa
};
};
#pragma warning( pop )
Reedbeta
02-19-2008, 09:14 AM
Please use the ... tags when posting code. I've added them for you this time.
Hawkwind
02-19-2008, 09:15 AM
Thanks. It looks clearer!
CheshireCat
02-19-2008, 09:23 AM
Might be faster apply temporal coherency and use insertion sort on the sorted list from previous frame. I doubt this is bottleneck in your rendering code though.
Hawkwind
02-19-2008, 09:56 AM
Cheshire Cat - interesting idea, problem is with a rapidly moving camera I don't know how it would work - how would you know whether something has potentially changed its order in the list ? possibly using bounding spheres ?
I'd be interested to know how this could be done if you've managed it....
CheshireCat
02-19-2008, 10:42 AM
Insertion sort will work regardless how fast the camera moves. It'll just become slower when the initial list becomes less sorted, but the order of transparent objects should be pretty much the same from previous frame. Another nice thing is that when there are relatively small updates to the order it's very cache friendly algorithm.
Nils Pipenbrinck
02-19-2008, 12:28 PM
Hi there.
First off: pusha and popa are 16 bit instructions and only save the lower 16 bit of your 32 bit registers. Use pushad and popad instead (and thank MSVC that it analyzed your code and saved the upper 16 bit for your).
To your soring example: Here is my code.. it is roughly 2.5 times faster than your code on my core2 duo (up to factor 4 for more than half a million items). Main difference is, that I do radix of 8 bits instead of 16 bits. My code expects ints, but since your element-size is 32 bits it will just work. My code sorts for the lower 16 bits unsigned (just like your code does it...)
#include <mmintrin.h>
void radix16_mmx (unsigned int amount, unsigned int * __restrict src, unsigned int * __restrict tmp)
{
unsigned int i;
__declspec(align(16)) unsigned int count[512];
__m64 a0, a1, zero;
__m64 * cnt64;
// fill frequency table with zeros:
cnt64 = (__m64 *) count;
zero = _mm_setzero_si64();
for (i=0; i<256; i+=8)
{
cnt64[i+0]=zero; cnt64[i+1]=zero; cnt64[i+2]=zero; cnt64[i+3]=zero;
cnt64[i+4]=zero; cnt64[i+5]=zero; cnt64[i+6]=zero; cnt64[i+7]=zero;
}
// build byte-frequency table:
for (i=0; i<amount; i++)
{
unsigned int s = src[i];
count[2*((s>> 0)&255)+0]++;
count[2*((s>> 8)&255)+1]++;
}
// turn frequency table into offset table:
// process four dwords per iteration
a0 = a1 = zero;
for (i=0; i<256; i++)
{
__m64 c0 = cnt64[i*2];
cnt64[i*2+0] = a0;
a0 = _m_paddd (a0, c0);
}
// sort from lowest byte to highest byte:
for (i=0; i<amount; i++) tmp[count[((src[i]>> 0)&0xff)*2+0]++] = src[i];
for (i=0; i<amount; i++) src[count[((tmp[i]>> 8)&0xff)*2+1]++] = tmp[i];
_m_empty();
}
Btw - I do use MMX for this function, but not for speed but for code-size. The code runs approximately at the same speed if you do it with ordinary integers.
Nils
Btw - I do use MMX for this function, but not for speed but for code-size. The code runs approximately at the same speed if you do it with ordinary integers.
Heh.. Nice to see somebody optimizing for code-size.. you never see that anymore (except for 4/64k intros maybe.. and embedded devices.. well maybe you do see it -- but still..) everybody always think about speed -- and often where it doesnt make any sense to do so.
Nils Pipenbrinck
02-19-2008, 02:30 PM
I'm an embedded programmer, but I don't care if my application needs a megabyte of code-size or not. Nowadays we don't need to squeeze everything into 64kb anymore (there are exceptions, but ...).
The point is: optimizing for size if you don't sacrifice speed for it *is* optimizing for speef. There exist something called the instruction cache, and it can make a big difference to the overall speed of an application.
The big problem with the i-cache is: you can't really measure the performance gain of small versus bloated code. But it makes a difference. Otherwise the instruction-cache won't be there :-)
I didnt really think a lot about the instruction cache eventhough I should have since I know its there. Very good points you bring up there Nils as usual :-)
Hawkwind
02-20-2008, 01:30 AM
Nils - Thanks for this. 2.5x is a great improvement, I'm just looking through it so I understand how it works....
Nils Pipenbrinck
02-20-2008, 02:17 AM
To be fair, I run your code with a iMaxKey value of 0xffff. If you have much less keys I wouldn't be surprised if your code wins performance wise.
I could do the same and write a special version that handles the cases where only the lower 8 bits are used as a key, but well - performance matters most when you have to process lots of data.
It would be interesting to see how good std::sort compares to your and my function.
Btw - your asm-code looks pretty good. I like the conditional move to get rid of the branches.
You might get some additional speedup if you make sure that your branch labels aren't as near to each other as they are now. The modern CPU's build a branch history based on the address of the branch. In the worst case you can have two unrelated branches messing up their branch-prediction because one branches likely and the other does not.
The optimization guides at agner.org have more details on this.
About the algorithm itself: It's just plain old radix-sort with a 8-bit radix. You can find the details here: http://www.cubic.org/docs/radix.htm I removed the work done for the upper 16 bits of the key. I restructured the byte-history table building by doing it once for all sort-passes in parallel. That increases the stack-usage by a kilobyte, but enables me to do the mmx/sse trick during the history to index conversion step.
Hawkwind
02-20-2008, 03:45 AM
I don't know much about MMX, I tried compiling your code in Studio2008 but I keep getting a runtime error of stack corruption around count.... do I need to 'enable' amd 'disable' MMX to stop floating point problems ? I read something somewhere about this....
Nils Pipenbrinck
02-20-2008, 01:23 PM
MMX shouldn't be the problem. There a re issues, but you only have to take care of them after you've done MMX stuff. I checked the code, and it should not overflow.
Anyway, here is the non-mmx version:
void radix16 (unsigned int amount, unsigned int * __restrict src, unsigned int * __restrict tmp)
{
unsigned int i;
unsigned int a0, a1;
unsigned int count[512];
// fill frequency table with zeros:
memset (count, 0, sizeof (count));
// build byte-frequency table:
for (i=0; i<amount; i++)
{
unsigned int s = src[i];
count[2*((s>> 0)&255)+0]++;
count[2*((s>> 8)&255)+1]++;
}
// turn frequency table into offset table:
// process four dwords per iteration
a0 = a1 = 0;
for (i=0; i<256; i++)
{
int c0 = count[i*2+0];
int c1 = count[i*2+1];
count [i*2+0] = a0;
count [i*2+1] = a1;
a0 += c0;
a1 += c1;
}
// sort from lowest byte to highest byte:
for (i=0; i<amount; i++) tmp[count[((src[i]>> 0)&0xff)*2+0]++] = src[i];
for (i=0; i<amount; i++) src[count[((tmp[i]>> 8)&0xff)*2+1]++] = tmp[i];
}
from numbers-of-elements > 1024 it's more than 5 times faster on my core2 btw...
SamuraiCrow
02-21-2008, 10:17 AM
Also, there's another sort algorithm known as Burst Sort that is in O(n) time. You just insert each element into a trie and then do a traversal of the trie. It's mainly useful for DNA sequencing where each element is constructed of 2-bit genes. The Burst Sort hogs memory like there's no tomorrow but it is a bit faster than the radix sort since there are only 2 passes.
Nils Pipenbrinck
02-21-2008, 03:15 PM
Burst Sort sounds interesting. Do you have a link to a website that gives some details on it? My google search didn't had any good results.
Anyway,
I benchmarked my and Hawkwinds code against the stl-sort.
Results are interesting, and I want to share them with you. To bad that I have misplaced the password for my webspace, so I can't show the cool statistics I've done.
I tested with several amounts from 512 to 65536 in steps of 1024. The performance scales almost linear with the element-count The O(n log n) complexity of the std sort does not even shows up! It nearly looks like a linear sort algorithm. (I checked that, it's merge-sort, so no O(n)).
Here are some numbers, calculated as an average of all test runs, and expressed as cycles per element (I measured using RDTSC):
radix8 sort: 34
radix1 sort: 171
std::sort: 220
clib qsort: 420
The lower ranges - where I measured less than 4000 elements - are interesting as well.
std::sort is faster than radix1 for less than 3000 and faster than radix8 for less than 64 elements.
I would have expected std::sort to perform better, and regarding the clib qsort function: That one is slow as hell. I will not use it anymore.
starstutter
02-21-2008, 07:59 PM
maybe this has been mentioned (I didn't read through all the post) but I hear the Radix sort is extremley fast.
Also, a word of advice, unless under special conditions, you only need to resort once a second, if that even. Unless your strapping the camera to a rocket, they just don't move fast enough to need a re-sort every frame.
Hawkwind
02-22-2008, 01:43 AM
Nils - I got the sort to compile. Thanks its cool. I must admit that I had
dismissed the idea of using a radix > 2 since I just assumed that the extra
space and copy complexity would make it slower. Guess it proves I should assume nothing. I ditched the built in quicksort long ago since it chugs.
SamuraiCrow - I'll google for Burst sort (not heard of this one). I'm currently
playing around on paper with the coherency-insertion idea and looking at methods of predicting 'potential' order changes in the list.
Cheers.
Mihail121
02-22-2008, 04:27 AM
Also, there's another sort algorithm known as Burst Sort that is in O(n) time. You just insert each element into a trie and then do a traversal of the trie. It's mainly useful for DNA sequencing where each element is constructed of 2-bit genes. The Burst Sort hogs memory like there's no tomorrow but it is a bit faster than the radix sort since there are only 2 passes.
Ehm, wouldn't consider that very effective, since the lower bound of sort algorithms has been found and O(n) is on a computer NOT POSSIBLE. I've heard of some hardware dependant algos, that come close to O(n), but I guess that's not very effective.
Reedbeta
02-22-2008, 08:44 AM
Mihail: the lower bound is O(n log n) only for *comparison* sorts - that is, sorts in which calling some less-than function is the only way to compare elements. You can truly do the sort in O(n) time when sorting integers. For instance, radix sort is also O(n). This may seem like dark magic (it did to me when I learned it), but basically it's just taking advantage that integers have bits inside them you can look at and that the number of bits is O(1).
Also, asymptotic time isn't everything. Someone suggested the insertion sort above, and that algorithm is O(n^2) in the worst case...but it could well be the fastest thing in the special case of an almost-sorted list. Also, you might already know that quicksort has O(n^2) worst case time (though O(n log n) average case time), yet it's used over heapsort and mergesort, which have O(n log n) worst case time, because quicksort's faster in terms of *real* time in many situations.
SamuraiCrow
02-22-2008, 10:59 AM
http://sourceforge.net/projects/burstsort/ is the link to it. Apparently there's a new version of it called the Copy Burstsort that maintains cache coherency.
-edit- I see that the example source is GPL licenced so maybe "rolling your own" version of it would be practical for closed-source projects.
@Mihail121
Reedbeta is right. The only thing limiting sorting algorithms to O(n log n) is comparisons. When inserting a string into a trie, you typecast the characters as ints and use them to index into the pointer array in the current node to reach the next node. Indexing is O(1) operation times the constant number of characters in a string is still O(1) to insert into a trie. Iterating over the trie to read back the data is O(n) operation so inserting n strings at constant time each is O(n) and retrieving is a O(n) operation so the net result is O(n+n*K) where K is the constant length of all the strings and therefore, reducing out the constants, is O(n).
vBulletin, Copyright ©2000-2010, Jelsoft Enterprises Ltd.