PDA

View Full Version : Vertex shader slower than fixed function pipeline?


lynedavid
03-28-2006, 01:08 PM
Hi,

I've read that shaders are just as fast as the fixed pipeline but I decided to do my own tests to be sure.

I found some interesting results. In short, I have found fragment shaders to be just as fast (if not then a little faster) than the fixed function pipeline.

However, vertex shaders seem significantly slower.

I have tested with the following code. I used a display list as this removed download bottlenecks from the equation.


#define ISURFACES 6000

glFinish();
double t1 = getTime();

glslshader->begin();

static int first=1;
glBindTexture ( GL_TEXTURE_2D, basetex );
for (int i=0;i<ISURFACES;i++)
{

float quadwidth = 20.0;
float tx=0.0;
float ty=0.0;
float h = -10.0;
glPushMatrix();
glTranslatef(xg,0.0,0.0);

if (first)
{
glNewList(DL,GL_COMPILE);
glPushMatrix();
glBegin(GL_QUADS);
for (float x = -50.0 ; x < 50.0 ; x+= quadwidth)
{
tx=0.0;
for (float y = -50.0 ; y < 50.0 ; y+= quadwidth)
{
glTexCoord2f(tx,ty);
glVertex3f(x,h,y);
glTexCoord2f(tx,ty+0.2);
glVertex3f(x,h,y+quadwidth);
glTexCoord2f(tx+0.2,ty+0.2);
glVertex3f(x+quadwidth,h,y+quadwidth);
glTexCoord2f(tx+0.2,ty);
glVertex3f(x+quadwidth,h,y);
tx+=0.2;
}
ty+=0.2;
}
glEnd();
glPopMatrix();
glEndList();
first=0;

}else
{
glCallList(DL);

}
glPopMatrix();
xg+=100.0;


}

glslshader->end();
glFinish();
fprintf(stderr,"Time elapsed=%0.2f\n",(getTime() - t1) * 1000.0);



On my GF 6800 Ultra, the above 6000 primatives with the fixed function pipeline took 4 milliseconds to render.

On the other hand, to render with a vertex shader enabled took over 10 milliseconds!

The glsl vertex shader looks like this (pretty simple):

void main()
{
gl_TexCoord[0] = gl_MultiTexCoord0;
gl_Position = ftransform();

}

I have also tried with CG and ARB assembly shaders. All with the same results.

Is there an optimization I am missing here?

Alex
03-29-2006, 03:04 AM
Now..it certainly depends on the shader and whether your hardware implements the ff pipeline in hardware or emulates it using shaders.

If there is a hw impl of the ff pipe and your shader has lots of (and maybe expensive) instructions then the ff pipe will execute quicker...

Finally measuring these things using a timeGetTime() or similar around that code you will mostly measure (opengl) api overhead ..
Also be ceraful when you actually want to measure cpu time spend. Some of the timers have resolutions way above 1 ms.

Your gfx hardware runs in parallel to your code. You'd need a special tool to see how much time the gfx processor spends (NVperf or winpix or something)..or you can look at the change of fps if you're 100% sure you're not cpu bound to get a rough idea.

So to get useful data:

- preload all data (textures, shaders, vertex buffers, index buffers etc) into video ram
- use lot's of data (but not too much) so you can actually measure the time deltas
- preset all states and minimize all api calls (they tend to spend lot's of time in drivers and settings states isn't that fast/interferes with the hw)
- make sure you max out the gfx card at the resource you're intersted in
(that'd be the TnL path..so don't use mad texture size/pixel shader stuff etc)
- run an apropriate perf tool

Alex

lynedavid
03-29-2006, 09:52 AM
Now..it certainly depends on the shader and whether your hardware implements the ff pipeline in hardware or emulates it using shaders.

I have heard that the NV40/G70 does not have special ff pipeline implementation. Isn't this only the case for < N35?


If there is a hw impl of the ff pipe and your shader has lots of (and maybe expensive) instructions then the ff pipe will execute quicker...


The shader source above only translates the vertices according to the modelview and projection matrix. It cant get much simpler than that can it?


Finally measuring these things using a timeGetTime() or similar around that code you will mostly measure (opengl) api overhead ..

I am executing glFinish after the above and before the timing is taken to let the graphics card finish its rendering so I dont see why this is inaccurate.

Also, if I am measuring api overhead then my question would then be why do vertex shaders require so much more api overhead?

Also be ceraful when you actually want to measure cpu time spend. Some of the timers have resolutions way above 1 ms.

I am using the QueryPerformanceCounter high resolution timing function.


Your gfx hardware runs in parallel to your code. You'd need a special tool to see how much time the gfx processor spends (NVperf or winpix or something)..or you can look at the change of fps if you're 100% sure you're not cpu bound to get a rough idea.


I have also taken fps. They are as follows:
With vertex shader: 59fps
Without vertex shader: 170fps



- preload all data (textures, shaders, vertex buffers, index buffers etc) into video ram

As you can see with the above example, this is the case. All gemetry is display listed and the shader and textures are loaded and bound before the loop actually starts.


- use lot's of data (but not too much) so you can actually measure the time deltas

The shown example renders 300000 polygons per frame. Is that sufficient?


- preset all states and minimize all api calls (they tend to spend lot's of time in drivers and settings states isn't that fast/interferes with the hw)

The only api calls that I am using in the above test is a translate, a push and a pop matrix per display list.


- make sure you max out the gfx card at the resource you're intersted in
(that'd be the TnL path..so don't use mad texture size/pixel shader stuff etc)


The above test renders 300000 polygons per frame and uses display lists
which should rule out bandwidth issues.

One 512x512 texture is applied to the entire test. So not much texture binding there.

Most of the geometry is located off-screen (and I have a massive far clipping plane) so the test will not become fill-limited.

SigKILL
03-29-2006, 01:11 PM
Most of the geometry is located off-screen (and I have a massive far clipping plane) so the test will not become fill-limited.

A wild guess, but does the number change when all of the geometry is visible? It might just be that the cards are using some early-out culling in fixed function. A few years ago atleast, there was some keyword you could use to simply pass-through vertex transformations that was faster than doing transformation in a VP. Maybe you should try setting this if it is still in...

-si

lynedavid
03-29-2006, 04:15 PM
A wild guess, but does the number change when all of the geometry is visible? It might just be that the cards are using some early-out culling in fixed function. A few years ago atleast, there was some keyword you could use to simply pass-through vertex transformations that was faster than doing transformation in a VP. Maybe you should try setting this if it is still in...

-si

An interesting guess.

But i've just tried this, and no it makes no difference to the results.

Blaxill
03-29-2006, 05:24 PM
Have you tried it without this line?

gl_TexCoord[0] = gl_MultiTexCoord0;

So have you tried just using

void main()
{
gl_Position = ftransform();
}

lynedavid
03-30-2006, 10:19 AM
Have you tried it without this line?

gl_TexCoord[0] = gl_MultiTexCoord0;

So have you tried just using

void main()
{
gl_Position = ftransform();
}


No noticable change (and no textures) by removing that line.

Alex
03-30-2006, 04:04 PM
hm..you're right..I didn't spot the glFinish() command..still the api/driver overhead of binding the shader might be more or as much as the actual rendering.

Did you try to time only the rendering(with the vshader bind outside the timing)?

Does a compiled list guarantee that you vertex data is polled from local vid ram rather than anything else?

Alex

elengyel
03-30-2006, 09:08 PM
Copying my response from GameDev.net:

On a GeForce 6800/7800, your vertex shader generates the following *native* code for the GPU:

401F9C6C 01CD400D 8106C0C3 60411F80 DP4 o[HPOS].x, v[OPOS], c[212];
401F9C6C 01CD500D 8106C0C3 60409F80 DP4 o[HPOS].y, v[OPOS], c[213];
401F9C6C 01CD600D 8106C0C3 60405F80 DP4 o[HPOS].z, v[OPOS], c[214];
401F9C6C 01CD700D 8106C0C3 60403F80 DP4 o[HPOS].w, v[OPOS], c[215];
401F9C6C 00400808 0106C083 60419F9D MOV o[TEX0].xy, v[TEX0].xyxx;

(The hexcodes are the 128-bit instruction words.) The "fixed-function" path produces exactly the same code, but with different constant register indexes.

BTW, in case you're curious, your vertex shader produces the following native code on Radeon X800/X1800.

00100201 00D10002 00D10001 00D10005 DP4 o[0].x, c[0], v[0];
00200201 00D10022 00D10001 00D10005 DP4 o[0].y, c[1], v[0];
00400201 00D10042 00D10001 00D10005 DP4 o[0].z, c[2], v[0];
00800201 00D10062 00D10001 00D10005 DP4 o[0].w, c[3], v[0];
00F02203 01648000 01248000 01248005 MOV o[1], R0.0001;
00304203 00D10041 01248041 01248045 MOV o[2].xy, v[2];

The ATI driver inserts an extra instruction to move (0,0,0,1) into the primary color interpolant, but otherwise, it's the same native instruction sequence that Nvidia hardware uses.

-------
If you want to do very accurate timing on the GPU, you can use the EXT_timer_query extension. If defines the following enums and functions.

#define GL_TIME_ELAPSED_EXT 0x88BF

typedef __int64 GLint64EXT;
typedef unsigned __int64 GLuint64EXT;

void glGetQueryObjecti64vEXT(GLuint id, GLenum pname, GLint64EXT *params);
void glGetQueryObjectui64vEXT(GLuint id, GLenum pname, GLuint64EXT *params);

Use the glBeginQuery/glEndQuery mechanism with the GL_TIME_ELAPSED_EXT target to specify a timing interval. A call to glGetQueryObjectui64vEXT with <pname> GL_QUERY_RESULT returns the elapsed time in nanoseconds.

Axel
03-31-2006, 12:12 AM
I'm very curious how you get that info. I always thought that the actual GPU instruction sets are a secret of the IHVs.

Reedbeta
03-31-2006, 12:30 AM
Yeah - how'd you get it?

elengyel
03-31-2006, 12:33 AM
I'm very curious how you get that info. I always thought that the actual GPU instruction sets are a secret of the IHVs.

You'd be surprised by what you can figure out just by watching what the driver writes to memory...

lynedavid
03-31-2006, 03:39 AM
hm..you're right..I didn't spot the glFinish() command..still the api/driver overhead of binding the shader might be more or as much as the actual rendering.

Did you try to time only the rendering(with the vshader bind outside the timing)?

Does a compiled list guarantee that you vertex data is polled from local vid ram rather than anything else?

Alex

I have timed it without the binding of the shader and it does not change things.

As far as the compiling of the display list goes, i'm really assuming that it is the case. Its a great deal slower in direct rendering mode.

Reedbeta
03-31-2006, 11:34 AM
Have you tried using vertex buffers instead? Also, have you tried adding a simple fragment program when you test with shaders, just to ensure that the slowdown isn't due to some trouble interfacing the vertex shader with fixed-functionality fragment processing? (It shouldn't be, but you never know.)