DevMaster.net Forums
[[ Home | Forums | 3D Engines Database | Wiki | Articles/Tutorials | Game Dev Jobs | IRC Chat Network | Contact Us ]]

Go Back   DevMaster.net Forums > Programming & Development > Graphics Theory & Programming
User Name
Password
Register FAQ Members List Search Today's Posts Mark Forums Read

Reply
 
Thread Tools Search this Thread Display Modes
Old 05-20-2006, 02:02 PM   #1
ogo
New Member
 
Join Date: May 2006
Posts: 5
Default Vertex/pixel shader instruction clock cycles

First of all, hi guys, my first post here. Not sure how active this board is but I used to hang around at Flipcode many years ago (under a little bit different nickname, though ) and this is the place where I was said many ol' Flipcoders moved to, so I thought I'd give it a try...

So, anyway, on to my question. I am just getting into vertex and pixel shader programming and I am looking for a reference that lists the execution speed of the individual assembler instructions. For example, one that says the "mov" instruction takes so and so many GPU clock cycles to execute while the "crs" instruction takes so many clock cycles etc. If possible, for different GPUs .

None of the books and tutorials or online resources I found so far contains this kind of information. As I said, I am just starting out with shaders and did not deal with the detailed hardware architecture of them yet, so if I missed an important point (like, every instruction takes the same amount of clock cycles to execute or "1 instruction slot = 1 clock cycle" or something like that) then I would be glad if someone could point it out to me.

Thanks in advance
ogo
ogo is offline   Reply With Quote
Old 05-20-2006, 02:43 PM   #2
Jare
Valued Member
 
Join Date: Oct 2005
Posts: 247
Default Re: Vertex/pixel shader instruction clock cycles

Quote:
Originally Posted by ogo
I am looking for a reference that lists the execution speed of the individual assembler instructions.
The APIs are moving to 100% HLSL and dropping shader assembly code completely, so you might want to think about other things. Precise timing in a shader is complex and mostly impossible, but yeah, I suppose as a rule of thumb you can think in terms of 1 instruction = 1 cycle and then add texture lookup overhead which is HIGHLY variable.
___________________________________________
http://www.iguanademos.com/Jare
Jare is offline   Reply With Quote
Old 05-20-2006, 04:35 PM   #3
Reedbeta
DevMaster Staff
 
Join Date: Oct 2004
Location: Seattle, WA
Posts: 4,015
Default Re: Vertex/pixel shader instruction clock cycles

Not to mention that every piece of 3D hardware has its own machine language that it compiles shaders into (which doesn't necessarily correspond to ARB assembly, that is sort of like the machine language for a virtualized device...), and so cards from different manufacturers, and different chipsets will all have different cycle counts and latencies and such.
___________________________________________
Currently working at Sucker Punch
reedbeta.com - OpenGL demos and other projects
Luabridge - a lightweight, dependency-free C++/Lua binding library.
CD Lite - an unobtrusive, minimal CD player application for Windows.
Reedbeta is offline   Reply With Quote
Old 05-20-2006, 05:36 PM   #4
elengyel
Member
 
Join Date: Oct 2004
Location: Roseville, CA
Posts: 42
Default Re: Vertex/pixel shader instruction clock cycles

The details of the native instructions for each GPU and how long they take to execute are closely guarded secrets of the chip vendors. It's not impossible to write a program that will time fragment programs -- if you know the clock frequency of your GPU and how many pixel pipes it has, you can time how long it takes to fill the screen a few thousand times and then do a simple calculation to determine how long it took to process each pixel. (Use the EXT_timer_query extension if available for the best timing.)

The NVShaderPerf tool will tell you how many cycles your fragment programs should take on various Nvidia GPUs, but it won't give you an instruction by instruction breakdown. (The native instructions on Nvidia hardware are almost identical to what you'd find in the NV_fragment_program* extension specs.) The fragment shading hardware is extraordinarily complex, and often many instructions are executed in a single cycle. For example, Nvidia hardware can execute up to six native instructions at once if they're the right ones (one TEX, one normalize, and two pairs of dual-issued arithmetic instructions). In addition to that, Nvidia and ATI hardware both have special operations that you get for free, such as multiplying or dividing by 2, 4, or 8.
elengyel is offline   Reply With Quote
Old 05-21-2006, 03:28 AM   #5
Nick
Senior Member
 
Join Date: Aug 2004
Location: Ghent, Belgium
Posts: 1,135
Default Re: Vertex/pixel shader instruction clock cycles

ogotay?!

On current hardware every instruction has a latency of 1 clock cycle if I'm not mistaken. But macro instructions can take several clock cycles. For example m4x3 is actually three dp4's (hence it takes 3 instruction slots).

But a lot also depends on the architecture. Modern graphics cards tend to be bandwidth limited but have impressive shader processing capabilities. The Radeon X1900 for example has 48 shader units, each with a vector ALU, scalar ALU, branching logic and possibly other execution units.

So all you can do is just limit the number of instructions and rely on the driver to optimize it. Also minimize texture accesses to minimize bandwidth.
Nick is offline   Reply With Quote
Old 05-21-2006, 04:09 AM   #6
kusma
Valued Member
 
Join Date: Aug 2005
Posts: 162
Default Re: Vertex/pixel shader instruction clock cycles

Quote:
Originally Posted by Nick
On current hardware every instruction has a latency of 1 clock cycle if I'm not mistaken. But macro instructions can take several clock cycles. For example m4x3 is actually three dp4's (hence it takes 3 instruction slots).

uhm, the other way around isn't it? they have the throughput of 1 clock cycle, the latency can be really high, but it's quite easy to mask away with the really long pipelines. basicly you have a shitload of pixels working on different stages of the same instruction at the time, spitting out one pixel every clock cycle.

edit: duh, you're ofcourse speaking of cycles as seen by the programmer. my bad.

Last edited by kusma : 05-21-2006 at 04:12 AM.
kusma is offline   Reply With Quote
Old 05-21-2006, 06:32 AM   #7
Nick
Senior Member
 
Join Date: Aug 2004
Location: Ghent, Belgium
Posts: 1,135
Default Re: Vertex/pixel shader instruction clock cycles

Quote:
Originally Posted by kusma
uhm, the other way around isn't it? they have the throughput of 1 clock cycle, the latency can be really high, but it's quite easy to mask away with the really long pipelines. basicly you have a shitload of pixels working on different stages of the same instruction at the time, spitting out one pixel every clock cycle.
Latencies are also one clock cycle (for arithmethic instructions) as far as I know. GPU's are not like CPU's where latencies can be like several clock cycles for one integer multiplication. Their clock frequency is low enough to perform things like a dot product in one clock cycle.

And it definitely doesn't output one pixel every clock cycle (per pipeline). Long shaders can take many clock cycles between pixels. This is also why modern graphics cards have less ROP units (Raster OPeration - for alpha blending and such) than shader units. The maximum fillrate is only reached with trivial shaders.
Nick is offline   Reply With Quote
Old 05-21-2006, 07:52 AM   #8
ogo
New Member
 
Join Date: May 2006
Posts: 5
Default Re: Vertex/pixel shader instruction clock cycles

Hey, Nick, yeah, it's "Ogotay" here, didn't think anybody would remember this name . Nice to see you're still around, how ya doing man???

Thanks for all your responses guys, they helped me a good deal. However, I've got one followup question.

Quote:
Originally Posted by Jare
The APIs are moving to 100% HLSL and dropping shader assembly code completely, so you might want to think about other things.

What do you mean?? In the future I won't be able to upload any shaders written in asm to the graphics hardware using APIs like DirectX, only shaders written in HLSL? Where do you have this info from, I can hardly imagine this is going to happen. I see the benefits of writing shaders in a high level language as it's easier and faster and you don't have to deal with hardware dependant low level optimizations yourself (especially since there are already so many different graphic cards at the market) and let the API/whatever do it for you automatically before your shader is loaded and executed, but I can't imagine support for asm will be dropped any time soon, at least not as long as HLSL itself is compiled into asm. Execution speed is crucial for shaders and I don't see it changing in the near future, and I know I don't have to tell you that well written asm code is at least as fast and usually faster than automatically optimized code.

Last edited by ogo : 05-21-2006 at 07:55 AM.
ogo is offline   Reply With Quote
Old 05-21-2006, 08:11 AM   #9
Nick
Senior Member
 
Join Date: Aug 2004
Location: Ghent, Belgium
Posts: 1,135
Default Re: Vertex/pixel shader instruction clock cycles

Quote:
Originally Posted by ogo
Hey, Nick, yeah, it's "Ogotay" here, didn't think anybody would remember this name . Nice to see you're still around, how ya doing man???
Great. I work for TransGaming now, on the SwiftShader project. So I'm still a software rendering nut.
Quote:
I see the benefits of writing shaders in a high level language as it's easier and faster and you don't have to deal with hardware dependant low level optimizations yourself (especially since there are already so many different graphic cards at the market) and let the API/whatever do it for you automatically before your shader is loaded and executed, but I can't imagine support for asm will be dropped any time soon, at least not as long as HLSL itself is compiled into asm. Execution speed is crucial for shaders and I don't see it changing in the near future, and I know I don't have to tell you that well written asm code is at least as fast and usually faster than automatically optimized code.
It's true. Direct3D 10 will put all focus on HLSL. The runtime compiles them directly into a form of binary code. So you can still 'precompile' them but accessing the assembly will be harder. There's still an assembly to binary compiler, but it won't be as easy to use. Anyway, there's still a form of inline assembly in HLSL using intrinsics. So it's not like you'd loose all control. Besides, GPU's use a RISC-like instruction set (though 3D specific) so it's hard to do any optimizations manually. The driver will absolutely do a good job at optimizing the code.
Nick is offline   Reply With Quote
Old 05-21-2006, 08:32 AM   #10
ogo
New Member
 
Join Date: May 2006
Posts: 5
Default Re: Vertex/pixel shader instruction clock cycles

Quote:
Originally Posted by Nick
Great. I work for TransGaming now, on the SwiftShader project. So I'm still a software rendering nut.

I see, I see, nice . And people looked strange at you when you started working on your software renderer back then . So what's next, a realtime raytracer?

Quote:
Originally Posted by Nick
It's true. Direct3D 10 will put all focus on HLSL (...)

So are you really advising me to drop asm and concentrate solely on HLSL and the like? Is this the future? Darn, the chance to code in assembly on actually practical projects was the main thing that got me back to programming, and now you guys say it's a waste of time?

You've got some articles or whatever where I could read more about these plans?
ogo is offline   Reply With Quote
Old 05-21-2006, 09:56 AM   #11
kusma
Valued Member
 
Join Date: Aug 2005
Posts: 162
Default Re: Vertex/pixel shader instruction clock cycles

Quote:
Originally Posted by Nick
Latencies are also one clock cycle (for arithmethic instructions) as far as I know. GPU's are not like CPU's where latencies can be like several clock cycles for one integer multiplication. Their clock frequency is low enough to perform things like a dot product in one clock cycle.

...the thing is that the multi-thread nature of pixel-shaders makes the programming model practically insensitive to latency as it can be used for other good things (like other pixels). i know that atleast some gpus exploits this, and have something like each functional unit heavy pipelined (let's face it, the texture-mapper HAS to be).

Quote:
Originally Posted by Nick
And it definitely doesn't output one pixel every clock cycle (per pipeline). Long shaders can take many clock cycles between pixels. This is also why modern graphics cards have less ROP units (Raster OPeration - for alpha blending and such) than shader units. The maximum fillrate is only reached with trivial shaders.

yeah. that's why i never claimed the -entire- pipeline to squeeze out one pixel per clock, only each pipeline for each instruction.
kusma is offline   Reply With Quote
Old 05-22-2006, 12:12 PM   #12
Jare
Valued Member
 
Join Date: Oct 2005
Posts: 247
Default Re: Vertex/pixel shader instruction clock cycles

Quote:
Originally Posted by ogo
You've got some articles or whatever where I could read more about these plans?
"To access the Direct3D 10 documentation, click the Start Menu, choose All Programs, Microsoft DirectX SDK (April 2006), and select "Documentation for Direct3D 10".

Obviously you'll have to download the DX9 April SDK: http://msdn.microsoft.com/directx/sdk/
___________________________________________
http://www.iguanademos.com/Jare
Jare is offline   Reply With Quote
Old 05-22-2006, 01:57 PM   #13
ogo
New Member
 
Join Date: May 2006
Posts: 5
Default Re: Vertex/pixel shader instruction clock cycles

Quote:
Originally Posted by Jare
"To access the Direct3D 10 documentation, click the Start Menu, choose All Programs, Microsoft DirectX SDK (April 2006), and select "Documentation for Direct3D 10".

Obviously you'll have to download the DX9 April SDK: http://msdn.microsoft.com/directx/sdk/

Oh. I've actually had it lying around on my harddrive all the time. Thanks.
ogo is offline   Reply With Quote
Old 05-22-2006, 03:18 PM   #14
moe
Valued Member
 
Join Date: Sep 2004
Posts: 226
Default Re: Vertex/pixel shader instruction clock cycles

ogo(tay):
Quote:
I don't have to tell you that well written asm code is at least as fast and usually faster than automatically optimized code.
Nick:
Quote:
The driver will absolutely do a good job at optimizing the code.
I have actually never heard or read that someone could "hand optimise" the generated output. But that won't mean much.
So, if someone has a sample where the generated output from a HLSL file was optimised (through changes by hand), please let me know
moe is offline   Reply With Quote
Old 05-22-2006, 03:33 PM   #15
Axel
Valued Member
 
Join Date: Sep 2005
Location: Germany
Posts: 119
Default Re: Vertex/pixel shader instruction clock cycles

The "assembly" you get in touch on current APIs is more like Java bytecode than real instructions. The GPU-Driver will recompile it anyway to the hardware's native format.

Don't waste your time with it and just use a high level language.
Axel is offline   Reply With Quote
Old 05-24-2006, 01:50 AM   #16
Hyde
New Member
 
Join Date: Feb 2006
Location: California
Posts: 22
Default Re: Vertex/pixel shader instruction clock cycles

I agree it is a waste of time playing with asm-shaders. I also think it is sad, as the chance of writing some cool asm was partly my motivation for starting programming again after ten years of sobbing over the death of amiga.

However, after giving up on asm-shaders, I still have one nifty shader that does not convert to hlsl. It just will not fit (too many instructions) whenever I try to implement it since the asm code is too compact. I also believe that it would look rather unreadable in hlsl.

In the end, since the 'asm' code is not really native asm and just some kind of bytecode, I am really not that sad.

I recently read a paper which included some shader code (I think I downloaded it from nvidia's site). The code was written mostly in hlsl, but also included one of the shaders in asm. The reason was that the hlsl version compiled into too many instructions for low level shader models.

Of course, these intruction-limitations are quickly disappearing as the cards rapidly gets more powerful, so they the problem (or motivation for sticking to asm if you like) should be considered temporary.
___________________________________________
0, 1/2, 2/3, 3/4, 4/5, ...
Hyde is offline   Reply With Quote
Old 05-24-2006, 05:53 AM   #17
ogo
New Member
 
Join Date: May 2006
Posts: 5
Default Re: Vertex/pixel shader instruction clock cycles

Ok, thanks guys, you all definitely helped to make up my mind. So, yeah, gonna drop asm but I'm not too sad about this either. It would be cool to have the opportunity to code in asm after so many years, but I already made my peace years ago with it becoming more and more pointless and seeing how HLSL is something you can pick up very quickly and then concentrate on the actual algorithms instead of "byte shifting" I think I will go this route. After all, in computer graphics it's all about the result, not how you got there and if HLSL delivers the same speed and functionality with less work than asm, then why not use it.

Last edited by ogo : 05-24-2006 at 05:57 AM.
ogo is offline   Reply With Quote
Old 05-25-2006, 05:34 PM   #18
wolf
New Member
 
Join Date: Oct 2005
Posts: 11
Default Re: Vertex/pixel shader instruction clock cycles

there are articles in ShaderX3 and ShaderX4 that give you an idea on how fast single instructions are. It highly depends on a number of things and it might be very different for the two main graphics vendors ....
There was never a good reason to believe, that the cycle numbers in the documentation do anything to do with real world numbers :-)
wolf is offline   Reply With Quote
Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Forum Jump


All times are GMT -7. The time now is 04:42 AM.


Powered by vBulletin
Copyright ©2000 - 2010, Jelsoft Enterprises Ltd.