![]() |
| [[ Home | Forums | 3D Engines Database | Wiki | Articles/Tutorials | Game Dev Jobs | IRC Chat Network | Contact Us ]] |
|
|
#1 |
|
New Member
Join Date: May 2006
Posts: 5
|
First of all, hi guys, my first post here. Not sure how active this board is but I used to hang around at Flipcode many years ago (under a little bit different nickname, though
) and this is the place where I was said many ol' Flipcoders moved to, so I thought I'd give it a try...So, anyway, on to my question. I am just getting into vertex and pixel shader programming and I am looking for a reference that lists the execution speed of the individual assembler instructions. For example, one that says the "mov" instruction takes so and so many GPU clock cycles to execute while the "crs" instruction takes so many clock cycles etc. If possible, for different GPUs .None of the books and tutorials or online resources I found so far contains this kind of information. As I said, I am just starting out with shaders and did not deal with the detailed hardware architecture of them yet, so if I missed an important point (like, every instruction takes the same amount of clock cycles to execute or "1 instruction slot = 1 clock cycle" or something like that) then I would be glad if someone could point it out to me. Thanks in advance ogo |
|
|
|
|
|
#2 | |
|
Valued Member
Join Date: Oct 2005
Posts: 247
|
Quote:
___________________________________________
http://www.iguanademos.com/Jare |
|
|
|
|
|
|
#3 |
|
DevMaster Staff
Join Date: Oct 2004
Location: Seattle, WA
Posts: 4,015
|
Not to mention that every piece of 3D hardware has its own machine language that it compiles shaders into (which doesn't necessarily correspond to ARB assembly, that is sort of like the machine language for a virtualized device...), and so cards from different manufacturers, and different chipsets will all have different cycle counts and latencies and such.
___________________________________________
Currently working at Sucker Punch reedbeta.com - OpenGL demos and other projects Luabridge - a lightweight, dependency-free C++/Lua binding library. CD Lite - an unobtrusive, minimal CD player application for Windows. |
|
|
|
|
|
#4 |
|
Member
Join Date: Oct 2004
Location: Roseville, CA
Posts: 42
|
The details of the native instructions for each GPU and how long they take to execute are closely guarded secrets of the chip vendors. It's not impossible to write a program that will time fragment programs -- if you know the clock frequency of your GPU and how many pixel pipes it has, you can time how long it takes to fill the screen a few thousand times and then do a simple calculation to determine how long it took to process each pixel. (Use the EXT_timer_query extension if available for the best timing.)
The NVShaderPerf tool will tell you how many cycles your fragment programs should take on various Nvidia GPUs, but it won't give you an instruction by instruction breakdown. (The native instructions on Nvidia hardware are almost identical to what you'd find in the NV_fragment_program* extension specs.) The fragment shading hardware is extraordinarily complex, and often many instructions are executed in a single cycle. For example, Nvidia hardware can execute up to six native instructions at once if they're the right ones (one TEX, one normalize, and two pairs of dual-issued arithmetic instructions). In addition to that, Nvidia and ATI hardware both have special operations that you get for free, such as multiplying or dividing by 2, 4, or 8. |
|
|
|
|
|
#5 |
|
Senior Member
Join Date: Aug 2004
Location: Ghent, Belgium
Posts: 1,135
|
ogotay?!
On current hardware every instruction has a latency of 1 clock cycle if I'm not mistaken. But macro instructions can take several clock cycles. For example m4x3 is actually three dp4's (hence it takes 3 instruction slots). But a lot also depends on the architecture. Modern graphics cards tend to be bandwidth limited but have impressive shader processing capabilities. The Radeon X1900 for example has 48 shader units, each with a vector ALU, scalar ALU, branching logic and possibly other execution units. So all you can do is just limit the number of instructions and rely on the driver to optimize it. Also minimize texture accesses to minimize bandwidth. |
|
|
|
|
|
#6 | |
|
Valued Member
Join Date: Aug 2005
Posts: 162
|
Quote:
uhm, the other way around isn't it? they have the throughput of 1 clock cycle, the latency can be really high, but it's quite easy to mask away with the really long pipelines. basicly you have a shitload of pixels working on different stages of the same instruction at the time, spitting out one pixel every clock cycle. edit: duh, you're ofcourse speaking of cycles as seen by the programmer. my bad. Last edited by kusma : 05-21-2006 at 04:12 AM. |
|
|
|
|
|
|
#7 | |
|
Senior Member
Join Date: Aug 2004
Location: Ghent, Belgium
Posts: 1,135
|
Quote:
And it definitely doesn't output one pixel every clock cycle (per pipeline). Long shaders can take many clock cycles between pixels. This is also why modern graphics cards have less ROP units (Raster OPeration - for alpha blending and such) than shader units. The maximum fillrate is only reached with trivial shaders. |
|
|
|
|
|
|
#8 | |
|
New Member
Join Date: May 2006
Posts: 5
|
Hey, Nick, yeah, it's "Ogotay" here, didn't think anybody would remember this name
. Nice to see you're still around, how ya doing man??? ![]() Thanks for all your responses guys, they helped me a good deal. However, I've got one followup question. Quote:
What do you mean?? In the future I won't be able to upload any shaders written in asm to the graphics hardware using APIs like DirectX, only shaders written in HLSL? Where do you have this info from, I can hardly imagine this is going to happen. I see the benefits of writing shaders in a high level language as it's easier and faster and you don't have to deal with hardware dependant low level optimizations yourself (especially since there are already so many different graphic cards at the market) and let the API/whatever do it for you automatically before your shader is loaded and executed, but I can't imagine support for asm will be dropped any time soon, at least not as long as HLSL itself is compiled into asm. Execution speed is crucial for shaders and I don't see it changing in the near future, and I know I don't have to tell you that well written asm code is at least as fast and usually faster than automatically optimized code. Last edited by ogo : 05-21-2006 at 07:55 AM. |
|
|
|
|
|
|
#9 | ||
|
Senior Member
Join Date: Aug 2004
Location: Ghent, Belgium
Posts: 1,135
|
Quote:
I work for TransGaming now, on the SwiftShader project. So I'm still a software rendering nut. ![]() Quote:
![]() |
||
|
|
|
|
|
#10 | ||
|
New Member
Join Date: May 2006
Posts: 5
|
Quote:
I see, I see, nice . And people looked strange at you when you started working on your software renderer back then . So what's next, a realtime raytracer? Quote:
So are you really advising me to drop asm and concentrate solely on HLSL and the like? Is this the future? Darn, the chance to code in assembly on actually practical projects was the main thing that got me back to programming, and now you guys say it's a waste of time? ![]() You've got some articles or whatever where I could read more about these plans? |
||
|
|
|
|
|
#11 | ||
|
Valued Member
Join Date: Aug 2005
Posts: 162
|
Quote:
...the thing is that the multi-thread nature of pixel-shaders makes the programming model practically insensitive to latency as it can be used for other good things (like other pixels). i know that atleast some gpus exploits this, and have something like each functional unit heavy pipelined (let's face it, the texture-mapper HAS to be). Quote:
yeah. that's why i never claimed the -entire- pipeline to squeeze out one pixel per clock, only each pipeline for each instruction. |
||
|
|
|
|
|
#12 | |
|
Valued Member
Join Date: Oct 2005
Posts: 247
|
Quote:
Obviously you'll have to download the DX9 April SDK: http://msdn.microsoft.com/directx/sdk/
___________________________________________
http://www.iguanademos.com/Jare |
|
|
|
|
|
|
#13 | |
|
New Member
Join Date: May 2006
Posts: 5
|
Quote:
Oh. I've actually had it lying around on my harddrive all the time. Thanks. |
|
|
|
|
|
|
#14 | ||
|
Valued Member
Join Date: Sep 2004
Posts: 226
|
ogo(tay):
Quote:
Quote:
So, if someone has a sample where the generated output from a HLSL file was optimised (through changes by hand), please let me know ![]() |
||
|
|
|
|
|
#15 |
|
Valued Member
Join Date: Sep 2005
Location: Germany
Posts: 119
|
The "assembly" you get in touch on current APIs is more like Java bytecode than real instructions. The GPU-Driver will recompile it anyway to the hardware's native format.
Don't waste your time with it and just use a high level language. |
|
|
|
|
|
#16 |
|
New Member
Join Date: Feb 2006
Location: California
Posts: 22
|
I agree it is a waste of time playing with asm-shaders. I also think it is sad, as the chance of writing some cool asm was partly my motivation for starting programming again after ten years of sobbing over the death of amiga.
![]() However, after giving up on asm-shaders, I still have one nifty shader that does not convert to hlsl. It just will not fit (too many instructions) whenever I try to implement it since the asm code is too compact. I also believe that it would look rather unreadable in hlsl. In the end, since the 'asm' code is not really native asm and just some kind of bytecode, I am really not that sad. I recently read a paper which included some shader code (I think I downloaded it from nvidia's site). The code was written mostly in hlsl, but also included one of the shaders in asm. The reason was that the hlsl version compiled into too many instructions for low level shader models. Of course, these intruction-limitations are quickly disappearing as the cards rapidly gets more powerful, so they the problem (or motivation for sticking to asm if you like) should be considered temporary.
___________________________________________
0, 1/2, 2/3, 3/4, 4/5, ... |
|
|
|
|
|
#17 |
|
New Member
Join Date: May 2006
Posts: 5
|
Ok, thanks guys, you all definitely helped to make up my mind. So, yeah, gonna drop asm but I'm not too sad about this either. It would be cool to have the opportunity to code in asm after so many years, but I already made my peace years ago with it becoming more and more pointless and seeing how HLSL is something you can pick up very quickly and then concentrate on the actual algorithms instead of "byte shifting" I think I will go this route. After all, in computer graphics it's all about the result, not how you got there and if HLSL delivers the same speed and functionality with less work than asm, then why not use it.
Last edited by ogo : 05-24-2006 at 05:57 AM. |
|
|
|
|
|
#18 |
|
New Member
Join Date: Oct 2005
Posts: 11
|
there are articles in ShaderX3 and ShaderX4 that give you an idea on how fast single instructions are. It highly depends on a number of things and it might be very different for the two main graphics vendors ....
There was never a good reason to believe, that the cycle numbers in the documentation do anything to do with real world numbers :-) |
|
|
|
![]() |
| Thread Tools | Search this Thread |
| Display Modes | |
|