瀏覽單個文章
freaky
Advance Member
 

加入日期: Jan 2002
文章: 449
話說我剛好看到的Intel Gen9 Skylake討論串,Intel人員Andrew Lauritzen的發言,這才是了解實際狀況的人會有的說法:

"From an API point of view, async compute is a way to provide an implementation with more potential parallelism to exploit. It is pretty analogous to SMT/hyper-threading: the API (multiple threads) are obviously supported on all hardware and depending on the workload and architecture it can increase performance in some cases where the different threads are using different hardware resources. However there is some inherent overhead to multithreading and an architecture that can get high performance with fewer threads (i.e. high IPC) is always preferable from a performance perspective."

從API的觀點來說,async compute是提供實作探索更多平行處理可能性的一種方式。它和SMT/hyper-thread十分相似:顯然所有的硬體都支援API (多執行緒),而根據工作和架構的不同,它在某些情況下可以提升效能,也就是當不同的執行緒使用不同的硬體資源時。然而多執行緒本身就有隱含的成本,從效能觀點來說,我們總是想要一個較少執行緒的高效能架構。

"When someone says that an architecture does or doesn't support "async compute/shaders" it is already an ambiguous statement (particularly for the latter). All DX12 implementations must support the API (i.e. there is no caps bit for "async compute", because such a thing doesn't really even make sense), although how they implement it under the hood may differ. This is the same as with many other features in the API."

當某人說一個架構支援或者不支援 "async compute/shaders"時,就已經是個不精確的敘述(特別是不支援)。所有DX12的實作都必須支援這個API(並沒有針對"async compute"的支援屬性可設定,因為這件事根本無意義),儘管實際上如何實作可能有所不同。這和許多其他(D3D12) API中的功能一樣。

"From an architecture point of view, a more well-formed question is "can a given implementation ever be running 3D and compute workloads simultaneously, and at what granularity in hardware?" Gen9 cannot run 3D and compute simultaneously, as we've referenced in our slides. However what that means in practice is entirely workload dependent, and anyone asking the first question should also be asking questions about "how much execution unit idle time is there in workload X/Y/Z", "what is the granularity and overhead of preemption", etc. All of these things - most of all the workload - are relevant when determining how efficiently a given situation maps to a given architecture."

從架構的角度而言,一個更合適的問題是〝這個實作是否能同時執行3D和運算工作,以及硬體的精細度如何?〞Gen9無法同時執行3D和運算工作,我們已在簡報投影片中註明。然而實際上的意義完全和工作相關,因此問第一個問題的人接著也該問〝在工作X/Y/Z中的執行單位閒置時間是多少〞,〝先佔式(多工)的成本和精細度為何〞等問題。所有這般考量—其中最重要的,工作本身—都與決定一個使用情境對應到某個架構的效率如何相關。

"Without that context you're effectively in making claims like 8 cores are always better than 4 cores (regardless of architecture) because they can run 8 things simultaneously. Hopefully folks on this site understand why that's not particularly useful."

不去討論應用背景,基本上就像在說八核永遠比四核好(無論架構為何)因為它們可以同時做八件事。希望本站的網友能了解到這並無幫助。

"... and if anyone starts talking about numbers of hardware queues and ACEs and whatever else you can pretty safely ignore that as marketing/fanboy nonsense that is just adding more confusion rather than useful information."

…話說如果某人開始講到硬體佇列和ACE的數量之類的,你就可以忽略他的發言,因為那都是行銷/粉絲的胡說八道,只會帶來更多困擾而非有用的資訊。

"Arun said: ↑
Yes, I'm honestly curious what the benefits to having multiple compute kernels in parallel really are (ala AMD's >2 ACEs)... This is beneficial if you cannot overlap an independent graphics workload and you have multiple independent compute workloads to run, but I'm not sure how important that is in practice."

Arun的發言:
是啊,我真心好奇擁有數個平行多重運算核心的好處到底在那(啊,AMD有兩個以上的ACE)…。如果你無法重疊不相關的圖形工作,而且你有多個獨立的運算工作要執行,這的確有助益,但我不清楚實際上這有多重要。

"Right so the bit people get confused with is that "I want multiple semantically async queues for convenience/middleware in the API" does *not* imply you need some sort of independent hardware queue resources to handle this, or even that they are an advantage. I hate to beat a dead horse here but it really is similar to multithreading and SMT... you don't need one hardware thread per software thread that you want to run - the OS *schedules* the software threads onto the available hardware resources and while there are advantages to hardware-based scheduling at the finer granularity, you're on thin ice arguing that you need any more than 2-3 hardware-backed "queues" here."

沒錯,所以人們弄不懂的說法是〝我想要API中有多個語意上非同步的佇列以便於使用/用於中繼軟體〞,並*不*表示你需要某種獨立的硬體佇列資源來處理,或者它們根本不會帶來任何好處。我不想鞭屍,不過這真的和多執行緒與SMT很像…你不需要為每個軟體執行緒提供一個硬體執行緒—作業系統會將軟體執行緒*分配*到可供運用的硬體資源上,另一方面,硬體排程在精細度上有其優點,然而需要比兩到三個更多的硬體*佇列*的論點十分薄弱。

"Arun said: ↑
Certainly a lot depends on the workload, the developer, *and* the API's ability to expose that parallelism in the first place."

Arun的發言:
確實一開始很多東西就和工作、開發者,*以及*API表現的平行處理能力有關。

"Absolutely, and that's another point that people miss here. GPUs are *heavily* pipe-lined and already run many things at the same time. Every GPU I know of for quite a while can run many simultaneous and unique compute kernels at once. You do not need async compute "queues" to expose that - pipelining + appropriate barrier APIs already do that just fine and without adding heavy weight synchronization primitives that multiple queues typically require. Most DX11 drivers already make use of parallel hardware engines under the hood since they need to track dependencies anyways... in fact it would be sort of surprising if AMD was not taking advantage of "async compute" in DX11 as it is certainly quite possible with the API and extensions that they have."

無庸置疑,而且人們還忽略了另一點。GPU本來就是*極度*管線化,並且已經同時做許多事。每個我接觸過一段時間的GPU都可以同時運作許多且獨一無二的運算核心。你不需要async compute*佇列*來實現—管線化+適當屏障的API已經運作得很好,並不需要增加多重佇列通常必備的重量級同步基本體。大部分DX11驅動程式已經可以使用平行硬體引擎,反正它們都需要追蹤相關性…事實上我有點意外,假如AMD之前沒有在DX11中利用"async compute"的話,因為以(D3D11) API和其擴充而言實現可能性十分高。

"Now obviously I'm all for making the API more explicit like they have in DX12. But don't confuse that with mapping one-to-one with some hardware feature on some GPU. That's simply a misunderstanding of how this all works."

顯然我非常贊成像DX12這樣,將這種API變得更加清楚直接。但是不要把這個東西與某些GPU中硬體功能的一對一對應搞混。這純然是對整個運作機制的一種誤解。

"Arun said: ↑
Another thing to consider is that if you have enough parallelism on one workload, then running a second one at the same time risks trashing your cache, and arbitration may also be non-trivial. Again I have never done any performance analysis of GCN so I don't know how well they handle that but it's certainly something that I expect will benefit from gradual improvement between hardware generations."

Arun的發言:
另一個可以考慮的是,如果你在一件工作上有足夠的平行處理能力,那麼同時間執行第二件工作就可能破壞快取,而且仲裁工作可能也不是十分容易。再次強調,我從來沒有針對GCN進行任何效能分析,因此我不知道它們對這種情況的處理有多好,不過我預期這確實是能隨著不同代的硬體逐漸改善的。

"Yes, the scheduling is non-trivial and not really something an application can do well either, but GCN tends to leave a lot of units idle from what I can tell, and thus it needs this sort of mechanism the most. I fully expect applications to tweak themselves for GCN/consoles and then basically have that all undone by the next architectures from each IHV that have different characteristics. If GCN wasn't in the consoles I wouldn't really expect ISVs to care about this very much. Suffice it to say I'm not convinced that it's a magical panacea of portable performance that has just been hiding and waiting for DX12 to expose it."

是的,排程本來就不是一件容易的事,也不是一個應用程式能做好的工作,但就我所知,GCN似乎傾向讓一堆單位閒置,因此其最需要這種機制。我完全預期應用程式會針對GCN/遊戲機調整,然後在來自每個IHV,特性不同的下一代架構中移除這些調整。如果遊戲機中使用的不是GCN,我真的不認為ISV會在乎這些。總而言之,我不覺得這是可移植效能的仙丹,好像原本只是藏在那兒等待DX12來發現。

"Anyways this is going a bit off topic so I'll leave it at that I think I've answered the question in any case and hopefully made those asking it think a little deeper."

不管怎樣,這已經離題了所以我就此打住:)我想無論如何我已經回答了問題,也希望能讓問這些問題的人再想清楚一點。
 
舊 2015-09-05, 07:32 PM #52
回應時引用此文章
freaky離線中