![]() |
PCDVD數位科技討論區
(https://www.pcdvd.com.tw/index.php)
- 系統組件
(https://www.pcdvd.com.tw/forumdisplay.php?f=19)
- - 完全整合繪圖及北橋功能的單晶片處理器 Intel 32nm Sandy Bridge 已流片 :shock:
(https://www.pcdvd.com.tw/showthread.php?t=859799)
|
---|
引用:
快取越大、關聯性可以越高,命中率當然也越高啊...... 因為倉庫越大擺的東西越多 您是不是把命中率和什麼東西搞錯了? |
感覺這像是異質多核心的開端∼
NV在主版業務快撐不下去了 |
其實我最想問的是...
啊AMD的Fusion咧? |
引用:
說的很棒+++++1111111 但要很懂處理器架構的人.才會認同你所說的... 一般似懂非懂的user恐怕會在編一些歪理... 讓你有理說不清.不過我還是覺得很棒.. 算是溫習吧.. :D |
引用:
大概要等 AMD 的 32nm 推出後才看得到成品吧,一年半後或許有機會 :stupefy: |
引用:
歪理?是 :jolin: 顯然閣下就是閣下所謂的「很懂處理器架構的人」 :jolin: :jolin: :jolin: 說穿了不過是個只會趁機自捧兼損人的貨色 :stupefy: |
引用:
你把 CPU 的 cache 和 OS 的 memory management 搞混了吧 |
光是 wiki 上這一小段 cache 的說明,
應該就夠解釋要真的要比較, 不是只有大小跟速度而已. "Multi-level caches Another issue is the fundamental tradeoff between cache latency and hit rate. Larger caches have better hit rates but longer latency. To address this tradeoff, many computers use multiple levels of cache, with small fast caches backed up by larger slower caches. Multi-level caches generally operate by checking the smallest Level 1 (L1) cache first; if it hits, the processor proceeds at high speed. If the smaller cache misses, the next larger cache (L2) is checked, and so on, before external memory is checked. As the latency difference between main memory and the fastest cache has become larger, some processors have begun to utilize as many as three levels of on-chip cache. For example, the Alpha 21164 (1995) had a 96 KB on-die L3 cache, the IBM POWER4 (2001) had a 256 MB L3 cache off-chip, shared among several processors, the Itanium 2 (2003) had a 6 MB unified level 3 (L3) cache on-die, Intel's Xeon MP product code-named "Tulsa" (2006) features 16 MB of on-die L3 cache shared between two processor cores, the AMD Phenom II (2008) has up to 6 MB on-die unified L3 cache and the Intel Core i7 (2008) has an 8 MB on-die unified L3 cache that is inclusive, shared by all cores. The benefits of an L3 cache depend on the application's access patterns. Finally, at the other end of the memory hierarchy, the CPU register file itself can be considered the smallest, fastest cache in the system, with the special characteristic that it is scheduled in software—typically by a compiler, as it allocates registers to hold values retrieved from main memory. (See especially loop nest optimization.) Register files sometimes also have hierarchy: The Cray-1 (circa 1976) had 8 address "A" and 8 scalar data "S" registers that were generally usable. There was also a set of 64 address "B" and 64 scalar data "T" registers that took longer to access, but were faster than main memory. The "B" and "T" registers were provided because the Cray-1 did not have a data cache. (The Cray-1 did, however, have an instruction cache.) [edit] Exclusive versus inclusive Multi-level caches introduce new design decisions. For instance, in some processors, all data in the L1 cache must also be somewhere in the L2 cache. These caches are called strictly inclusive. Other processors (like the AMD Athlon) have exclusive caches — data is guaranteed to be in at most one of the L1 and L2 caches, never in both. Still other processors (like the Intel Pentium II, III, and 4), do not require that data in the L1 cache also reside in the L2 cache, although it may often do so. There is no universally accepted name for this intermediate policy, although the term mainly inclusive has been used.[citation needed] The advantage of exclusive caches is that they store more data. This advantage is larger when the exclusive L1 cache is comparable to the L2 cache, and diminishes if the L2 cache is many times larger than the L1 cache. When the L1 misses and the L2 hits on an access, the hitting cache line in the L2 is exchanged with a line in the L1. This exchange is quite a bit more work than just copying a line from L2 to L1, which is what an inclusive cache does. One advantage of strictly inclusive caches is that when external devices or other processors in a multiprocessor system wish to remove a cache line from the processor, they need only have the processor check the L2 cache. In cache hierarchies which do not enforce inclusion, the L1 cache must be checked as well. As a drawback, there is a correlation between the associativities of L1 and L2 caches: if the L2 cache does not have at least as many ways as all L1 caches together, the effective associativity of the L1 caches is restricted. Another advantage of inclusive caches is that the larger cache can use larger cache lines, which reduces the size of the secondary cache tags. (Exclusive caches require both caches to have the same size cache lines, so that cache lines can be swapped on a L1 miss, L2 hit). If the secondary cache is an order of magnitude larger than the primary, and the cache data is an order of magnitude larger than the cache tags, this tag area saved can be comparable to the incremental area needed to store the L1 cache data in the L2. [edit] Example: the K8 To illustrate both specialization and multi-level caching, here is the cache hierarchy of the K8 core in the AMD Athlon 64 CPU.[7] ![]() Example of hierarchy, the K8 The K8 has 4 specialized caches: an instruction cache, an instruction TLB, a data TLB, and a data cache. Each of these caches is specialized: * The instruction cache keeps copies of 64 byte lines of memory, and fetches 16 bytes each cycle. Each byte in this cache is stored in ten bits rather than 8, with the extra bits marking the boundaries of instructions (this is an example of predecoding). The cache has only parity protection rather than ECC, because parity is smaller and any damaged data can be replaced by fresh data fetched from memory (which always has an up-to-date copy of instructions). * The instruction TLB keeps copies of page table entries (PTEs). Each cycle's instruction fetch has its virtual address translated through this TLB into a physical address. Each entry is either 4 or 8 bytes in memory. Each of the TLBs is split into two sections, one to keep PTEs that map 4 KiB, and one to keep PTEs that map 4 MiB or 2 MiB. The split allows the fully associative match circuitry in each section to be simpler. The operating system maps different sections of the virtual address space with different size PTEs. * The data TLB has two copies which keep identical entries. The two copies allow two data accesses per cycle to translate virtual addresses to physical addresses. Like the instruction TLB, this TLB is split into two kinds of entries. * The data cache keeps copies of 64 byte lines of memory. It is split into 8 banks (each storing 8 KiB of data), and can fetch two 8-byte data each cycle so long as those data are in different banks. There are two copies of the tags, because each 64 byte line is spread among all 8 banks. Each tag copy handles one of the two accesses per cycle. The K8 also has multiple-level caches. There are second-level instruction and data TLBs, which store only PTEs mapping 4 KiB. Both instruction and data caches, and the various TLBs, can fill from the large unified L2 cache. This cache is exclusive to both the L1 instruction and data caches, which means that any 8-byte line can only be in one of the L1 instruction cache, the L1 data cache, or the L2 cache. It is, however, possible for a line in the data cache to have a PTE which is also in one of the TLBs—the operating system is responsible for keeping the TLBs coherent by flushing portions of them when the page tables in memory are updated. The K8 also caches information that is never stored in memory—prediction information. These caches are not shown in the above diagram. As is usual for this class of CPU, the K8 has fairly complex branch prediction, with tables that help predict whether branches are taken and other tables which predict the targets of branches and jumps. Some of this information is associated with instructions, in both the level 1 instruction cache and the unified secondary cache. The K8 uses an interesting trick to store prediction information with instructions in the secondary cache. Lines in the secondary cache are protected from accidental data corruption (e.g. by an alpha particle strike) by either ECC or parity, depending on whether those lines were evicted from the data or instruction primary caches. Since the parity code takes fewer bits than the ECC code, lines from the instruction cache have a few spare bits. These bits are used to cache branch prediction information associated with those instructions. The net result is that the branch predictor has a larger effective history table, and so has better accuracy." http://en.wikipedia.org/wiki/CPU_ca...ti-level_caches |
引用:
一、要比架構當然有得談,但不管架構如何,AMD 以及 Intel 都只會把快取「越做越快」及「越做越大」... 二、雖然我看到英文就頭大,但 AMD 初登場的 L3 TLB Bug 我還沒忘記;此外 Exclusive versus inclusive 更證明 AMD 為了善用有限的快取容量,不得不採行 Exclusive 多層式快取設計... |
引用:
所以說結論還是, 光是 wiki 上這一小段 cache 的說明, 應該就夠解釋要真的要比較, 不是只有大小跟速度而已. |
所有的時間均為GMT +8。 現在的時間是08:23 PM. |
vBulletin Version 3.0.1
powered_by_vbulletin 2025。