Master's thesis: A Quantitative Analysis of Wiring Lengths in 2D and 3D VLSI Implementation of 2D Systolic Array
Aleksandar Milenkovic, School of Electrical Engineering, University of Belgrade, Serbia, 1997.

Abstract

Performance and cost of the widely used submicron 2D VLSI technology are primarily determined by interconnection delays and on-chip area. One of the possibilities for overcoming this problem is the use of the innovative 3D VLSI. In this structure, shortening of interconnection wires can be achieved, resulting in better performance and packing density. In this research, 3D structures of the systolic array for matrix multiplication are proposed and compared with corresponding 2D structures, considering the distribution of interconnection wire lengths, average interconnection length, and on-chip VLSI area. This type of research is important since the interconnection delays and on-chip area primarily determine chip complexity and performance. Gate delays decrease with scaling, whereas interconnection delays remain constant. Consequently, the circuit speed becomes dominated by interconnection delays rather than device delays. This analysis assumes an existing 3D channel routing methodology, based on the 2D channel routing methodology for standard-cell VLSI. The interconnection wire length in 3D and 2D structures is compared for several examples of systolic arrays. In this research, 2D VLSI chips and each active layer of 3D VLSI chips are designed using Tanner Research tools (NetTran, GateSim, L-Edit) for 2D VLSI design, running on low-cost workstations. The output file in CIF format contains mask geometry information for each mask level. The quantitative analysis of interconnection wire lengths is done using the originally developed program package called MARS. Package MARS includes: (a) the program which extracts geometric primitives relevant for interconnections (input/output terminals of standard cell and geometric primitives from interconnection layers), (b) the program which instances the CIF symbols, by applying defined transformations, and (c) the program which determines the sets of connected geometric primitives, extracts interconnections, and calculates their lengths in CIF units. Distribution of the interconnection wire lengths, the on-chip area, and the average interconnection wire length are measured for all considered experiments. For 3D implementations in two active layers the average interconnection length is between 43% and 48% of the average interconnection length for 2D implementations, while the overall on-chip area is between 45% and 60% of the on-chip area for 2D implementations; the same conclusion holds for all conducted experiments. For 3D implementations in four active layers, the average interconnection length is between 18% and 30% of the average interconnection length for 2D implementations, while the overall on-chip area is between 20% and 40% of the on-chip area for 2D implementations.

Back


Phd thesis: Cache injection in bus-based shared memory multiprocessors
Aleksandar Milenkovic, School of Electrical Engineering, University of Belgrade, Serbia, 1999.

Abstract

The problem of high memory latency is the most critical performance issue in the cache-coherent shared memory multiprocessors. One way to cope with this problem is to tolerate high memory latency by overlapping memory accesses with computation. The importance of techniques for tolerating high memory latency in multiprocessor systems increases due to several reasons: (a) the widening gap in speed between CPU and memory, (b) high contention on interconnection networks, (c) interconnect operations, caused by data sharing between processors, and (d) the increasing physical distances between processors and memory.

Software-controlled cache prefetching is a widely accepted consumer-initiated technique for tolerating memory latency in multiprocessors, as well as in uniprocessors. In software-controlled cache prefetching, a CPU executes a special prefetch instruction that moves data (expected to be used by that CPU) into its cache, before it is actually needed. In the best case, the data arrives at the cache before it is needed by the CPU, and the CPU load instruction results in a hit. For many programs and sharing patterns (e.g., producer-consumer), producer-initiated data transfers are a natural style of communication. Producer initiated primitives are known as data forwarding, delivery, remote writes, and software-controlled updates. With data forwarding, when a CPU produces the data, in addition to updating its cache, it sends a copy of the data to the caches of the processors that are identified by compiler or programmer as its future consumers. Therefore, when the consumer processors access the data, they find it in their caches. Comparing cache prefetching and data forwarding, it should be noted that prefetching can eliminate any kind of read misses (cold, conflict, and coherence), while forwarding can eliminate only coherence misses and the related cases of cold misses. However, prefetching is inapplicable or insufficient when the location to be accessed is not known sufficiently early or when the value to be read is not produced sufficiently early (synchronization, producer-consumer sharing patterns, etc). In addition to that, prefetching can negatively affect data sharing and a too early issued prefetch degrades performance. For the coherence misses, forwarding can be more effective than prefetching due to the following reasons: (a) forwarding delivers data to consumers as soon as it is produced, (b) smaller latency, and (c) possibility to forward the same data to several consumers with a single instruction. However, data forwarding requires more sophisticated compiler support, compared to prefetching; for prefetching, consumer processor does not need to know the identity of the producer processor, while for forwarding the producer processor needs to know the identity of consumers. Most of the studies have shown the high effectiveness of cache prefetching and data forwarding in CC-NUMA or CC-UMA architectures. However, it has been shown that cache prefetching is not so effective in bus-based multiprocessors shared memory multiprocessors due to the following reasons. First, prefetching increases bus traffic. Since bus-based architecture is very sensitive to changes in bus traffic, prefetching can result in performance degradation. Second, cache prefetching can negatively affect data sharing, especially in the cases of prefetching initiated too early. Last, current prefetching algorithms are not so effective in predicting coherence misses. Actually, cache misses caused by data sharing represent the biggest challenge for designers, especially as caches become larger and coherence misses dominate the performance of parallel programs. On the other side, data forwarding has not been examined in bus-based architectures yet. Complexity of implementation and compiler algorithm restricts applicability of data forwarding in bus-based architectures.

In this dissertation, a novel technique called cache injection is proposed. Cache injection is aimed to overcome some of the shortcomings of the existing techniques, such as: (a) bus and memory contention, (b) negative impact on data sharing and instruction overhead in the case of cache prefetching, (c) compiler complexity in identification of future consumers, and (d) complexity of implementation in the case of data forwarding. In cache injection, a consumer predicts its future needs for shared data executing an OpenWin instruction. This instruction does not initiate any bus transaction, but only stores the first and the last address of successive cache blocks, in a special local injection table. This address scope is called address window. During the specific bus transactions, each processor snoops the bus. If the current bus address belongs to one of the open address windows in the injection table of a processor (injection hit), that processor "catches" the data and stores it into its cache. There are two main scenarios when cache injection could happen: during the bus read transaction (injection on first read) or during the software-initiated write-back bus transaction (injection on write-back). Injection on first read is applicable when there is more than one consumer. Each consumer initializes injection table according to its future needs. When the first one among consumers executes a load instruction, which specifies the shared data, it sees cache miss and initiates a bus read transaction. During this transaction, each processor snoops the bus and if there is an injection hit, the processor stores the block into its cache. Hence, in the case of multiple consumers, only one read bus transaction is needed to update all consumers, if they all have initialized injection table before this transaction. Injection on write-back bus transaction is applicable when shared data exhibit 1-Producer-1-Consumer and 1-Producer-Multiple-Consumers patterns. In these scenarios, each consumer also initializes its injection table. At the producer side, after the data producing is finished, the producer initiates bus write-back transactions in order to update the memory, by executing an Update instruction. During the write-back bus transaction, all consumers snoop the bus, and if they find injection hit, they catch the data from the data bus and store it into their caches.

We evaluate the performance impact of cache injection using Limes [12] - a tool for execution driven simulation of shared memory multiprocessors. Two synchronization kernels, three parallel applications well suited to demonstrate various data sharing patterns, and four applications from SPLASH-2 suite [13] are used in evaluation. Proposed instructions for cache injection support are hand-inserted into synchronization kernels and parallel applications. A detailed simulator of memory subsystem is developed. For each application, we compare the performance of base system and one or more systems that include cache injection. Experiments varied different memory subsystem parameters, in order to explore their influence on cache injection. In the case of synchronization kernels, cache injection shows gains from 12% to 96% over the base system. In the case of applications, cache injection improves performance from 2% to 90% relative to the base system, depending on the number of processors and parameters of the memory subsystem. Experiments have shown that efficiency of cache injection increases with the increase of the number of processors in the system, cache size, and memory latency. Key Words and Phrases: cache injection, cache prefetching, data forwarding, shared memory multiprocessors.

Back