A Dataflow Execution Core Prototype

Ronny Krashinsky

ronny@mit.edu
www.cag.lcs.mit.edu/~ronny

Introduction to VLSI Systems (6.371)
12/11/2002

Overview

In this document I describe the architecture and VLSI implementation of a Dataflow Execution Core (DXC) prototype. DXC is a unit intended to serve as the computational core of a larger processor design. In this project, I developed a cycle-accurate behavioral model of a prototype DXC design. This model loads a user test program into simulated instruction issue buffers, and then tracks the values of all major nets in the design while simulating the execution of these instructions. I implemented a subset of this DXC model in full-custom layout. To achieve this, I obtained a library of 32-bit datapath modules, and used a procedural layout library to place and route the design. To verify the layout, I drove an extracted netlist with vectors from the behavioral model, and checked that all the net values matched on every cycle. As the emphasis of this project was on implementing a functional prototype, I did not conduct detailed timing simulations. The tool-flow for my project is shown in Figure 1, and referred to in the remainder of this report. All figures can be clicked on to see a larger version of the image.

Figure 1: Tool-flow.

Architecture

Figure 2 shows an overview of the DXC architecture. The DXC contains an array of interconnected arithmetic units and register files. Each arithmetic unit and register file pair is called a cluster. A vertical array of four clusters comprises a lane. Within a lane, each cluster has a dedicated output bus, and each of these lane transport busses is an input to every cluster. Across lanes, each cluster has a do-across transport bus that connects to its neighbor in a unidirectional ring.

Figure 2: DXC Architecture Overview.

Each cluster in the DXC executes instructions which can operate on local data values, and also send and receive values on the transport network. A central feature of the Dataflow Execution Core is the dynamic resolution of dependencies when two clusters communicate. The dynamic dataflow is accomplished using dedicated send and receive signals to connect every pair of communicating clusters. When a cluster is ready to send data to another cluster, it broadcasts the data on its output bus and asserts the appropriate send signal; likewise when a cluster is ready to receive data from another it asserts the appropriate receive signal. When both signals are asserted, the two clusters simultaneously detect that the data transfer has completed.

Figure 3 shows a pipeline diagram of the cluster. The pipeline has three stages: decode (D), execute (X), and writeback (W); and each stage has two phases: clock high (h), and clock low (l). A new instruction is clocked into the instruction register (IR) at the beginning of the decode stage. During Dh the register file bit-lines precharge while the source register specifiers are decoded, and two registers are read and latched (in X and Y) during Dl. During Xh, a mux selects the ALU/shifter inputs, and these are latched (in S and T) and sent to the ALU to initialize the carry chain. The ALU's dynamic carry chain evaluates during Xl, and a mux selects between the ALU and shifter for the X stage result latch (R). This value can be bypassed to the following instruction, or written back to the register file during Wh (in time for the second following instruction to read the register during Dl). A result value can be latched (in Xo) at the beginning of Wh to be transported to another cluster during Wh and Wl. A received transport value enters the pipeline (in Xi) during Dl for input to arithmetic operations. Therefore, the pipeline allocates almost a full clock cycle to allow transport operations to complete.

Figure 3: Cluster pipeline diagram.

The DXC pipeline has two stall conditions. An instruction stalls in stage D if it has a transport receive operation for which the data is not yet available. The stall is accomplished by not clocking a new instruction into the instruction register, and sending a pipeline bubble into stage X. If a transport send operation has not completed by the end of stage W, then the instructions in D and X must be stalled if the instruction in stage X also contains a transport send operation. In this case, the result of the stage X instruction is preserved in the R latch by gating its clock on the following cycle. Writeback operations to the register file are never stalled after an instruction enters stage X.

I created a cycle-accurate behavioral DXC simulator using the Sychosys simulation framework. The Sychosys structural netlist files [ cluster.sychonet, lane.sychonet, DXC.sychonet ], behavioral C++ models [ SychoBlock_DXC.h, SychoBlock_alu.h, DXC_Block_ALU.h, SychoBlock_shifter.h ], and simulation driver code [ testdxc.cc, DXC_Instr.h, DXC_Instr.cc ] are provided for reference.

VLSI

I implemented a subset of the DXC design in custom layout targeting the TSMC 0.25μm process. I obtained a library of 32-bit datapath modules, including an ALU, shifter, muxes, and latches, and used these to construct the cluster datapath. The register file, instruction latches, and control logic were not included in the layout. I instantiated an array of 16 clusters to form the DXC, and routed the lane and do-across transport busses. The placement, routing, and labeling for the cluster and DXC layout was accomplished using C++ code and the Spongepaint procedural layout library.

The datapath uses a bit-slice layout strategy in which each bit has a height of 54λ, enough space for 6 metal-3 routing tracks. Figure 4 shows 3 bits of a 4-input mux as an example. The datapath cells use metal-1 and metal-2 for internal routing, and metal-3 is used to connect the datapath modules together. To simplify the routing task, I extended the datapath modules with input and output "landing pads". As shown in Figure 4, these are strips of metal-2 which extend across the height of the bit-slices so that a connection can be made to any metal-3 track by simply adding a via.

Figure 4: Mux4 cells.

Figure 5 shows the placement and routing track allocation strategy for the cluster datapath. The datapath cells shown represent a single bit-slice, and the routing above the cells shows the track allocation. Connections between neighboring cells can use a "sneak path" and avoid using a metal-3 track. Figure 6 shows the placed and routed cluster layout with different sub-figures for each metal layer. Note that this layout is rotated 90 degrees to the right compared to Figure 5. Figure 7 shows a floorplan of the entire DXC array of clusters together with the lane and do-across transport routing. The dimensions of a cluster are 2022λ x 3675λ, or 0.25mm x 0.46 mm; the entire DXC is 1.0 mm x 1.8 mm. Each cluster contains 10,328 transistors, for a total of 165,248 transistors in the entire DXC.

Figure 5: Cluster datapath module placement and routing track allocation.


(a) Top-level	(b) Cells	(c) Metal-5	(d) Metal-4	(e) Metal-3	(f) Metal-2	(g) Metal-1	(h) Transistors

Figure 6: Cluster layout.

Figure 7: DXC layout.

Due to the limited number of metal-3 tracks in the datapath cells, the four lane transport busses use metal-5, as shown in Figure 5. To make these input and output connections, a short segment of a metal-3 track is used to jump up to metal-4 and create a landing pad directly above the metal-2 landing pad for the datapath cell. Then, any metal 5 track can connect to the pad by adding a via. Figure 8 shows the connections between the lane transport busses and the input mux. One connection for each of the four lane transport bus is highlighted in red (in four adjacent bit slices).

The do-across transport busses between neighboring clusters use metal-4. The connections between Lane 0 and 1, Lane 1 and 2, and Lane 2 and 3, all use the same 32 metal-4 tracks; and the connection from Lane 3 back to Lane 0 uses an additional 32 metal-4 tracks. As shown in Figure 5, the incoming do-across bus to a cluster connects down to metal-3 on track 0 (the left-most track in Figure 7). The outgoing do-across bus connects to the appropriate lane transport bus on metal-5; the lane-transport busses use tracks 2 through 5 so that the incoming and outgoing do-across busses can use the same metal-4 tracks without conflicting. An example section of the do-across bus ripping is shown in Figure 9. In this figure, the incoming do-across bus from the left connects to metal-3 on track 0, and the outgoing do-across bus connects to metal-5 on track 3.

Figure 8: Lane transport bus connections to input mux.

Figure 9: Do-across transport bus rip.

The C++ code [ SP_build_cluster.cc, SP_build_lane.cc, SP_build_dxc.cc ] used along with the Spongepaint library to generate layout for the cluster, lane, and DXC, is provided for reference.

Testing

The behavioral DXC simulator takes a program text file as input and loads it into the simulated cluster instruction queues to drive the simulation. The programs are written in a simple ASCII assembly code format. Each instruction specifies a cluster number and an instruction type. Three types of instructions are defined: vector instructions execute on the specified cluster in every lane, dox-begin (do-across begin) instructions execute on the specified cluster in only the last lane (Lane 3), and dox-end (do-across end) instructions execute on the specified cluster in only the first lane (Lane 0). All instructions specify either an ALU or shifter operation to perform. Instructions specify one or two sources which can come from the register file, an instruction immediate, the X stage bypass latch, another cluster in the same lane, or the previous cluster in the do-across network. Instructions also specify zero, one, or two outputs destinations which can be a local register, another cluster in the same lane, or the following cluster in the do-across network. Four test programs are provided for reference: basic.txt tests ALU and shift operations on cluster 0, lane.txt tests lane transport operations, xadd.txt tests do-across transport operations, and pascal.txt uses both do-across and lane transport operations to compute diagonals in a Pascal triangle.

The test programs were run using the behavioral DXC simulator. Correct behavior was verified by analyzing a dump of all the register values at the end of each simulation. These simulations produce VCD (value-change dump) trace output which tracks the values on all the nets in the behavioral DXC model for every simulated clock phase (twice per cycle). The VCD output can be displayed in a waveform viewer, and examples are shown in Figure 10 for the xadd.txt program, and in Figure 11 for the pascal.txt program. Figure 10 shows the send_done and recv_done control signals and the output of the X state result latch for cluster 0 on all the lanes. The computation can be seen to ripple around the do-across network as the instructions' dataflow dependencies resolve dynamically. Figure 11 shows the same signals for all the clusters. The results show that each row of clusters in the DXC compute a diagonal of Pascal's triangle ({1, 1, 1, 1, ...} on cluster 0, {1, 2, 3, 4, ...} on cluster 1, {1, 3, 6, 10, ...} on cluster 2, and {1, 4, 10, 20, ...} on cluster 3).

Figure 10: Sychosys behavioral simulation test results for xadd.txt test program.

Figure 11: Sychosys behavioral simulation test results for pascal.txt test program.

To test the layout, a spice netlist was extracted from the final layout representation. A script was used to convert the VCD trace from the behavioral simulator into stimulus files to drive the external inputs of the extracted layout, and to check that all the internal nets had the correct values every cycle. The NanoSim tool from Synopsys was used to run this simulation, and it flagged any differences between the behavior of the extracted layout and the expected net values. NanoSim also generated a trace of the net values which could be compared to the behavioral simulation. Figure 12 shows an example for the pascal.txt program, and the values can be verified to match those of Figure 11. The send_done and recv_done control signals are not shown because the extracted layout does not include any control logic.

Figure 12: NanoSim extracted netlist simulation test results for pascal.txt test program.

Using this infrastructure, the extracted layout of the ALU and shifter were verified by running several thousand cycles of directed tests. Both the extracted layout for a single cluster and the extracted layout for the entire DXC were verified by running the four test programs listed above. For all test cases, the behavior of the layout matches the behavioral model.

Acknowledgments

Chris Batten developed the Spongepaint layout manipulation toolset.
Seongmoo Heo designed and layed out the datapath modules.