FLAT: An Optimized Dataflow for Mitigating Attention Bottlenecks

Sheng-Chun Kao
Georgia Institute of Technology
Atlanta, USA
felix@gatech.edu

Suvinay Subramanian
Google
Mountain View, USA
suvinay@google.com

Gaurav Agrawal
Microsoft
Seattle, USA
gaagrawal@microsoft.com

Amir Yazdanbakhsh
Google Research, Brain Team
Mountain View, USA
ayazdan@google.com

Tushar Krishna
Georgia Institute of Technology
Atlanta, USA
tushar@ece.gatech.edu

ABSTRACT
Attention mechanisms, primarily designed to capture pairwise correlations between words, have become the backbone of machine learning, expanding beyond natural language processing into other domains. This growth in adaptation comes at the cost of prohibitively large memory requirements and computational complexity, especially at higher number of input elements. This limitation is due to inherently limited data reuse opportunities and quadratic growth in memory footprints, leading to severe memory-boundness and limited scalability of input elements. This work addresses these challenges by devising a tailored dataflow optimization, called FLAT, for attention mechanisms without altering their functionality. This dataflow processes costly attention operations through a unique fusion mechanism, transforming the memory footprint quadratic growth to merely a linear one. To realize the full potential of this bespoke mechanism, we propose a tiling approach to enhance the data reuse across attention operations. Our method both mitigates the off-chip bandwidth bottleneck as well as reduces the on-chip memory requirement. FLAT delivers 1.94× (1.76×) speedup and 49% and (42%) of energy savings compared to the state-of-the-art Edge (Cloud) accelerators with no customized dataflow optimization. When on-chip resources are scarce (20 KB-200 KB), FLAT yields, on average, 1.5× end-to-end latency reduction across a diverse range of conventional attention-based models with input sequence lengths ranging from 512-token to 64K-token. Our evaluations demonstrate that state-of-the-art DNN dataflows applied to attention operations reach the efficiency limit for inputs above 512 elements. In contrast, FLAT unblocks transformer models for inputs with up to 64K elements.

CCS CONCEPTS
• Computer systems organization → Dataflow architectures.

KEYWORDS
Transformer, Attention, Dataflow, DNN Accelerators

1 INTRODUCTION
Attention mechanisms, the key building block of transformer models, have enabled state-of-the-art results across a wide range of machine learning (ML) tasks—from natural language processing (NLP) [17, 53, 90], to object detection [6, 82, 111], image classification [28, 100, 108, 110], image generation [7, 21, 66], and music synthesis [34, 35].

This exponential growth of transformer models are expected to serve as the foundation of a new bread of machine learning models in the upcoming years. A key attribute of attention-based models is the sequence length (\(N\)) defining the number of input elements for which a pairwise correlation scores is computed. Intuitively, increasing sequence length enables the attention-based models to better capture the context of input sentences or the relation between image segments. The demand for leveraging long-sequence (e.g. \(N = 8K\) to \(N = 69K\)) attention-based models has already emerged in ML communities [87], beyond natural language understanding [70] into protein folding [14] and text summarization [47] and audio generation [57]. Employing long sequences is pivotal in these algorithms because the property of input emerges from the global context. For example, two proteins may look identical if we examine identical sequence fragments, but when the entire sequence is considered, the differences in their function arise. We observe an analogous phenomenon in text summarization, where context can drastically alter the meaning of the selected text subset. In that instance, the subset represents a shorter sequence while the entire context refers to the full-length one.

Compared to existing neural network accelerators [9, 11, 20, 67, 78, 105], architecting accelerators for attention-based models poses different design challenges, attributed to their soaring demand for on-chip memory and compute complexities. Recent accelerators for attention-based models [30, 31] have mainly relied on algorithmic optimizations, often with negative repercussion on model accuracy. Algorithmic techniques in practice include sparsification or compression [5, 12–14, 16, 17, 43, 47, 56, 66, 68, 70, 73, 80, 86, 94, 104] and/or leveraging lossy approximation [30, 31, 93].

In this work, we identify that the conventional dataflow/mapping methods for CONV and FC layers [9, 20, 40, 63] are inadequate for
attention layers. This is because the main operators within attention layers exhibit distinct compute and memory characteristics posing notable bottlenecks on off-chip memory bandwidth compared to CONV and FC. We identify the following challenges in devising dataflow optimizations for attention layers:

(1) Significantly low operational intensity. Inherently low data reuse in activation-activation operators significantly reduces the operational density of such operators in attention layers. This inherently low operational density subsequently makes the activation-activation operators fundamentally memory-bound. While prior work on intra-operator dataflow optimization, such as loop transformation and scheduling techniques \([11, 20, 49, 51, 63, 65]\), targets CONV and batched FC operators by leveraging the ample intrinsic data reuse, which are not well-suited for activation-activation operators lacking data reuse opportunity.

(2) Complex many-to-many operators. The main attention operators have many-to-many relation, obscuring opportunities to use operator fusion \([62]\) in attention operators. That is because operator fusion in conventional ML compilers \([8, 27, 72]\) mainly target operations such as CONV and FC with one-to-one (i.e. element-wise) relation.

(3) Prohibitively large intermediate tensors. The size of intermediate tensors in attention layers grows quadratically—\(\mathcal{O}(N^2)\) \([13, 14, 17, 43, 90, 94]\)—with the sequence length. This quadratic growth imposes a significant pressure on on-chip memory capacity and prohibits opportunities to store the intermediate results on-chip and improve the compute utilization, a common practice in CNN accelerators \([9]\).

This paper fundamentally tackles the challenges associated with attention layers by devising a first in its class many-to-many inter-operator dataflow optimization mechanism, called Fused Logit Attention Tiling. This optimization particularly fuses multiple many-to-many tensor operator, while systematically preserving their inter-operator data dependencies, leading to a significant reduction on off-chip memory bandwidth pressure. In addition, to fully realizing the performance benefit of this inter-operator fusing mechanism, FLAT performs a new tiling approach across the fused operators. This tiling enables efficient staging of quadratically growing intermediate tensors of attention operations on tight-budgeted on-chip memories, leading to higher performance and energy savings and elevates the scalability of transformer models up to 64K inputs. These benefits are unlocked with only modest hardware changes, integrating into a platform deployable on off-the-shelf DNN accelerators. In summary, this paper presents the following specific contributions for attention-based models:

- We systematically study the operational intensity of different operators within attention layers and characterize the fundamental roadblocks imposed by limited hardware resources to improve the overall realized performance of attention accelerators (§3).
- Based on the resulting findings, we explore fusion opportunities between different operators in attention layers and justify our proposed many-to-many inter-operator fusion (§4). While beneficial, this fusion inflicts a fundamental challenge of preserving the inter-operator data dependencies, imposed by the softmax operation. To address, we expound our tailored dataflow optimization approach for attention layers, enabling higher data reuse of the quadratically growing intermediate attention tensors from low-capacity but high-bandwidth on-chip memory. We show that this dataflow optimization efficiently mitigates the pressure on off-chip memory bandwidth, leading to a higher performance and energy efficiency in accelerators (§5).
- We develop a map-space exploration framework to efficiently search for optimal loop orders across fused operators and tiling sizes. This framework optimizes performance metrics of interest subject to different hardware resource constraints, such as number of processing elements and on-chip memory capacity (§6).

We evaluate FLAT on a variety of Attention-based models, including BERT \([90]\), T	extsc{xl} \([53]\), Flau	extsc{bert} \([54]\), T5 \([71]\), and XLM \([53]\), for both Edge and Cloud accelerators. Compared to a range of state-of-the-art dataflow optimizers, FLAT delivers 1.75\times and 1.65\times speedup and 44\% and 55\% energy savings for recent Edge and Cloud accelerators, respectively. When on-chip resources are scarce (in the order of 10KB-100KB), FLAT yields 1.5x end-to-end latency reduction and 1.4x end-to-end energy savings for the Attention-models with conventionally-sized input sequence length (512-token). Furthermore, our results show that while the conventional DNN dataflow optimizers for attention operations bumps up against the efficiency limit for inputs above 512 tokens, our dataflow optimization tailored for attention operations unblocks the scalability of transformer models for significantly larger input sizes, up to 64K tokens.

2 BACKGROUND

Terminology. Transformer models \([54, 83, 90]\) generally share similar architectures. In a top-down view (Fig. 1), an attention-based “model” comprises multiple (often identically parameterized) attention “blocks”\(^4\). An attention block comprises multiple “layers”; an attention layer, a normalization layer, followed by multiple (typically two) fully connected layers. Finally, each layer comprises one or more “operations” or “operators”.

Computation operators. The attention mechanism measures how closely two tokens are related in an input sequence. Each token in the input sequence is represented as a vector of dimension \(D\), each sequence has \(N\) tokens, and an input batch to an attention layer comprises a batch of \(B\) sequences; thus the input to the attention layer is a tensor of dimension \([B,N,D]\). Fig. 1 (bottom) highlights the flow of a single token vector (of dimension \(D\)) through the attention mechanism. Step ① three vectors are derived for each token’s vector in the input: called Key (K), Query (Q), Value (V). This is achieved by multiplying the input tensor with learnable weight matrices. Attention mechanisms often use multiple heads to generate \(H\) such K, Q, V vectors for each token vector in the input. Thus each of K, Q, V generates a tensor of dimension \([B,H,N,d]\). Step ②, we compute the logit score (L) which captures how strongly each token is related to each other token in the sequence. This is done by a \((d\text{-dim})\) dot product of each vector of the key with the corresponding vector of the query. Each dot product yields a single scalar score, but this score is computed for each token against all other tokens in the sequence yielding an output tensor of dimension \([B,H,N,N]\). Step ③, the logit scores then needs to be normalized. While there are several normalization functions, they all share key traits. The normalization

\(^4\)Models may include other blocks: an embedding block with positional encoding and masking, and a few task-specific FC or CONV layers.
Fig. 1: The structure of attention-based models. Green matrix notation shows the size of weight tensors; black matrix notation shows the size of activation tensors; Softmax is applied on output of Logit.

is carried out across a row of N logits scores in each sequence. To generate a row of N normalized scores, the normalization operator reads N input scores, performs a reduction of these N scores, and scales each of the N input scores by the reduced value. We use softmax since it is the most commonly used normalization function.

Step 2 involves performing a weighted sum of the value vectors with the corresponding weights from the logits, which yields an output tensor $[B, H, N, d]$. Finally, step 3 concatenates the attention outputs from the $H$ heads and with its weight matrix computes a output tensor $[B, N, D]$. These complete an attention layer.

This can be represented succinctly as the computational graph in Fig. 1, comprising the following operators: i) Query (Q), Key (K), and Value (V) operators that perform a projection of the input tensor, ii) Logit (L) and Attend (A) operators that compute the logits scores and weighted-sum of values respectively, and iii) Output (O) operator that performs an output projection. We categorize them into two: (i) activation-weight operators (Q, K, V, O), which operate on activation tensors (from previous operators) and weight tensors (model parameters) and perform a GEMM computation as conventional fully connected operators (FCs), and ii) activation-activation operators (L, A), which operate on two activations from different previous operators and perform a GEMM computation. The L and A operators often dominate the latency and power consumption while running the model [36], and even more so at long sequence lengths, as shown in our evaluations (Fig. 13).

### 2.1 DNN Accelerators - Performance

We consider spatial DNN accelerators [9, 26, 49] in this work (Fig. 7 provides more details). We discuss the key factors that determine the realized performance when running a DNN model on an accelerator with specific hardware resources (PEs, on-chip memory size, and off-chip memory BW).

**Operational intensity | Roeline performance.** Operational intensity is a proxy metric to gauge the maximum possible performance of an individual operator given a set of hardware resources. The operational intensity of an operator is defined as the number of arithmetic operations divided by the number of memory accesses. A lower operational intensity implies an operator has fewer opportunities for data reuse and is more likely to be memory bandwidth (BW)-bounded. This directly decides the roeline (or best achievable) performance of the operator on the underlying accelerator.

\[
I = \frac{\text{# of Operations}}{\text{# of Memory Accesses}}
\]  

**Dataflow | Realized performance.** Dataflow refers to the mechanisms to stage data from the off-chip memory through the on-chip memory hierarchy to the compute PEs, over space and time [49]. It determines the actual achieved performance. Since memory access is often the bottleneck in executing DNN operators [34], the dataflow exposes data reuse opportunities across operands that can be exploited in hardware via buffering and data forwarding/broadcast [49]. Formally, the dataflow encompasses: (i) tiling (how tensors are sliced, stored and fetched across the memory hierarchy), (ii) compute order (order in which loop iterations are performed), and (iii) parallelism (how compute is mapped across PEs spatially). The dataflow along with specific tile sizes is often called a mapping [49, 51].

### 3 CHALLENGES WITH ATTENTION LAYERS

#### 3.1 Challenge 1: Low Operational Intensity of L/A

**Activation-Weight operators (Q/K/V/O).** Following the notation in Fig. 1, the number of operations in these operators is $O(BND^2)$. The number of memory accesses for the input (activations), weight (parameters), and output (activations) tensors are $O(BND), O(D^2)$, $O(BND)$, respectively. Therefore the operational intensity is $O(\frac{BND^2}{BND+D+BN})$. We see that increasing the batch size (B) can increase the operational intensity—the same weight value can be reused by multiple activations, leading to lower BW pressure. This is a typical technique used in activation-weight operators (e.g., CONV and FC, the staple in most DNN models) as it makes better use of the scarce memory bandwidth in accelerators and enables higher utilization of the provisioned compute FLOPs, leading to improved throughput. The specific mechanism to exploit the operational intensity is called dataflow [9, 49].

**Activation-Activation operators (L/A).** For L and A operators, the number of operations is $O(BN^2D)$. The number of memory access for the two input-activations and the output-activations are $O(BND), O(BND), O(BN^2D)$, respectively. Therefore the operational intensity is $O(\frac{BND^2}{2BND+BN})$. Embedding size (D) is decided by the model, and sequence length (N) is decided by the application. Furthermore, multi-head attention is an often-used variant of the attention mechanism: it leads to higher accuracy in many tasks [90]. It splits the output of the Q/K/V operator along a hidden dimension, reshaping it from size $[N, D]$ to $[H, N, d]$, where $d=D/H$. The operational intensity of L, A becomes $O(\frac{BND^2}{2BND+BN})$. For these operators, one can not simply increase the batch size to increase the operational intensity.
intensity increases and can become compute-bound. This demonstrates why batching is a popular technique for FC layers. In contrast, L and A operators sit at memory-bound and low-performance region, and batch size increase is not effective in these operators (Fig. 3.1).

Low operational intensity of the L/A operators makes them fundamentally memory-bound, and any dataflow/mapping exploration at the individual operator level cannot further improve performance.

### 3.2 Challenge 2: Complexity of Op Fusion for L/A

Given the low operational intensity for L/A operators, fusion is an attractive technique to stage the intermediate tensor data on-chip and leverage the higher on-chip memory bandwidth. Operator fusion is an optimization that schedules back-to-back operators together such that the producer’s output directly feeds the consumer; thus avoiding materialization of full intermediate tensor in memory. By avoiding off-chip data movement of the intermediate tensor, we can use the higher on-chip bandwidth to enable improved performance for the fused operator (as opposed to executing the operators individually).

When exploring operation fusion opportunities, we can either fuse among Element-wise Operators (E.O.) or Tensor-wise Operators (T.O.), as shown in Fig. 3. Element-Element fusion (E-E) is the simplest fusion optimization. With increased interest in operation fusion, more ML compilers/frameworks today support Tensor-Element (T-E) or Element-Tensor (E-T) fusion [4, 8, 27, 48, 72] where MatMul operators (i.e., CONV or FC) are often fused with element-wise operators (such as ReLu or Add), reshapes, or shuffling operators [62]. However, Tensor-Tensor fusion (T-T) is not done automatically. The key reason is that T.O. is a many-to-many operator (Fig. 3(a)). While it is often straightforward and a simple engineering exercise to fuse a T.O. with one-to-one operator like E.O., how to fuse many-to-many with other many-to-many and whether it is beneficial to fuse them is still a research question [62]. Indeed, DNNFusion [62] which studied T-T as recently as PLDI 2021, reported it to be either too complicated or unprofitable. The key reason is that the additional complexity to maintain dependence and stage data (grey intermediate data as shown in Fig. 3(b)) could end up negatively impacting register and cache usage [62].

Some previous research papers [2, 97] have discussed Tensor-Fusion in CNNs with the fusion pattern T-(one-to-one)-T such as CONV-Relu-CONV and shown huge potential gain with a well-designed inter-operator dataflow for Tensor-Fusion. This is highlighted in Fig. 3(c). However, Tensor-Fusion for attention-based models has not been explored to date, to the best of our knowledge. The complexity of its fusion pattern, T-(many-to-many)-T, makes it much more challenging than for CNNs.

To address this challenge, we design a specialized inter-operator dataflow that not only considers the data dependency of two large tensor operators but also tackles the complex data dependency incurred by the many-to-many intermediate activation (§5.2).

### 3.3 Challenge 3: Tensor Footprint of L/A

There is one other challenge that is unique to L/A when considering Tensor-Fusion—namely a quadratic intermediate tensor footprint. From Fig. 1 we can calculate the intermediate tensor between L...
Table 1: Intermediate tensor size between $L$ and $A$ operators in BERT-(base) [90], TrXL-(large) [17], and XLM(xlm-mem)-en) [53].

<table>
<thead>
<tr>
<th>Seq len:512</th>
<th>Seq len:1K</th>
<th>Seq len:16K</th>
</tr>
</thead>
<tbody>
<tr>
<td>BERT</td>
<td>TrXL</td>
<td>XLM</td>
</tr>
<tr>
<td>Seq len:512</td>
<td>10 (MB)</td>
<td>10 (MB)</td>
</tr>
<tr>
<td>Seq len:1K</td>
<td>136 (MB)</td>
<td>136 (MB)</td>
</tr>
<tr>
<td>Seq len:16K</td>
<td>8.2 (GB)</td>
<td>8.2 (GB)</td>
</tr>
</tbody>
</table>

Fig. 4: Potential of Tensor-Tensor Fusion. The operation intensity ($Z$) of single and fused operators in attention layers in TrXL-(large) [17] using batch size=1. The notations are from Fig. 1. F(X, Y) implies fusion between X and Y operator. The red dashed bar indicates the minimum operation intensity for the operator to become compute-bound. FC$_1$ and FC$_2$ indicate operator K/Q/V/O and FF1/FF2, respectively. F(L/A, FC$_1$) specifies fusion between “A” and “O”, whereas F(FC$_1$, L/A) expresses the fusion between “Q-L”, “V-A”, and “K-L”.

Fig. 5
Size
Sequence Length = 512
Sequence Length = 1K
Sequence Length = 2K

4.2 Challenges with Tensor Fusion
Fusing $L$ and $A$ operators introduces two key challenges that we discuss here. §5 presents implementation details.

Challenge 1: Respecting data dependencies across operators. Fusing $L$ and $A$ causes its unique challenge of data dependency owing to the many-to-many Softmax operation between them (§3.2). Softmax requires a reduction along a specific dimension of the tensor before scaling individual elements. Arbitrary inter-loop tiling as employed by prior CONV/FC fusion techniques [2, 97] violates this data dependency constraint.

Challenge 2: Effectively handling large intermediate tensors that do not fit in on-chip memory. Recall that the intermediate tensor between $L$ and $A$ has size $O(BHN^2)$. As discussed in §3.3, this can easily exceed the on-chip memory capacity of DNN accelerators. Further, the specific size of the on-chip memory may be highly variable across different accelerators. Owing to the above challenges, conventionally, we often do not apply tensor fusion to attention layer and stick to operator-by-operator operation scheme, as shown in Fig. 6(a). In this work, we use FLAT to enable L-A fusion operation scheme, as in Fig. 6(b).
Fig. 5: For loop of fused L-Softmax-A (or shortened as L-A in the paper) and the choice of granularity.

( interleaved), and iterate through the shared outer-loop. Considerations for tile sizes to address the data dependence and on-chip memory constraints (§4.2) are discussed in this section.

5.1 FLAT-Tile and Execution Granularity

FLAT employs two levels of tiling: intra-operator tiling and inter-operator tiling. We name each tile in inter-operator tiling, a FLAT-tile. FLAT computes FLAT-tile activations from \( \mathbf{L} \) and feeds it through Softmax and to a FLAT-tiles, the inner-loop in Fig. 5, essentially specifies how many slices of the partial intermediate tensor are calculated in one pass of the fused-operator in Fig. 6(b). The minimum granularity of the FLAT-tile is determined by the data dependence constraint of Softmax and called row-granularity (discussed in §5.2), for effectively collecting a group/tile of (input data) that fulfills the Many-to-Many dependency pattern of Softmax. We progressively build larger (coarser-grain) tiles, namely, tiling multiple number of rows at a time (\( R_{x} \)), multiple number of heads (\( H_{x} \)), and finally, multiple number of (micro-)batches (\( B_{y} \)) in the tile. We refer to these as Row (R-Gran), Head (H-Gran), and Batch (B-Gran) granularity respectively (discussed in §5.3). Further, the most intuitive baseline of moving the entire intermediate tensor (namely the entire output of \( \mathbf{L} \)) on-chip is referred to as Batch-Multi-Head granularity (M-Gran), as shown in Fig. 5.

5.2 Constraints from Data Dependence

Basic execution unit → "Row-granularity". The Softmax reduction is along the key dimension: this effectively captures the relative weight of each token in the query sequence against other tokens in the key sequence. The minimum Softmax execution requires an [1, N] input array, which in turn requires a query of [1, D] and a key of [D, N], as illustrated in Fig. 1 Step 2 and Step 3. This forms our basic tiling unit (finest granularity)—row-granularity, which respects the data dependency introduced by the Softmax while keeping minimum number of elements to pass between L and A. FLAT restricts the tile sizes to operate in multiples of this row-granularity.

5.3 Constraints from On-Chip Memory

M-Gran, B-Gran, H-Gran: Leveraging Limited Reuse of \( f(\mathbf{L}, \mathbf{A}) \). Coarser granularities require staging larger tiles in the on-chip memory. As sequence lengths increase this can increase rapidly (recall the \( O(N^3) \) growth). To fit into the limited on-chip memory, one may target finer granularities, e.g., moving from M-Gran to B-Gran (i.e., effectively tiling micro-batches). In general, while this helps reduce the size of the tile, when we are tiling two operators at finer granularity at the outer-loop, we may trade-off the reuse opportunity at the inner-loops. For example, for \( f(\mathbf{FC}, \mathbf{FC}) \) and \( f(\mathbf{CONV}, \mathbf{CONV}) \), when decreasing the batch size (i.e., micro-batching), we directly reduce the number of times a weight can be reused. The weights need to be re-fetched again and again for each micro-batch. This effect is exacerbated when considering finer granularities such as H-Gran for the weight-activation K/Q/V/O operators. The reduced reuse opportunity by inter-operator tiling reduces the achievable performance, even though the fused operator has large operational intensity (Fig. 4). In contrast, L and A are activation-activation operations (Fig. 3.1). Each new activation of \( L \) needs to compute with a new activation of \( A \), i.e., there are no reuse opportunities at the algorithmic level. Decreasing the tiling granularity (M-Gran to B-Gran to H-Gran), does not preclude any reuse opportunity, since there are no reuse opportunities at the algorithmic level. Thus, the finer M-Gran, B-Gran, H-Gran in FLAT are well-suited for \( f(\mathbf{L}, \mathbf{A}) \).

R-Gran: Extreme Large Sequence Range. To enable very long sequence lengths [14, 47, 70], but with limited on-chip memory resources [41, 89], we need to tile at even finer granularity, namely R-Gran. However, finer granularities come with an associated trade-off: when we reduce the number of rows (\( R_{x} \)), we will also reduce the reuse opportunity in the matrix multiplication itself. For example, even for L/A fusion, using fewer rows means the same key vectors need to be fetched multiple times across the interleaved cross-operator outer loops. Further, reducing number of rows at the outer-loop could also decrease the achievable performance at the inner-loop, e.g., not enough dimension size to fully utilize PE array. Thus, FLAT co-explores inter-operator (optimizing the outer-loop) and intra-operator dataflow (optimizing the inner-loop) to mitigate these potential sources of inefficiencies.

On-chip buffer requirement. Table 2 lists the required on-chip buffer size using FLAT. We derive the R-Gran value here (others follow similar reasoning). L operator consumes (Rd+Nd)x2 size of the on-chip buffer (2 to account for double buffering), and A consumes (Nd+Rd)x2. RN for buffering the intermediate tensor (FLAT-tile) (no double buffering since it does not interact with off-chip memory), whose on-chip buffer requirement is shown in Table 2.

HW support to implement FLAT. FLAT requires minimal HW support: (1) controller to recognize the proposed fine-grained dataflow and (2) on-chip buffer to be software-addressable to support tiling. These features are supported by most recent industry and academic accelerators [3, 41, 78, 89].

6 EVALUATION METHODOLOGY

Accelerator modeling methodology. We developed a detailed analytical cost model to estimate the performance and energy consumption of FLAT across a range of hardware accelerators configurations, following similar methodology as prior work [51, 65]. We
Conventionally, a tile of L is staged on-chip. Softmax directly works with staged data. All data dependency from L, S, and A is followed.

**Fig. 6:** (a) Baseline and (b) FLAT dataflow. FLAT performs inter-operator fusion of L, A while respecting data dependencies introduced by Softmax. FLAT tile enables staging slices of the logits tensor in the on-chip scratchpad increasing effective memory bandwidth. This fused, interleaved execution of L, A yields higher compute utilization and improved performance.

Fig. 7: Map space exploration framework. (Special Function Unit: for computing non-linear operations, e.g., softmax, activation.)

meticulously model the major microarchitectural blocks commonly shared by most DNN accelerators as outlined in Figure 7. Based on this model, we collect relevant architectural details, which are used to compute the accelerator performance metrics.

**Compute model.** We model the compute array as a collection of processing elements with configurable bandwidth from/to the global on-chip buffers. The compute array model supports common intra-operator dataflow, including weight, input, and output stationary. In addition, we model various data distribution and reduction NoCs, including systolic, tree, or crossbar structures to study the the trade-offs between compute bandwidth and distribution-collection time [51, 52]. Following this methodology, we model TPU [40] (systolic-array) as well as other spatial array accelerators, such as Eyeriss v2 [11] and MAERI [52]). We also carefully model the overhead of switching tiles for filling and draining data to reflect the cold start and tailing effect. Finally, we account for softmax operation runtime in all the evaluations.

**Buffering model.** Studying dataflow optimization techniques demand for a detailed modeling of buffers. To achieve this objective, we model PE arrays with local scratchpad for input, weight, intermediate results, and output storage. We add the on-chip global buffer to store the intra- and inter-operator tiles. The performance model also includes the data spilling overhead. That is when the live memory footprint (buffer requirement for staging data on-chip) is larger than the on-chip global buffer capacity.

**Memory bandwidth.** Since there are multiple microarchitectural units that access the on-chip and off-chip memories, we model them as limited bandwidth shared-hardware resources. That is, if the access rate to a shared memory resource exceeds a pre-defined bandwidth, the data accesses are throttled. This overhead manifests as longer runtime. A key feature of our simulation methodology is the detailed modeling of the accelerator memory hierarchy to systematically assess the memory-boundness of attention operators and their pressure on off-chip memory bandwidth.

**Energy model.** Collecting the detailed activity counts from the analytical model, we use Accelergy [101] framework to estimate the energy consumption for the major microarchitectural blocks. That includes compute, on-chip memory, and off-chip memory communications. Note that FLAT neither alters the total number of computations nor the total number of accesses to the on-chip global buffer. Instead, it optimizes the number of off-chip memory accesses, which is the major contributor to the overall accelerator energy consumption [9, 84].

**Map-space exploration workflow.** We also integrate a map-space exploration (MSE) workflow (Fig. 7) into our simulation framework. The main purpose of this exploration workflow is to carry out a search algorithm in a predefined map space governed by the cost model. In this work, we use exhaustive search to find the optimal design point uniformly across all the dataflow optimizations. MSE includes both intra- and inter-operator dataflow optimization space (enabling optimal dataflow comparisons with and without FLAT technique later in Table 4 and §7). The relevant architectural parameters for this optimization space are outlined in Fig. 7.
Table 3: The HW compute resource and BW configuration of Edge and Cloud platforms in the evaluation sections. The on-chip memory capacity is varied across explored design-points.

<table>
<thead>
<tr>
<th>Platform</th>
<th># of PEs</th>
<th>On-Chip BW</th>
<th>Off-Chip BW</th>
</tr>
</thead>
<tbody>
<tr>
<td>Edge</td>
<td>32×32</td>
<td>1 TB/Sec</td>
<td>50 GB/Sec</td>
</tr>
<tr>
<td>Cloud</td>
<td>256×256</td>
<td>8 TB/Sec</td>
<td>400 GB/Sec</td>
</tr>
</tbody>
</table>

Comparison to prior accelerator modeling tools. There are several popular open-sourced DNN accelerator modeling frameworks [18, 51, 60, 65, 75, 103]. However, none of them offer support for cross-layer performance (and reuse) modeling, assuming layer-by-layer execution. In contrast, our framework evaluates the performance of DNN models in both single-layer and cross-layer manner, enabling various cross-operator fusion studies. To ensure the integrity and correctness of our framework, we compared the simulation results from our framework under single-layer modeling to MAESTRO [51]. The performance metrics are within 1% difference to MAESTRO’s results.

Table 4: Comparisons dataflow configurations.

<table>
<thead>
<tr>
<th>Design Point</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>Naive (Intra-Operator)</td>
<td>Intra-operator weight-stationary dataflow with fixed tile size, similar to [20, 26]</td>
</tr>
<tr>
<td>Flex (Intra-Operator)</td>
<td>We exhaustively search for optimal intra-operator dataflow, reflecting the optimal solution can be found in existing intra-operator mappers [1, 4, 62, 63]</td>
</tr>
<tr>
<td>Flat (Intra-/Inter-Operator)</td>
<td>We exhaustively search for optimal intra-operator as well as inter-operator dataflow</td>
</tr>
</tbody>
</table>

Table 5: Run time performance improvement of FLAT over Naive and Flex, using sequence length=512 and Edge platform compute + BW configurations (Table 3) with varying on-chip buffer sizes (200K, 20M and 2GB). FLAT shows its advantage when buffer sizes are limited.

<table>
<thead>
<tr>
<th>Run Time Improvement</th>
<th>L/A layer</th>
<th>End-to-End</th>
</tr>
</thead>
<tbody>
<tr>
<td>FLAT over Naive</td>
<td>1.7X</td>
<td>1.5X</td>
</tr>
<tr>
<td>FLAT over Flex</td>
<td>1.02X</td>
<td>1.02X</td>
</tr>
</tbody>
</table>

7 EVALUATION I: FLAT DATAFLOW EFFICACY

In this section, we fix the “headline” – HW resources (i.e., FLOPs and off-chip memory bandwidth) as outlined in Table 3 and sweep-and-explore other microarchitectural parameters relevant to the dataflow efficiency, including on-chip memory size and dataflow variations (Naive, Flex and FLAT as explained in Table 4). The on-chip memory size assesses the dataflow optimizations associated to the large intermediate tensor size in the attention layers. The goal of this section is demonstrate the benefits of FLAT across a range of hardware and dataflow configurations, without biasing to any specific design point.

Runtime performance. Table 5 shows the run time performance improvement of FLAT (FLAT-Opt) over Naive (the baseline dataflow without any optimizations) and Flex (Flex-Opt), under commonly observed on-chip buffer sizes and sequence length (512-token). We observe 1) providing huge on-chip buffer resources (2GB), FLAT with its fused operation and improved data-reuse can improves Naive and Flex; 2) at limited buffer resources, FLAT becomes handy owing to its reduced on-chip buffer footprint. For example, at 200KB, FLAT-Opt improves the current state-of-the-art baseline – Flex-Opt – by 1.7X on the focusing L-A operations, which contributes to 1.1X improvement of end-to-end performance. Note that owing to the quadratic complexity of L-A operations, L-A operations will become more dominant at larger sequence length. For e.g., in this evaluation with 512-token, L-A operation contributes only 10% of the computes to the end-to-end model; however, it increases to 47% and 78% at 4K-token and 16K-token. Next, we show the efficacy of FLAT under a sweep of buffer sizes, sequence length, and with perspectives of different granularity of the models (operations, layers, and end-to-end).

Sensitivity to sequence length. As described in Fig. 6, we use U$t_{Dataflow}$, a normalized run time performance metric to show the performance difference of different sequence lengths, buffer size, and model granularity in the same plot. As Sa–c shows, FLAT-Opt consistently outperforms Naive and Flex-Opt. Analyzing the results indicate that though tensor-tensor fusion seems to be complicated and deemed as non-profitable, FLAT can efficiently execute tensor-tensor fusion in attention layers and harvest the highest performance gains. In 8a–c, as the sequence length increases, the on-chip buffer requirement increases quickly (Table 2). Under this scenario, most
Fig. 8: Comparisons of compute utilization of BERT under Edge platform with different input sequence lengths. We sweep the size of available on-chip buffer from 20KB to 2GB. The figures demonstrate three different performance analysis, (first bar) L-A → focusing on performance difference at the L, A operators; (second bar) Block→ consider all operators in the attention block; and (third bar) Model → a model-wise (end-to-end) performance.

of the accelerator design points in Flex design space starts to hit the memory boundedness. However, applying the Flat technique, we can effectively reduce the memory requirement and thus providing a better scalability to sequence length. At the optimal design point (Flat-Opt) reaches nearly 1.0 compute utilization with 10x-100x less on-chip buffer requirement in Edge accelerator as shown in 8a-c, and 100x-1000x in Cloud accelerator as shown in 10a-c, a scarce and critical hardware resource for accelerators, compared to Flex-Opt. **Sensitivity to end-to-end performance.** So far, we analyze the performance for only L and A operators, while not considering other operators within the model. As shown in the first row of Fig. 8 from left to right, we observe that the effect of L/A operators are diluted as more operators are considered. In attention-based models, FC/GEMM and attention operators, namely L and A, are generally the most dominant computation. For FC/GEMM, the typical single (intra-)operator dataflow is often sufficient to reach a high compute utilization, and hence Flat-Opt and Flex-Opt performs equally well for these operators. As we can see, for the sequence length below 512, both Block-level and Model-level (i.e., End-to-End) performance is dominated by FC/GEMM operators. Therefore, the gains from Flex-Opt and Flat-Opt are immaterial. The significant gains from our approach emerge when the sequence length increases beyond 512 to 4K, 16K, and to 64K. Under these long-sequence lengths, the runtime contribution of L and A operators grows from 12% to 49%, 79%, and 94%, respectively. This increase causes our proposed Flat-Opt to outperform Flex-Opt significantly even in Block and Model level scenarios.

Fig. 9: Comparisons of energy consumption of BERT under Edge platform with different input sequence length. We normalize the energy consumption results to the largest value in each sub-plot. Each bar represents the same analysis as described in Figure 8.
Table 6: End-to-End speedup and energy-consumption ratio of Flat-Opt-E(edge) over Flex-Opt-E(edge) and Flat-Opt-C(cloud) over Flex-Opt-C(cloud) on different models.

<table>
<thead>
<tr>
<th>Edges</th>
<th>Geomean Speedup = 1.75X</th>
<th>Geomean Consumption Ratio = 0.56X</th>
</tr>
</thead>
<tbody>
<tr>
<td>Seq. Length</td>
<td>512</td>
<td>4K</td>
</tr>
<tr>
<td>BERT</td>
<td>1.02</td>
<td>1.27</td>
</tr>
<tr>
<td>TrXL</td>
<td>1.02</td>
<td>1.23</td>
</tr>
<tr>
<td>FlashBERT</td>
<td>1.81</td>
<td>1.11</td>
</tr>
<tr>
<td>T5</td>
<td>1.03</td>
<td>1.34</td>
</tr>
<tr>
<td>XLM</td>
<td>1.00</td>
<td>1.05</td>
</tr>
<tr>
<td>Average</td>
<td>1.02</td>
<td>1.20</td>
</tr>
</tbody>
</table>

Table 7: Flat on the TrEMBL protein sequencing dataset [15] on a target accelerator (Tesla T4 GPU [88]) with 16GB memory, using sequence length = 8K.

<table>
<thead>
<tr>
<th>Memory Req.(GB)</th>
<th>Number of Attention Blocks (Sequence Length=8K)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Baseline</td>
<td>1 2 3 4 5 6</td>
</tr>
<tr>
<td>Flat</td>
<td>4.6 9.1 13.7 18.2 -OOM -OOM</td>
</tr>
<tr>
<td></td>
<td>27.3</td>
</tr>
</tbody>
</table>

Table 8: Flat on the TrEMBL protein sequencing dataset [15] on a target accelerator (Tesla T4 GPU [88]) with 16GB memory, using sequence length = 16K.

<table>
<thead>
<tr>
<th>Memory Req.(GB)</th>
<th>Number of Attention Blocks (Sequence Length=16K)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Baseline</td>
<td>1 2 3 4 5 6</td>
</tr>
<tr>
<td>Flat</td>
<td>17.5 35 72.5 70.0 -OOM -OOM</td>
</tr>
<tr>
<td></td>
<td>87.5 109.0 -OOM</td>
</tr>
<tr>
<td></td>
<td>2.8 5.6 8.5 11.3 -OOM</td>
</tr>
<tr>
<td></td>
<td>14.1 16.9 -OOM</td>
</tr>
</tbody>
</table>

8 EVALUATION II: COMPARISON

This section contrasts the performance of specific accelerator design points with and without the Flat dataflow.

8.1 Cloud and Edge Accelerators

We start by selecting two specific hardware design points, namely a Cloud and an Edge accelerator, with headline HW resources that closely resemble a TPU-v3 [41] and Edge TPU [26, 77], as shown in Table 3. We fix the on-chip buffer capacity to 512KB [26] and 32MB [41] for Edge and Cloud accelerators, respectively. Analyzing these accelerators across different dataflow spaces (Table 4), namely Naïve, Flex, and Flat, forms a concrete and reasonably realistic accelerator design space. Similar to previous sections, we name the optimal accelerator design point in each design space: Naïve Edge, Flex-Opt-Edge, and Flat-Opt-Edge for edge accelerator, and Naïve-Cloud, Flex-Opt-Cloud, and Flat-Opt-Cloud for cloud accelerator, respectively.
Fig. 10: Comparisons of compute utilization of XLM under Cloud platform with different input sequence length. We sweep the size of available on-chip buffer from 20KB to 2GB. Each bar represents the same analysis as described in Figure 8.

Accelerator performance. As show in 13a, FLEX-Opt-Edge and FLAT-Opt-Edge share the same normalized runtime for K/Q/V/O and FF1/FF2. This similarity in performance is because in FLAT-Opt-Edge, both K/Q/V/O and FF1/FF2 are treated as non-fused operators, and hence the map space for them are the same as the one in FLEX-Opt-Edge. In edge accelerator, when the sequence length is 512, FLAT-Opt-Edge and FLEX-Opt-Edge both reach a near optimal performance. However, when the sequence length increases to 4K, 16K, and 64K, the performance gap between FLAT-Opt-Edge and FLEX-Opt-Edge widen. For example, at sequence length of 64K, FLAT-Opt-Edge runs 2.8X faster than FLEX-Opt-Edge, showing the efficiency of our dataflow optimization. In the cloud accelerator (Fig. 13b), the performance difference between FLAT-Opt-Cloud and FLEX-Opt-Cloud exaggerates even further. For example, at sequence length of 64K, FLAT-Opt-Cloud runs 3.07X faster than FLEX-Opt-Cloud. That is partly because of the larger model size for the cloud accelerator that enables FLAT-Opt-Cloud to better utilize the on-chip hardware resources.

Comparisons across different models. Table 6 compares the performance of different dataflow optimizations across various transformer models. Compared to FLEX-Opt-Edge, FLAT-Opt-Edge delivers 1.75X speedup in edge accelerator, while significantly reducing the energy consumption by 44%. In cloud accelerator, FLAT-Opt-Cloud achieves 1.65X speedup and 55% energy savings over FLEX-Opt-Cloud. These results show the broad application of FLAT in improving the performance of various attention-based models under different design constraints.

Memory bandwidth requirement. Effectively using a limited off-chip bandwidth is an critical factor in the scalability of the hardware accelerator. That is because most DNN operations are often memory-bound and the off-chip memory bandwidth is often shared across
different microarchitectural components in the system. In Fig. 14, we show the peak off-chip bandwidth requirement to achieve a compute utilization over 0.95 for L and A attention operators. The left hand side of the U-shape of Fig. 14 comes from the increase in the operational intensity and thus decrease of the bandwidth-boundness as sequence length increases (§3.1).

The right hand side of the U-shape of Fig. 14 is caused by the quadratic and linear increase of on-chip memory requirement as sequence length increases for Flex and Flat, respectively. On average, Flat-Opt-Cloud reduces the off-chip bandwidth requirement by 82% against Flex-Opt-Cloud. Similarly, when evaluated under the edge scenario running BERT, Flat-Opt-Edge achieves 71% reduction, on average, in the off-chip bandwidth requirement against Flex-Opt-Edge. In summary, we demonstrate that Flat with its advantage of lowering on-chip buffer footprint can improve attention performance under existing Edge [26] and Cloud [41] DNN accelerators.

### 8.2 Flat Compatibility with Other Accelerators

**Flat compatibility on GPU.** We implement and evaluate Flat on Nvidia Tesla-T4 [88] with 16GB memory (Fig. 6). We use the BERT-Edge model and perform two experiments: (1) We fix the sequence length to 256 and sweep the batch size (Table 9), and (2) we fix the batch size to one and sweep the sequence length (Table 10). Table 9 shows that Flat can run faster than baseline and supports larger batch sizes, whereas Table 10 demonstrates that Flat runs faster than baseline and supports up to 64K-word.

**Table 9: Runtime improvement of attention layer on Tesla-T4 [88] GPU under different batch sizes.**

<table>
<thead>
<tr>
<th>Runtime (ms)</th>
<th>Batch Size (Sequence Length=256)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Baseline</td>
<td>1  45  125 256 512 1024 2048 2K</td>
</tr>
<tr>
<td>Flat</td>
<td>28 480 1,870 3,740 7,560 14,016 26K</td>
</tr>
<tr>
<td></td>
<td>OOM</td>
</tr>
</tbody>
</table>

**Table 10: Runtime improvement of attention layer on Tesla-T4 [88] GPU under different sequence length.**

<table>
<thead>
<tr>
<th>Runtime (ms)</th>
<th>Sequence Length (Batch Size=1)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Baseline</td>
<td>12  74 697 OOM  OOM  OOM  OOM</td>
</tr>
<tr>
<td>Flat</td>
<td>11  83 175 424 4,599 64,359 OOM</td>
</tr>
</tbody>
</table>

**Flat compatibility with sparse-attention accelerators.** Recent sparse-attention accelerators such as ELSA [31] and Sanger [58]...
FLAT: An Optimized Dataflow for Mitigating Attention Bottlenecks

9 RELATED WORKS

Dataflow and mapping. Most work on DNN hardware dataflow techniques focus on individual CONV [9, 18, 20, 23, 24, 38, 42, 59, 63, 65, 78, 81, 91, 92, 103, 105, 107], GEMM [40, 50, 98] operators, or loop reordering for transformer operations [69]. Some recent works consider fusion of multiple CONV operator [2, 97]. Andrei et al. [37] merely targets operation fusion between MatMul operators and element-wise operators. Fusing multiple heads of the attention operators [61, 64] primarily involves adding an additional loop over the H independent heads, which is already captured by Flex. They, however, do not explore dependent MatMul-Softmax-MatMul fusion, which is more complicated. FLAT targets such fusion and enables significantly higher performance.

Algorithmic optimization. Techniques such as quantization [45, 79, 106, 109], pruning [29, 56, 74, 93, 96, 104], and distillation [39, 76, 83, 95] are used for compressing Attention-based models. There are a large body of algorithmic changes to attention mechanism [5, 12, 66, 68, 80], learned sparsity [16, 47, 73, 86] low-rank and kernel methods [13, 14, 43, 94], and others [5, 17, 70]. These techniques impact model quality and are orthogonal to the ideas developed in this paper. FLAT can be leveraged in association with these techniques when deployed on DNN accelerators to further improve run time and energy.

Matrix-Matrix fusion accelerators. The core of Graph Neural Networks (GNNs) includes two consecutive matrix computations ("aggregate" and "combine"). GCNAX [55], GRIP [46], HyGCN [102] and others [25] form a matrix-matrix loop fusion dataflow to optimize the throughput and energy efficiency. There are different challenge and focus for matrix-matrix fusion in GNNs and attentions. 1) The dataflows of GNN accelerators [25, 46, 55, 102] optimize matrix-matrix fusion (Fig. 3(c)-left column), whereas the dataflows of attention optimizes matrix-Softmax-matrix fusion (Fig. 3(c)-right column). 2) GNN includes one activation-weight and one activation-activation matrix computation, whereas attention has both matrix computations as activation-activation. 3) The key challenge of GNN is the sparsity in the "aggregate" matrix, whereas attention is challenged by the quadratic complexity of intermediate activation matrix between two matrix-multiplies.

Attention accelerators. A³ [30] and ELSA [31] propose dedicated attention accelerators and leverage approximate computation to accelerate attention layers. Sanger [58] uses quantized query and key to predict attention matrix and rearranges the sparse attention matrix for better utilization. These technique trade-off performance with model quality. FLAT, by contrast, does not impact model quality, and is a generic yet efficient dataflow technique that can be leveraged on most existing accelerators. SM6 [85] is an attention accelerator for RNN-based networks, which exposes different challenge and is orthogonal to this work.

Compiler optimizations. Fusion is a classic compiler technique [1, 4, 8, 19, 22, 27, 44, 48, 99]. However, machine learning compilers employ fusion in a limited fashion to fuse matrix operators with element-wise operators [62].

10 CONCLUSION

We identify that running attention-based models with long sequences is challenging because of low reuse in certain attention operators and quadratic growth of intermediate memory footprint, both of which compound memory bandwidth requirements. We propose FLAT, a novel dataflow for attention layers employing inter-operator fusion (the first work to investigate this for attention layers), interleaved execution, and efficient tiling to enhance the operational intensity and provide high compute utilization, reduced off-chip bandwidth requirements and scalability to long sequence lengths.

ACKNOWLEDGMENTS

We extend our gratitude towards Parthasarathy Ranganathan, Nischant Patil, James Laudon, Stella Aslibekyan, and extended Google Research, Brain Team for their invaluable feedback and comments.
We also thank Prasanth Chatarasi for feedback on early drafts. This work was supported in-part by NSF Award #1909900.

REFERENCES


Received 2022-07-07; accepted 2022-09-22