Reducing FPGA Algorithm Area by Avoiding Redundant Computation

Abstract

We develop a new paradigm for designing fully streaming, area-efficient FPGA implementations of common vision algorithm building blocks. By focusing on avoiding redundant computation we are able to get 1-2 orders of magnitude reduction in design area utilization as compared to previous implementations. We demonstrate that our design works in practice by building five 325 frames per second high resolution Harris corner detection cores onto a single FPGA.

Main Idea

Simply implementing algorithms designed for a CPU on a FPGA doesn’t work very well. Underlying the issue is that on an FPGA one must worry about gate complexity in addition to time complexity. Algorithms that have a lot of branching (conditional statements such as if) might have low time complexity but require a very large circuit to implement on an FPGA.

In this paper we examine a pair of common computer vision operations (non-max suppression and convolution). By analyzing the algorithm as a directed graph with computation at the nodes we can gain insight into the gate complexity of the algorithm. Minimizing the number of nodes in the graph allows us to design fast streaming algorithms for these two operations using many fewer gates. With window size \(n\) we get resource complexities of \(O(n)\) and \(O(\log n)\) for low-rank convolution and non-max suppression respectively. We demonstrate the designs by implementing a visual odometry pipeline with very high performance.