Berkin Akin will be presenting our work on bandwidth-optimized large size 2D FFTs at the IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM 2012) this week. This work was performed with Berkin Akin, Franz Franchetti, and James C. Hoe.
Memory Bandwidth Efficient Two-Dimensional Fast Fourier Transform Algorithm and Implementation for Large Problem Sizes. Berkin Akin, Peter Milder, Franz Franchetti, and James C. Hoe. FCCM 2012.
Abstract: Prevailing VLSI trends point to a growing gap between the scaling of on-chip processing throughput and off-chip memory bandwidth. An efficient use of memory bandwidth must become a first-class design consideration in order to fully utilize the processing capability of highly concurrent processing platforms like FPGAs. In this paper, we present key aspects of this challenge in developing FPGA-based implementations of two-dimensional fast Fourier transform (2D-FFT) where the large datasets must reside off-chip in DRAM. Our scalable implementations address the memory bandwidth bottleneck through both (1) algorithm design to enable efficient DRAM access patterns and (2) datapath design to extract the maximum compute throughput for a given level of memory bandwidth. We present results for double-precision 2D-FFT up to size 2,048-by-2,048. On an Altera DE4 platform our implementation of the 2,048-by-2,048 2D-FFT can achieve over 19.2 Gflop/s from the 12 GByte/s maximum DRAM bandwidth available. The results also show that our FPGA-based implementations of 2D-FFT are more efficient than 2D-FFT running on state-of- the-art CPUs and GPUs in terms of the bandwidth and power efficiency.
(Click to view full size.)
My article entitled Computer Generation of Hardware for Linear Digital Signal Processing Transforms has been published in ACM Transactions on Design Automation of Electronic Systems.
This paper (co-written with Franz Franchetti, James C. Hoe, and Markus Püschel) presents an overview of my work on the Spiral hardware generation framework, a high-level synthesis and optimization engine that produces highly-customized hardware implementations of linear DSP transforms such as the FFT.
A subset of this system’s functionality is used in my online FFT IP Core Generator, which allows you to create customized FFT cores directly from your web browser, and download the result as synthesizable RTL Verilog.
Abstract: Linear signal transforms such as the discrete Fourier transform (DFT) are very widely used in digital signal processing and other domains. Due to high performance or efficiency requirements, these transforms are often implemented in hardware. This implementation is challenging due to the large number of algorithmic options (e.g., fast Fourier transform algorithms or FFTs), the variety of ways that a fixed algorithm can be mapped to a sequential datapath, and the design of the components of this datapath. The best choices depend heavily on the resource budget and the performance goals of the target application. Thus, it is difficult for a designer to determine which set of options will best meet a given set of requirements.
In this article we introduce the Spiral hardware generation framework and system for linear transforms. The system takes a problem specification as input as well as directives that define characteristics of the desired datapath. Using a mathematical language to represent and explore transform algorithms and datapath characteristics, the system automatically generates an algorithm, maps it to a datapath, and outputs a synthesizable register transfer level Verilog description suitable for FPGA or ASIC implementation. The quality of the generated designs rivals the best available handwritten IP cores.