Our new paper “Fused Layer CNN Accelerators” by Manoj Alwani, Han Chen, Michael Ferdman, and Peter Milder has been accepted to appear at MICRO 2016.
A preprint is available here.
In this work, we observe that a previously unexplored dimension exists in the design space of CNN accelerators that focuses on the dataflow across convolutional layers. We find that we are able to fuse the processing of multiple CNN layers by modifying the order in which the input data are brought on chip, enabling caching of intermediate data between the evaluation of adjacent CNN layers. We demonstrate the effectiveness of our approach by constructing a fused-layer CNN accelerator for the first five convolutional layers of the VGGNet-E network, and find that, by using 362KB of on-chip storage, our fused-layer accelerator minimizes off-chip feature map data transfer, reducing the total transfer by 95%, from 77MB down to 3.6MB per image.
Our recent paper on CNN accelerator hardware: “Overcoming Resource Underutilization in Spatial CNN Accelerators” has been accepted to appear at the International Conference on Field-Programmable Logic and Applications (FPL) 2016. This paper was co-written with Yongming Shen and Michael Ferdman. A pre-print is available here.
A new overview paper that I co-authored with with Marcela Zuluaga and Markus Püschel of ETH Zurich has been published in ACM Transactions on Design Automation of Electronics Systems (TODAES). In this paper, we present new hardware structures for sorting that we call streaming sorting networks, which we derive through a mathematical formalism that we introduce, and an accompanying domain-specific hardware generator that translates our formal mathematical description into synthesizable RTL Verilog.
You can read the paper here, and see also our online sorting network generator, which allows you to use the tool described in this paper in your web browser.
As a preview, the following graph shows the cost of implementing various sorters with 16-bit fixed point input values that fit on a Xilinx Virtex-6 FPGA. The x-axis indicates the input size n, the y-axis indicates the number of FPGA con- figurable slices used, and the size of the marker quantifies the number of BRAMs used (BRAMs are blocks of on-chip memory available in FPGAs). The implementations using Batcher’s and Stone’s architectures can only sort up to 128 or 256 elements, respectively, on this FPGA. Conversely, our streaming sorting networks with streaming width w = 2 can sort up to 219 elements on this FPGA, and our smallest fully streaming design can sort up to 216 elements.
The following graph shows all 256-element sorting networks that we generate with our framework (using 16-bits per element) that fit onto the Virtex-6 FPGA. The x-axis indicates the number of configurable FPGA slices used, the y-axis indicates the maximum achievable throughput in giga samples per second, and the size of the marker indicates the number of BRAMs used. This plot shows that we can generate a wide range of design trade-offs that outperform previous implementations, such as that of Stone and the linear sorter (Batcher’s is omitted due to the high cost). For practical applications, only the Pareto-optimal ones (those toward the top left) would be considered.