Our recent paper on CNN accelerator hardware: “Overcoming Resource Underutilization in Spatial CNN Accelerators” has been accepted to appear at the International Conference on Field-Programmable Logic and Applications (FPL) 2016. This paper was co-written with Yongming Shen and Michael Ferdman. A pre-print is available here.
A new overview paper that I co-authored with with Marcela Zuluaga and Markus Püschel of ETH Zurich has been published in ACM Transactions on Design Automation of Electronics Systems (TODAES). In this paper, we present new hardware structures for sorting that we call streaming sorting networks, which we derive through a mathematical formalism that we introduce, and an accompanying domain-specific hardware generator that translates our formal mathematical description into synthesizable RTL Verilog.
As a preview, the following graph shows the cost of implementing various sorters with 16-bit fixed point input values that fit on a Xilinx Virtex-6 FPGA. The x-axis indicates the input size n, the y-axis indicates the number of FPGA con- figurable slices used, and the size of the marker quantifies the number of BRAMs used (BRAMs are blocks of on-chip memory available in FPGAs). The implementations using Batcher’s and Stone’s architectures can only sort up to 128 or 256 elements, respectively, on this FPGA. Conversely, our streaming sorting networks with streaming width w = 2 can sort up to 219 elements on this FPGA, and our smallest fully streaming design can sort up to 216 elements.
The following graph shows all 256-element sorting networks that we generate with our framework (using 16-bits per element) that fit onto the Virtex-6 FPGA. The x-axis indicates the number of configurable FPGA slices used, the y-axis indicates the maximum achievable throughput in giga samples per second, and the size of the marker indicates the number of BRAMs used. This plot shows that we can generate a wide range of design trade-offs that outperform previous implementations, such as that of Stone and the linear sorter (Batcher’s is omitted due to the high cost). For practical applications, only the Pareto-optimal ones (those toward the top left) would be considered.
The National Science Foundation program on Exploiting Parallelism and Scalability (XPS) has funded our project focused on using clouds of FPGAs for deep learning algorithms. This work is in collaboration with Mike Ferdman (Stony Brook CS) and Alex Berg (UNC Chapel Hill CS).
A new paper in collaboration with Michael Papamichael and James C. Hoe of Carnegie Mellon is accepted for publication at the 2015 Design Automation Conference.
Michael Papamichael, Peter Milder, and James C. Hoe. “Nautilus: Fast Automated IP Design Space Search Using Guided Genetic Algorithms.”
Abstract: Today’s crop of parameterized hardware IP generators permit very high degrees of performance and implementation customization. Nevertheless, it is ultimately still left to the IP users to set IP parameters to achieve the desired tuning effects. For the average IP user, the knowledge and effort required to navigate a complex IP’s design space can significantly offset the productivity gain from using the IP. This paper presents an approach that builds into an IP generator an extended genetic algorithm (GA) to perform automatic IP parameter tuning. In particular, we propose extensions that allow IP authors to embed pertinent designer knowledge to improve GA performance. In the context of several IP generators, our evaluations show that (1) GA is an effective solution to this problem and (2) our modified IP-author guided GA can reach the same quality of results up to eight times faster compared to the basic GA.
My recent paper describing the Spiral Hardware Generation system has been awarded the 2014 ACM TODAES Best Paper Award.
This paper (co-written with Franz Franchetti, James C. Hoe, and Markus Püschel) presents an overview of my work on the Spiral hardware generation framework, a high-level synthesis and optimization engine that produces highly-customized hardware implementations of linear DSP transforms such as the FFT. This award was presented during the awards session at DAC 2014.
You can read the paper here.
In the Fall 2014 semester I will be teaching two courses:
- ESE-305 (Deterministic Signals and Systems) on Tuesdays and Thursdays from 10:00–11:20am.
- ESE-507 (Advanced Digital System Design and Generation) on Mondays and Wednesdays from 4:00–5:20pm.
Jonah Caplan will soon present our work on execution signature compression at DATE 2014.
The talk will be in session “4.7 Dependable System Design” on Tuesday 3/25 at 6:00pm.