Energy efficiency has become the limiting factor of current and future computing performance, affecting computing systems of all kinds, from mobile devices to datacenters. Meanwhile, modern applications continue to grow more complex and computationally expensive, while relying on larger amounts of data. This presents a considerable challenge: how can we continue to improve our computational capabilities in spite of these limitations?
A key technique to improve energy efficiency and reach high performance is hardware specialization. Recently, there has been much interest in using field-programmable gate arrays (FPGAs) as accelerators in general-purpose computing environments. Their fine-grained parallel structures allow them to exploit the benefits of hardware-level customization while they still allow reprogrammability.
However the biggest obstacle limiting the growth of FPGAs is the difficulty of implementing algorithms in hardware and integrating this hardware into real-world computer systems. My research aims to address these difficulties by combining the areas of digital hardware design with compilers, tools, and domain-specific languages. More specifically, my work explores how we can use computer-based tools to make digital hardware more efficient, how we can reduce the effort needed to design, optimize, and verify digital systems, and how these technologies can be exploited to address key challenges in modern computing.
Below you will find high-level descriptions of my current research work and information for a few selected papers. For a full list of papers please see my Publications page.
Deep learning and convolutional neural networks (CNNs) have revolutionized machine learning, leading to recent advances in several areas such natural language processing and computer vision, and widespread interest from industry and academia. However, these advances come at a steep computational cost. The goal of this project is to enable implementation of large-scale deep learning applications on a scalable parallel “cloud” of FPGAs by automating the translation from straightforward algorithmic specifications of deep learning problems into optimized hardware, parallelized across many interconnected FPGAs.
This work is funded by the National Science Foundation’s Exploiting Parallelism and Scalability (XPS) program through award 1533739.
Selected papers:
Argus: An End-to-End Framework for Accelerating CNNs on FPGAs [info] [PDF]
Yongming Shen, Tianchu Ji, Michael Ferdman, and Peter Milder
Accepted, to appear in IEEE Micro special issue on Machine Learning Acceleration, 2019.
Medusa: A Scalable Interconnect for Many-Port DNN Accelerators and Wide DRAM Controller Interfaces [PDF] [extended arXiv version]
Yongming Shen, Tianchu Ji, Michael Ferdman, and Peter Milder
International Conference on Field-Programmable Logic and Applications (FPL) 2018. See also extended version arXiv:1807.04013.
Maximizing CNN Accelerator Efficiency Through Resource Partitioning [PDF]
Yongming Shen, Michael Ferdman, and Peter Milder
Proceedings of the 44th International Symposium on Computer Architecture (ISCA), 2017.
Escher: A CNN Accelerator with Flexible Buffering to Minimize Off-Chip Transfer [PDF]
Yongming Shen, Michael Ferdman, and Peter Milder
Proceedings of the 25th IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2017.
Fused Layer CNN Accelerators [PDF]
Manoj Alwani, Han Chen, Michael Ferdman, and Peter Milder
Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) 2016.
Overcoming Resource Underutilization in Spatial CNN Accelerators [PDF]
Yongming Shen, Michael Ferdman, and Peter Milder
Proceedings of the International Conference on Field-Programmable Logic and Applications (FPL) 2016.
In order to reduce the difficulty of implementing FPGA and ASIC accelerators, researchers have proposed a number of different types of automated systems. Some of these take the form of parameterized IP (intellectual property) cores, which are implementations of a given problem created by an expert with a small amount of flexibility through parameters. At the other end of the spectrum are tools such as “high-level synthesis” (HLS) that aim to convert C or C++ code directly into hardware. In practice, typical parameterized IPs are too restrictive, forcing designers into a “one-size-fits-all” approach; meanwhile, HLS is too open-ended: by trying to work well for all problems, it is too difficult to produce good solutions.
My work aims to address these problems through the use of domain-specific hardware generation tools. These tools target a specific domain of problems (e.g. linear DSP transforms), providing enough flexibility to work well for a variety of different problems in the domain, while being targeted enough that they can produce very good results with little effort from the end user. One example of this is my work on the Spiral hardware generation framework, a domain specific hardware generation tool for linear signal processing transforms such as the fast Fourier transform. This system uses a mathematical domain-specific language (DSL) to optimize transform algorithm hardware; its results are competitive with (and often are more efficient than) hand-designed systems.
My ongoing work aims to create a flexible framework for creating domain-specific hardware generators, improving their usability, and using the results to study new application domains.
Selected papers:
Computer Generation of Hardware for Linear Digital Signal Processing Transforms [info] [PDF]
Peter Milder, Franz Franchetti, James C. Hoe, and Markus Püschel
ACM Transactions on Design Automation of Electronic Systems, Vol. 17, No. 2, Article 15, April 2012.
Winner, 2014 ACM TODAES Best Paper Award.
Streaming Sorting Networks [info] [preprint PDF]
Marcela Zuluaga, Peter Milder, and Markus Püschel
ACM Transactions on Design Automation of Electronic Systems, Vol. 21, No. 4, Article 55, May 2016.
Nautilus: Fast Automated IP Design Space Search Using Guided Genetic Algorithms [info] [PDF]
Michael K. Papamichael, Peter Milder, and James C. Hoe
Proceedings of Design Automation Conference (DAC), 2015.
"Smart" Design Space Sampling to Predict Pareto-Optimal Solutions [info, PDF]
Marcela Zuluaga, Andreas Krause, Peter A. Milder, and Markus Püschel
Proceedings of Languages, Compilers, Tools and Theory for Embedded Systems, 2012.
See also the Spiral DFT/FFT hardware generator, which produces high quality designs over a very wide tradeoff space, allowing users to choose designs that best match their implementation-specific tradeoff goals, balancing cost (power, energy, area) against performance (throughput, latency). The system is able to produce cores that compare well with existing designs in the literature or in IP libraries and enables higher performance/cost design points than otherwise available.
Datacenters (large-scale computing centers comprised of large numbers of servers) have become ubiquitous in modern computing, but are severely power constrained. Although typical datacenter applications are not traditional targets for hardware acceleration, their strict power limits have made FPGA acceleration an attractive target. However, typical datacenter applications can be considerably challenging to accelerate with FPGAs. The goal of this work is to study how FPGAs can improve efficiency and speed of large-scale datacenters and their applications.
This work is supported by the Semiconductor Research Corporation.
With the explosive growth in wireless communications, the RF spectrum is now more than ever an important but limited resource. However, monitoring use of the RF spectrum in space and time, whether to patrol for unauthorized access or to exploit under-utilized bandwidth, can become extremely difficult. The goal of this collaborative project is to bring “spectrum sensing to the masses” by studying how to design efficient low-power spectrum sensing hardware that can be distributed across a region of interest, and pairing it with intelligent centralized algorithms that can aggregate and interpolate sensed data. Specifically, we are studying how automatic generation techniques can help create efficient hardware for sensing and detecting usage of the frequency spectrum. The goal is to produce a domain-specific hardware generation framework that will allow users to quickly create hardware designs adapted to different tradeoff scenarios.
This work is supported by the National Science Foundation EARS (“Enhancing Access to the Radio Spectrum”) program, under award 1642965.
In the near future, edge devices (including smartphones, connected vehicles, road-side units with sensors and radios, and Internet-of-Things (IoT) devices) will be densely distributed and pervasively embedded in the world around us. Edge devices will sense and control our physical environments, processing sensing data to allow objects and computers to understand our surroundings. This new edge sensing/computing paradigm utilizes pervasively embedded sensing and computing resources, thus distributing the data processing, decision making and intelligence throughout the environment. Edge computing devices pose significant challenges, often requiring high computational capabilities over a range of different types of algorithms (e.g., signal processing, feature detection, machine learning) within limited power budgets. For these reasons, FPGAs represent an attractive platform for edge computing (especially for research in this area), but the difficulty of working with hardware dissuades many researchers. The goal of this recent project is to create a hardware/software platform for FPGA-based edge sensing and computing devices with an accompanying domain-specific hardware generator that will allow practitioners to easily prototype edge computing systems with FPGAs.
This work is supported by the National Science Foundation under award 1730291.