Our new paper on improving the efficiency of hardware accelerators for convolutional neural networks has been accepted for publication at the 44th International Symposium on Computer Architecture (ISCA), 2017.
This paper, co-authored with Yongming Shen (Stony Brook CS PhD student) and Stony Brook CS professor Mike Ferdman, proposes a new Convolutional Neural Network (CNN) accelerator paradigm and an accompanying automated design methodology that partitions the available FPGA resources into multiple processors, each of which is tailored for a different subset of the CNN convolutional layers.
Yongming Shen, Michael Ferdman, and Peter Milder. “Maximizing CNN Accelerator Efficiency Through Resource Partitioning.” To appear at The 44th International Symposium on Computer Architecture (ISCA), 2017.
You can read a pre-print here.
Our new paper on bandwidth-efficient hardware accelerators for convolutional neural networks will appear at FCCM 2017. This paper, co-authored with Stony Brook CS PhD student Yongming Shen and Stony Brook CS professor Mike Ferdman, proposes a new method to efficiently balance between the transfer costs of CNN data and CNN parameters and describes a new flexible architecture that is able to reduce the overall communication requirement.
Abstract—Convolutional neural networks (CNNs) are used to solve many challenging machine learning problems. Interest in CNNs has led to the design of CNN accelerators to improve CNN evaluation throughput and efficiency. Importantly, the bandwidth demand from weight data transfer for modern large CNNs causes CNN accelerators to be severely bandwidth bottlenecked, prompting the need for processing images in batches to increase weight reuse. However, existing CNN accelerator designs limit the choice of batch sizes and lack support for batch processing of convolutional layers.
We observe that, for a given storage budget, choosing the best batch size requires balancing the input and weight transfer. We propose Escher, a CNN accelerator with a flexible data buffering scheme that ensures a balance between the input and weight transfer bandwidth, significantly reducing overall bandwidth requirements. For example, compared to the state-of-the-art CNN accelerator designs targeting a Virtex-7 690T FPGA, Escher reduces the accelerator peak bandwidth requirements by 2.4× across both fully-connected and convolutional layers on fixed-point AlexNet, and reduces convolutional layer bandwidth by up to 10.5× on fixed-point GoogleNet.
Yongming Shen, Michael Ferdman, and Peter Milder. “Escher: A CNN Accelerator with Flexible Buffering to Minimize Off-Chip Transfer.” To appear at The 25th IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2017.
You can read a preprint here.