New work on using FPGAs to accelerate homomorphic encryption to appear

Our recent work in collaboration with a team from CEA (The French Alternative Energies and Atomic Energy Commission) LIST institute has been accepted to appear at the Conference on Cryptographic Hardware and Embedded Systems 2018.

The paper, “Data Flow Oriented Hardware Design of RNS-based Polynomial Multiplication for SHE Acceleration” was authored by Joël Cathébras, Alexandre Carbon, Peter Milder, Renaud Sirdey, and Nicolas Ventroux. In this work, we use an FPGA to residue polynomial multiplication. To do so, we adapted Spiral to generate efficient hardware implementations of the Number Theoretic Transform (NTT), and added a new hardware structure that changes its twiddle factors on the fly.

Please check back soon for a pre-print.

New paper on FPGA-based model checking to appear at FPL2018

A new paper on using FPGAs for swarm-based model checking, co-authored with Shenghsun Cho and Mike Ferdman, will appear at FPL 2018.

A preprint is available here.

Abstract—Explicit state model checking has been widely used to discover difficult-to-find errors in critical software and hard- ware systems by exploring all possible combinations of control paths to determine if any input sequence can cause the system to enter an illegal state. Unfortunately, the vast state spaces of modern systems limit the ability of current general-purpose CPUs to perform explicit state model checking effectively due to the computational complexity of the model checking process. Complex software may require days or weeks to go through the formal verification phase, making it impractical to use model checking as part of the regular software development process.

In this work, we explore the possibility of leveraging FPGAs to overcome the performance challenges of model checking. We designed FPGASwarm, an FPGA model checker based on the concept of Swarm verification. FPGASwarm provides the necessary parallelism, performance, and flexibility to achieve high-throughput and reconfigurable explicit state model check- ing. Our experimental results show that, using a Xilinx Virtex- 7 FPGA, the FPGASwarm can achieve near three orders of magnitude speedup over the conventional software approach to state exploration.

Citation: Shenghsun Cho, Michael Ferdman, and Peter Milder. “FPGASwarm: High Throughput Model Checking on FPGAs.” To appear at the 28th International Conference on Field Programmable Logic and Applications (FPL), 2018.

New paper on Scalable Memory Interconnects for DNN Accelerators to appear at FPL 2018

A new paper, co-authored with Yongming Shen, Tianchu Ji, and Mike Ferdman has been accepted to appear at FPL2018.

A preprint is available here, and an extended version is available on arXiv.

Abstract—To cope with the increasing demand and computational intensity of deep neural networks (DNNs), industry and academia have turned to accelerator technologies. In particular, FPGAs have been shown to provide a good balance between performance and energy efficiency for accelerating DNNs. While significant research has focused on how to build efficient layer processors, the computational building blocks of DNN accelerators, relatively little attention has been paid to the on-chip interconnects that sit between the layer processors and the FPGA’s DRAM controller.

We observe a disparity between DNN accelerator interfaces, which tend to comprise many narrow ports, and FPGA DRAM controller interfaces, which tend to be wide buses. This mismatch causes traditional interconnects to consume significant FPGA resources. To address this problem, we designed Medusa: an optimized FPGA memory interconnect which transposes data in the interconnect fabric, tailoring the interconnect to the needs of DNN layer processors. Compared to a traditional FPGA interconnect, our design can reduce LUT and FF use by 4.7x and 6.0x, and improves frequency by 1.8x.

Citation: Yongming Shen, Tianchu Ji, Michael Ferdman, and Peter Milder. “Medusa: A Scalable Memory Interconnect for Many-Port DNN Accelerators and Wide DRAM Controller Interfaces.” To appear at the 28th International Conference on Field Programmable Logic and Applications (FPL), 2018.

New paper on VM-HDL co-simulation framework to appear at FPGA18

Our new paper, which describes our recent work on creating a framework that allows co-simulation of server systems with PCIe-connected FPGAs, has been accepted to appear at FPGA 2018. We are also planning an open source release of this framework.

“A Full-System VM-HDL Co-Simulation Framework for Servers with PCIe-Connected FPGAs.” Shenghsun Cho, Mrunal Patel, Han Chen, Peter Milder, and Michael Ferdman. To appear at FPGA 2018.

Abstract: The need for high-performance and low-power acceleration technologies in servers is driving the adoption of PCIe-connected FPGAs in datacenter environments. However, the co-development of the application software, driver, and hardware HDL for server FPGA platforms remains one of the fundamental challenges standing in the way of wide-scale adoption. The FPGA accelerator development process is plagued by a lack of comprehensive full-system simulation tools, unacceptably slow debug iteration times, and limited visibility into the software and hardware at the time of failure.

In this work, we develop a framework that pairs a virtual machine and an HDL simulator to enable full-system co-simulation of a server system with a PCIe-connected FPGA. Our framework enables rapid development and debugging of unmodi ed application software, operating system, device drivers, and hardware design.

Once the system is debugged, neither the software nor the hardware requires any changes before being deployed in a production environment. In our case studies, we nd that the co-simulation framework greatly improves debug iteration time while providing invaluable visibility into both the software and hardware components.

Please click here for a pre-print [pdf].

New paper on CNN accelerator architectures to appear at ISCA 2017

Our new paper on improving the efficiency of hardware accelerators for convolutional neural networks has been accepted for publication at the 44th International Symposium on Computer Architecture (ISCA), 2017.

This paper, co-authored with Yongming Shen (Stony Brook CS PhD student) and Stony Brook CS professor Mike Ferdman, proposes a new Convolutional Neural Network (CNN) accelerator paradigm and an accompanying automated design methodology that partitions the available FPGA resources into multiple processors, each of which is tailored for a different subset of the CNN convolutional layers.

Yongming Shen, Michael Ferdman, and Peter Milder. “Maximizing CNN Accelerator Efficiency Through Resource Partitioning.” To appear at The 44th International Symposium on Computer Architecture (ISCA), 2017.

You can read a pre-print here.

New Paper on Bandwidth-Efficient CNN accelerators to appear at FCCM 2017

Our new paper on bandwidth-efficient hardware accelerators for convolutional neural networks will appear at FCCM 2017. This paper, co-authored with Stony Brook CS PhD student Yongming Shen and Stony Brook CS professor Mike Ferdman, proposes a new method to efficiently balance between the transfer costs of CNN data and CNN parameters and describes a new flexible architecture that is able to reduce the overall communication requirement.

Abstract—Convolutional neural networks (CNNs) are used to solve many challenging machine learning problems. Interest in CNNs has led to the design of CNN accelerators to improve CNN evaluation throughput and efficiency. Importantly, the bandwidth demand from weight data transfer for modern large CNNs causes CNN accelerators to be severely bandwidth bottlenecked, prompting the need for processing images in batches to increase weight reuse. However, existing CNN accelerator designs limit the choice of batch sizes and lack support for batch processing of convolutional layers.

We observe that, for a given storage budget, choosing the best batch size requires balancing the input and weight transfer. We propose Escher, a CNN accelerator with a flexible data buffering scheme that ensures a balance between the input and weight transfer bandwidth, significantly reducing overall bandwidth requirements. For example, compared to the state-of-the-art CNN accelerator designs targeting a Virtex-7 690T FPGA, Escher reduces the accelerator peak bandwidth requirements by 2.4× across both fully-connected and convolutional layers on fixed-point AlexNet, and reduces convolutional layer bandwidth by up to 10.5× on fixed-point GoogleNet.

Yongming Shen, Michael Ferdman, and Peter Milder. “Escher: A CNN Accelerator with Flexible Buffering to Minimize Off-Chip Transfer.” To appear at The 25th IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2017.

You can read a preprint here.

New article on hardware reliability to appear in ACM TECS

A new article focusing on hardware implementation of execution stream compression will appear in ACM Transactions on Embedded Computing Systems, in a special issue on Secure and Fault-tolerant Embedded Computing. This paper was co-authored with Maria Isabel Mera (a Stony Brook ECE MS alum, currently a PhD student at NYU), Jonah Caplan and Seyyed Hasan Mozafari (graduate students at McGill University), and Prof. Brett Meyer from McGill. This work was based in part on Maria Isabel Mera’s MS thesis.

“Area, Throughput and Power Trade-offs for FPGA- and ASIC-based Execution Stream Compression.” Maria Isabel Mera, Jonah Caplan, Seyyed Hasan Mozafari, Brett H. Meyer, and Peter Milder. To appear in ACM Trans. on Embedded Computing Systems, 2017.

Abstract: An emerging trend in safety-critical computer system design is the use of compression, e.g., using cyclic redundancy check (CRC) or Fletcher Checksum (FC), to reduce the state that must be compared to verify correct redundant execution. We examine the costs and performance of CRC and FC as compression algorithms when implemented in hardware for embedded safety-critical systems. To do so, we have developed parameterizable hardware generation tools targeting CRC and two novel FC implementations. We evaluate the resulting designs implemented for FPGA and ASIC and analyze their efficiency; while CRC is often best, FC dominates when high throughput is needed.

Please check back later for a pre-print.

Poster on neural network hardware to appear at FPGA 2017

Yongming Shen will be presenting a poster on our current work to implement bandwidth-efficienct fully-connect neural network layers next month.

Yongming Shen, Michael Ferdman, and Peter Milder. “Storage-Efficient Batching for Minimizing Bandwidth of Fully-Connected Neural Network Layers.” Poster to appear at FPGA 2017.