News

New paper on Programmable Model Checking Hardware to Appear at FPL 2019

A new paper describing our recent work on programmable pipelines for accelerating model checking hardware has been accepted to appear at the International Conference on Field-Programmable Logic and Applications (FPL 2019).

This paper, titled “Runtime Programmable Pipelines for Model Checkers on FPGAs,” was co-authored with PhD students Mrunal Patel and Shenghsun Cho and my collaborator Prof. Mike Ferdman.

Abstract: Software verification is an important stage of the software development process, particularly for mission-critical systems. As the traditional methodology of using unit tests falls short of verifying complex software, developers are increasingly relying on formal verification methods, such as explicit state model checking, to automatically verify that the software functions properly. However, due to the ever-increasing complexity of software designs, model checking software running on general purpose cores cannot be performed in a reasonable amount of time, leading to the exploration of hardware-accelerated model checking. FPGAs have been demon- strated as a promising accelerator because of their high throughput, inherent parallelism, and flexibility. Unfortunately, the “FPGA programmability wall,” particularly the long synthesis and place-and- route times, block the general adoption of FPGAs for model checking.

To address this problem, we designed a runtime-programmable pipeline specifically for model checkers on FPGAs to minimize the “preparation time” before a model can be checked. Our runtime- programmable pipeline design of the successor state generator and the state validator modules enables FPGA acceleration of model checking without incurring the time-consuming FPGA implementation stages. Our experimental results show that the runtime-programmable pipeline reduces the preparation time before checking a new or modified model from multiple hours to less than a minute, while maintaining similar throughput as FPGA model checkers with model-specific pipelines.

Runtime-programmable successor state geneartor pipeline

Please check back later for a preprint.

This entry was posted on May 17, 2019.


Poster on Sorting Hardware at FCCM

Han Chen will be presenting a poster at FCCM 2019 on our recent work on accelerating the sorting of large sets of data using FPGAs.

Han Chen, Sergey Madaminov, Michael Ferdman and Peter Milder. “Soring Large Data Sets with FPGA-Accelerated Samplesort.” Poster, FCCM 2019.

Han will be presenting during poster session 3 on Tuesday 4/30/19.

This entry was posted on April 27, 2019.


New Graduate Course on Deep Learning Hardware

This spring, I will be teaching my new graduate course: ESE 587 Hardware Architectures for Deep Learning. The course focuses on the design and implementation of specialized digital hardware systems for deep learning algorithms. The course will include hands-on FPGA experience with Xilinx Zynq.

A syllabus for the course is available here.

This entry was posted on November 04, 2018.


Poster Presentation at MobiCom 2018

Mohammed Elbadry will be presenting a poster at MobiCom 2018 on our recent work on data-centric vehicular networking.

Mohammed Elbadry, Bing Zhou, Fan Ye, Peter Milder, and YuanYuan Yang. “Poster: A Raspberry Pi Based Data-Centric MAC for Robust Multicast in Vehicular Network.” MobiCom 2018.

You can read the poster’s abstract here.

This entry was posted on October 26, 2018.


New work on using FPGAs to accelerate homomorphic encryption to appear

Our recent work in collaboration with a team from CEA (The French Alternative Energies and Atomic Energy Commission) LIST institute has been accepted to appear at the Conference on Cryptographic Hardware and Embedded Systems 2018.

The paper, “Data Flow Oriented Hardware Design of RNS-based Polynomial Multiplication for SHE Acceleration” was authored by Joël Cathébras, Alexandre Carbon, Peter Milder, Renaud Sirdey, and Nicolas Ventroux. In this work, we use an FPGA to residue polynomial multiplication. To do so, we adapted Spiral to generate efficient hardware implementations of the Number Theoretic Transform (NTT), and added a new hardware structure that changes its twiddle factors on the fly.

Please check back soon for a pre-print.

This entry was posted on July 06, 2018.


New paper on FPGA-based model checking to appear at FPL2018

A new paper on using FPGAs for swarm-based model checking, co-authored with Shenghsun Cho and Mike Ferdman, will appear at FPL 2018.

A preprint is available here.

Abstract—Explicit state model checking has been widely used to discover difficult-to-find errors in critical software and hard- ware systems by exploring all possible combinations of control paths to determine if any input sequence can cause the system to enter an illegal state. Unfortunately, the vast state spaces of modern systems limit the ability of current general-purpose CPUs to perform explicit state model checking effectively due to the computational complexity of the model checking process. Complex software may require days or weeks to go through the formal verification phase, making it impractical to use model checking as part of the regular software development process.

In this work, we explore the possibility of leveraging FPGAs to overcome the performance challenges of model checking. We designed FPGASwarm, an FPGA model checker based on the concept of Swarm verification. FPGASwarm provides the necessary parallelism, performance, and flexibility to achieve high-throughput and reconfigurable explicit state model check- ing. Our experimental results show that, using a Xilinx Virtex- 7 FPGA, the FPGASwarm can achieve near three orders of magnitude speedup over the conventional software approach to state exploration.

Citation: Shenghsun Cho, Michael Ferdman, and Peter Milder. “FPGASwarm: High Throughput Model Checking on FPGAs.” To appear at the 28th International Conference on Field Programmable Logic and Applications (FPL), 2018.

This entry was posted on May 22, 2018.


New paper on Scalable Memory Interconnects for DNN Accelerators to appear at FPL 2018

A new paper, co-authored with Yongming Shen, Tianchu Ji, and Mike Ferdman has been accepted to appear at FPL2018.

A preprint is available here, and an extended version is available on arXiv.

Abstract—To cope with the increasing demand and computational intensity of deep neural networks (DNNs), industry and academia have turned to accelerator technologies. In particular, FPGAs have been shown to provide a good balance between performance and energy efficiency for accelerating DNNs. While significant research has focused on how to build efficient layer processors, the computational building blocks of DNN accelerators, relatively little attention has been paid to the on-chip interconnects that sit between the layer processors and the FPGA’s DRAM controller.

We observe a disparity between DNN accelerator interfaces, which tend to comprise many narrow ports, and FPGA DRAM controller interfaces, which tend to be wide buses. This mismatch causes traditional interconnects to consume significant FPGA resources. To address this problem, we designed Medusa: an optimized FPGA memory interconnect which transposes data in the interconnect fabric, tailoring the interconnect to the needs of DNN layer processors. Compared to a traditional FPGA interconnect, our design can reduce LUT and FF use by 4.7x and 6.0x, and improves frequency by 1.8x.

Citation: Yongming Shen, Tianchu Ji, Michael Ferdman, and Peter Milder. “Medusa: A Scalable Memory Interconnect for Many-Port DNN Accelerators and Wide DRAM Controller Interfaces.” To appear at the 28th International Conference on Field Programmable Logic and Applications (FPL), 2018.

This entry was posted on May 22, 2018.


New paper on VM-HDL co-simulation framework to appear at FPGA18

Our new paper, which describes our recent work on creating a framework that allows co-simulation of server systems with PCIe-connected FPGAs, has been accepted to appear at FPGA 2018. We are also planning an open source release of this framework.

“A Full-System VM-HDL Co-Simulation Framework for Servers with PCIe-Connected FPGAs.” Shenghsun Cho, Mrunal Patel, Han Chen, Peter Milder, and Michael Ferdman. To appear at FPGA 2018.

Cosimulation overview

Abstract: The need for high-performance and low-power acceleration technologies in servers is driving the adoption of PCIe-connected FPGAs in datacenter environments. However, the co-development of the application software, driver, and hardware HDL for server FPGA platforms remains one of the fundamental challenges standing in the way of wide-scale adoption. The FPGA accelerator development process is plagued by a lack of comprehensive full-system simulation tools, unacceptably slow debug iteration times, and limited visibility into the software and hardware at the time of failure.

In this work, we develop a framework that pairs a virtual machine and an HDL simulator to enable full-system co-simulation of a server system with a PCIe-connected FPGA. Our framework enables rapid development and debugging of unmodi ed application software, operating system, device drivers, and hardware design.

Once the system is debugged, neither the software nor the hardware requires any changes before being deployed in a production environment. In our case studies, we nd that the co-simulation framework greatly improves debug iteration time while providing invaluable visibility into both the software and hardware components.

Please click here for a pre-print [pdf].

This entry was posted on December 01, 2017.


NSF funds work on hardware and software for edge computing

The National Science Foundation has funded our work that aims to create a flexible hardware and software framework for next generation edge computing devices.

Please read more in this article at the Stony Brook College of Engineering and Applied Sciences website.

This entry was posted on July 02, 2017.


New paper on CNN accelerator architectures to appear at ISCA 2017

Our new paper on improving the efficiency of hardware accelerators for convolutional neural networks has been accepted for publication at the 44th International Symposium on Computer Architecture (ISCA), 2017.

This paper, co-authored with Yongming Shen (Stony Brook CS PhD student) and Stony Brook CS professor Mike Ferdman, proposes a new Convolutional Neural Network (CNN) accelerator paradigm and an accompanying automated design methodology that partitions the available FPGA resources into multiple processors, each of which is tailored for a different subset of the CNN convolutional layers.

Yongming Shen, Michael Ferdman, and Peter Milder. “Maximizing CNN Accelerator Efficiency Through Resource Partitioning.” To appear at The 44th International Symposium on Computer Architecture (ISCA), 2017.

You can read a pre-print here.

This entry was posted on March 08, 2017.


New Paper on Bandwidth-Efficient CNN accelerators to appear at FCCM 2017

Our new paper on bandwidth-efficient hardware accelerators for convolutional neural networks will appear at FCCM 2017. This paper, co-authored with Stony Brook CS PhD student Yongming Shen and Stony Brook CS professor Mike Ferdman, proposes a new method to efficiently balance between the transfer costs of CNN data and CNN parameters and describes a new flexible architecture that is able to reduce the overall communication requirement.

Abstract—Convolutional neural networks (CNNs) are used to solve many challenging machine learning problems. Interest in CNNs has led to the design of CNN accelerators to improve CNN evaluation throughput and efficiency. Importantly, the bandwidth demand from weight data transfer for modern large CNNs causes CNN accelerators to be severely bandwidth bottlenecked, prompting the need for processing images in batches to increase weight reuse. However, existing CNN accelerator designs limit the choice of batch sizes and lack support for batch processing of convolutional layers.

We observe that, for a given storage budget, choosing the best batch size requires balancing the input and weight transfer. We propose Escher, a CNN accelerator with a flexible data buffering scheme that ensures a balance between the input and weight transfer bandwidth, significantly reducing overall bandwidth requirements. For example, compared to the state-of-the-art CNN accelerator designs targeting a Virtex-7 690T FPGA, Escher reduces the accelerator peak bandwidth requirements by 2.4× across both fully-connected and convolutional layers on fixed-point AlexNet, and reduces convolutional layer bandwidth by up to 10.5× on fixed-point GoogleNet.

Yongming Shen, Michael Ferdman, and Peter Milder. “Escher: A CNN Accelerator with Flexible Buffering to Minimize Off-Chip Transfer.” To appear at The 25th IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2017.

You can read a preprint here.

This entry was posted on March 06, 2017.


New article on hardware reliability to appear in ACM TECS

A new article focusing on hardware implementation of execution stream compression will appear in ACM Transactions on Embedded Computing Systems, in a special issue on Secure and Fault-tolerant Embedded Computing. This paper was co-authored with Maria Isabel Mera (a Stony Brook ECE MS alum, currently a PhD student at NYU), Jonah Caplan and Seyyed Hasan Mozafari (graduate students at McGill University), and Prof. Brett Meyer from McGill. This work was based in part on Maria Isabel Mera’s MS thesis.

“Area, Throughput and Power Trade-offs for FPGA- and ASIC-based Execution Stream Compression.” Maria Isabel Mera, Jonah Caplan, Seyyed Hasan Mozafari, Brett H. Meyer, and Peter Milder. To appear in ACM Trans. on Embedded Computing Systems, 2017.

Abstract: An emerging trend in safety-critical computer system design is the use of compression, e.g., using cyclic redundancy check (CRC) or Fletcher Checksum (FC), to reduce the state that must be compared to verify correct redundant execution. We examine the costs and performance of CRC and FC as compression algorithms when implemented in hardware for embedded safety-critical systems. To do so, we have developed parameterizable hardware generation tools targeting CRC and two novel FC implementations. We evaluate the resulting designs implemented for FPGA and ASIC and analyze their efficiency; while CRC is often best, FC dominates when high throughput is needed.

Please check back later for a pre-print.

This entry was posted on February 14, 2017.


New paper to appear at ICASSP in special session on signal processing education

A new paper, entitled “Practical Matlab Experience in Lecture-Based Signals and Systems Courses,” which I co-authored with Prof. Mónica Bugallo, will appear at the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), in the special session on Advances in Signal Processing Education.

This entry was posted on January 09, 2017.


Poster on neural network hardware to appear at FPGA 2017

Yongming Shen will be presenting a poster on our current work to implement bandwidth-efficienct fully-connect neural network layers next month.

Yongming Shen, Michael Ferdman, and Peter Milder. “Storage-Efficient Batching for Minimizing Bandwidth of Fully-Connected Neural Network Layers.” Poster to appear at FPGA 2017.

This entry was posted on January 08, 2017.


NSF Funds our work on efficient spectrum sensing

The National Science Foundation’s Enhancing Access to the Radio Spectrum program has funded our group’s work on efficient distributed spectrum sensing. The goal of this work is to enable crowd-sourced collaborative spectrum sensing including low-cost low-power FPGA-based hardware and novel interpolation and optimization techniques to aggregate and analyze data.

This work is a collaboration with Samir Das and Himanshu Gupta (Stony Brook CS), and Petar Djurić (Stony Brook ECE).

You can read more at the NSF website.

This entry was posted on September 19, 2016.


Announcing the 2016 MEMOCODE design contest

I am organizing the 2016 MEMOCODE Design Contest, which begins today and lasts through September 13.

This year’s contest problem is will be k-means clustering. You can read the contest description here, and read more about MEMOCODE 2016 here.

This entry was posted on August 15, 2016.


“Fused Layer CNN Accelerators” to appear at MICRO 2016

Our new paper “Fused Layer CNN Accelerators” by Manoj Alwani, Han Chen, Michael Ferdman, and Peter Milder has been accepted to appear at MICRO 2016.

A preprint is available here.

In this work, we observe that a previously unexplored dimension exists in the design space of CNN accelerators that focuses on the dataflow across convolutional layers. We find that we are able to fuse the processing of multiple CNN layers by modifying the order in which the input data are brought on chip, enabling caching of intermediate data between the evaluation of adjacent CNN layers. We demonstrate the effectiveness of our approach by constructing a fused-layer CNN accelerator for the first five convolutional layers of the VGGNet-E network, and find that, by using 362KB of on-chip storage, our fused-layer accelerator minimizes off-chip feature map data transfer, reducing the total transfer by 95%, from 77MB down to 3.6MB per image.

 

This entry was posted on June 28, 2016.


New paper on CNN accelerator efficiency to appear at FPL 2016

Our recent paper on CNN accelerator hardware: “Overcoming Resource Underutilization in Spatial CNN Accelerators” has been accepted to appear at the International Conference on Field-Programmable Logic and Applications (FPL) 2016.  This paper was co-written with Yongming Shen and Michael Ferdman. A pre-print is available here.

This entry was posted on June 23, 2016.


New paper on streaming sorting networks published in ACM TODAES

A new overview paper that I co-authored with with Marcela Zuluaga and Markus Püschel of ETH Zurich has been published in ACM Transactions on Design Automation of Electronics Systems (TODAES). In this paper, we present new hardware structures for sorting that we call streaming sorting networks, which we derive through a mathematical formalism that we introduce, and an accompanying domain-specific hardware generator that translates our formal mathematical description into synthesizable RTL Verilog.

You can read the paper here, and see also our online sorting network generator, which allows you to use the tool described in this paper in your web browser.

As a preview, the following graph shows the cost of implementing various sorters with 16-bit fixed point input values that fit on a Xilinx Virtex-6 FPGA. The x-axis indicates the input size n, the y-axis indicates the number of FPGA configurable slices used, and the size of the marker quantifies the number of BRAMs used (BRAMs are blocks of on-chip memory available in FPGAs). The implementations using Batcher’s and Stone’s architectures can only sort up to 128 or 256 elements, respectively, on this FPGA. Conversely, our streaming sorting networks with streaming width w = 2 can sort up to 219 elements on this FPGA, and our smallest fully streaming design can sort up to 216 elements.

The cost of implementing various sorters with 16-bit fixed point input values that fit on a Xilinx Virtex-6 FPGA

The following graph shows all 256-element sorting networks that we generate with our framework (using 16-bits per element) that fit onto the Virtex-6 FPGA. The x-axis indicates the number of configurable FPGA slices used, the y-axis indicates the maximum achievable throughput in giga samples per second, and the size of the marker indicates the number of BRAMs used. This plot shows that we can generate a wide range of design trade-offs that outperform previous implementations, such as that of Stone and the linear sorter (Batcher’s is omitted due to the high cost). For practical applications, only the Pareto-optimal ones (those toward the top left) would be considered.

all 256-element sorting networks that we generate with our framework (using 16-bits per element) that fit onto the Virtex-6 FPGA

This entry was posted on June 08, 2016.


NSF Funds work on Deep Learning with Clouds of FPGAs

The National Science Foundation program on Exploiting Parallelism and Scalability (XPS) has funded our project focused on using clouds of FPGAs for deep learning algorithms. This work is in collaboration with Mike Ferdman (Stony Brook CS) and Alex Berg (UNC Chapel Hill CS).

 

This entry was posted on August 15, 2015.


Announcing the MEMOCODE 2015 Design Contest

I am organizing the 2015 MEMOCODE Design Contest, which begins today and lasts through the month of June.

This year’s contest problem is will be the Continuous Skyline Computation. You can read the contest description here, and read more about MEMOCODE 2015 here.

This entry was posted on June 01, 2015.


New paper on IP Design Space Search to Appear at DAC 2015

A new paper in collaboration with Michael Papamichael and James C. Hoe of Carnegie Mellon is accepted for publication at the 2015 Design Automation Conference.

Michael Papamichael, Peter Milder, and James C. Hoe. “Nautilus: Fast Automated IP Design Space Search Using Guided Genetic Algorithms.”

This entry was posted on March 09, 2015.


NSF Funds work on Deep Learning with FPGAs

The National Science Foundation has provided new funding for my work on Deep Learning for Computer Vision with FGPAs (in collaboration with Michael Ferdman, Stony Brook CS, and Alex Berg, UNC Chapel Hill CS).

See also press coverage at Gigaom.

This entry was posted on August 08, 2014.


ACM TODAES Best Paper Award

My recent paper describing the Spiral Hardware Generation system has been awarded the 2014 ACM TODAES Best Paper Award.

This paper (co-written with Franz Franchetti, James C. Hoe, and Markus Püschel) presents an overview of my work on the Spiral hardware generation framework, a high-level synthesis and optimization engine that produces highly-customized hardware implementations of linear DSP transforms such as the FFT. This award was presented during the awards session at DAC 2014.

You can read the paper here.

This entry was posted on June 09, 2014.


MEMOCODE 2014 Design Contest

I am please to announce the 2014 MEMOCODE Design Contest, which begins today and lasts through the month of June.

This year’s contest will be k-Nearest Neighbors with Mahalanobis distance metric. You can read the contest description here, and read more about MEMOCODE 14 here.

This entry was posted on June 01, 2014.


Teaching Fall 2014: ESE-305 and ESE-507

In the Fall 2014 semester I will be teaching two courses:

This entry was posted on March 31, 2014.


Execution signature compression paper to be presented next week at DATE 2014.

Jonah Caplan will soon present our work on execution signature compression at DATE 2014.

The talk will be in session “4.7 Dependable System Design” on Tuesday 3/25 at 6:00pm.

This entry was posted on March 21, 2014.


Work on symbol synchronization to be presented this week at OFC 2014

Robert Killey will be presenting our work on symbol synchronization for optical OFDM systems this week at the Optical Fiber Communication Conference.

The  talk will be on Thursday 3/13 at 2:00pm in room 133, in the “Direct Detection” session.

This entry was posted on March 11, 2014.


New co-authored paper published in Optics Express

A new co-authored paper has been published in the Optics Express journal. The paper is on symbol synchronization for optical OFDM systems, and it is an extension of the work that will be presented in March at OFC.

You can read the paper at this link.

R. Bouziane, P. A. Milder, S. Erkılınç, L. Galdino, S. Kilmurray, B. C. Thomsen, P. Bayvel, and R. I. Killey. “Experimental demonstration of 30 Gb/s direct-detection optical OFDM transmission with blind symbol synchronisation using virtual subcarriers.” Optics Express, Vol. 22, Issue 4, pp. 4342–4348, 2014.

Abstract: The paper investigates the performance of a blind symbol synchronisation technique for optical OFDM systems based on virtual subcarriers. The test-bed includes a real-time 16-QAM OFDM transmitter operating at a net data rate of 30.65 Gb/s using a single OFDM band with a single FPGA-DAC subsystem and demonstrates transmission over 23.3 km SSMF with direct detection at a BER of 10−3. By comparing the performance of the proposed synchronisation scheme with that of the Schmidl and Cox algorithm, it was found that the two approaches achieve similar performance for large numbers of averaging symbols, but the performance of the proposed scheme degrades as the number of averaging symbols is reduced. The proposed technique has lower complexity and bandwidth overhead as it does not rely on training sequences. Consequently, it is suitable for implementation in high speed optical OFDM transceivers.

This entry was posted on February 19, 2014.


ESE-507 enrollment full (Spring 2014)

Unfortunately, ESE-507 for this semester is full. If you were not able to enroll, please note that I will be offering this course again in Fall 2014.

This entry was posted on January 27, 2014.


Paper to appear at Optical Fiber Communication Conference (OFC)

A recent co-authored paper on symbol synchronization for optical OFDM systems has been accepted at publication at the 2014 Optical Fiber Communication Conference (OFC).

Rachid Bouziane, Peter A. Milder, Sean Kilmurray, Benn C. Thomsen, Stephan Pachnicke, Polina Bayvel, and Robert I. Killey. “Blind symbol synchronisation in direct-detection optical OFDM using virtual subcarriers.”

Abstract: We investigate the performance of a novel blind symbol synchronisation technique using a 30.65Gb/s real-time 16-QAM OFDM transmitter with direct detection. The proposed scheme exhibits low complexity and does not have any bandwidth overhead.

ofc

This entry was posted on December 22, 2013.


Reliability paper to appear at DATE 2014

My recently coauthored paper on execution signature compression has been accepted to Design, Automation and Test in Europe (DATE) 2014.

Jonah Caplan, Maria Isabel Mera, Peter Milder, and Brett H. Meyer. “Trade-offs in Execution Signature Compression for Reliable Processor Systems.”

Preprint available here.

Abstract—As semiconductor processes scale, making transistors more vulnerable to transient upset, a wide variety of microarchitectural and system-level strategies are emerging to perform efficient error detection and correction computer systems. While these approaches often target various application domains and address error detection and correction at different granularities and with different overheads, an emerging trend is the use of state compression, e.g., cyclic redundancy check (CRC), to reduce the cost of redundancy checking. Prior work in the literature has shown that Fletcher’s checksum (FC), while less effective where error detection probability is concerned, is less computationally complex when implemented in software than the more-effective CRC. In this paper, we reexamine the suitability of CRC and FC as compression algorithms when implemented in hardware for embedded safety-critical systems. We have developed and evaluated parameterizable implementations of CRC and FC in FPGA, and we observe that what was true for software implementations does not hold in hardware: CRC is more efficient than FC across a wide variety of target input bandwidths and compression strengths.

Results

 

This entry was posted on December 14, 2013.


Work on smart NICs funded by SRC

My collaborative research with Mike Ferdman (Stony Brook CS) on smart NICs has been funded by the Semiconductor Research Corporation (SRC).

This entry was posted on December 13, 2013.


New course, Spring 2014

My new course ESE-507, “Advanced Digital System Design and Generation” will be offered this Spring 2014 semester.

The field of digital system design has entered a new and complicated era. Digital designers now have increasingly large amounts of chip area to exploit, but they are strictly limited by the amount of power that can be consumed per transistor (the so-called “power wall”). Modern design practices must carefully balance a variety of system tradeoffs such as power, energy, area, throughput, latency, bandwidth, and reusability/customization of digital systems. This course will study how new design abstractions, languages, and tools can help address these problems from the system designer’s perspective.

This entry was posted on December 10, 2013.


Fall 2013: ESE 305

The fall semester has started. This semester I am teaching ESE-305 Deterministic Signals and Systems. Please note that the location of this course has changed to Javits room 103.

This entry was posted on August 25, 2013.


ESE-670 Final Projects

Yesterday in class, our students presented their final projects for ESE-670 Digital System Design and Generation. We saw some excellent projects on topics such as high-level synthesis and customized design generation tools. Congratulations to all the students for their hard work!

This entry was posted on May 09, 2013.


Optics Express paper

Our followup journal article, extending our ECOC 2012 paper has been published in Optics Express. You can read the article here.

“Real-time OFDM or Nyquist Pulse Generation — Which Performs Better with Limited Resources?” R. Schmogrow, R. Bouziane, M. Meyer, P. A. Milder, P. C. Schindler, R. I. Killey, P. Bayvel, C. Koos, W. Freude, and J. Leuthold. Optics Express, Vol. 20, Issue 26, pp. B543–B551. 2012.

This paper was chosen by the review committee as one of three highlights of the special issue on ECOC 2012.

Abstract: We investigate the performance and DSP resource requirements of digitally generated OFDM and sinc-shaped Nyquist pulses. The two multiplexing techniques are of interest as they offer highest spectral efficiency. The comparison aims at determining which technology performs better with limited processing capacities of state-of-the-art FPGAs. It is shown that a novel Nyquist pulse shaping technique, based on look-up tables requires lower resource count than equivalent IFFT-based OFDM signal generation while achieving similar performance with low inter- channel guard-bands in ultra-dense WDM. Our findings are based on a resource assessment of selected DSP implementations in terms of both simulations and experimental validations. The experiments were performed with real-time software-defined transmitters using a single or three optical carriers.

opex12

This entry was posted on December 14, 2012.


Paper at ECOC 2012

On Monday, Rene Schmogrow will be presenting a collaborative paper at the European Conference on Optical Communication (ECOC 2012). This work was performed as a collaboration with Rene Schmogrow, Matthias Meyer, Phillipp Schindler, Wolfgang Freude, and Juerg Leuthold at Karlsruhe Institute of Technology and with Rachid Bouziane, Polina Bayvel, and Robert Killey at University College London.

“Real-Time Digital Nyquist-WDM and OFDM Signal Generation: Spectral Efficiency versus DSP Complexity.” Rene Schmogrow, Rachid Bouziane, Matthias Meyer, Peter A. Milder, Philipp Schindler, Polina Bayvel, Robert Killey, Wolfgang Freude, and Juerg Leuthold.

Rene has been nominated for the Best Student Paper award for this work!

Abstract: We investigated the performance of Nyquist WDM and OFDM with respect to required DSP complexity. We demonstrate Nyquist pulse-shaping requiring less resources than IFFT-based OFDM for a similar performance. Tests are performed with QPSK/16QAM in a three-carrier WDM scenario.

Edit 9/20: Rene was one of two students awarded the Best Student Honorary Award for this work. Congratulations!

ecoc12

This entry was posted on September 15, 2012.


Optics Express article published

We have published a new collaborative paper on the real-time generation of 85.4 Gb/sec optical OFDM signals using FPGAs, with transmission over 400 km of standard single-mode fiber using a single polarization.

You can read the paper here.

“Generation and Transmission of 85.4 Gb/s Real-time 16QAM Coherent Optical OFDM Signals Over 400 km SSMR with Preamble-less Reception.” Rachid Bouziane, Rene Schmogrow, David Hillerkuss, Peter Milder, Christian Koos, Wolfgang Freude, Juerg Leuthold, Polina Bayvel, and Robert I. Killey. Optics Express, Vol. 20, Issue 19, pp. 21612–21617. 2012.

Abstract: This paper presents a real-time, coherent optical OFDM transmitter based on a field programmable gate array implementation. The transmitter uses 16QAM mapping and runs at 28 GSa/s achieving a data rate of 85.4 Gb/s on a single polarization. A cyclic prefix of 25% of the symbol duration is added enabling dispersion-tolerant transmission over up to 400 km of SSMF. This is the first transmission experiment performed with a real-time OFDM transmitter running at data rates higher than 40 Gb/s. A key aspect of the paper is the introduction of a novel method for OFDM symbol synchronization without relying on training symbols. Unlike conventional preamble-based synchronization methods which perform cross-correlations at regular time intervals and let the system run freely in between, the proposed method performs synchronization in a continuous manner ensuring correct symbol alignment at all times.

opex-results

This entry was posted on September 05, 2012.


First day at Stony Brook

Today marks the first day of my first semester at Stony Brook. This semester I will be teaching ESE-305 Deterministic Signals and Systems.

This entry was posted on August 27, 2012.


Paper at LCTES 2012

This week, Marcela Zuluaga will present our paper on a machine-learning approach to predict Pareto-Optimal designs produced by high-level hardware generation tools such as Spiral at Languages, Compilers, Tools and Theory for Embedded Systems (LCTES 12). This work was performed with Marcela Zuluaga, Andreas Krause, and Markus Püschel at ETH Zurich.

“Smart” Design Space Sampling to Predict Pareto-Optimal Solutions. Marcela Zuluaga, Andreas Krause, Peter Milder, and Markus Püschel. LCTES 2012.

You can find the paper here. This paper has been nominated for the best paper award (3 papers nominated out of 18). You can read about prior work on generating sorting networks here, and prior work on generating linear transforms with Spiral here.

Abstract: Many high-level synthesis tools offer degrees of freedom in mapping high-level specifications to Register-Transfer Level descriptions. These choices do not affect the functional behavior but span a design space of different cost-performance tradeoffs. In this paper we present a novel machine learning-based approach that efficiently determines the Pareto-optimal designs while only sampling and synthesizing a fraction of the design space. The approach combines three key components: (1) A regression model based on Gaussian processes to predict area and throughput based on synthesis training data. (2) A “smart” sampling strategy, GP-PUCB, to iteratively refine the model by carefully selecting the next design to synthesize to maximize progress. (3) A stopping criterion based on assessing the accuracy of the model without access to complete synthesis data. We demonstrate the effectiveness of our approach using IP generators for discrete Fourier transforms and sorting networks. However, our algorithm is not specific to this application and can be applied to a wide range of Pareto front prediction problems.

LCTES results

This entry was posted on June 12, 2012.


Paper on generating sorting networks at DAC 2012

This week I will be traveling to the Design Automation Conference (DAC) in San Francisco, where Marcela Zuluaga will be presenting our collaborative work (with Markus Püschel) on generating sorting network hardware. You can find the paper here. The work described in this paper forms the basis for the Spiral Online Sorting Network IP Generator.

“Computer Generation of Streaming Sorting Networks.” Marcela Zuluaga, Peter Milder, and Markus Püschel. Design Automation Conference (DAC), 2012.

Abstract: Sorting networks offer great performance but become prohibitively expensive for large data sets. We present a domain-specific language and compiler to automatically generate hardware implementations of sorting networks with reduced area and optimized for latency or throughput. Our results show that the generator produces a wide range of Pareto-optimal solutions that both compete with and outperform prior sorting hardware.

DAC12 results

This entry was posted on June 01, 2012.


FCCM 2012

Berkin Akin will be presenting our work on bandwidth-optimized large size 2D FFTs at the IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM 2012) this week. This work was performed with Berkin Akin, Franz Franchetti, and James C. Hoe.

Memory Bandwidth Efficient Two-Dimensional Fast Fourier Transform Algorithm and Implementation for Large Problem Sizes. Berkin Akin, Peter Milder, Franz Franchetti, and James C. Hoe. FCCM 2012.

Abstract: Prevailing VLSI trends point to a growing gap between the scaling of on-chip processing throughput and off-chip memory bandwidth. An efficient use of memory bandwidth must become a first-class design consideration in order to fully utilize the processing capability of highly concurrent processing platforms like FPGAs. In this paper, we present key aspects of this challenge in developing FPGA-based implementations of two-dimensional fast Fourier transform (2D-FFT) where the large datasets must reside off-chip in DRAM. Our scalable implementations address the memory bandwidth bottleneck through both (1) algorithm design to enable efficient DRAM access patterns and (2) datapath design to extract the maximum compute throughput for a given level of memory bandwidth. We present results for double-precision 2D-FFT up to size 2,048-by-2,048. On an Altera DE4 platform our implementation of the 2,048-by-2,048 2D-FFT can achieve over 19.2 Gflop/s from the 12 GByte/s maximum DRAM bandwidth available. The results also show that our FPGA-based implementations of 2D-FFT are more efficient than 2D-FFT running on state-of- the-art CPUs and GPUs in terms of the bandwidth and power efficiency.

2D FFT results

(Click to view full size.)

This entry was posted on April 27, 2012.


ACM TODAES article published

My article entitled Computer Generation of Hardware for Linear Digital Signal Processing Transforms has been published in ACM Transactions on Design Automation of Electronic Systems.

This paper (co-written with Franz Franchetti, James C. Hoe, and Markus Püschel) presents an overview of my work on the Spiral hardware generation framework, a high-level synthesis and optimization engine that produces highly-customized hardware implementations of linear DSP transforms such as the FFT.

A subset of this system’s functionality is used in my online FFT IP Core Generator, which allows you to create customized FFT cores directly from your web browser, and download the result as synthesizable RTL Verilog.

Abstract: Linear signal transforms such as the discrete Fourier transform (DFT) are very widely used in digital signal processing and other domains. Due to high performance or efficiency requirements, these transforms are often implemented in hardware. This implementation is challenging due to the large number of algorithmic options (e.g., fast Fourier transform algorithms or FFTs), the variety of ways that a fixed algorithm can be mapped to a sequential datapath, and the design of the components of this datapath. The best choices depend heavily on the resource budget and the performance goals of the target application. Thus, it is difficult for a designer to determine which set of options will best meet a given set of requirements.

In this article we introduce the Spiral hardware generation framework and system for linear transforms. The system takes a problem specification as input as well as directives that define characteristics of the desired datapath. Using a mathematical language to represent and explore transform algorithms and datapath characteristics, the system automatically generates an algorithm, maps it to a datapath, and outputs a synthesizable register transfer level Verilog description suitable for FPGA or ASIC implementation. The quality of the generated designs rivals the best available handwritten IP cores.

todaes

This entry was posted on April 05, 2012.


Poster presentation at ICASSP 2012

I’m attending ICASSP 2012 in Kyoto, Japan. Bob Koutsoyannis will be presenting our paper Improving Fixed-Piont Accuracy of FFT Cores in O-OFDM Systems on Thursday 3/29 in Poster Area D.

This entry was posted on March 27, 2012.


Poster presentation at FPGA 2012

FPGA 2012 is underway. Berkin Akin will be presenting a poster on our recent work on large-size 2D FFTs on FPGA (Poster Session 4, Friday at 3pm).

This entry was posted on February 22, 2012.