Skip to main content


Project TitleAutomatic Generation of HLS-based Hardware Accelerator with Data Prefetching and Access/Execute Decoupling
Track Code7547
Short Description

The core of this invention is a framework that transforms and optimizes High-Level-Synthesis (HLS)-based hardware accelerators, eliminating the manual efforts typically required to create logic and manage the data supply.


Our design framework targets accelerators that are generated using High-Level Synthesis (HLS), removing the need for manual logic creation and latency hiding mechanisms. In order to obtain optimal performance, hardware accelerators need to manage the incoming data supply. This typically requires: additional time for the designer to manually create the data caching logic or implementation of expensive latency-hiding mechanisms such as dynamic scheduling. This framework avoids these additional time or fabrication expenses by automatically optimizing hardware accelerators enabling them to effectively hide long, variable memory latencies of a System on a Chip (SoC) memory hierarchy by preloading data parallel to computations.



The effective data preloading is achieved through hardware prefetching and design transformations to decouple memory accesses and computations. This framework is broadly applicable to stand-alone accelerator designs that are attached to the memory bus or the last-level cache and have their own memory access logic. This accelerator design style is widely adopted both in industry and the research community. When used with high-level synthesis, application of this framework requires minimal manual effort. While the principle of access/execute decoupling has been explored in various contexts, we have systematically applied this principle to the design of stand-alone accelerators and demonstrate how to enable decoupling automatically and efficiently.


Potential Applications

· High performance cache-based hardware accelerator design optimization



· Reduced design costs

· Supported by experimental results

· Average speedup of 2.28x across eight accelerators

· Reduced energy consumption (average of 15%).

TagsIndustrial Nanofabrication, Nanofabrication, chip design, High Level Synthesis, hardware accelerator
Posted DateAug 16, 2017 1:07 PM


Gookwon Suh
Tao Chen

Additional Information


·  Tao Chen and G. Edward Suh “Efficient data supply for hardware accelerators with prefetching and access/execute decoupling” 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) DOI: 10.1109/MICRO.2016.7783749


· Slides from Presentation:

Licensing Contact

Martin Teschl, Technology Commercialization & Liaison Officer
(607) 254-4454


File Name Description
Tech Brief D-7547 None Download
flintbox image.png None Download