Presented by David Neto of Altera Corporation

Download (PDF, 5.5MB)

FPGAs offer a radically different compute architecture from CPUs and GPUs. Getting the most performance from OpenCL on an FPGA has much in common with techniques for traditional targets. But it’s no surprise that different techniques are sometimes used to fully exploit the flexibility and extremely fine grain parallelism of FPGAs.

This tutorial describes:

  • Relevant FPGA architecture fundamentals
  • How the Altera SDK for OpenCL maps code to the FPGA fabric
    – “The kernel-specific machine”
    – Emphasis on pipelined execution over SIMD
  • Optimization strategies for FPGAs
    – Divergent control flow is cheap
    – Flexible vectorization and compute unit replication
    – Local memory is abundant
    – Memory access optimization via coalescing, caching, and manual partitioning
    – Reducing data dependencies in serial loop execution
    – Exploiting unusual function mix, and avoiding expensive operations

We will also preview capabilities beyond standard OpenCL. Altera extensions enable the natural expression of data flow graphs, with concurrent kernels communicating over low latency fine grain channels. In combination with the above optimization techniques, the massive parallelism and concurrency of FPGAs — previously available only to low level hardware designers is now accessible in a software based design flow.