

# Intel<sup>®</sup> FPGA host pipe extension for OpenCL<sup>™</sup> applications

Michael Kinsner, Dirk Seynhaeve

**IWOCL 2018** 

Topics

- 1. FPGA overview
- 2. Motivating application classes
- 3. Host pipes
- 4. Some data







## FPGA: Fine-grained Massive Parallelism



Intel<sup>®</sup> Stratix<sup>®</sup> 10 FPGA: Over 5 Million Basic Elements!





## Many Design Entry Options



## How Can You Compile Software to Hardware?



Instruction Level

 Instruction sequence can either be executed sequentially in time (temporal computing), or executed in parallel in space (spatial computing)





)

6

## Kernels consume "space"

Conceptually map to regions of the FPGA in Intel's OpenCL implementation

Pipeline, data, and task parallelism



- Efficient use of FPGA architecture
- "Concurrent execution"

- Data flow processing
- Fine grained on-chip communication



## OpenCL 1.x – Host/Device Bulk Communication





# **Motivating Classes of Applications**

#### -

## Long Running Kernel for Bursting or Massive Data

### Long running / persistent kernel



- Processing more data than fits in global memory, using a single kernel instance
  - Or for lower latency processing of data arriving piecemeal on host
- Reduced latency processing of bursting data
  - Avoid launch overhead and state reconstruction



## **Network Routing / Processing**

Routing table updates – low rate, non-periodic



### Network Data

- FPGAs have rich I/Os
- Often want long-running kernels
- Polling for memory-based host updates expensive
  - Plus memory consistency challenges
- FIFO semantics ideal

## **Streaming Content Analysis**

Low rate, non-periodic detection events signaled to host





- Long running kernels
- Data consistency challenges
- FIFO semantics ideal



### Two Use Models

High throughput streaming



### Asynchronous signaling/control





## Two Use Models – OpenCL 1.x Challenges





۲

Data availability

- Data availability
- Cost of memory polling

14

## **OpenCL 2.0 Pipes – A Reminder**



OpenCL 2.0 pipes: Communication is **between kernels** 





## Intel<sup>®</sup> Host Pipes

Allow pipes to be read/written from the host program as well as in kernels

## Host Pipe Extension

Small extension to OpenCL 2.x pipe API: cl\_intel\_fpga\_host\_pipe

New flags legal in clCreatePipe():

CL\_MEM\_HOST\_READ\_ONLY CL\_MEM\_HOST\_WRITE\_ONLY CL\_MEM\_READ\_ONLY CL\_MEM\_WRITE\_ONLY

Set host visibility Optional – From device perspective Host program: cl\_mem read\_pipe = clCreatePipe( context.

> CL\_MEM\_HOST\_READ\_ONLY, sizeof(cl\_int), 128, // Number of packets that can be buffered NULL, &error );

• New query / status enums:

| API Enum                                 | Parent Function        |
|------------------------------------------|------------------------|
| CL_KERNEL_ARG_HOST_ACCESSIBLE_PIPE_INTEL | clGetKernelArgInfo()   |
| CL_DEVICE_MAX_HOST_READ_PIPES_INTEL      | clGetDeviceInfo()      |
| CL_DEVICE_MAX_HOST_WRITE_PIPES_INTEL     | clGetDeviceInfo()      |
| CL_PIPE_FULL                             | clWritePipeIntelFPGA() |
| CL_PIPE_EMPTY                            | clReadPipeIntelFPGA()  |



## **Kernel Interface**

Used like normal OpenCL 2.x pipes

- Additional kernel argument attribute
- No reservation functionality (the OpenCL 2.x feature)

```
C kernel language:

kernel void foo (__attribute__((intel_host_accessible))) write_only pipe int p ) { .... }

read_pipe (P, &val)

write_pipe (P, &val)
```

C++ kernel language:

kernel void foo([[[cl::intel\_host\_accessible]]] cl::pipe<int, cl::pipe\_access::write> p ) { .... }
p.write( val )
p.read( &val )



## Host Interface – Low Rate Signaling

### Simple interface

Single word read/write

Data transferred "as soon as possible"

cl\_int clReadPipeIntelFPGA( cl\_mem pipe, void \*ptr );

cl\_int clWritePipeIntelFPGA( cl\_mem pipe, const void \*ptr );

```
// Create pipes, kernels, other startup code
. . . .
// Bind pipes to kernels
clSetKernelArg(read_kern, 0, sizeof(cl_mem), (void *)&write_pipe);
clSetKernelArg(write_kern, 0, sizeof(cl_mem), (void *)&read_pipe);
// Enqueue kernels
. . . .
int float2:
if (!clReadPipeIntelFPGA( read pipe, &val )) {
  int result = clWritePipeIntelFPGA( write_pipe, (int)(val.x + val.y));
  // Check write success/failure and handle
   . . . .
```



## Host Interface – High Throughput



## **FIFO Access Within Kernels**

Checking FIFO for data availability is cheap

Implicit control signals (ready/full), and low latency

```
kernel void
foo ( global int *G, ... ) {
     if (G[ get local id(0) ]) { ... }
}
             local id G
                                                   }
                                                  Data
                                                   Full 
                 load
```

```
kernel void
foo (read_only pipe int4 P ... ) {
    int4 val;
    if (0 == read_pipe(P, &val)) { ... }
}
```





## Visibility and Latency

### Additional memory model guarantee

- Data written to a pipe will eventually be visible on the read endpoint, without an OpenCL synchronization point. It is understood that an OpenCL implementation will make the data visible to the read endpoint "as soon as possible"
- No synchronization side effects with other pipes or memory

### The host pipe API supports low latency communication

- An OpenCL extension is not enough to guarantee latency
  - Board support package
  - Drivers/OS
  - System load
- The host pipe API was designed to enable latency-sensitive applications
  - Talk to board and system provider if guarantees are required





## Host Pipe Microbenchmark

- Results from an Intel<sup>®</sup> Arria<sup>®</sup> 10 GX FPGA Development Kit
  - <u>https://www.altera.com/products/boards\_and\_kits/dev-kits/altera/kit-a10-gx-fpga.html</u>
- Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz



Host pipes require platform (BSP) support

- Two CPU threads, each managing one host pipe direction. Loopback kernel
- 'aocl diagnose': Buffer transfer speed µbenchmark, that ships with the Intel<sup>®</sup> FPGA SDK for OpenCL<sup>™</sup>



Host pipe microbenchmark

## Now Available!



Host pipes are shipping in the Intel<sup>®</sup> FPGA SDK for OpenCL<sup>™</sup> 18.0

- Reference platform has some minor restrictions that will be relaxed in the future
  - # host pipes, width of host pipes
  - Some queries
- https://www.altera.com/products/design-software/embedded-software-developers/opencl/overview.html





## Legal Notice and Disclaimers

Intel technologies' features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer.

No computer system can be absolutely secure.

Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit <u>http://www.intel.com/benchmarks</u>.

Intel, the Intel logo and others are trademarks of Intel Corporation in the U.S. and/or other countries. \*Other names and brands may be claimed as the property of others.

© 2018 Intel Corporation.

OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.

\* Other names and brands may be claimed as the property of others.



## Legal Disclaimer and Optimization Notice

INFORMATION IN THIS DOCUMENT IS PROVIDED "AS IS". NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Copyright © 2018, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.

#### **Optimization Notice**

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804