Opening Address

  • Wednesday, 13 May
  • Schedule: 09:30-09:40
  • Li Ka Shing Center, Stanford

Keynote Address
Leveraging OpenCL to Create Differentiation

Salvatore De Dominicis, Imagination Technologies.

In the last year a number of phone and tablet manufacturers have successfully used OpenCL to offload computationally-intensive image processing algorithms from CPU and DSP cores to embedded GPUs. By leveraging the inherent low-power parallelism of the GPU, these OEMs are able to support new use cases that require sustained video-rate processing of HD content in areas such as computer vision and computational photography, all within the tight thermal and power envelope of a modern SoC.

 In this talk Imagination will present an Imaging Framework that we have developed in collaboration with lead partners. The framework comprises a set of extensions to the OpenCL and EGL application programming interfaces (APIs) that enable efficient allocation of memory buffers for sharing between the GPU and other system components such as a camera sensor, ISP (Image Signal Processor), CPU and video encoder/decoder. This makes it possible to construct a range of programmable imaging processing pipelines, which can form the basis for many advanced multimedia apps. Extensions are also provided to enable the GPU to directly access camera data in memory in its native YUV or RGB colour space. Another extension allows the PowerVR GPU to be configured to dynamically convert YUV pixels to RGB format when sampled. Low-level functions are provided for tightly integrating these extensions within Android’s camera and video HALs (Hardware Abstraction Layer).

  • Wednesday, 13 May
  • Schedule 09:40-10:00
  • Li Ka Shing Center, Stanford

A Look at the OpenCL 2.0 Execution Model

Benedict Gaster, University of West of England.

A popular approach to programming manycore GPUs is the Single Instruction Multiple Thread (SIMT) abstraction. SIMT has the benefit of presenting a ‘single thread’ view, alleviating the complexity of explicitly vectorizing the source code. However, due to the SIMD nature of the underlying hardware it is often difficult to fully hide all aspects from the developer. An example of ‘leaks’, is OpenCL’s barrier, which requires all workitems (i.e. threads) to reach and execute the ‘same’ barrier. Using a set of examples, sometimes surprisingly, we show that common transformations often performed by traditional scalar compilers are not, in general, valid when applied to OpenCL code containing workgroup (or subgroup) collective operations. Additionally, we introduce a mathematical notion of workgroup and subgroup uniformity and outline an execution model for OpenCL 2.0, which enables these traditional compiler transformations to be applied, even in the presence of collective operations, for all valid OpenCL programs. The model clearly describes when it is valid and when it is not valid to apply these transformations.

  • Wednesday, 13 May
  • Schedule 10:00-10:20
  • Li Ka Shing Center, Stanford

Exploring the Features of OpenCL 2.0

Saoni Mukherjee, Xiang Gong, Leiming Yu, Carter McCardwell, Yash Ukidave, Tuan Dao, Fanny Nina-Paravecino and David Kaeli. Northeastern University.

The growth in demand for heterogeneous accelerators has stimulated the development of cutting-edge features in newer accelerators. The heterogeneous programming frameworks such as OpenCL have matured over the years and have introduced new software features for developers. We explore one of these programming frameworks, OpenCL 2.0. To drive our study, we consider a number new features in OpenCL 2.0 using four popular applications from a range of computing domains including signal processing, cybersecurity and machine learning. These applications include: 1) the AES-128 encryption standard, 2) Finite Impulse Response filtering, 3) Infinite Impulse Response filtering, and 4) Hidden Markov model. In this work, we introduce the latest runtime features enabled in OpenCL 2.0, and discuss how well our sample applications can benefit from some of these features.

  • Wednesday, 13 May
  • Schedule 10:20-10:40
  • Li Ka Shing Center, Stanford

Achieving Performance with OpenCL 2.0 on Intel Processor Graphics

Robert Ioffe, Sonal Sharma and Michael Stoner. Intel.

OpenCL 2.0 is here, supported for the first time on the 5th Generation Intel® Core Processors with Intel® Processor Graphics. We are going to talk about the things we have learned in the past year developing workloads for OpenCL 2.0 and speedups achieved over comparable OpenCL 1.2 implementations. We are going to cover the following features of OpenCL 2.0 supported in our released driver: (1) Shared Virtual Memory provides the capability to create SVM buffers (coarse-grained and fine-grained) and sharing pointers between host and device without the need to create device memory objects and eliminating the need to perform data copying between the device and the host. SVM enhances programming experience and improves performance. We will discuss ways of SVM buffer allocation, how SVM buffers interact with devices, synchronizations methods and ways to maintain memory consistency. There are number of new API calls added to OpenCL 2.0 specification, which were designed to support this new feature and to perform the above mentioned tasks. The advantages of SVM will be showcased with the help of samples. (2) Nested Parallelism allows GPU kernels to enqueue other GPU kernels or self-enqueue without any interaction with the CPU host. We are going to present GPU-Quicksort for OpenCL 2.0, a high performance example that benefits from nested parallelism and work group functions; in our experience, nested parallelism allows developers to develop high performance iterative and recursive algorithms and allows to move housekeeping and scheduling operations previously reserved for the CPU to the GPU. (3) Work-group scan/reduce functions facilitate scan and reduce operations across work-items of a work-group; these functions are correct, performant and heavily optimized for Intel Architecture, they simplify your code and improve performance.

  • Wednesday, 13 May
  • Schedule 10:40-11:00
  • Li Ka Shing Center, Stanford
  • Michael Stoner

Morning Break & Table-Top Demonstrations

  • Wednesday, 13 May
  • Schedule: 11:00-11:30
  • Li Ka Shing Center, Stanford

Platinum Sponsor Invited Talk
Heterogeneous Computing: the rise of open programming frameworks

JC Baratault, AMD Global GPU Computing

Code portability and scalability is the next big challenge on the road to Exascale computing. Despite the fact that GPU-based accelerators deliver great performance, most users are still reluctant to modify their codes using proprietary frameworks. AMD’s Heterogeneous Computing strategy relies on open programming standards such as C++, OpenCL, OpenMP and OpenACC. With AMD, the users are not locked to a single source vendor and can now benefit from AMD GPU accelerators on multiple vendors servers like the HP DL380 G9 and SL250a. In this session, AMD will outline the evolving technologies that show how best to enable tomorrow’s business, today. From GPU and APU technologies to server technologies – come learn about AMD’s vision of tomorrow’s data center and how you can help your customers succeed with AMD solutions.

  • Wednesday, 13 May
  • Schedule 11:30-11:50
  • Li Ka Shing Center, Stanford
  • JC Baratault
    speaker-jc

Mapping C++ AMP to OpenCL / HSA

Jack Chung, Curtis Davis and Jayram Ramachandran, MulticoreWare.

C++ AMP is a parallel programming extension to C++, and MulticoreWare have contributed to Clamp, an open source implementation. The compiler is based on Clang / LLVM, and could target multiple platforms such as OpenCL / SPIR / HSA. We present some important implementation techniques in this compiler, and we present how shared virtual memory, platform atomics allow more generic C++ codes to leverage multi-core architectures.

  • Wednesday, 13 May
  • Schedule 11:50-12:10
  • Li Ka Shing Center, Stanford

Update on the SYCL for OpenCL Open Standard to Enable C++ Meta Programming on Top of OpenCL

Andrew Richards, Codeplay.

SYCL is a royalty-free, open standard, higher-level C++ programming model for OpenCL. C++ developers can produce easy-to-use template libraries for OpenCL devices, as well as easily porting C++ applications to use OpenCL. By providing ease-of-use, high performance and modern C++ techniques, SYCL enables a wide range of developers to accelerate their applications and libraries. Previously, two provisional specifications of SYCL have been released. We hope, subject to approval, to be able to present exciting news about SYCL for developers at IWOCL. This presentation will take developers through: what SYCL is, our latest news, as well as all the new possibilities SYCL enables for developers. We are particularly keen to talk about how SYCL can work well with other C++ standards and libraries, such as C++17, to bring the high performance and widespread device support of OpenCL to a whole new community of developers. Andrew Richards is Chair of the SYCL working group.

  • Wednesday, 13 May
  • Schedule 12:10-12:30
  • Li Ka Shing Center, Stanford

Kernel Composition in SYCL

Ralph Potter, Paul Keir, Russell J. Bradford and Alastair Murray. Univ. Bath, Univ. University of the West of Scotland and Codeplay.

Parallel primitives libraries reduce the burden of knowledge required for developers to begin developing parallel applications and accelerating them with OpenCL. Unfortunately some current implementations implement primitives as individual kernels and so incur a high performance cost in off-chip memory operations for intermediate variables. We describe a methodology for creating efficient domain specific embedded languages on top of the SYCL for OpenCL standard for parallel programming. Using this approach, a small example language was developed which provides an environment for composing image processing pipelines from a library of more primitive operations, while retaining the capability to generate a single kernel from a complex expression, and so eliminate unnecessary intermediate loads and stores to global memory. This elimination of global memory accesses leads to a 2.75x speedup over implementing an unsharp mask in OpenCLIPP. We give details of our domain specific embedded language, and provide experimental performance measurements of both primitive performance and an unsharp mask operation composed of multiple primitives.

  • Wednesday, 13 May
  • Schedule 12:30-13:00
  • Li Ka Shing Center, Stanford

Lunch Break & Table-Top Demonstrations

  • Wednesday, 13 May
  • Schedule: 13:00-14:00
  • Li Ka Shing Center, Stanford

Platinum Sponsor Invited Talk
GPU Compute on Snapdragon

Eric Demers, a Vice President of engineering at Qualcomm Inc.

Eric will provide an overview on current Qualcomm activities related to OpenCL and GPU Compute and talk about the best ways for developers to approach Adreno GPUs for differentiating their products. Eric is responsible for Qualcomm’s graphics hardware, including graphics technology research and development and delivery of the Adreno™ graphics cores. He’s had over 20 years of experience in the GPU industry, working at companies such as AMD, ATI, SGI and others. He holds a Masters in Engineering from Cornell University in Ithaca, New York, specializing in computer architecture and signal processing. Eric Demers is a member of the Association of Computer Machinery and SIGGRAPH.

  • Wednesday, 13 May
  • Schedule 14:00-14:20
  • Li Ka Shing Center, Stanford
  • Eric Demers

Oclgrind: An Extensible OpenCL Device Simulator

James Price and Simon McIntosh-Smith. University of Bristol.

We describe Oclgrind, a platform designed to enable the creation of developer tools for analysis and debugging of OpenCL programs. Oclgrind simulates how OpenCL kernels execute with respect to the OpenCL standard, adhering to the execution and memory models that it defines. A simple plugin interface allows developer tools to observe the simulation and collect execution information to provide useful analysis, or catch bugs that would be otherwise difficult to spot when running the application on a real device. We give details about the implementation of the simulator, and describe how it can be extended with plugins that provide useful developer tools. We also present several example use-cases that have already been created using this platform, motivated by real-world problems that OpenCL developers face.

  • Wednesday, 13 May
  • Schedule 14:20-14:50
  • Li Ka Shing Center, Stanford

Asynchronous OpenCL/MPI numerical simulations of conservation laws

Philippe Helluy, Thomas Strub, Michel Massaro and Malcolm Roberts. Univ. Strasbourg, Inria, IRMIA and AxesSim.

Hyperbolic conservation laws are important mathematical models for describing many phenomena in physics or engineering.
The Finite Volume (FV) method and the Discontinuous Galerkin (DG) methods are two popular methods for solving conservation laws on computers. Those two methods are good candidates for parallel computing:they require a large amount of uniform and simple computations, they rely on explicit time-integration and they present regular and local data access pattern. In this paper, we present several FV and DG numerical simulations that we have realized with the OpenCL and MPI paradigms.
First, we compare two optimized implementations of the FV method on a regular grid: an OpenCL implementation and a more traditional OpenMP implementation. We compare the efficiency of the approach on several CPU and GPU architectures of different brands. Then we give a short presentation of the DG method. Finally, we present how we have implemented this DG method in the OpenCL/MPI framework in order to achieve high efficiency. The implementation relies on a splitting of the DG mesh into sub-domains and sub-zones. Different kernels are compiled according to the zones properties. In addition, we rely on the OpenCL asynchronous task graph in order to overlap OpenCL computations, memory transfers and MPI communications.

  • Wednesday, 13 May
  • Schedule 14:50-15:10
  • Li Ka Shing Center, Stanford

The Great Beyond: Higher Productivity, Parallel Processors and the Extraordinary Search for a Theory of Expression

Alan Ward. Texas Instruments.

Embedded system on a chip (SOC) vendors of today are perpetually challenged with the following goals; provide more compute capability and reduce cost and power. Unfortunately, these goals compete rather than cooperate. In this presentation, I will explore the use of OpenCL, beyond its more typical use cases on GPU and CPU systems, as a unifying agent between the demands on SOC software teams and the demands on SOC design teams. I will discuss OpenCL use on a multicore CPU+DSP SOC, how we have extended OpenCL to increase it’s acceptance and adoption within TI’s software community. I will also discuss additional use cases, features and requirements of embedded software that OpenCL does not cover today, but perhaps could!

  • Wednesday, 13 May
  • Schedule 15:10-15:30
  • Li Ka Shing Center, Stanford

Afternoon Break & Table-Top Demonstrations

  • Wednesday, 13 May
  • Schedule: 15:30-16:00
  • Li Ka Shing Center, Stanford

CHO: Towards a Benchmark Suite for OpenCL FPGA Accelerators

Geoffrey Ndu, Javier Navaridas and Mikel Lujan, University of Manchester.

Programming FPGAs with OpenCL-based high-level synthesis frameworks is gaining attention with a number of commercial and research frameworks announced. However, there are no benchmarks for evaluating these frameworks. To this end, we present CHO benchmark suite an extension of CHStone, a commonly used C-based high-level synthesis benchmark suite, for OpenCL. We characterise CHO at various levels and use it to investigate compiling non-trivial software to FPGAs. CHO is work in progress and more benchmarks will be added with time.

  • Wednesday, 13 May
  • Schedule: 16:00-16:50
  • Li Ka Shing Center, Stanford

Performance Optimization for a SHA-1 Cryptographic Workload Expressed in OpenCL for FPGA Execution

Spenser Gilliland and Fernando Martinez Vallina, Xilinx.

The introduction of Field Programmable Gate Array (FPGA) based devices for OpenCL applications provides an opportunity to develop kernels which are executed on application specific compute units which can be optimized for specific workloads such as encryption. This work examines the optimization of the SHA-1 hashing algorithm developed in OpenCL for and FPGA based implementation. The implementation starts from the freely available SHA-1 implementation in OpenSWAN; ports the implementation to OpenCL; and optimizes the kernel for FPGA implementation using the Xilinx SDAcccel development environment for OpenCL applications. Through each stage, the implementation is benchmarked in order to examine latency, throughput, and power usage on FPGA, Graphics Processing Unit (GPU), and Central Processing Unit (CPU) systems.

  • Wednesday, 13 May
  • Schedule: 16:30-16:50
  • Li Ka Shing Center, Stanford

IWOCL 2015 wrap and look forward to IWOCL 2016!

Simon McIntosh-Smith

  • Wednesday, 13 May
  • Schedule: 16:50-17:00
  • Li Ka Shing Center, Stanford