Conference Program

||Conference Program
Conference Program2019-03-20T10:51:12+00:00


The authors marked in bold will be presenting.  Schedule subject to change.

Monday 13 May – Tutorial, Workshop and DHPCC++ Conference

Advanced Hands-On-OpenCL

Prof. Simon McIntosh-Smith, University of Bristol

The “Advanced Hands-On OpenCL Tutorial” focuses on advanced OpenCL concepts and is an extension of the highly successful ‘Hands on OpenCL’ course which has received over 6,500 downloads from GitHub. Simon McIntosh-Smith, Professor in High Performance Computing at the University of Bristol and one of the tutorial authors will lead the sessions along with members of his research team.

“I’m delighted to offer developers the opportunity to extend their OpenCL knowledge by running this advanced version of the open source Hands-On OpenCL tutorial.” said Simon McIntosh-Smith. “The course is based on my extensive “HandsOnOpenCL” online material which been incredibly popular on Github. Anyone looking to extend their OpenCL skills beyond the introductory level should benefit from this one-day tutorial.”

Course Outline

The tutorial format is a 50/50 split between lectures and exercises and uses a mix of OpenCL C and C++ host APIs. Attendees will require their own laptop to log onto a server running OpenCL 1.0 thru OpenCL 2.0. Alternatively, students can run the exercises on their laptops using their preferred OpenCL SDK.

  • Aimed at all developers looking to use OpenCL on any platform
  • Attendees should have written at least one OpenCL program
  • There will be plenty of time for attendees to have specific OpenCL questions addressed.
  • Course Outline
    • Shipping kernel code,
    • Portable binaries with SPIR
    • OpenCL kernel compiler options
    • Kernel metaprogramming
    • Optimised host-device communications
    • Using multiple OpenCL devices
    • Performance portability
    • Coalesced memory accesses
    • Tuning work-group sizes
    • Vectorisation
    • OpenCL / OpenGL interoperability
    • The OpenCL ecosystem:
      • OpenCL 2.0 and future versions
      • OpenCL SPIR
      • OpenCL SYCL
      • OpenCL libraries
    • Other OpenCL resources

About the Presenter

Simon McIntosh-Smith is a leading OpenCL trainer, having taught the subject since 2009. He has run many OpenCL training courses at conferences such as Super Computing and HiPEAC, and has provided OpenCL training for the UK’s national supercomputing service and for the Barcelona Supercomputing Center. With OpenCL training experience ranging from half day on-site introductions within companies, to three-day intensive hands-on workshops, Simon provides standard and customized OpenCL training courses. The tutorial will also be supported by members of Simon’s research team, all of whom are experienced OpenCL software developers. Follow Simon on Twitter: @simonmcs

Track 1

09:30 – 17:30


Optimizing OpenCL for Intel FPGAs

Karl Qi, Intel

FPGAs are reconfigurable silicon used to create custom circuits for accelerating algorithms. This hands-on workshop will cover how to use OpenCL to implement high performance solutions on the FPGA using the latest version of Intel® FPGA SDK for OpenCL. We will examine how kernels are converted to custom dataflow circuits and how executions of the OpenCL kernels are mapped onto the FPGAs. We will experiment with various debug and analysis tools available in the SDK to help us optimize our OpenCL kernels with regards to both FPGA resource consumption and performance. We will examine how loops in kernels can be effectively optimized for deep pipelined-parallel execution. We will practice stream data in and out of the kernels using pipes and channels from the host, external interfaces, and other kernels for effective inline acceleration. We will guide the compiler to make performance and area trade-offs through use of attributes and pragmas and arbitrary-precision data types. We will lastly discuss how local memory systems can be generated on the FPGA for effective stall-free accesses from kernels.

Attendees will be provided with remote access to a development node and should bring a non-Linux laptop.  A basic knowledge of OpenCL would be an advantage, but nothing too sophisticated is required.

Track 2

09:30 – 17:30


DHPCC++ 2019

4th Distributed & Heterogeneous Programming for C/C++

About DHPCC++

In response to the demand for heterogeneous programming models for C/C++, and the interest in driving these models in ISO C++, Distributed & Heterogeneous Programming in C/C++ includes all the programming models that have been designed to support heterogeneous programming in C and C++.

Many models now exist including SYCL, HPX, KoKKos, Raja, C++AMP, HCC, Boost.Compute, and CUDA to name a few.

This conference aims to address the needs of both HPC and the consumer/embedded community where a number of C++ parallel programming frameworks have been developed to address the needs of multi-threaded and distributed applications. The C++11/14/17 International Standards have introduced new tools for parallel programming to the language, and the ongoing standardization effort is developing additional features which will enable support for heterogeneous and distributed parallelism into ISO C++ 20/23.

DHPCC++ is an ideal place to discuss research in this domain, consolidate usage experience, and share new directions to support new hardware and memory models with the aim of passing that experience to ISO C and C++.

Program Coming Soon
Track 3

09:30 – 17:30


Conference Sessions on Tuesday 14 May

10:00 - 10:30

Profiling OpenCL Kernels Using Wavefront Occupancy with Radeon GPU Profiler

Perhaad Mistry and Budirijanto Purnomo |  AMD

Profiling OpenCL applications on modern GPUs is usually limited to gathering timestamps from the host side or gathering performance counter data for a complete GPU kernel. In this presentation, we will show the limitations of existing performance counter-based methods with respect to optimizing complex applications. Existing performance counter-based methods of profiling only provide information aggregated over a kernel’s lifetime and does not provide insight into load balancing across shader engines or the behavior of a GPU kernel over time.

In this presentation, we will discuss the Radeon GPU Profiler (RGP)[2]. RGP is a modern GPU performance optimization tool that leverages hardware support unique to AMD GPU platforms[1]. RGP brings to OpenCL developers low-level performance information previously only available to game engine developers. Optimization using RGP is based on a new metric called “Wavefront Occupancy” which is a measure of the wavefront capacity utilization in a GPU at a point in time.

By visualizing wavefront occupancy, RGP provides a fine grained view of the behavior of a kernel as it is executed on a GPU. This allows developers to see load balancing across the different shader engines on the GPU to understand how their OpenCL workgroups are distributed across the GPU.

RGP also provides new insights into overall application throughput for OpenCL applications. RGP shows the data dependencies between consecutively enqueued GPU kernels and how the device driver groups GPU kernels into submissions on the device. RGP also allows us to understand the interaction and synchronization between the GPU and the host CPU driving it. This enables new optimizations by application developers to improve performance by reordering dispatches.

We will present results based on optimizing OpenCL kernels and improving application throughput for a professional ray tracing application. We will also show how RGP can be used to optimize Vulkan applications using async compute. This presentation will show RGP can be used to optimize an OpenCL application for GCN and Radeon Vega GPUs.

10:00 - 10:30
10:30 - 11:00

Advances in the OpenCL Offload Support in GROMACS

Szilard Pall  |  PDC Center for High Performance Computing at the KTH Royal Institute of Technology
Roland Schulz | Intel

GROMACS is a molecular dynamics (MD) simulation package widely used in research and education on machines ranging from laptops to workstation to the largest supercomputers. Built on a highly portable free and open source codebase GROMACS is known for one of the fastest simulation engines thanks to highly tuned kernels for more than a dozen processor architectures. For CPUs it relies on SIMD intrinsics-based code, while for GPUs OpenCL also supported on NVIDIA, AMD and Intel GPUs and is actively developed besides the dominant CUDA platform.

This talk aims to present the recent advances in improved offload capabilities and broader platform support of the GROMACS OpenCL codebase. With a long history of CUDA-based GPU offload, in an effort to maintain the portability and vendor-neutral GPU offload, an OpenCL port was developed four years ago and has been successfully used predominantly on AMD GPUs. Despite the modest user-base, recent efforts have focused on both closing the feature gap with CUDA codebase as well as broadening platform support. The offload of additional computation (the particle mesh Ewald solver) aims to compensate for the shift in the performance advantage of GPUs as well as to better support dense accelerator nodes. Thanks to improved CPU-GPU balance, performance improvement of up to 1.5x can be seen on workstations equipped with AMD Vega GPUs.

Additionally, platform support has been expanded to Intel iGPUs. Thanks to the flexibility of the pair-interaction algorithm developed for wide SIMD-style execution, parameter tuning for this new architecture was done to reach a good performance.We observe 5-25% performance benefit in an asynchronous offload scenario running concurrently on both on the CPU cores and the iGPU compared to only using the highly tuned SIMD intrinsics code on the CPU cores. We observe that leaving a larger fraction of the limited power budget of a mobile processor for the iGPU, application performance improves. This suggests that a configurable TDP allocation to better match the inherent computational load of the workload with the hardware balance would be beneficial. The results of our study of an HPC workload on a power-optimized SoC with both throughput and latency cores are particularly interesting as most future high performance processor architectures will feature on-chip heterogeneity with increased integration of different components more or less well suited for different parts of an HPC application.

10:30 - 11:00

Coffee Break  |  11:00 to 11:30  |  Including Poster Session and Table-top Demos

11:30 - 12:00

Comparative Performance Analysis of Vulkan Implementations of Computational Applications

Nikolaos Bellas, Maria Rafaela Gkeka and Christos Antonopoulos  |  University of Thessaly, Greece

The recent introduction of the Vulkan API and the SPIR-V intermediate-level language by the Khronos Group provides a new GPU programming model in an effort to combine the advantages of its predecessors, OpenGL for 3D graphics and OpenCL for computing. Vulkan’s low-level and more direct control over the underlying GPU hardware as well as its support for explicit multi-threaded execution offers opportunities for better performance at the cost of higher programming effort.

Most of the previous work associated with Vulkan has targeted the graphics pipeline. The fact that Vulkan also supports the compute pipeline has motivated us to examine it from the GPGPU perspective, by porting a number of realistic applications to a desktop GPU and evaluating their Vulkan implementations in terms of performance and programmability. Specifically, we consider the Laplacian filter which is used in image processing to detect areas of rapid change (edges) in images. Also, we consider a Visual Odometry (VO) application used to track the position and pose of a robot by analyzing a sequence of camera frames. VO is part of a Simultaneous Localization and Mapping (SLAM) application used in autonomous navigation systems to build a map of surrounding environments and to determine the location of a moving robot inside this map. These applications require advanced pixel-level processing at different levels of pyramid-based granularity, and may even require real-time performance (when, for example, SLAM is used in a robot navigation system). We ported the original implementations (written in C for Laplacian filter and in CUDA for SLAM) to OpenCL, OpenGL and Vulkan and evaluated their performance on a desktop NVIDIA GPGPU.

We show that Vulkan performance is comparable (within 10%) with the performance attained by OpenCL and higher than the performance attained by OpenGL compute shader implementations. By exploiting Vulkan synchronization primitives using the command buffer, we can eliminate the overhead of launching multiple kernel invocations in iterative applications and improve performance of Vulkan implementations by up to 30%. However, the OpenCL compiler seems to be more mature than the SPIR-V compiler used in Vulkan implementations resulting in slightly faster OpenCL kernel execution.

On the other hand, the low-level semantics of Vulkan demand higher programming effort compared with OpenCL/OpenGL which can be a burden if Vulkan is to be used as a GPGPU programming model. Most of the additional effort, however, is boilerplate code that can be reused in more than one Vulkan applications.

Our work is one of the first to consider Vulkan compute as an implementation language for larger scale applications (and not just for small kernels as in previous work).

11:30 - 12:00
12:00 - 12:30

Developing Performance-Portable OpenCL Code via Multi-Dimensional Homomorphisms

Ari Rasch, Richard Schulze and Sergei Gorlatch |  University of Münster

A key challenge in programming high-performance applications is achieving portable performance, such that the same program code can reach a consistent level of performance over the variety of modern parallel processors, including multi-core CPU and many-core Graphics Processing Units (GPU), and over the variety of problem sizes.

Popular approaches to parallel programming are either restricted to the hardware of a particular vendor (like CUDA for NVIDIA) or, even if they provide code portability (like OpenCL), performance portability is usually not available: for example, a parallel program achieving high performance on a GPU often yields poor performance on a CPU, or even on another GPU model. The reason is that hardware architectures differ significantly in their characteristics, e.g., GPU provide a high number of cores but small caches while CPU have a low number of cores and big caches; also GPU from different vendors (e.g., NVIDIA vs. AMD) pose different or even contradicting requirements on the code for achieving the full performance potential of the corresponding architecture. Performance differs also across input sizes. For example, a high-performance implementation of GEneral Matrix-Matrix Multiplication (GEMM) targeting big input matrices differs significantly from a GEMM implementation optimized for small matrices, e.g., as used in deep learning. This is because high performance on big matrices is achieved by computing all elements of the resulting matrix simultaneously and each of them sequentially, whereas for high performance on small matrices, the computation of each element should be parallelized as well.

The lack of performance portability often requires re-designing program code for every new target architecture and/or another problem size.

In this talk, we address an approach to performance portability based on patterns of parallelism and auto-tuning. We extend the functional formalism of Multi-Dimensional Homomorphisms (MDH) that allows us to express a wide range of applications (including the popular BLAS routines and stencil computations) as MDH-instances. For MDH, we develop a generic OpenCL implementation schema. This schema is performance-portable: it is parametrized with the performance-critical parameters of OpenCL’s platform and memory model, such that, for each particular MDH-instance, particular problem size and particular target architecture, we can automatically find the well-performing parameter values using our novel Auto-Tuning Framework (ATF), and thereby adapt the OpenCL code correspondingly.

Our experiments with linear algebra routines (BLAS) and stencil applications demonstrate that we reach competitive and often even significantly better performance than the related work — e.g., speedup factors of up to 5x over the hand-implemented, vendor-provided BLAS libraries Intel MKL and NVIDIA cuBLAS — on representative parallel architectures and for important input sizes that are used in deep learning.

12:00 - 12:30
12:30 - 13:00

Evaluating Portability and Performance of OpenCL FPGA Kernels on Intel HARPv2

Anthony M. Cabrera and Roger Chamberlain |  Washington University in St. Louis

As Moore’s law draws nearer, researchers across disciplines are looking beyond relying on performance increases through faster CPU clock speeds and advances in semiconductor process technologies. FPGAs offer a heterogenous compute solution to this problem by enabling the creation of application-specific hardware that accelerates computation. While the barrier to entry has historically been steep, advances in High Level Synthesis (HLS) are making FPGAs more accessible. Specifically, the Intel FPGA OpenCL SDK allows software designers to abstract away low level details of architecting hardware on an FPGA and allows them to author computational kernels in higher level languages. Furthermore, Intel has developed a system that incorporates both a multicore Xeon CPU and Arria
10 FPGA into the same chip package, as part of the Heterogeneous Accelerator Research Program (HARP), that can be targeted by their SDK.

In this work, we target the second iteration of the HARP platform (HARPv2) using HLS through porting OpenCL kernels written for FPGAs connected via PCIe card. We evaluate their performance against previously reported results, explore the portability of kernels through a hardware design space search, and empirically show the benefits of using the SVM abstraction over explicit reads and writes to the FPGA. Additionally, all code will be made available via Github and all FPGA images and raw data will be available via WashU OpenScholarship.

12:30 - 13:00

Lunch Break  |  13:00 to 14:00  |  Including Poster Session and Table-top Demos

14:00 - 15:30

Khronos Update - OpenCL, SYCL and SPIR - The Next Steps

The session is awaiting confirmation from the speaker.
14:00 - 15:30

Coffee Break  |  15:30 to 16:00  |  Including Poster Session and Table-top Demos

16:00 - 17:30

Khronos Panel Discussion

Chaired by Simon McIntosh-Smith with the Panelists selected from the Khronos OpenCL, SYCL and SPIR working groups and leading members of the OpenCL development community.

This session is always a favourite amongst attendees and we don’t expect it to be any different this year.  This panel session provides the opportunity for delegates to quiz the Khronos working groups members and other panelists on any topic related to OpenCL.  Don’t hold back!

16:00 - 17:30

IWOCL 2019 Conference Dinner  |  18:30 to 21:00  |  Northeastern University Faculty Club

Conference Sessions on Wednesday 15 May

09:00 - 09:30

Blurring the Boundary between CPU and GPU

INVITED TALK by: Jerome Glisse, Karol Herbst |  Redhat

OpenCL 2.0 have define various level of share virtual memory (SVM) and this is a feature that is still not widely adopted by end users.  This talk aims to provide insight in what way SVM can help OpenCL programmers in their application. It will also looks at some of the today pitfalls and limitations and all the work under way inside the Linux kernel to address those and improve usability of this feature.  The talk will reference the work undertaken to add support for OpenCL to Nouveau through SPIR-V/NIR in order to be able to use HMM (Heterogeneous Memory Management).

09:00 - 09:30
09:30 - 10:00

Accelerated Neural Networks on OpenCL Devices Using SYCL-DNN

Rod Burns, John Lawson, Duncan McBain and Daniel Soutar |  Codeplay

Over the past few years machine learning has seen a renewed explosion of interest, following a number of studies showing the effectiveness of neural networks in a range of tasks which had previously been considered incredibly hard. Neural networks’ effectiveness in the fields of image recognition and natural language processing stems primarily from the vast amounts of data available to companies and researchers, coupled with the huge amounts of compute available in modern accelerators such as GPUs, FPGAs and ASICs. There are a number of approaches available to developers for utilizing GPGPU technologies such as SYCL, OpenCL and CUDA, however many applications require the same low level mathematical routines. Libraries dedicated to accelerating these common routines allow developers to easily make full use of the available hardware without requiring low level knowledge of the hardware themselves, however such libraries are often provided by hardware manufacturers for specific hardware such as cuDNN for Nvidia hardware or MIOpen for AMD hardware.

SYCL-DNN is a new open-source library dedicated to providing accelerated routines for neural network operations which are hardware and vendor agnostic. Built on top of the SYCL open standard and written entirely in standard C++, SYCL-DNN allows a user to easily accelerate neural network code for a wide range of hardware using a modern C++ interface. The library is tested on AMD’s OpenCL for GPU, Intel’s OpenCL for CPU and GPU, ARM’s OpenCL for Mali GPUs as well as ComputeAorta’s OpenCL for RCar CVEngine and host CPU. In this talk we will present performance figures for SYCL-DNN on this range of hardware, and discuss the requirements for achieving high performance on such a varied set of accelerators with such different hardware features.

For additional information visit:

09:30 - 10:00
10:00 - 10:30

How to Deploy AI Software to Self Driving Cars

Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller |  Codeplay

The automotive industry is embracing new challenges to deliver self-driving cars, and this in turn requires increasingly complex hardware and software. Software developers are leveraging artificial intelligence, and in particular machine learning, to deliver the capabilities required for an autonomous vehicle to operate. This has driven the integration of heterogeneous hardware into automotive systems offering multi-core processors capable of performing the intense algorithms required for artificial intelligence and machine learning. These multi-core processors can be used to vastly speed up common operations used in AI and machine learning algorithms.

This session will demonstrate how artificial intelligence software can be developed and accelerated using SYCL and OpenCL on Yocto Linux, then targeted at a range of hardware including the Rensas R-Car IMP range of automotive processors. The OpenCL model enables extensive usage of heterogenous hardware, including fully programmable IP, efficient data transfer using the DMA and on-chip memory, and fixed function IP block, such as CNN for enabling high throughput convolution operations, via OpenCL builtin kernels. We will look at the memory mapping to bring in the efficiency and the software pipelining & parallelism. These hardware architectures include AI accelerator processors specifically designed to be used in the next generation of vehicles. In particular, the processors are designed to tackle complex algorithms whilst limiting the overall consumption of power. Benchmarks will be presented to show how portable code can also deliver performance for developers using this hardware.

As well as enabling developers to choose OpenCL or SYCL, we will talk about how these standards enable additional high-level frameworks that can be used to target this hardware. These include libraries for deep neural networks and linear algebra operations.

10:00 - 10:30
10:30 - 11:00

Breaking the Last Line of Performance Border.

Michal Mrozek |  Intel

In this talk we will present various techniques that we used to optimize performance of our clDNN libraries and obtain top notch performance.

We will provide details about following techniques:

  • offloading execution
  • combining primitives into graphs to apply graph level optimizations
  • leveraging proper layout of data
  • primitive fusing
  • using memory padding
  • aggregating independent primitives to be executed concurrently
  • dedicated kernel selection, formed only to service efficiently dedicated use cases
  • utilizing proper memory pool
  • auto tuning

We will provide details about those techniques, how much performance can be gained with those and how to apply them to OpenCL programs.

10:30 - 11:00

Coffee Break  |  11:00 to 11:30  |  Including Poster Session and Table-top Demos

11:30 - 12:00

Performance Evaluation of OpenCL Standard Support (and Beyond)

Tyler Sorensen | Princepton University
Sreepathi Pai | University of Rochester
Alastair F. Donaldson  |  Imperial College London and Google

In this talk, we will discuss how support for a diverse set of OpenCL features affects performance in the domain of graph applications executing on GPU platforms. Given that adoption of OpenCL features varies widely across vendors, these results can help quantify the performance benefits, and potentially motivate, the timely adoption of these OpenCL features.

Our findings are drawn from the experience of developing an OpenCL backend for a state-of-the-art graph application DSL, originally developed with a CUDA backend. This DSL allows competitive algorithms for applications such as breadth-first-search, page-rank, and single-source-shortest-path to be written at a high level. A series of optimisations can then be applied by the compiler and executable OpenCL code can be generated. These optional optimisations exercise various features of OpenCL: on one end of the spectrum, applications compiled without optimisations require only core OpenCL features provided in version 1.1 of the standard; on the other end, a certain optimisation requires inter-workgroup forward progress guarantees, which are yet to be officially supported by OpenCL, but have been empirically validated. Other optimisations require OpenCL features such as: fine-grained memory consistency guarantees (added in OpenCL 2.0) and subgroup primitives (added to core in OpenCL 2.1).

Our compiler can apply 6 independent optimisations. For each optimisation, we determine the minimum version of OpenCL required to support the optimisation. We find that the relevant OpenCL versions, and the number of optimisations they support, are: 1.1 (2 optimisations are supported), 2.0 (adds 1 additional optimisation), and 2.1 (adds 2 more additional optimisation). We additionally create the notion of version FP (forward-progress) that adds support for unofficial forward progress guarantees, which are required for the final optimisation. Clearly, as support increases, so does the number of supported optimisations. For each optimisation, we will discuss the OpenCL features required for support and the idioms in which the features are used. Use-case discussions of these features (i.e. memory consistency and subgroup primitives) are valuable as there appear to be very few open-source examples, e.g. a GitHub search shows only a small number of examples.

The compiler infrastructure enables us to carry out a large and controlled study, in which the performance benefit of various levels of OpenCL support can be evaluated. We gather runtime data exhaustively on all combinations across: all optimisations, 17 applications, 3 graph inputs, 6 different GPUs (spanning 4 vendors: Nvidia, AMD, Intel and ARM). Our results show that if feature support is limited to OpenCL 2.0 (and below), the available optimisations fail to achieve any speedup up in over 70% of the cases. If support OpenCL 2.1 is added, then this number drops to 60%; however, in all of these cases, observed application speedup is modest, rarely exceeding 2x. Finally, if unsupported forward progress guarantees can be assumed, then speedups can be observed in over half of the cases, including impressive speedups of over 14x for AMD and Intel GPUs. We believe this provides compelling evidence for forward progress properties to be considered for adoption for a future OpenCL version.

11:30 - 12:00
12:00 - 12:30

OpenCL vs: Accelerated Finite-Difference Digital Synthesis

Harri Renney, Benedict Gaster and Tom Mitchell  |  University of West of England

Digital audio synthesis has become an important component of modern music production with techniques that can produce realistic simulations of real instruments. Physical modelling sound synthesis is a category of audio synthesis that uses mathematical models to emulate the physical phenomena of acoustic musical instruments including drum membranes, air columns and strings. The synthesis of physical phenomena can be expressed as discrete variants of Newton’s laws of motion, using, for example, the Finite-Difference Time-Domain method or FDTD.

FDTD is notoriously computationally expensive and the real time demands of sound synthesis in a live setting has led implementers to consider offloading to GPUs. In this paper we present multiple OpenCL implementations of FDTD for real time simulation of a drum membrane. Additionally, we compare against an AVX optimized CPU implementation and an OpenGL version that utilizes a careful mapping to the GPU texture cache. We find using a discrete, laptop class, AMD GPU that for all but the smallest mesh sizes, the OpenCL implementation out performs the others. Although, to our surprise we found that optimizing for workgroup local memory provided only a small performance benefit.

12:00 - 12:30
12:30 -13:00

The Challenge of Targeting Scratch-pad Memory Devices with OpenCL

Zvi Rackover  |  Mobileye

A Scratch-pad memory (SPM) DSP or programmable accelerator is characterized by small local memories which are filled and flushed using asynchronous DMA operations and a small amount of execution units with high instruction level parallelism (ILP). The Data Parallel Programming Model does not fit well to this device due to the lack of abundant compute elements that can execute the same program in parallel.

In order to maximize utilization of the device’s execution resources, techniques such as double-buffering are employed to implement overlapping of compute and DMA operations.

In this talk I will review the challenges faced in an attempt to create an OpenCL kernel that employs double-buffering on a device with banked scratch-pad memories. I will point-out the OpenCL kernel language’s (and SPIR’s) deficiencies to express the essential semantics the program is required to convey in order to achieve a correct and performant solution.

12:30 -13:00

Lunch Break  |  13:00 to 14:00  |  Including Poster Session

14:00 - 14:30

Exploring Integer Sum Reduction using Atomics on Intel CPU

Zheming Jin and Hal Finkel |  Argonne National Lab

Atomic functions are useful in updating a shared variable by multiple threads, barrier synchronizations, constructing complex data structures, and building high-level frameworks. In this paper, we focus on the evaluation and analysis of integer sum reduction, a common data parallel primitive. We convert the sequential reduction into parallel OpenCL implementations on the CPU. We also develop three micro kernels, which allow us to understand the relationships between the kernel performance and the operations involved in reduction. The results of the micro kernels show that increasing the work-group size linearly can linearly improve the kernel performance. There is a sweet spot in the relationship between the work-group size and barrier synchronization overhead. The performance of the atomics over local memory are not sensitive to the work-group size. The sum reduction kernel with vectorized memory accesses can improve the performance of the baseline kernel for a wide range of work-group sizes. However, the vectorization efficiency shrinks with the growing work-group size.
We also find that the vendor’s default OpenCL kernel optimization does not improve the kernel performance. On average, disabling the optimization can reduce the execution time of the kernel with vectorized memory accesses by 15%. We attribute the performance drop to the fact that the default kernel optimizations instantiate a large number of atomics over global memory when implicitly vectorizing the kernel computation.

14:00 - 14:30
14:30 - 15:00

MGSim: a Flexible High-Performance Simulator for Multi-GPU Systems

Yifan Sun, Trinayan Baruah, Shi Dong, and David Kaeli

GPUs can provide both high performance and energy efficiency in processing data-parallel workloads. Today, GPUs are accelerating a wide range of applications, spanning large-scale physics simulations to deep neural network training. However, faced with the ever-increasing amounts of data in the many of these applications, a single GPU can no longer satisfy the compute and memory demands of these workloads. In response, industry has started to offer
multi-GPU systems, designing high-performance platforms with an impressive amount of raw computing power. The introduction of new multi-GPU systems comes with a number of new design challenges, including innovations in scalable distributed shared-memory design, new cache coherency policies and high-throughput GPU interconnects. Presently, there is no opensource simulation framework that can support detailed simulation and architectural exploration
of multi-GPU tradeoffs.

Modifications to existing simulators to support multi-GPU system modeling require a complete redesign of a framework, and result in poorly architected simulation infrastructure. Instead, we believe it is time for a new class of simulator, one that addresses many of the current issues present in architectural simulators. The time is right for a flexible simulator infrastructure that satisfies these design requirements and provides a rich framework to support multi-GPU

To enable multi-GPU architectural modeling, we introduce MGSim, a new open-source, cyclelevel multi-GPU simulator. MGSim runs AMD GCN3 binaries that are compiled from OpenCL kernels using the official ROCm drivers. MGSim natively supports running parallel simulation without compromising simulation accuracy. MGSim also features a flexible modular design, allowing users to create a wide variety of system configurations. We developed MGSim using
the Go programming language, primarily based on Go’s simplicity, tool support, and language level multi-threading support.

MGSim represents the next generation in GPU simulation. In terms of accuracy, MGSim simulations differs by 5.5% on average as compared to GPU hardware execution. Exploiting the multi-threaded capabilities of our simulation, on a 4-core CPU we can achieve a 3.5X speedup running functional emulation and a 2.5X speedup running detailed timing simulation, as compared to single-threaded simulation.

14:30 - 15:00
15:00 - 15:30

To be announced

The session is awaiting confirmation from the speaker.
15:00 - 15:30

Coffee Break  |  15:30 to 16:00  |  Including Poster Session

16:30 - 17:30

Open Discussion

This session provides an opportunity after the main presentations have finished to network with other delegates and members of the Khronos Working group.
16:30 - 17:30

Posters – Tuesday 14 and Wednesday 15, May


Data Integration Tasks on Heterogeneous Systems Using OpenCL

Clayton Faber, Anthony Cabrera, Orondé Booker, and Roger Chamberlain | Washington Unviersity of St. Louis
Gabe Maayan | Rensselaer Polytechnic Institute

In the era of big data, many new programs and algorithms are developed to try and find the most efficient way to perform computations with massive amounts of data. However, what is often overlooked is the preprocessing step for many of these applications. Surprisingly, the data integration tasks can sometimes take an inordinate amount of time compared to the actual algorithm running time. The Data Integration Benchmark Suite (DIBS) [1] was designed to understand the characteristics of dataset transformations from a hardware agnostic point of view. While on the surface these applications have a high amount of data parallelism, there are caveats in their specification that can potentially affect this characteristic. Even still, OpenCL can be an effective deployment environment for these applications where we can decompose the data transformations into data points for individual work items and in turn those work items into queues. In this work we take a subset of the data transformations from each category presented in DIBS and implement them in OpenCL to evaluate their performance for heterogeneous systems. For targeting heterogeneous systems, we will take a common application and attempt to deploy it to three platforms targetable by OpenCL (CPU, GPU, and FPGA). The applications will be evaluated by their total transformation time, which will not include the data transfer time of bringing in data from disk into main memory. Through this we will illustrate the advantages of each compute device in the data integration space along with different communications schemes allowed for host/device communication in the OpenCL platform. The primary distinguishing factor among the applications we consider is the following: whether or not there is an apparent sequential dependency in the specification of the data integration task to be performed. Several of the applications have no such dependency (i.e., they are embarrassingly parallel at the level of individual data elements), and subsequently perform quite well on each of the target platforms. The more interesting cases are those for which there is a sequential dependency (e.g., parsing comma-separated fields), and considerably more effort must be expended to enable these applications to perform well. The greatest success is seen when the sequential dependencies can be expressed in terms of prefix operators that can then be executed by parallel prefix mechanisms.

[1] Cabrera, A., Faber C., Cepeda, K., et al. 2018. DIBS: A Data Integration Benchmark Suite. In Proc. of ACM/SPEC International Conference on Performance Engineering Companion (ICPE ’18). ACM, New York, NY, USA, 25-28. DOI:


Optimization of Fast Fourier Transform (FFT) on Qualcomm Adreno GPUs

Hongqiang Wang, Alex Bourd and Vaibhav Rajesh Gandhi | QUALCOMM Technologies Inc.

As a classical algorithm to compute discrete Fourier transform of a sequence, Fast Fourier transform (FFT) has been widely used in many applications, including traditional multimedia signal processing and also machine learning.

In this poster, we illustrate how to accelerate the FFT algorithm on Qualcomm’s Adreno GPUs by using OpenCL, a general purpose, royalty-free API that is widely available on desktop and mobile GPUs. We provide a high level overview of Adreno’s compute architecture and the OpenCL optimization for mobile GPUs.

We show that descent FFT performance with good power and energy efficiency can be achieved by using various optimization techniques, such as use of on-chip memory, sophisticated design of memory access patterns and good parallelism.


Parsing CUDA for Transformation to SYCL in an IDE

Tobias Stauber and Peter Sommerlad | IFS Institute for Software at FHO-HSR Rapperswil

This poster presents a master thesis that created extensible CUDA® support for Cevelop, IFS’ C++ IDE based on Eclipse CDT. One component ReSYCLator uses this infrastructure to interactively transform existing CUDA-C++ parallel computation solutions to SYCL, an open specification by the Khronos® Group. In contrast to CUDA, SYCL allows expressing heterogeneous parallel programs in standard C++ syntax.

A good foundation for CUDA to SYCL transformation needs its own parser for CUDA C++ code. This parser enhances its AST with detailed contextual information concerning a piece of code’s execution-space, denoting if a given function runs on the GPU device or the host computer’s CPU. This information greatly simplifies transforming CUDA C++ code to SYCL.

Using this CUDA parsing infrastructure and the AST transformation and rewriting engine of Eclipse CDT, formerly developed by IFS, the transformation happens within the IDE. This has the additional benefit that limitations of the transformation are easily visualized and can be adjusted by the developer. On the other hand, tedious manual transformation steps are automated saving time.


Sparse-Matrix Compression Primitives with OpenCL Framework to Support Halide

Chao-Lin Lee, Chen-Ting Chao, Jenq-Kuen Lee | National Tsing Hua University
Chung Wen Huang and Ming-Yu Hung  |  MediaTek Inc

Halide and OpenCL now play important roles for heterogeneous multi-core computing. OpenCL provides vendor-level support and Halide provides domain-specific support such as vision processing and AI model (TVM Halide IR). Halide also provides flexible scheduling for applications on target machines. OpenCL plays a supporting role for Halide environments. In this work, we investigate the research issues in supporting sparse computation with Halide and their corresponding OpenCL support. We present sparse matrix compression primitives on Halide for sparse matrix matrix (SpMM) multiplication with OpenCL framework. Halide is a programming language designed to process image and array from numerous algorithms and scheduling primitives to achieve state-of-art performance including SIMD and heterogeneous computation. Given a m-by-k sparse matrix A and a k-by-n dense matrix B, SpMM computes a m-by-n dense matrix C = AB. We choose the most two common sparse matrix formats – coordinate (COO) format and compressed-sparse-row (CSR) format. COO format stores data in a list of tuples with three elements; row, column and value. The first element is row index, the second element is column index, and the third element is the value to be stored in the row and column. All the value store in COO tuples is non-zero elements. We paralleled the reading of the 2-D array index with OpenCL support in Halide to speed up the traversal time of the array index. When the traverse meets non-zero elements, we stored them in COO format. As a result, the SpMM multiplication can be speedup with compressed COO matrix. The shortcomings of COO format is that the row index using identical entries. CSR format can improve it by replacing an array of row index with a shorter array of row offset. We also integrate recent related work of sparse matrix compression, hybrid CSR and DCSR. DCSR (Doubly Compressed Sparse Row) and CSR split sparse matrix into clustered row segments and light row segments. Heavily clustered row segments of DCSR format can be used as the basis with tiling. This method enabled higher data reuse than COO format. The design of experiments includes Halide primitives for sparse matrix compression and matrix computations. The experiments were performed on an AMD Radeon R9 GPU with OpenCL2.0 framework. Our experiment uses Trefethen20000, ACTIVSg10K, G67, and ACTIVSg2000 as data sets from the SuiteSparse Matrix Collection. The experimental result of computation with compressing matrix shows the performance are improved by more than 85% compared to the baseline without compression. Our work also gives the detailed methods to use OpenCL in implementing sparse matrix compression for Halide.


Case Study: Support OpenCL Complex Class for Baseband Computing

Jenq-Kuen Lee, Chia-Hsuan Chang, Chun-Chieh Yang | National Tsing Hua University
Yung-Chia Lin  |  MediaTek Inc

With the growing computing complexity of 5G baseband, parallel computing and OpenCL become a promising solution to help with the computing needs for baseband computing. In this paper, the issues are investigated in supporting the complex class library with OpenCL C++ class for baseband computing. This library is developed based on OpenCL C Complex Math library and also uses the complex libraries of LLVM and libclcxx as references. Templates and overloaded C++ features in the proposed complex class can work with many data types such as complex half, float, double, and fixed point. This work also addresses the issue with fast math. Our work can work with standard C++ math with complex number as well as fast-math option in the complex class that the check of Not a Number (NaN) and infimum (INF) are removed. This is also observed by 5G computation that value range are bound and not to get to the INF. Our OpenCL Complex C++ class is also experimented with the minimum mean-square error (MMSE) application. MMSE is considered computationally expensive in the baseband computing to find maximum likelihood detection. MMSE is one of linear precoding/detection methods in MIMO. MMSE attempts to strike a balance between amplifying the signal and reducing the interference. Our work is experimented with two environments. One is with Intel OpenCL 2.1 environment. In that we experiment with complex float and complex half. In the second environment, we experiment with our SPIR-V parser to evaluate the linguistic to possibly incorporate complex fixed-point. The experiment also runs a MMSE equalizer application in Massive MIMO. As compared to the C++ sequential code, the complex class in OpenCL can reduce total execution time. Based on the preliminary results, the kernel execution time could be reduced by about 5% to 45% with fast math to omit NaN and INF check. In the future, the C++ class header file will also be illustrated to fit C++ complex class with OpenCL C++ linguistics. Furthermore, other complex types will be added in the proposed complex class library to support more applications and further optimizations.


Application of OpenCL to Numerical Study of the Abrikosov Vortex Energy in a Superconductor with Cylindrical Hole

Petr Kartsev | Moscow Engineering-Physics Institute

We solve numerically the system of Ginzburg-Landau equations describing the superconductor containing Abrikosov vortices in the special geometry essential for practical application: bulk material with a cylindrical hole. The interaction between the vortex and the hole (absense of a superconductor) defines the potential well needed for correct description and simulation of modern superconducting materials. The solution of these nonlinear equations is still hard and time-consuming operation. We apply the GPU OpenCL solver developed especially for vortex-type problems (IWOCL 2018 proceedings: Depending on the hole radius, we can address such phenomena as vortex pinning (small radius) and interaction with curved surface (large radius). Taking into account the nonlinearity of the equations, numerical approach is considered the only possible one to describe complex geometries, as well as interaction of several vortices. We describe the OpenCL specific tricks and optimizations used for maximal GPU utilization and overall calculation performance