CONFERENCE PROGRAM 2020

The conference has now finished but the discussions continue.  We have set-up a dedicated Slack workspace so that authors and attendees can discuss the content of the program.  There are active channels covering the full program, including the SYCL tutorials, Khronos panel discussion plus the main program (papers, technical presentations and posters).  Registration will remain open until May 29th 2020 .

JOIN THE SLACK DISCUSSIONS

Tutorials

An Introduction to SYCL

By Codeplay, Heidelberg University, Intel and Xilinx
Live streamed on Monday 27th April, 2020

SYCL is a programming model that lets developers support a wide variety of devices (CPUs, GPUs, and more) from a single code base. Given the growing heterogeneity of processor roadmaps, moving to a platform-independent model such as SYCL is essential for modern software developers. SYCL has the further advantage of supporting a single-source style of programming from completely standard C++. In this tutorial, we will introduce SYCL and provide programmers with a solid foundation they can build on to gain mastery of this language.

The format will be short video presentations followed by short coding exercises.

Course Outline

  • Introduction to SYCL
  • The basic structure of every SYCL program
  • Device topology and configuring a queue
  • Handling SYCL errors
  • The host-device interface and launching a kernel
  • The fundamental concepts of a SYCL kernel
  • Managing Data in SYCL

Coding Exercises

There will be time allocated for hands on coding exercises during the tutorial and we recommend that you set up your machine or the Intel DevCloud before the tutorial for these exercises. Since there are various implementations of the SYCL standard you have some choices.

We’ll provide instructions during the exercises for all these options.

Use SYCL in the Cloud

The simplest choice is to use DPC++ and the Intel DevCloud because it requires no setup on your machine.
You can access the cloud machine through your web browser or via SSH terminal and run SYCL code on an Intel CPU
Register for access at https://devcloud.intel.com/oneapi/connect/

ComputeCpp

Hardware Supported: Intel CPU/GPU, Arm GPU
Go to https://developer.codeplay.com/products/computecpp/ce/guides for setup instructions

DPC++

Hardware Supported: Intel CPU/GPU/FPGA, NVIDIA GPU
Go to https://software.intel.com/en-us/oneapi/dpc-compiler for setup instructions

hipSYCL

Hardware Supported: AMD GPU, NVIDIA GPU, (Other CPUs where a C++ compiler exists for that CPU that supports OpenMP)
Go to https://github.com/illuhad/hipSYCL for setup instructions

Course Schedule (Pacific Time)

8:0020 minLectureIntroduction to SYCLCodeplay
8:2025 minHands onGetting set upCodeplay
8:4520 minLectureTopology discovery and queue configurationCodeplay
9:0525 minHands onConfiguring a queueCodeplay
9:3020 minLectureDefining kernelsCodeplay
9:5025 minHands onHello worldCodeplay
10:1520 minLectureManaging dataCodeplay
10:3525 minHands onVector addCodeplay
11:001 hourQ&AContinued Discussion and Q&A on chat channelTBD
View the Presentations

Application Development with SYCL

By Codeplay, Heidelberg University, Intel and Xilinx
Live streamed on Wednesday 29th April, 2020

As a programming model, SYCL is interesting, but its real value comes from its use in writing applications. This second part of the SYCL tutorial covers the ecosystem available to SYCL developers with an insight into some of the features that are being added to the standard, and more details on some of the advanced features. There will also be time during the panel session to pose question to some of the developers involved in defining and implementing the SYCL standard.

The format will be short video presentations followed by live Q&A sessions with experts from Codeplay, Heidelberg University, Intel and Xilinx.

Course Outline

  • Understanding the SYCL implementations
  • Migrating from CUDA to SYCL
  • Unified Shared Memory in SYCL
  • Host Tasks
  • FPGA extensions in SYCL
  • SYCL Implementer Panel and Q&A

Course Schedule (Pacific Time)

8:0040 minLectureOverview of SYCL implementations

  • ComputeCpp
  • DPC++
  • hipSYCL triSYCL
Codeplay, Intel, Heidelberg University, Xilinx
8:4015 minLive Q&A
8:5545 minLectureExpanding the ecosystem

  • DPC++ Compatibility Tool: Migrating from CUDA* to DPC++
  • Offload advisor
  • Nvidia support
Codeplay, Intel
9:4015 minLive Q&A
9:5545 minTBDFuture direction of SYCL

  • Unified Shared Memory
  • Host Task
  • FPGA Extensions
Codeplay, Intel, Xilinx
10:4015 minLive Q&A
View the Presentations

Khronos Announcements and Panel Discussion

This live webinar took place on April 28 and featured some significant announcements and updates from the Khronos Group on both OpenCL and SYCL. These were followed by an open panel discussion, a session that is always a favorite with our audience. This lively and informative session put leading members of the Khronos OpenCL, SYCL and SPIR Working groups on our ‘virtual stage’ alongside experts from the OpenCL development community to debate the issues of the day and answer questions from the online audience. 

Panel Chair

  • Simon McIntosh-Smith, Conference Chair. Professor of High-Performance Computing and Head of the HPC Research Group, University of Bristol.

Khronos Announcements

  • Neil Trevett, Khronos President and OpenCL Working Group Chair, VP NVIDIA
  • Michael Wong, SYCL Working Group Chair. VP Research & Development, Codeplay Software

Panelists

Our panel of experts are all leading figures in the OpenCL and SYCL community and play leading roles within the Khronos Working Groups and the wider development community.

  • Alastair Murray, Codeplay
  • Ben Ashbaugh, Intel
  • Dennis Adams, Sony Creative Software
  • Eric Berdahl, Adobe
  • Hal Finkel, Argonne National Laboratory
  • Jeremy Kemp, Imagination
  • Kévin Petit, Arm
  • Martin Schreiber, Technical University of Munich
  • Ronan Keryell, Xilinx

SYCL 2020 Update

Program of Talks – Papers and Technical Presentations

A “Virtual” Welcome to IWOCL & SYCLcon 2020

This year we had a record number of submissions.  The quality was very high and competition was fierce in all categories; research papers, technical presentations, tutorials and posters. The following submissions were accepted and thanks to all our authors for creating the following video presentations.  The slack workspace accompanying the event will be open until at least the end of May if you would like to discuss any of the submissions. Register to Join the Slack Workspace.

Simon McIntosh-Smith, General Chair.  University of Bristol.

Martin Schreiber, Local Co-Chair. TUM
Christoph Riesinger, Local Co-Chair. Intel

ACM Digital Proceedings

This year’s conference proceedings are now available to anyone with a subscription to the ACM Digital Library

ACM Digital Library

KEYNOTE PRESENTATION:
Preparing to program Aurora at Exascale: Early experiences and future directions

Hal Finkel (Argonne National Laboratory)

Argonne National Laboratory’s Leadership Computing Facility will be home to Aurora, our first exascale supercomputer. Aurora promises to take scientific computing to a whole new level, and scientists and engineers from many different fields will take advantage of Aurora’s unprecedented computational capabilities to push the boundaries of human knowledge. In addition, Aurora’s support for advanced machine-learning and big-data computations will enable scientific workflows incorporating these techniques along with traditional HPC algorithms.

Programming the state-of-the-art hardware in Aurora will be accomplished using state-of-the-art programming models. Some of these models, such as OpenMP, are long-established in the HPC ecosystem. Other models, such as Intel’s oneAPI, based on SYCL, are relatively-new models constructed with the benefit of significant experience. Many applications will not use these models directly, but rather, will use C++ abstraction libraries such as Kokkos or RAJA. Python will also be a common entry point to high-performance capabilities. As we look toward the future, features in the C++ standard itself will become increasingly relevant for accessing the extreme parallelism of exascale platforms.

This presentation will summarize the experiences of our team as we prepare for Aurora, exploring how to port applications to Aurora’s architecture and programming models, and distilling the challenges and best practices we’ve developed to date. oneAPI/SYCL and OpenMP are both critical models in these efforts, and while the ecosystem for Aurora has yet to mature, we’ve already had a great deal of success.

Importantly, we are not passive recipients of programming models developed by others. Our team works not only with vendor-provided compilers and tools, but also develops improved open-source LLVM-based technologies that feed both open-source and vendor-provided capabilities. In addition, we actively participate in the standardization of OpenMP, SYCL, and C++. To conclude, I’ll share our thoughts on how these models can best develop in the future to support exascale-class systems.

Taking memory management to the next level – Unified Shared Memory in action

Michal Mrozek, Ben Ashbaugh and James Brodman (Intel)
When adding OpenCL or SYCL acceleration to existing code bases it is desirable to represent memory allocations using pointers. This aligns naturally with the Shared Virtual Memory (SVM) capabilities in standard OpenCL, but we found many properties of SVM that are either cumbersome to use or inefficient.

This talk will describe Unified Shared Memory (USM), which is designed to address these shortcomings and take SVM to the next level. Unified Shared Memory is intended to provide:
– Representation of memory allocations as pointers, with full support for pointer arithmetic.
– Fine-grain control over ownership and accessibility of memory allocations, to optimally choose between performance and programmer convenience.
– A simpler programming model, by automatically migrating some allocations between devices and the host.

We will explain how we accomplished these goals by introducing three new memory allocation types:
– Host allocations, which are owned by the host and are intended to be allocated out of system memory. They are not expected to migrate between system memory and device local memory, they have address equivalence, and they trade off wide accessibility and transfer benefits for potentially higher per-access costs, such as over PCI express.
– Device allocations, which are owned by a specific device and are intended to be allocated out of device local memory. They trade off access limitations for higher performance. They are not accessible by the host.
– Shared allocations, which share ownership and are intended to migrate between the host and one or more devices, they are accessible by the host and at least one associated device, and trade off transfer costs for per-access benefits.
We have implemented USM as an extension to OpenCL (for CPU and GPU) and SYCL (in our DPC++ Beta compiler), and are recommending that USM be included in future versions of both standards.

We will compare SVM and USM and explain how USM addresses the shortcomings we found in SVM. We will show how USM can be more convenient to use and provide quick prototyping capabilities. We will describe expected USM usage, including performance suggestions and best practices. We will share success stories and feedback from our users, who switched their applications to USM and noticed tremendous benefits.

The latest draft OpenCL USM extension specification can be found here:
https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/USM/cl_intel_unified_shared_memory.asciidoc

A SYCL proposal for USM can be found here:
https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/USM/USM.adoc

Working examples demonstrating how to use OpenCL USM can be found here:
https://github.com/intel/compute-samples
https://github.com/bashbaug/SimpleOpenCLSamples

The C++ for OpenCL Programming Language

Anastasia Stulova, Neil Hickey, Sven van Haastregt, Marco Antognini and Kevin Petit (Arm)
The OpenCL programming model has traditionally been C based. On the host side however, C++ gained quite a lot of popularity lately, with C++ bindings becoming available [1]. The kernel side language has been primarily C99 based for a very long time up until last year when in Clang 9 experimental support of C++ for OpenCL has been released [2].

The main motivation for adding this language mode is to allow developers to leverage modern C++ features when writing complex applications for GPUs and other accelerators. In contrast to the OpenCL C++ language released under public Khronos specification in 2015, C++ for OpenCL does not break backwards compatibility with OpenCL C and therefore opens up a gradual transition path for existing applications and allows reusing existing libraries written in OpenCL C. It works just the same way as C and C++.

The C++ for OpenCL language combines C++17 and OpenCL C v2.0. It can be used to compile kernel code to SPIR-V that can then be loaded by OpenCL v2.0 compatible drivers with some limitations. However, driver updates are also planned in the future to allow taking full advantage of C++ functionality.

In this presentation we explain the main design philosophy and features of the new language along with its restrictions. The documentation is open and it is hosted on the Khronos github repository [3]. It is available for everyone to contribute to. We plan to explain how Clang and other open source tools can be used to compile the code written in the new language and how the compiled output can be used with existing OpenCL capable devices. In the conclusion we will also draw a comparison with other similar languages, such as SYCL or CUDA, and try to provide some recommendations of language choice depending on use cases.

Additionally, we will present some early experiments with the new language used for OpenCL applications and invite developers for evaluation and feedback about future directions for this development.

[1] https://github.khronos.org/OpenCL-CLHPP/[2] https://clang.llvm.org/docs/UsersManual.html#cxx-for-opencl[3] https://github.com/KhronosGroup/Khronosdotorg/blob/master/api/opencl/assets/CXX_for_OpenCL.pdf

SYCL 2020: More than meets the eye

Ruyman Reyes, Gordon Brown, Rod Burns and Michael Wong (Codeplay Software)

In the first quarter of 2020 the SYCL™ working group of the Khronos® Group will seek to ratify a new provisional revision of the SYCL specification. This new revision will bring a number of significant improvements that will improve the portability and capabilities of the standard that will continue the goal of the SYCL programming model to provide a modern C++ programming model that is portable across the widest range of hardware.

Since the last revision of SYCL there are now four implementations of the standard; ComputeCpp; the first conformant commercial implementation, DPC++; Intel’s open-source SYCL implementation that is part of oneAPI, triSYCL; an open-source implementation developed by Xilinx, and hipSYCL; an open-source implementation based on AMD’s HIP stack. The SYCL working group has also grown and now include a number of users including members from the US Department of Energy national laboratories, who are helping to shape future revisions of SYCL to meet the needs of users and get better alignment with the evolution of the ISO C++ standard.

Over recent years the SYCL working group has gathered feedback from users and implementors across various domains and is introducing a number of flagship features to support expanding platform support and user requirements.

The first of these is SYCL generalization; which removes the restriction that SYCL must be implemented on top of OpenCL, allowing conformant SYCL implementations to target a number of other back-ends. The change also introduces new APIs which allow users to query the available back-ends of a SYCL implementation, make use of back-end specific extensions and interoperate with the underlying API of a back-end. With this change, SYCL becomes a high-level programming model that can target a great variety of hardware, enabling both generic programming and back-end-specific optimizations.

The next is a new API for managing the process of compiling and linking SYCL kernels from various different sources, including offline compilation, which will now be a first class feature SYCL. This new API is designed to be flexible such that it can support alternative back-ends to OpenCL and provide a more strictly typed and thread-safe interface for compiling and linking kernels. The new interface has built-in support for specialization constants, a feature already present in SPIR-V, Vulkan and OpenCL that so far has been only an extension provided by Codeplay. We expect the usage of specialization constants in SYCL applications to become one of the key aspects for achieving performance on different hardware.

Another feature is the introduction of host tasks, which will provide users a way to submit arbitrary C++ code as tasks scheduled within a SYCL task graph. These tasks can be used as callbacks triggered after device events, but will also provide the ability to perform backend interoperability with the underlying backend API from within the task.

This presentation will give an overview of all the different features above and many more, explaining their motivation and how to use them with example code.

Bringing performant support for Nvidia hardware to SYCL

Ruyman Reyes, Gordon Brown and Rod Burns. (Codeplay Software)

The SYCL™ programming model provides developers with the ability to write standard C++ code for heterogeneous systems, and accelerate execution using a range of different processors including CPUs, GPUs and FPGAs from a range of manufacturers.

This talk will describe how Codeplay™ has collaborated with Intel® to deliver an open source SYCL back-end implementation for the Clang LLVM compiler project that provides full support for Nvidia GPUs. We will present details of how the implementation has been designed and the current status of the implementation. This is an ongoing project from Codeplay that will see further progress over the year, including support for more CUDA features, integration with libraries and performance improvements.

Software developers will be able to use this implementation with Intel’s DPC++ framework to target Nvidia GPUs using any SYCL code, without any porting or special tricks required. If you have existing SYCL code, or if you are writing new SYCL code, you can compile it and target Nvidia GPUs without modifications.

A number of SYCL 1.2.1 extensions are provided to expose most capabilities of the CUDA programming model, and they have been used to provide feedback for the SYCL 2020 specification that provides support for multiple backends.

Data Parallel C++: Enhancing SYCL Through Extensions for Productivity and Performance

James Brodman, Michael Kinsner, Ben Ashbaugh, Jeff Hammond, Alexey Bader, John Pennycook, Jason Sewall and Roland Schulz (Intel)

SYCL is a heterogeneous programming framework built on top of modern C++. Data Parallel C++, recently introduced as part of Intel’s oneAPI project, is Intel’s implementation of SYCL. Data Parallel C++, or DPC++, is being developed as an open-source project on top of Clang and LLVM. It combines C++, SYCL, and new extensions to improve programmer productivity when writing highly performant code for heterogeneous architectures.

This talk will describe several extensions that DPC++ has proposed and implemented on top of SYCL. While many of the extensions can help to improve application performance, all of them work to improve programmer productivity by both enabling easy integration into existing C++ applications, and by simplifying common patterns found in SYCL and C++. DPC++ is a proving ground where the value of extensions can be demonstrated before being proposed for inclusion in future versions of the SYCL specification. Intel contributes DPC++ extensions back to SYCL, to enable a unified standards-based solution.

The extensions that this talk will cover include:

Unified Shared Memory, which adds support for pointer-based programming to SYCL and provides a shared-memory programming model that significantly improves upon the shared virtual memory (SVM) model defined in OpenCL

Unnamed Kernel Lambdas, which simplify development for applications and libraries

In-order Queues, which simplifies the common pattern of kernels that execute in sequence

Subgroups, which enable efficient execution of specific collective operations across work items

Reductions, which allow easily expressing an important computational pattern across subgroups, workgroups, and entire devices

Language and API simplifications, which include C++ improvements such as template argument deduction guides, type aliases, and additional overloads of methods to reduce the verbosity of code

The DPC++ compiler project is located at http://github.com/intel/llvm.

Definitions for the extensions can be found both in the compiler and runtime code as well as in a repository located at: https://github.com/intel/llvm/tree/sycl/sycl/doc/extensions

RTX-RSim: Accelerated Vulkan room response simulation for time-of-flight imaging

Peter Thoman, Markus Wippler and Thomas Fahringer (University of Innsbruck) and Robert Hranitzky (tofmotion)

Time-of-Flight camera systems are an essential component in 3D scene analysis and reconstruction for many modern computer vision applications. The development and validation of such systems requires testing in a large variety of scenes and situations. Accurate room impulse response simulation greatly speeds up development and validation, as well as reducing its cost, but large computational overhead has so far limited its applicability.

While the overall algorithmic requirements of this simulation differ significantly from 3D rendering, the recently introduced hardware raytracing support in GPUs nonetheless provides an interesting new implementation option. In this paper, we present a new room impulse simulation method, implemented with Vulkan compute shaders and leveraging VKRay hardware raytracing. We also extend this method to asynchronous streaming with low overhead in order to overcome the limitations of on-board GPU memory when simulating very large scenes.

Our implementation is, to the best of our knowledge, the first ever application of Vulkan hardware raytracing in a non-rendering simulation setting. Compared to a state-of-the-art multicore CPU implementation running on 12 CPU cores, we achieve an overall speedup of factor 20.9 when streaming is not required, and 17.6 with streaming, on a single consumer GPU.

Evaluating the performance of HPC-style SYCL applications

Tom Deakin and Simon Mcintosh-Smith (University of Bristol)

SYCL is a parallel programming model for developing single-source programs for running on heterogeneous platforms. To this end, it allows for one code to be written which can run on a different architectures. For this study, we develop applications in SYCL which are representative of those often used in High-Performance Computing. Their performance is benchmarked on a variety of CPU and GPU architectures from multiple vendors, and compared to well optimised versions written in OpenCL and other parallel programming models.

Modeling heterogeneous computing performance with offload advisor

Cedric Andreolli, Zakhar Matveev and Vladimir Tsymbal (Intel)

Programming of heterogeneous platforms requires thorough analysis of applications on their design stage for determining the best data and work decomposition between CPU and an accelerating hardware. In many cases the applications already exist in a form of conventional for CPU programming language like C++, the main problem is to determine which part of the application would leverage from being offloaded to an accelerating devise. Even bigger problem is to estimate, how much performance increase one should expect due to the accelerating in the particular heterogeneous platform. Each platform has its unique limitations that are affecting performance of offloaded compute tasks, e.g. data transfer tax, task initialization overhead, memory latency and bandwidth constrains. In order to take into account those constrains, software architects and developers need tooling for collecting right information and producing recommendations to make the best design and optimization decisions.

In this presentation we will introduce basics of the offload performance estimation analysis and the tool Offload Advisor which it intended to help with application design process. The Offload Advisor is an extended version of the Intel® Advisor, a code modernization, programming guidance, and performance estimation tool that supports OpenCL and SYCL/Data Parallel C++ languages on CPU and GPU. It provides codesign, performance modeling, analysis, and characterization features for industry-size applications written in C, C++, Fortran, and mixed Python*.

Offload Advisor analysis helps to determine which sections of a code can be offloaded to a GPU, accelerating the performance of a CPU-based application. It provides metrics and performance data such as projected speedup, a call tree showing offloaded and accelerated regions, identifies key bottlenecks (algorithmic, compute, caches, memory, throughput/latency), and more. It considers not only compute and memory limitations, but the time required to transfer data and the executed code on the target hardware.

Performance estimates provided by the tool are not limited to GPU only. It uses Accelerator Performance Models (APM) for modeling a target accelerator. Although, to the date APMs for Intel Gen architecture (integrated into CPU graphics module) and Intel Xe architecture (discrete GPU board) are available, it is extendable for future architectures.

The tool is flexible enough to accept for analysis applications which are already written on languages, like SYCL, dedicated for a heterogeneous platform but running on CPU. In this case the result of analysis would be the performance increase projection if executed on CPU+GPU.

In case an application is already designed for heterogeneous platform: written on OpenCL and execute computing tasks on iGPU, Intel Advisor proposes a GPU Roofline analysis. The GPU Roofline analysis helps estimate and visualize the actual performance of GPU kernels using benchmarks and hardware metric profiling against hardware-imposed performance ceilings, as well as determine the main limiting factor. With GPU profiling it collects OpenCL™ kernels timings and memory data, measures the hardware limitations and collects floating-point and integer operations data, similarly to Intel Advisor for CPU.
Offload Advisor is a new tool which is being actively developed along with development of new acceleration architectures at Intel.

ComputeAorta: A toolkit for implementing heterogeneous programming models

Alastair Murray, and Ewan Crawford (Codeplay Software)

The modern heterogeneous programming landscape has a wide variety of programming models targeting a range of hardware that is equally diverse. ComputeAorta from Codeplay Software Ltd is designed to be able to provide implementations of heterogeneous APIs, such as OpenCL or Vulkan Compute, on hardware ranging from DSPs to large machine learning accelerators. This talk is about how the design of ComputeAorta has evolved over the years to enable engineers to quickly implement industry-standard APIs for such devices, exposing their full performance capabilities with minimal effort.

ComputeAorta exists within Codeplay’s ComputeSuite stack along with the ComputeCpp implementation of SYCL and SYCL libraries such as SYCL-BLAS and SYCL-DNN. Thus, ComputeAorta is designed to provide a standards-compliant interface to custom, heterogeneous hardware that is used by applications further up the ComputeSuite stack. Key design goals of ComputeAorta are to provide mechanisms to support SPIR-V and similar technologies that enable high-level programming models; to be able to map the parallelism inherent in data-parallel programming models to hardware via compiler optimizations; to expose irregular hardware features via language or API extensions so that programmers can reliably achieve top performance; and to minimize the amount of effort required to create a correct implementation of a heterogeneous API on a new hardware device. These design goals are achieved using an internal specification for a very low level programming model called “Core”. Standardized programming models, such as OpenCL and Vulkan Compute, are implemented in terms of the Core specification, and Core is implemented for each unique hardware device. This separation of concerns allows a dedicated customer team to focus on each new device, implementing the specification in whatever way is necessary to achieve top performance on that hardware. To aid customer teams, the ComputeAorta toolkit contains a reference CPU implementation of Core, example OpenCL extensions, a set of compiler passes, a math library, and carefully maintained build and test infrastructure.

In this talk we will cover the challenges of supporting multiple heterogeneous APIs in a single codebase and the implications of implementing public APIs in terms of another abstraction layer. The talk will reference related or alternative approaches, such as Intel’s Level Zero [1], clvk [2], and POCL [3]. We will also cover how separating different aspects of implementation via a specification allows the project to scale to many varied customer hardware devices and continuously adapt in the ever evolving fields of heterogeneous compute architectures and programming models, all the while retaining centralized test-suites enforce correctness across all API implementations. We will use our experience implementing multiple heterogeneous programming models to provide comments on future-proofing for upcoming APIs, such as OpenCL Next, and our experience implementing on a variety of hardware to explain the rationale of our design decisions. This will all help the audience to understand what the key concerns for implementing heterogeneous programming models are, and what to consider should they too end up embarking on such an engineering project.

[1] https://spec.oneapi.com/oneL0/core_INTRO.html[2] https://github.com/kpet/clvk[3] https://doi.org/10.1007/s10766-014-0320-y

SYCL beyond OpenCL: The architecture, current state and future direction of hipSYCL

Aksel Alpay and Vincent Heuveline (Heidelberg University)

The SYCL ecosystem currently contains four major SYCL implementations: Codeplay’s
ComputeCpp, Intel’s LLVM/clang implementation, triSYCL led by Xilinx and hipSYCL led by Heidelberg University. HipSYCL, an open source project available at https://github.com/illuhad/hipSYCL, is the first SYCL implementation that does not build on OpenCL, providing instead CUDA/HIP and OpenMP backends that allow it to target CPUs, NVIDIA GPUs and AMD GPUs. Since hipSYCL builds on the CUDA/HIP clang frontend, augmented with a clang plugin to add support for SYCL constructs, it is inherently interoperable with existing CUDA or HIP codebases, vendor-optimized libraries, and can expose latest hardware features such as intrinsics as soon as they become available in the clang CUDA/HIP toolchain.

In this presentation, we will review the architecture of hipSYCL, as well as the current state of the implementation, including performance and limitations. We will also discuss future directions and recent and ongoing work, in particular the development of an entirely new runtime which among other features, introduces a generalization to multiple backends, a batched kernel submission model, and ahead-of-time scheduling capabilities with the ability to reason about the consequences of scheduling decisions. This will allow us to experiment in hipSYCL with scheduling models that are more general than what is currently defined in the SYCL specification.

Evaluating the Performance of the hipSYCL toolchain for HPC kernels on NVIDIA V100 GPUs

Brian Homerding and John Tramm (Argonne National Laboratory)

Future HPC leadership computing systems for the United States Department of Energy will utilize GPUs for acceleration of scien- tific codes. These systems will utilize GPUs from various vendors which places a large focus on the performance portability of the programming models used by scientific application developers. In the HPC domain, SYCL is an open C++ standard for heterogeneous computing that is gaining support. This is fueling a growing inter- est in understanding the performance of SYCL toolchains for the various GPU vendors.

In this paper, we produce SYCL benchmarks and mini-apps whose performance on the NVIDIA Volta GPU is analyzed. We utilize the RAJA Performance Suite to evaluate the performance of the hipSYCL toolchain, followed by an more detailed investigation of the performance of two HPC mini-apps. We find that the kernel performance from the SYCL kernels compiled directly to CUDA preform at a competitive level with their CUDA counterparts when comparing the straightforward implementations.

Study of OpenCL atomic functions and their acceleration using the subgroup extensions

Hongqiang Wang Jeng-Hau Lin and Alex Bourd (QUALCOMM) and Jiahui Huang (University of California-Riverside)

In this work we study and compare the performance of a number of OpenCL 1.2 atomic functions on GPUs from various vendors, including Intel HD520 GPU, and Nvidia Jetson Tx2 GPU, and Adreno GPUs. Tests are designed and optimized for different use cases, including atomic with uniform address, atomic with non-uniform addresses, atomic with return values, and atomics with no-return values. We collect and analyze the results, and also try to interpret them using the publicly available technical documents.

We then attempt to improve the atomic performance by exploiting the newly standardized OpenCL subgroup extensions by Khronos Group. For example, by using the sub_group_reduce_op, we can do parallel reduction within the subgroups first, then just do one atomic operation per subgroup for some simple cases. For more complex cases, e.g., atomics operations requiring return of the old value, a more sophisticated scheme using the subgroup_scan_exclusive_op subgroup function is designed. The most challenging scenario, the non-uniform atomics requiring return of the old value, can also be accelerated by using subgroup ballot operation within a nested loop.

We will present and analyze the performance numbers. These experiments are designed with the Qualcomm’s Adreno GPUs, which can be easily extended to GPUs of other vendors with atomic and subgroup supports.

Presentation Slides – Pending

SYCL-Bench: A Versatile Single-Source Benchmark Suite for Heterogeneous Computing

Aksel Alpay and Vincent Heuveline (Heidelberg University), Sohan Lal and Nicolai Stawinoga (TU Berlin), Philip Salzmann, Peter Thoman and Thomas Fahringer (University of Innsbruck) and Biagio Cosenza (University of Salerno)

SYCL is a royalty-free Khronos group open standard that enables heterogeneous programming using pure C++, which targets a broad range of parallel devices, including multicore CPUs, GPUs, and FPGAs, without any additional attributes or pragmas. While SYCL kernels follow a data-parallel model, they are implicitly organized in a task graph built by the runtime from data access specifications. Scheduling, data management and synchronization between different tasks are handled implicitly by the SYCL runtime, which varies depending on the implementation. While this simultaneously preserves programmer productivity and allows the SYCLruntime to automatically perform optimizations such as over-lapping data transfers and kernel execution, it is not apparent whether the SYCL implementation actually employs such optimizations for a particular code pattern. Benchmarks are therefore necessary to characterize the performance of programming models such as SYCL that rely heavily on implicit operations. To this end, we present SYCL-Bench, a versatile benchmark suite written in SYCL. SYCL-Bench not only contains benchmarks to characterize the hardware but also SYCL-specific benchmarks that present optimization opportunities to the SYCL runtime and test how well a particular SYCL implementation capitalizes on those opportunities. SYCL-Bench benchmarking methodology includes:109 codes suited for hardware characterization; 24 codes based on data-parallel patterns such as reduction; 9 codes to evaluate SYCL-specific runtime features. We experimentally demonstrate the effectiveness of SYCL-Bench by performing a device characterization on NVIDIA GeForce GTX Titan X, GeForce 1080 Ti, and AMD Radeon VII GPUs, and by evaluating the runtime efficiency of HipSYCL and ComputeCPPSYCL implementations.

Characterizing optimizations to memory access patterns using architecture-independent program features

Aditya Chilukuri, Josh Milthorpe and Beau Johnston (Australian National University)

HPC developers are faced with the challenge of optimizing OpenCL workloads for high performance on novel architectures.
The Architecture Independent Workload Characterisation (AIWC) tool is a plugin for the Oclgrind OpenCL simulator that gathers metrics of OpenCL programs that can be used to understand and predict program performance on an arbitrary given hardware architecture.

However, AIWC metrics are not always easily interpreted and do not reflect some important memory access patterns affecting efficiency across architectures.
We propose the concept of parallel spatial locality — the closeness of memory accesses simultaneously issued by OpenCL work-items (threads).

We implement the parallel spatial locality metric in the AIWC framework, and analyse gathered results on matrix multiply and the Extended OpenDwarfs OpenCL benchmarks.

The differences in the obtained parallel spatial locality metric across implementations of matrix multiply reflect the optimizations performed.
The new metric can be used to distinguish between the OpenDwarfs benchmarks based on the memory access patterns affecting their performance on various architectures.

The improvements suggested to AIWC will help HPC developers better understand memory access patterns of complex codes, a crucial step in guiding optimization of codes for arbitrary hardware targets.

Experiences with OpenCL in PyFR: 2014—Present

Freddie Witherden (Texas A&M University) and Peter Vincent (Imperial College London)

PyFR is an open source high-performance computing (HPC) framework for performing scale-resolving computational fluid dynamics simulations [1]. The algorithmic core of PyFR is the flux reconstruction (FR) approach of Huynh [2], which combines the geometric flexibility of finite volume schemes with the high-order accuracy and efficiency of spectral schemes. Primarily written in Python, PyFR aims to be performance portable across a range of hardware platforms. This is accomplished through the use of a bespoke domain specific language based around the Mako templating engine and a range of run-time code generation backends. Our approach enables PyFR to target platforms through OpenMP annotated C kernels, CUDA kernels, and OpenCL kernels [3].

In this talk I will discuss our experiences with OpenCL in PyFR. This will include the current role of the OpenCL backend within PyFR and our plans for the future apropos SYCL and Intel’s OpenAPI initiative. The performance of the OpenCL backend will be compared and contrasted to that of the ‘native’ backends in PyFR. Furthermore, we will also highlight the limitations of OpenCL and related standards; specifically in the areas of MPI-awareness and the availability of performance primitives. Implementation quality on the part of hardware vendors will also be discussed.

References[1] F. D. Witherden, A. M. Farrington, and P. E. Vincent, PyFR: An Open Source Framework for Solving Advection-Diffusion Type Problems on Streaming Architectures Using the Flux Reconstruction Approach. Computer Physics Communications, 185(11), 2014, 3028–3040.[2] H. T. Huynh, A flux reconstruction approach to high-order schemes including discontinuous Galerkin methods. 18th AIAA Computational Fluid Dynamics Conference, 2011.[3] F. D. Witherden, B. C. Vermeire, and P. E. Vincent, Heterogeneous computing on mixed unstructured grids with PyFR. Computers & Fluids, 120, 2015, 173–186.

Evaluation of modern GPGPU technologies for image processing

Joachim Meyer (Stemmer Imaging)

Image processing has high computational requirements. As many papers have indicated, a number of image processing operations can be optimized by heavy parallelization of the computation. Currently, one of the best options for parallelized image processing is using GPGPU. In the area of GPGPU, a rather wide range of APIs is available and finding an appropriate choice for a project is difficult.

In the course of being able to provide guidance on selecting the right API in an image processing context, four GPU programming models were compared according to their platform independence, usability, and performance. To gain information on the metrics usability and performance a test project was created, implementing a number of image processing tasks in the four GPU programming models. The reference algorithms form a pipeline for processing images from polarization cameras. The four investigated APIs are CUDA, OpenCL 1.2, Vulkan and SYCL. While CUDA, OpenCL, and OpenMP are compared in a wide number of papers, the more recent standards SYCL and Vulkan are not yet included in as many comparisons. It is very rare to find both of them included in a single investigation. The aim of this work is to provide an overview of the most important factors to consider, when choosing a GPGPU API for a project, taking the new Khronos standards into account.

The direct comparison of the APIs, done for this work, cumulates into a decision matrix to aid organizations in their process of selecting an offloading API. It shows, for example, that CUDA and SYCL are comparable in terms of development cost, support of modern C++ and ease of integration in existing code bases. When considering modern standard adoption and GPU portability Vulkan has advantages, while OpenCL and SYCL can run on CPUs and FPGAs. In the tooling category, CUDA has the most specialized and effective tools whereas Vulkan has almost no compute specific tools available. CUDA and Vulkan provide the best performance (on the tested devices), while Vulkan was able to outperform CUDA on a Nvidia Jetson.

It becomes clear that all four APIs have their advantages and may be suitable for different applications. Vulkan is the API of choice in mobile consumer applications. CUDA is convenient for fast development but only if it is feasible to support Nvidia GPUs only. OpenCL might be chosen if a considerable number of different desktop GPUs and FPGAs should be supported. Although some metrics still show room for improvement, SYCL seems to grow into a good all-rounder with all the available implementations. It could benefit from weakening its tight integration with OpenCL to improve support on some platforms. A Vulkan implementation, for example, would open opportunities to target mobile and Apple devices, while hipSYCL could gain the status of a standard-conforming implementation.

Debugging SYCL programs on heterogeneous Intel architectures

Baris Aktemur, Markus Metzger, Natalia Saiapova and Mihails Strasuns (Intel)

Intel recently announced a large initiative named oneAPI that provides a direct programming model based on SYCL. As part of the oneAPI distribution, we developed a debugger that can be used to debug SYCL programs that offload kernels to CPU, GPU, or FPGA emulator devices. The debugger is based on GDB. It allows programmers to inspect the host and kernel portion of their SYCL programs seamlessly in the same debug session. To realize the debugger, we made enhancements to GDB including SIMD-based thread views and C++-related improvements. In this work we present the
general architecture of the debugger, provide a sample session of how it can be used to debug a SYCL kernel running on a GPU, and discuss the encountered and anticipated challenges during the development phase. Currently a beta version of the debugger is publicly available.

Automated OpenCL GPU kernel fusion for Stan Math

Tadej Ciglarič, Rok Češnovar and Erik Štrumbelj (University of Ljubljana)

We developed an OpenCL GPU kernel fusion library for the Stan software for Bayesian statistics. The library automatically combines kernels, optimizes computation, and is simple to use. The practical utility of the library is that it speeds up the development of new GPU kernels while keeping the performance of automatically combined kernels comparable to hand crafted kernels. We demonstrate this with experiments on basic operations and a linear regression model likelihood.