Tutorials

An Introduction to SYCL

By Codeplay, Heidelberg University, Intel and Xilinx
Live streamed on Monday 27th April, 2020

SYCL is a programming model that lets developers support a wide variety of devices (CPUs, GPUs, and more) from a single code base. Given the growing heterogeneity of processor roadmaps, moving to a platform-independent model such as SYCL is essential for modern software developers. SYCL has the further advantage of supporting a single-source style of programming from completely standard C++. In this tutorial, we will introduce SYCL and provide programmers with a solid foundation they can build on to gain mastery of this language.

The format will be short video presentations followed by short coding exercises.

Course Outline

  • Introduction to SYCL
  • The basic structure of every SYCL program
  • Device topology and configuring a queue
  • Handling SYCL errors
  • The host-device interface and launching a kernel
  • The fundamental concepts of a SYCL kernel
  • Managing Data in SYCL

Coding Exercises

There will be time allocated for hands on coding exercises during the tutorial and we recommend that you set up your machine or the Intel DevCloud before the tutorial for these exercises. Since there are various implementations of the SYCL standard you have some choices.

We’ll provide instructions during the exercises for all these options.

Use SYCL in the Cloud

The simplest choice is to use DPC++ and the Intel DevCloud because it requires no setup on your machine.
You can access the cloud machine through your web browser or via SSH terminal and run SYCL code on an Intel CPU
Register for access at https://devcloud.intel.com/oneapi/connect/

ComputeCpp

Hardware Supported: Intel CPU/GPU, Arm GPU
Go to https://developer.codeplay.com/products/computecpp/ce/guides for setup instructions

DPC++

Hardware Supported: Intel CPU/GPU/FPGA, NVIDIA GPU
Go to https://software.intel.com/en-us/oneapi/dpc-compiler for setup instructions

hipSYCL

Hardware Supported: AMD GPU, NVIDIA GPU, (Other CPUs where a C++ compiler exists for that CPU that supports OpenMP)
Go to https://github.com/illuhad/hipSYCL for setup instructions

Course Schedule (Pacific Time)

8:0020 minLectureIntroduction to SYCLCodeplay
8:2025 minHands onGetting set upCodeplay
8:4520 minLectureTopology discovery and queue configurationCodeplay
9:0525 minHands onConfiguring a queueCodeplay
9:3020 minLectureDefining kernelsCodeplay
9:5025 minHands onHello worldCodeplay
10:1520 minLectureManaging dataCodeplay
10:3525 minHands onVector addCodeplay
11:001 hourQ&AContinued Discussion and Q&A on chat channelTBD

Application Development with SYCL

By Codeplay, Heidelberg University, Intel and Xilinx
Live streamed on Wednesday 29th April, 2020

As a programming model, SYCL is interesting, but its real value comes from its use in writing applications. This second part of the SYCL tutorial covers the ecosystem available to SYCL developers with an insight into some of the features that are being added to the standard, and more details on some of the advanced features. There will also be time during the panel session to pose question to some of the developers involved in defining and implementing the SYCL standard.

The format will be short video presentations followed by live Q&A sessions with experts from Codeplay, Heidelberg University, Intel and Xilinx.

Course Outline

  • Understanding the SYCL implementations
  • Migrating from CUDA to SYCL
  • Unified Shared Memory in SYCL
  • Host Tasks
  • FPGA extensions in SYCL
  • SYCL Implementer Panel and Q&A

Course Schedule (Pacific Time)

8:0040 minLectureOverview of SYCL implementations

  • ComputeCpp
  • DPC++
  • hipSYCL triSYCL
Codeplay, Intel, Heidelberg University, Xilinx
8:4015 minLive Q&A
8:5545 minLectureExpanding the ecosystem

  • DPC++ Compatibility Tool: Migrating from CUDA* to DPC++
  • Offload advisor
  • Nvidia support
Codeplay, Intel
9:4015 minLive Q&A
9:5545 minTBDFuture direction of SYCL

  • Unified Shared Memory
  • Host Task
  • FPGA Extensions
Codeplay, Intel, Xilinx
10:4015 minLive Q&A

Khronos Announcements and Panel Discussion

This live webinar took place on April 28 and featured some significant announcements and updates from the Khronos Group on both OpenCL and SYCL. These were followed by an open panel discussion, a session that is always a favorite with our audience. This lively and informative session put leading members of the Khronos OpenCL, SYCL and SPIR Working groups on our ‘virtual stage’ alongside experts from the OpenCL development community to debate the issues of the day and answer questions from the online audience. 

Panel Chair

  • Simon McIntosh-Smith, Conference Chair. Professor of High-Performance Computing and Head of the HPC Research Group, University of Bristol.

Khronos Announcements

  • Neil Trevett, Khronos President and OpenCL Working Group Chair, VP NVIDIA
  • Michael Wong, SYCL Working Group Chair. VP Research & Development, Codeplay Software

Panelists

Our panel of experts are all leading figures in the OpenCL and SYCL community and play leading roles within the Khronos Working Groups and the wider development community.

  • Alastair Murray, Codeplay
  • Ben Ashbaugh, Intel
  • Dennis Adams, Sony Creative Software
  • Eric Berdahl, Adobe
  • Hal Finkel, Argonne National Laboratory
  • Jeremy Kemp, Imagination
  • Kévin Petit, Arm
  • Martin Schreiber, Technical University of Munich
  • Ronan Keryell, Xilinx

SYCL 2020 Update

Program of Talks – Papers and Technical Presentations

A “Virtual” Welcome to IWOCL & SYCLcon 2020

This year we had a record number of submissions.  The quality was very high and competition was fierce in all categories; research papers, technical presentations, tutorials and posters. The following submissions were accepted and thanks to all our authors for creating the following video presentations.  The slack workspace accompanying the event will be open until at least the end of May if you would like to discuss any of the submissions.  Register to Join the Slack Workspace.

Simon McIntosh-Smith, General Chair.  University of Bristol.

Martin Schreiber, Local Co-Chair. TUM
Christoph Riesinger, Local Co-Chair. Intel

ACM Digital Proceedings

This year’s conference proceedings are now available to anyone with a subscription to the ACM Digital Library

KEYNOTE PRESENTATION:
Preparing to program Aurora at Exascale: Early experiences and future directions

Hal Finkel (Argonne National Laboratory)

Argonne National Laboratory’s Leadership Computing Facility will be home to Aurora, our first exascale supercomputer. Aurora promises to take scientific computing to a whole new level, and scientists and engineers from many different fields will take advantage of Aurora’s unprecedented computational capabilities to push the boundaries of human knowledge. In addition, Aurora’s support for advanced machine-learning and big-data computations will enable scientific workflows incorporating these techniques along with traditional HPC algorithms.

Programming the state-of-the-art hardware in Aurora will be accomplished using state-of-the-art programming models. Some of these models, such as OpenMP, are long-established in the HPC ecosystem. Other models, such as Intel’s oneAPI, based on SYCL, are relatively-new models constructed with the benefit of significant experience. Many applications will not use these models directly, but rather, will use C++ abstraction libraries such as Kokkos or RAJA. Python will also be a common entry point to high-performance capabilities. As we look toward the future, features in the C++ standard itself will become increasingly relevant for accessing the extreme parallelism of exascale platforms.

This presentation will summarize the experiences of our team as we prepare for Aurora, exploring how to port applications to Aurora’s architecture and programming models, and distilling the challenges and best practices we’ve developed to date. oneAPI/SYCL and OpenMP are both critical models in these efforts, and while the ecosystem for Aurora has yet to mature, we’ve already had a great deal of success.

Importantly, we are not passive recipients of programming models developed by others. Our team works not only with vendor-provided compilers and tools, but also develops improved open-source LLVM-based technologies that feed both open-source and vendor-provided capabilities. In addition, we actively participate in the standardization of OpenMP, SYCL, and C++. To conclude, I’ll share our thoughts on how these models can best develop in the future to support exascale-class systems.

Taking memory management to the next level – Unified Shared Memory in action

Michal Mrozek, Ben Ashbaugh and James Brodman (Intel)
When adding OpenCL or SYCL acceleration to existing code bases it is desirable to represent memory allocations using pointers. This aligns naturally with the Shared Virtual Memory (SVM) capabilities in standard OpenCL, but we found many properties of SVM that are either cumbersome to use or inefficient.

This talk will describe Unified Shared Memory (USM), which is designed to address these shortcomings and take SVM to the next level. Unified Shared Memory is intended to provide:
– Representation of memory allocations as pointers, with full support for pointer arithmetic.
– Fine-grain control over ownership and accessibility of memory allocations, to optimally choose between performance and programmer convenience.
– A simpler programming model, by automatically migrating some allocations between devices and the host.

We will explain how we accomplished these goals by introducing three new memory allocation types:
– Host allocations, which are owned by the host and are intended to be allocated out of system memory. They are not expected to migrate between system memory and device local memory, they have address equivalence, and they trade off wide accessibility and transfer benefits for potentially higher per-access costs, such as over PCI express.
– Device allocations, which are owned by a specific device and are intended to be allocated out of device local memory. They trade off access limitations for higher performance. They are not accessible by the host.
– Shared allocations, which share ownership and are intended to migrate between the host and one or more devices, they are accessible by the host and at least one associated device, and trade off transfer costs for per-access benefits.
We have implemented USM as an extension to OpenCL (for CPU and GPU) and SYCL (in our DPC++ Beta compiler), and are recommending that USM be included in future versions of both standards.

We will compare SVM and USM and explain how USM addresses the shortcomings we found in SVM. We will show how USM can be more convenient to use and provide quick prototyping capabilities. We will describe expected USM usage, including performance suggestions and best practices. We will share success stories and feedback from our users, who switched their applications to USM and noticed tremendous benefits.

The latest draft OpenCL USM extension specification can be found here:
https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/USM/cl_intel_unified_shared_memory.asciidoc

A SYCL proposal for USM can be found here:
https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/USM/USM.adoc

Working examples demonstrating how to use OpenCL USM can be found here:
https://github.com/intel/compute-samples
https://github.com/bashbaug/SimpleOpenCLSamples

The C++ for OpenCL Programming Language

Anastasia Stulova, Neil Hickey, Sven van Haastregt, Marco Antognini and Kevin Petit (Arm)
The OpenCL programming model has traditionally been C based. On the host side however, C++ gained quite a lot of popularity lately, with C++ bindings becoming available [1]. The kernel side language has been primarily C99 based for a very long time up until last year when in Clang 9 experimental support of C++ for OpenCL has been released [2].

The main motivation for adding this language mode is to allow developers to leverage modern C++ features when writing complex applications for GPUs and other accelerators. In contrast to the OpenCL C++ language released under public Khronos specification in 2015, C++ for OpenCL does not break backwards compatibility with OpenCL C and therefore opens up a gradual transition path for existing applications and allows reusing existing libraries written in OpenCL C. It works just the same way as C and C++.

The C++ for OpenCL language combines C++17 and OpenCL C v2.0. It can be used to compile kernel code to SPIR-V that can then be loaded by OpenCL v2.0 compatible drivers with some limitations. However, driver updates are also planned in the future to allow taking full advantage of C++ functionality.

In this presentation we explain the main design philosophy and features of the new language along with its restrictions. The documentation is open and it is hosted on the Khronos github repository [3]. It is available for everyone to contribute to. We plan to explain how Clang and other open source tools can be used to compile the code written in the new language and how the compiled output can be used with existing OpenCL capable devices. In the conclusion we will also draw a comparison with other similar languages, such as SYCL or CUDA, and try to provide some recommendations of language choice depending on use cases.

Additionally, we will present some early experiments with the new language used for OpenCL applications and invite developers for evaluation and feedback about future directions for this development.

[1] https://github.khronos.org/OpenCL-CLHPP/[2] https://clang.llvm.org/docs/UsersManual.html#cxx-for-opencl[3] https://github.com/KhronosGroup/Khronosdotorg/blob/master/api/opencl/assets/CXX_for_OpenCL.pdf

SYCL 2020: More than meets the eye

Ruyman Reyes, Gordon Brown, Rod Burns and Michael Wong (Codeplay Software)

In the first quarter of 2020 the SYCL™ working group of the Khronos® Group will seek to ratify a new provisional revision of the SYCL specification. This new revision will bring a number of significant improvements that will improve the portability and capabilities of the standard that will continue the goal of the SYCL programming model to provide a modern C++ programming model that is portable across the widest range of hardware.

Since the last revision of SYCL there are now four implementations of the standard; ComputeCpp; the first conformant commercial implementation, DPC++; Intel’s open-source SYCL implementation that is part of oneAPI, triSYCL; an open-source implementation developed by Xilinx, and hipSYCL; an open-source implementation based on AMD’s HIP stack. The SYCL working group has also grown and now include a number of users including members from the US Department of Energy national laboratories, who are helping to shape future revisions of SYCL to meet the needs of users and get better alignment with the evolution of the ISO C++ standard.

Over recent years the SYCL working group has gathered feedback from users and implementors across various domains and is introducing a number of flagship features to support expanding platform support and user requirements.

The first of these is SYCL generalization; which removes the restriction that SYCL must be implemented on top of OpenCL, allowing conformant SYCL implementations to target a number of other back-ends. The change also introduces new APIs which allow users to query the available back-ends of a SYCL implementation, make use of back-end specific extensions and interoperate with the underlying API of a back-end. With this change, SYCL becomes a high-level programming model that can target a great variety of hardware, enabling both generic programming and back-end-specific optimizations.

The next is a new API for managing the process of compiling and linking SYCL kernels from various different sources, including offline compilation, which will now be a first class feature SYCL. This new API is designed to be flexible such that it can support alternative back-ends to OpenCL and provide a more strictly typed and thread-safe interface for compiling and linking kernels. The new interface has built-in support for specialization constants, a feature already present in SPIR-V, Vulkan and OpenCL that so far has been only an extension provided by Codeplay. We expect the usage of specialization constants in SYCL applications to become one of the key aspects for achieving performance on different hardware.

Another feature is the introduction of host tasks, which will provide users a way to submit arbitrary C++ code as tasks scheduled within a SYCL task graph. These tasks can be used as callbacks triggered after device events, but will also provide the ability to perform backend interoperability with the underlying backend API from within the task.

This presentation will give an overview of all the different features above and many more, explaining their motivation and how to use them with example code.

Bringing performant support for Nvidia hardware to SYCL

Ruyman Reyes, Gordon Brown and Rod Burns. (Codeplay Software)

The SYCL™ programming model provides developers with the ability to write standard C++ code for heterogeneous systems, and accelerate execution using a range of different processors including CPUs, GPUs and FPGAs from a range of manufacturers.

This talk will describe how Codeplay™ has collaborated with Intel® to deliver an open source SYCL back-end implementation for the Clang LLVM compiler project that provides full support for Nvidia GPUs. We will present details of how the implementation has been designed and the current status of the implementation. This is an ongoing project from Codeplay that will see further progress over the year, including support for more CUDA features, integration with libraries and performance improvements.

Software developers will be able to use this implementation with Intel’s DPC++ framework to target Nvidia GPUs using any SYCL code, without any porting or special tricks required. If you have existing SYCL code, or if you are writing new SYCL code, you can compile it and target Nvidia GPUs without modifications.

A number of SYCL 1.2.1 extensions are provided to expose most capabilities of the CUDA programming model, and they have been used to provide feedback for the SYCL 2020 specification that provides support for multiple backends.

Data Parallel C++: Enhancing SYCL Through Extensions for Productivity and Performance

James Brodman, Michael Kinsner, Ben Ashbaugh, Jeff Hammond, Alexey Bader, John Pennycook, Jason Sewall and Roland Schulz (Intel)

SYCL is a heterogeneous programming framework built on top of modern C++. Data Parallel C++, recently introduced as part of Intel’s oneAPI project, is Intel’s implementation of SYCL. Data Parallel C++, or DPC++, is being developed as an open-source project on top of Clang and LLVM. It combines C++, SYCL, and new extensions to improve programmer productivity when writing highly performant code for heterogeneous architectures.

This talk will describe several extensions that DPC++ has proposed and implemented on top of SYCL. While many of the extensions can help to improve application performance, all of them work to improve programmer productivity by both enabling easy integration into existing C++ applications, and by simplifying common patterns found in SYCL and C++. DPC++ is a proving ground where the value of extensions can be demonstrated before being proposed for inclusion in future versions of the SYCL specification. Intel contributes DPC++ extensions back to SYCL, to enable a unified standards-based solution.

The extensions that this talk will cover include:

Unified Shared Memory, which adds support for pointer-based programming to SYCL and provides a shared-memory programming model that significantly improves upon the shared virtual memory (SVM) model defined in OpenCL

Unnamed Kernel Lambdas, which simplify development for applications and libraries

In-order Queues, which simplifies the common pattern of kernels that execute in sequence

Subgroups, which enable efficient execution of specific collective operations across work items

Reductions, which allow easily expressing an important computational pattern across subgroups, workgroups, and entire devices

Language and API simplifications, which include C++ improvements such as template argument deduction guides, type aliases, and additional overloads of methods to reduce the verbosity of code

The DPC++ compiler project is located at http://github.com/intel/llvm.

Definitions for the extensions can be found both in the compiler and runtime code as well as in a repository located at: https://github.com/intel/llvm/tree/sycl/sycl/doc/extensions

RTX-RSim: Accelerated Vulkan room response simulation for time-of-flight imaging

Peter Thoman, Markus Wippler and Thomas Fahringer (University of Innsbruck) and Robert Hranitzky (tofmotion)

Time-of-Flight camera systems are an essential component in 3D scene analysis and reconstruction for many modern computer vision applications. The development and validation of such systems requires testing in a large variety of scenes and situations. Accurate room impulse response simulation greatly speeds up development and validation, as well as reducing its cost, but large computational overhead has so far limited its applicability.

While the overall algorithmic requirements of this simulation differ significantly from 3D rendering, the recently introduced hardware raytracing support in GPUs nonetheless provides an interesting new implementation option. In this paper, we present a new room impulse simulation method, implemented with Vulkan compute shaders and leveraging VKRay hardware raytracing. We also extend this method to asynchronous streaming with low overhead in order to overcome the limitations of on-board GPU memory when simulating very large scenes.

Our implementation is, to the best of our knowledge, the first ever application of Vulkan hardware raytracing in a non-rendering simulation setting. Compared to a state-of-the-art multicore CPU implementation running on 12 CPU cores, we achieve an overall speedup of factor 20.9 when streaming is not required, and 17.6 with streaming, on a single consumer GPU.

Evaluating the performance of HPC-style SYCL applications

Tom Deakin and Simon Mcintosh-Smith (University of Bristol)

SYCL is a parallel programming model for developing single-source programs for running on heterogeneous platforms. To this end, it allows for one code to be written which can run on a different architectures. For this study, we develop applications in SYCL which are representative of those often used in High-Performance Computing. Their performance is benchmarked on a variety of CPU and GPU architectures from multiple vendors, and compared to well optimised versions written in OpenCL and other parallel programming models.

Modeling heterogeneous computing performance with offload advisor

Cedric Andreolli, Zakhar Matveev and Vladimir Tsymbal (Intel)

Programming of heterogeneous platforms requires thorough analysis of applications on their design stage for determining the best data and work decomposition between CPU and an accelerating hardware. In many cases the applications already exist in a form of conventional for CPU programming language like C++, the main problem is to determine which part of the application would leverage from being offloaded to an accelerating devise. Even bigger problem is to estimate, how much performance increase one should expect due to the accelerating in the particular heterogeneous platform. Each platform has its unique limitations that are affecting performance of offloaded compute tasks, e.g. data transfer tax, task initialization overhead, memory latency and bandwidth constrains. In order to take into account those constrains, software architects and developers need tooling for collecting right information and producing recommendations to make the best design and optimization decisions.

In this presentation we will introduce basics of the offload performance estimation analysis and the tool Offload Advisor which it intended to help with application design process. The Offload Advisor is an extended version of the Intel® Advisor, a code modernization, programming guidance, and performance estimation tool that supports OpenCL and SYCL/Data Parallel C++ languages on CPU and GPU. It provides codesign, performance modeling, analysis, and characterization features for industry-size applications written in C, C++, Fortran, and mixed Python*.

Offload Advisor analysis helps to determine which sections of a code can be offloaded to a GPU, accelerating the performance of a CPU-based application. It provides metrics and performance data such as projected speedup, a call tree showing offloaded and accelerated regions, identifies key bottlenecks (algorithmic, compute, caches, memory, throughput/latency), and more. It considers not only compute and memory limitations, but the time required to transfer data and the executed code on the target hardware.

Performance estimates provided by the tool are not limited to GPU only. It uses Accelerator Performance Models (APM) for modeling a target accelerator. Although, to the date APMs for Intel Gen architecture (integrated into CPU graphics module) and Intel Xe architecture (discrete GPU board) are available, it is extendable for future architectures.

The tool is flexible enough to accept for analysis applications which are already written on languages, like SYCL, dedicated for a heterogeneous platform but running on CPU. In this case the result of analysis would be the performance increase projection if executed on CPU+GPU.

In case an application is already designed for heterogeneous platform: written on OpenCL and execute computing tasks on iGPU, Intel Advisor proposes a GPU Roofline analysis. The GPU Roofline analysis helps estimate and visualize the actual performance of GPU kernels using benchmarks and hardware metric profiling against hardware-imposed performance ceilings, as well as determine the main limiting factor. With GPU profiling it collects OpenCL™ kernels timings and memory data, measures the hardware limitations and collects floating-point and integer operations data, similarly to Intel Advisor for CPU.
Offload Advisor is a new tool which is being actively developed along with development of new acceleration architectures at Intel.

ComputeAorta: A toolkit for implementing heterogeneous programming models

Alastair Murray, and Ewan Crawford (Codeplay Software)

The modern heterogeneous programming landscape has a wide variety of programming models targeting a range of hardware that is equally diverse. ComputeAorta from Codeplay Software Ltd is designed to be able to provide implementations of heterogeneous APIs, such as OpenCL or Vulkan Compute, on hardware ranging from DSPs to large machine learning accelerators. This talk is about how the design of ComputeAorta has evolved over the years to enable engineers to quickly implement industry-standard APIs for such devices, exposing their full performance capabilities with minimal effort.

ComputeAorta exists within Codeplay’s ComputeSuite stack along with the ComputeCpp implementation of SYCL and SYCL libraries such as SYCL-BLAS and SYCL-DNN. Thus, ComputeAorta is designed to provide a standards-compliant interface to custom, heterogeneous hardware that is used by applications further up the ComputeSuite stack. Key design goals of ComputeAorta are to provide mechanisms to support SPIR-V and similar technologies that enable high-level programming models; to be able to map the parallelism inherent in data-parallel programming models to hardware via compiler optimizations; to expose irregular hardware features via language or API extensions so that programmers can reliably achieve top performance; and to minimize the amount of effort required to create a correct implementation of a heterogeneous API on a new hardware device. These design goals are achieved using an internal specification for a very low level programming model called “Core”. Standardized programming models, such as OpenCL and Vulkan Compute, are implemented in terms of the Core specification, and Core is implemented for each unique hardware device. This separation of concerns allows a dedicated customer team to focus on each new device, implementing the specification in whatever way is necessary to achieve top performance on that hardware. To aid customer teams, the ComputeAorta toolkit contains a reference CPU implementation of Core, example OpenCL extensions, a set of compiler passes, a math library, and carefully maintained build and test infrastructure.

In this talk we will cover the challenges of supporting multiple heterogeneous APIs in a single codebase and the implications of implementing public APIs in terms of another abstraction layer. The talk will reference related or alternative approaches, such as Intel’s Level Zero [1], clvk [2], and POCL [3]. We will also cover how separating different aspects of implementation via a specification allows the project to scale to many varied customer hardware devices and continuously adapt in the ever evolving fields of heterogeneous compute architectures and programming models, all the while retaining centralized test-suites enforce correctness across all API implementations. We will use our experience implementing multiple heterogeneous programming models to provide comments on future-proofing for upcoming APIs, such as OpenCL Next, and our experience implementing on a variety of hardware to explain the rationale of our design decisions. This will all help the audience to understand what the key concerns for implementing heterogeneous programming models are, and what to consider should they too end up embarking on such an engineering project.

[1] https://spec.oneapi.com/oneL0/core_INTRO.html[2] https://github.com/kpet/clvk[3] https://doi.org/10.1007/s10766-014-0320-y

SYCL beyond OpenCL: The architecture, current state and future direction of hipSYCL

Aksel Alpay and Vincent Heuveline (Heidelberg University)

The SYCL ecosystem currently contains four major SYCL implementations: Codeplay’s
ComputeCpp, Intel’s LLVM/clang implementation, triSYCL led by Xilinx and hipSYCL led by Heidelberg University. HipSYCL, an open source project available at https://github.com/illuhad/hipSYCL, is the first SYCL implementation that does not build on OpenCL, providing instead CUDA/HIP and OpenMP backends that allow it to target CPUs, NVIDIA GPUs and AMD GPUs. Since hipSYCL builds on the CUDA/HIP clang frontend, augmented with a clang plugin to add support for SYCL constructs, it is inherently interoperable with existing CUDA or HIP codebases, vendor-optimized libraries, and can expose latest hardware features such as intrinsics as soon as they become available in the clang CUDA/HIP toolchain.

In this presentation, we will review the architecture of hipSYCL, as well as the current state of the implementation, including performance and limitations. We will also discuss future directions and recent and ongoing work, in particular the development of an entirely new runtime which among other features, introduces a generalization to multiple backends, a batched kernel submission model, and ahead-of-time scheduling capabilities with the ability to reason about the consequences of scheduling decisions. This will allow us to experiment in hipSYCL with scheduling models that are more general than what is currently defined in the SYCL specification.

Evaluating the Performance of the hipSYCL toolchain for HPC kernels on NVIDIA V100 GPUs

Brian Homerding and John Tramm (Argonne National Laboratory)

Future HPC leadership computing systems for the United States Department of Energy will utilize GPUs for acceleration of scien- tific codes. These systems will utilize GPUs from various vendors which places a large focus on the performance portability of the programming models used by scientific application developers. In the HPC domain, SYCL is an open C++ standard for heterogeneous computing that is gaining support. This is fueling a growing inter- est in understanding the performance of SYCL toolchains for the various GPU vendors.

In this paper, we produce SYCL benchmarks and mini-apps whose performance on the NVIDIA Volta GPU is analyzed. We utilize the RAJA Performance Suite to evaluate the performance of the hipSYCL toolchain, followed by an more detailed investigation of the performance of two HPC mini-apps. We find that the kernel performance from the SYCL kernels compiled directly to CUDA preform at a competitive level with their CUDA counterparts when comparing the straightforward implementations.

Study of OpenCL atomic functions and their acceleration using the subgroup extensions

Hongqiang Wang Jeng-Hau Lin and Alex Bourd (QUALCOMM) and Jiahui Huang (University of California-Riverside)

In this work we study and compare the performance of a number of OpenCL 1.2 atomic functions on GPUs from various vendors, including Intel HD520 GPU, and Nvidia Jetson Tx2 GPU, and Adreno GPUs. Tests are designed and optimized for different use cases, including atomic with uniform address, atomic with non-uniform addresses, atomic with return values, and atomics with no-return values. We collect and analyze the results, and also try to interpret them using the publicly available technical documents.

We then attempt to improve the atomic performance by exploiting the newly standardized OpenCL subgroup extensions by Khronos Group. For example, by using the sub_group_reduce_op, we can do parallel reduction within the subgroups first, then just do one atomic operation per subgroup for some simple cases. For more complex cases, e.g., atomics operations requiring return of the old value, a more sophisticated scheme using the subgroup_scan_exclusive_op subgroup function is designed. The most challenging scenario, the non-uniform atomics requiring return of the old value, can also be accelerated by using subgroup ballot operation within a nested loop.

We will present and analyze the performance numbers. These experiments are designed with the Qualcomm’s Adreno GPUs, which can be easily extended to GPUs of other vendors with atomic and subgroup supports.

Presentation Slides – Pending

SYCL-Bench: A Versatile Single-Source Benchmark Suite for Heterogeneous Computing

Aksel Alpay and Vincent Heuveline (Heidelberg University), Sohan Lal and Nicolai Stawinoga (TU Berlin), Philip Salzmann, Peter Thoman and Thomas Fahringer (University of Innsbruck) and Biagio Cosenza (University of Salerno)

SYCL is a royalty-free Khronos group open standard that enables heterogeneous programming using pure C++, which targets a broad range of parallel devices, including multicore CPUs, GPUs, and FPGAs, without any additional attributes or pragmas. While SYCL kernels follow a data-parallel model, they are implicitly organized in a task graph built by the runtime from data access specifications. Scheduling, data management and synchronization between different tasks are handled implicitly by the SYCL runtime, which varies depending on the implementation. While this simultaneously preserves programmer productivity and allows the SYCLruntime to automatically perform optimizations such as over-lapping data transfers and kernel execution, it is not apparent whether the SYCL implementation actually employs such optimizations for a particular code pattern. Benchmarks are therefore necessary to characterize the performance of programming models such as SYCL that rely heavily on implicit operations. To this end, we present SYCL-Bench, a versatile benchmark suite written in SYCL. SYCL-Bench not only contains benchmarks to characterize the hardware but also SYCL-specific benchmarks that present optimization opportunities to the SYCL runtime and test how well a particular SYCL implementation capitalizes on those opportunities. SYCL-Bench benchmarking methodology includes:109 codes suited for hardware characterization; 24 codes based on data-parallel patterns such as reduction; 9 codes to evaluate SYCL-specific runtime features. We experimentally demonstrate the effectiveness of SYCL-Bench by performing a device characterization on NVIDIA GeForce GTX Titan X, GeForce 1080 Ti, and AMD Radeon VII GPUs, and by evaluating the runtime efficiency of HipSYCL and ComputeCPPSYCL implementations.

Characterizing optimizations to memory access patterns using architecture-independent program features

Aditya Chilukuri, Josh Milthorpe and Beau Johnston (Australian National University)

HPC developers are faced with the challenge of optimizing OpenCL workloads for high performance on novel architectures.
The Architecture Independent Workload Characterisation (AIWC) tool is a plugin for the Oclgrind OpenCL simulator that gathers metrics of OpenCL programs that can be used to understand and predict program performance on an arbitrary given hardware architecture.

However, AIWC metrics are not always easily interpreted and do not reflect some important memory access patterns affecting efficiency across architectures.
We propose the concept of parallel spatial locality — the closeness of memory accesses simultaneously issued by OpenCL work-items (threads).

We implement the parallel spatial locality metric in the AIWC framework, and analyse gathered results on matrix multiply and the Extended OpenDwarfs OpenCL benchmarks.

The differences in the obtained parallel spatial locality metric across implementations of matrix multiply reflect the optimizations performed.
The new metric can be used to distinguish between the OpenDwarfs benchmarks based on the memory access patterns affecting their performance on various architectures.

The improvements suggested to AIWC will help HPC developers better understand memory access patterns of complex codes, a crucial step in guiding optimization of codes for arbitrary hardware targets.

Experiences with OpenCL in PyFR: 2014—Present

Freddie Witherden (Texas A&M University) and Peter Vincent (Imperial College London)

PyFR is an open source high-performance computing (HPC) framework for performing scale-resolving computational fluid dynamics simulations [1]. The algorithmic core of PyFR is the flux reconstruction (FR) approach of Huynh [2], which combines the geometric flexibility of finite volume schemes with the high-order accuracy and efficiency of spectral schemes. Primarily written in Python, PyFR aims to be performance portable across a range of hardware platforms. This is accomplished through the use of a bespoke domain specific language based around the Mako templating engine and a range of run-time code generation backends. Our approach enables PyFR to target platforms through OpenMP annotated C kernels, CUDA kernels, and OpenCL kernels [3].

In this talk I will discuss our experiences with OpenCL in PyFR. This will include the current role of the OpenCL backend within PyFR and our plans for the future apropos SYCL and Intel’s OpenAPI initiative. The performance of the OpenCL backend will be compared and contrasted to that of the ‘native’ backends in PyFR. Furthermore, we will also highlight the limitations of OpenCL and related standards; specifically in the areas of MPI-awareness and the availability of performance primitives. Implementation quality on the part of hardware vendors will also be discussed.

References[1] F. D. Witherden, A. M. Farrington, and P. E. Vincent, PyFR: An Open Source Framework for Solving Advection-Diffusion Type Problems on Streaming Architectures Using the Flux Reconstruction Approach. Computer Physics Communications, 185(11), 2014, 3028–3040.[2] H. T. Huynh, A flux reconstruction approach to high-order schemes including discontinuous Galerkin methods. 18th AIAA Computational Fluid Dynamics Conference, 2011.[3] F. D. Witherden, B. C. Vermeire, and P. E. Vincent, Heterogeneous computing on mixed unstructured grids with PyFR. Computers & Fluids, 120, 2015, 173–186.

Evaluation of modern GPGPU technologies for image processing

Joachim Meyer (Stemmer Imaging)

Image processing has high computational requirements. As many papers have indicated, a number of image processing operations can be optimized by heavy parallelization of the computation. Currently, one of the best options for parallelized image processing is using GPGPU. In the area of GPGPU, a rather wide range of APIs is available and finding an appropriate choice for a project is difficult.

In the course of being able to provide guidance on selecting the right API in an image processing context, four GPU programming models were compared according to their platform independence, usability, and performance. To gain information on the metrics usability and performance a test project was created, implementing a number of image processing tasks in the four GPU programming models. The reference algorithms form a pipeline for processing images from polarization cameras. The four investigated APIs are CUDA, OpenCL 1.2, Vulkan and SYCL. While CUDA, OpenCL, and OpenMP are compared in a wide number of papers, the more recent standards SYCL and Vulkan are not yet included in as many comparisons. It is very rare to find both of them included in a single investigation. The aim of this work is to provide an overview of the most important factors to consider, when choosing a GPGPU API for a project, taking the new Khronos standards into account.

The direct comparison of the APIs, done for this work, cumulates into a decision matrix to aid organizations in their process of selecting an offloading API. It shows, for example, that CUDA and SYCL are comparable in terms of development cost, support of modern C++ and ease of integration in existing code bases. When considering modern standard adoption and GPU portability Vulkan has advantages, while OpenCL and SYCL can run on CPUs and FPGAs. In the tooling category, CUDA has the most specialized and effective tools whereas Vulkan has almost no compute specific tools available. CUDA and Vulkan provide the best performance (on the tested devices), while Vulkan was able to outperform CUDA on a Nvidia Jetson.

It becomes clear that all four APIs have their advantages and may be suitable for different applications. Vulkan is the API of choice in mobile consumer applications. CUDA is convenient for fast development but only if it is feasible to support Nvidia GPUs only. OpenCL might be chosen if a considerable number of different desktop GPUs and FPGAs should be supported. Although some metrics still show room for improvement, SYCL seems to grow into a good all-rounder with all the available implementations. It could benefit from weakening its tight integration with OpenCL to improve support on some platforms. A Vulkan implementation, for example, would open opportunities to target mobile and Apple devices, while hipSYCL could gain the status of a standard-conforming implementation.

Debugging SYCL programs on heterogeneous Intel architectures

Baris Aktemur, Markus Metzger, Natalia Saiapova and Mihails Strasuns (Intel)

Intel recently announced a large initiative named oneAPI that provides a direct programming model based on SYCL. As part of the oneAPI distribution, we developed a debugger that can be used to debug SYCL programs that offload kernels to CPU, GPU, or FPGA emulator devices. The debugger is based on GDB. It allows programmers to inspect the host and kernel portion of their SYCL programs seamlessly in the same debug session. To realize the debugger, we made enhancements to GDB including SIMD-based thread views and C++-related improvements. In this work we present the
general architecture of the debugger, provide a sample session of how it can be used to debug a SYCL kernel running on a GPU, and discuss the encountered and anticipated challenges during the development phase. Currently a beta version of the debugger is publicly available.

Automated OpenCL GPU kernel fusion for Stan Math

Tadej Ciglarič, Rok Češnovar and Erik Štrumbelj (University of Ljubljana)

We developed an OpenCL GPU kernel fusion library for the Stan software for Bayesian statistics. The library automatically combines kernels, optimizes computation, and is simple to use. The practical utility of the library is that it speeds up the development of new GPU kernels while keeping the performance of automatically combined kernels comparable to hand crafted kernels. We demonstrate this with experiments on basic operations and a linear regression model likelihood.

Program of Talks: Posters

Accelerating NNEF framework on OpenCL devices using clDNN

Meng-Shiun Yu, Tai-Liang Chen and Jenq-Kuen Lee (National Tsing Hua University)

In recent years, the rapid development of artificial intelligence (AI) applications has promoted the emergence of various AI frameworks, such as TensorFlow, Keras, Caffee, and PyTorch. However, these frameworks have owned their file formats and inference engine flows, as well as they are not compatible with each other. Such a phenomenon will cause developers’ extra redesign efforts for AI models. Therefore, the Khronos group has proposed an intermediate representation of open specification and the well-defined specification, called neural network exchange format (NNEF). The NNEF can work as a high-level abstraction layer of exchangeable AI models between frameworks. Nevertheless, execution performance with NNEF that uses AI technology will be an important issue. This paper mainly focused on improving the efficiency of AI models that execute NNEF format. We propose a solution that using Compute Library for Deep Neural Networks (clDNN) supports to accelerate the NNEF framework on OpenCL devices. The clDNN is an open-source library for AI applications, which aims to accelerate the execution of AI inference on the Intel hardware platform. We illustrate our enabling flow and they are with the following steps: (1) NNEF parser, which is a software tools to parse NNEF files including graph architecture (graph.nnef) and data (.dat) provided by Khronos NNEF working group, (2) our NNEF code generator, which extends and maps NNEF to clDNN, and (3) connecting and executing the clDNN API to generate inference results in the Intel hardware platform. Our experiments are performed on an Intel Core i7-7700 CPU 3.60GHz with a built-in HD Graphics 630 graphics card, which means that we can accelerate AI models through clDNN with OpenCL supports. The experiment is the inference result of Mobilenet_v1. Our result can speedup over 6 times better execution time than the C version implementation of NNEF specifications.

Multi-platform SYCL profiling with TAU

Nicholas Chaimov, Sameer Shende and Allen Malony (ParaTools)

A major challenge in high-performance computing is performance portability. Using abstraction layers like SYCL, applications can be developed which can target, with the same code base, different execution environments, such as traditional CPUs, many-core accelerators, and GPUs of various architectures, such as those from NVIDIA, AMD, and Intel. However, cross-platform code produced in that way will not necessarily provide acceptable performance on multiple platforms. Performance analysis tools are needed to evaluate performance of SYCL applications.

Vendors provide platform-specific performance analysis tools which provide access to vendor-specific hardware performance counters. By measuring these counters, developers can determine how their applications can be optimized for a specific vendor or specific hardware platform. However, each vendor’s tool works only on that vendor’s hardware. OpenCL and SYCL provide portable profiling interfaces through which timing data can be collected on kernel executions and data transfers, but which do not provide a mechanism for accessing hardware performance counter data.

The TAU Performance System is a framework for performance instrumentation, measurement, analysis and visualization of codes running on large-scale parallel computing systems~\cite{tau-ijhpca}. It supports multiple parallel programming models, including MPI, OpenMP, and OpenACC, as well as monitoring of kernel execution and hardware performance counters from CPUs and GPUs from multiple vendors, including GPUs from NVIDIA, AMD, and Intel. This poster describes the development of features within TAU for the capture of hardware performance counter data from SYCL applications.

We describe 1) an interface to ingest timing data from the OpenCL/SYCL profiling interface into TAU; 2) our experience developing an interface to collect hardware performance counter data on NVIDIA GPUs using NVIDIA’s CUPTI library for SYCL runtimes which target a CUDA backend (hipSYCL); and 3) our experience developing an interface to collect hardware performance counter data on AMD GPUs using AMD’s rocprofiler library. Lastly, we describe 4) our early experiences with the collection of hardware performance counter data from SYCL applications running in Intel’s oneAPI runtime using the oneAPI Level 0 tracing and metrics interfaces.

Tensor Virtual Machine (TVM) on Qualcomm Adreno GPUs

Hongqiang Wang, Li He, Adarsh Golikeri and Alex Bourd (QUALCOMM)

In this poster, we introduce our work on Tensor Virtual Machine (TVM) with Qualcomm Adreno GPUs and present some preliminary results. TVM is a popular machine learning (ML) compiler stack that targets a wide range of computing devices running ML networks. Inspired by Halide, TVM can auto-generate highly optimized CUDA and OpenCL kernels with little information provided by developers, and in many cases it can beat the kernels that are hand-optimized by experts. TVM has been well-tuned for many desktop and mobile devices. However, the support of TVM on Qualcomm’s Adreno GPUs is rather limited, e.g., no optimized schedule for Adreno GPUs. In this work we show how we manage to optimize the mainline TVM to get better performance on Adreno GPUs.

After enabling TVM on Adreno GPUs as a device target, we obtained the performance of MobileNet V1 generated by the mainline TVM, whose templates and schedules for Mobile GPUs are primarily based on ARM’s MALI GPUs. Several optimizations for Adreno GPUs are then enabled, such as better local memory and cache strategy tuning, and adding image 1D and image 1D buffer support for TVM, where currently only buffer is supported.

With these optimizations, it shows that the customized TVM can achieve up to 2.6x performance boost over the mainline TVM for MobileNet V1 on Adreno GPUs. Though it is still behind the hand-optimized one, it would be very promising that eventually TVM can beat the hand optimized, if more advanced features can be enabled.

We will discuss the results more in the poster, share the challenges TVM is facing from our experience, and our thoughts on the future of TVM.

Presentation Slides – Pending

Accelerating PP-distance algorithms on FPGA, with OpenCL parallelization directives and data transfer optimizations

César González (BSC / UB / IQAC-CSIC), Simone Balocco (UB) and Ramon Pons (IQAC-CSIC)

The particle-particle distance problem (pp-distance) appears in several applications like Molecular Dynamics. In particular, pp-distance calculation is a computationally demanding task required to calculate X-ray spectra. With the aim to use this problem as a testing bench for the current FPGA technology, we evaluate how the parallel computation capability of FPGAs can be exploited to reduce the computation time. The baseline implementation was based on a float (32-bit) format representation of data, that we use to transmit information between CPU and FPGA. The behavior of the performance as a function of the directive values, suggested that the speed velocity was limited by the data transmission channel. Therefore, we encapsulate the data using unsigned short (16-bit) format for the transmission. The results shown that the same algorithm with 16-bit transmission runs almost twice as fast as the 32-bit one, because it can benefit from the unrolls factors used at the FPGA. The main C program use OmpSs for task invocation, meanwhile the kernel is built with OpenCL. Benchmarks have been done over an Intel discrete Arria 10 GX 1150 platform, computing a model of 2 million particles in 2,616 seconds.

Poster (PDF)

PySAP-ComSET: an accelerated Python package for compressed sensing electron tomography (CS-ET) reconstruction

Jyh-Miin Lin, Martin Jacob and Zineb Saghi (University Grenoble Alpes)

We present the PySAP-ComSET: an accelerated Python package for compressed sensing electron tomography (CS-ET) reconstruction. CS-ET has proved successful for reconstructing scanning transmission electron microscopy (STEM) data, which are acquired at a reduced number of tilt angles. Although CS-ET is computationally expensive, parallel computing can significantly accelerate the reconstruction. The Python ecosystem provides a growing number of packages for developing CS-ET.

The Python non-uniform fast Fourier transform (PyNUFFT) is a fast and accurate NUFFT/NFFT implementation on heterogeneous architectures. Thus, Fourier-based Radon projections and back-projections can be carried out rapidly. In addition, various sparsity transforms can be integrated into the reconstruction pipelines. This heterogeneous software architecture significantly simplifies the development of CS-ET algorithms on OpenCL/CUDA devices. In this work, we evaluate the performance of CS-ET on different OpenCL devices, using 6 CS algorithms, including the total-variation (TV), the total-generalised variation (TGV), the à trous algorithm (WT), the combined WT-TV, polar TV and ridgelet.

The CS-ET reconstruction pipeline can be seen in Fig. 1. We use the data set obtained from a porous silicon sample consisting of 181 projections acquired from -90 degree to +90 degree with tilt angle increment of 1 degree. The matrix size for the reconstructions is 600 × 600. The regularisation parameter was set to 0.01 for TV and TGV, 0.001 for ridgelet and 0.0001 for wavelet transform. All algorithms are executed with 100 iterations on each device.

The benchmarks of the PySAP-ComSET package are performed on a CentOS 7 Linux server (see Table 1) equipped with dual Intel Xeon CPUs (each one has 12 cores running at 3.0-3.7GHz with 24.75MB L3 Cache) and one NVIDIA Tesla V100-PCIE 32GB~(5120 cores, 32GB HBM2, 4096-bit, 897.0 GB/s, PCI Express 3.0×16). The installed drivers consist of the NVIDIA driver 435.21 and the CUDA version 10.1. The Intel SDK OpenCL (build 2020.0.270) provides OpenCL 2.1 capability. The PySAP-ComSET package was installed and tested on Python 3.7 (Anaconda3-20190529 with Intel MKL). For accelerated Radon transforms, the astra-toolbox-1.8.3, PyNUFFT 2019.2.3, Reikna-0.7.4 and PyOpenCL-2019.1.2 are configured for this work.

Fig. 2 shows the speedups of the implemented CS-ET algorithms. OpenCL provides significant speedups on the NVIDIA V100 GPU (5.2×-16.7× faster) and 3×-8× speedups on the Intel Xeon Gold 6136 processor. NVIDIA V100 OCL1.2 outperforms the astra-toolbox using CUDA in Polar TV and Ridgelet.

In future studies, we will include more dedicated sparse transforms, such as curvelets and shearlets on the accelerators. We will also test the reconstruction accuracies of different CS-ET algorithms for subsampled data acquired with 18 projections.

HIPCL: Tool for porting CUDA applications to advanced OpenCL platforms through HIP

Michal Babej and Pekka Jääskeläinen (Tampere University)

Heterogeneous-compute Interface for Portability (HIP), is an open-source C++ runtime API and a kernel language. It is designed to be compatible with CUDA and to deliver close to native performance on CUDA platforms while exposing additional low-level hardware features. A key use case of HIP is in providing a portability route out from the NVIDIA CUDA platform, which is highlighted with an automated tool that can convert CUDA applications to HIP programs.

In this work we describe HIPCL, a new tool which allows running HIP programs on OpenCL platforms with sufficient capabilities. HIPCL thus expands the scope of the CUDA portability route from AMD ROCm platform supported targets to platforms with advanced OpenCL support. We highlight the implementation challenges of HIPCL related to the feature mismatches between CUDA and OpenCL and exemplify its runtime overheads in comparison to directly executing OpenCL applications.

Presentation Slides – Pending

High-performance micromagnetic simulations using OpenCL

Petr Kartsev(NRNU MEPhI - Moscow Engineering-Physics Institute)

Micromagnetic simulations based on the Landau-Lifshitz-Gilbert (LLG) type equations allow to study numerically the current and future magnetic and spintronic devices based on magnetic films or single-domain magnetic particles. Effects of the problem geometry, particle size and its orientation, short field pulses used for writing are usually simulated to determine the preformance of magnetic memory and its stability before the experimental realizations[1]. In simulation of large enough systems, the GPU approach is limited due to amount of RAM accesses during the simulation[2]. However, even smaller systems (10-20 magnetic particles) can be interesting in preliminary search of the required behaviour for specific interaction parameters and geometry.

In this work, we show the GPGPU OpenCL approach to study the remagnetization process of the magnetic particle in the thermal noise field, making use of parallel calculations to take into account the requirement of large statistics.

Factors to achieve the best performance in this case include:

  • simple equation and few variables fit into registers, reaching the maximal possible FLOPS;
  • parallel version of RANLUX random number generator[2] makes it easier to expand the volume of statistics;
  • parallel simulation for different starting orientations of magnetization vectors reusing the same external noise signal (loaded from global array);
  • parallel study of the same system with differing model parameters (anisotropy, orientation of easy axis, distance between particles, shape of the writing field pulse etc.) or different stream of random numbers;
  • the results of simulation are combined into statistical characteristics: average time of remagnetization among all runs etc., also reducing the amount of memory transfers.

Bibliography:[1] doi:10.1088/1361-6463/aa7c04[2] doi:10.1088/1361-6463/aaab1c[3] doi:10.1016/j.cpc.2010.12.008

Presentation Slides – Pending

POCL-R: Distributed OpenCL runtime for low latency remote offloading

Jan Solanti, Michal Babej, Julius Ikkala and Pekka Jääskeläinen (Tampere University)

Running complex applications on mobile devices stress their performance and power consumption limits. As new wireless networking technologies that promise lower communication latencies appear, it is interesting to study whether dynamic offloading from mobile devices across a network to a nearby compute cluster is feasible when quality and complexity can be traded off dynamically. This type of scheme calls for a
light weight heterogeneous distributed programming layer.

Here we describe our work-in-progress distributed OpenCL runtime optimized for low latency quality-complexity trade-off cases. We call it POCL-R, since it is implemented on top of the Portable Computing Language (POCL) as a device layer implementation that exposes remote compute devices accessible over a network connection in a transparent manner on the local OpenCL platform. In order to improve the latency, we expand upon our earlier work on exploiting OpenCL-described heterogeneous task parallelism for intelligent cross-device event synchronization and buffer management.

Presentation Slides – Pending

High-performance GPGPU OpenCL simulation of quantum Boltzmann equation

Petr Kartsev (NRNU MEPhI - Moscow Engineering-Physics Institute)

Quantum Boltzmann equation (QBE) is the universal approach of theoretical physics to describe the behaviour of complex quantum systems: electrons and holes in semiconductor; Cooper pairs and excited states in superconductor; photons in a nonlinear optical medium, etc. The QBE approach generates the infinite chain of interconnected time-dependent differential equations for particle numbers and various correlators of increasing order. Limiting the maximal order of correlators, we cut the chain of equations and arrive to closed system of nonlinear time-dependent differential equations.

The kinetic equations of lowest possible order are not always correct, which requires to increase the approximation order, but numerical study of realistic problems with QBE is usually problematic due to large amount of equations and simulated values.

In this work, we report the GPGPU solver using OpenCL for the system of QBE equations written in the form (1)-(2).
(a) For the problem with model of general form (1)-(2), the sparse matrix of coefficients is calculated in advance and loaded into GPU RAM. With reasonable choice of the system size and equation order, the values can fit into fast local or constant memory.
(b) For the problem with simple enough model, the coefficients can be calculated on the fly, removing the need to read the connections from memory. This approach is preferable as integer calculations are much cheaper than reading from pre-calculated array.

The benchmarks for several GPUs of different architectures and generations will be presented.

Towards Green Computing: A Survey of Performance and Energy Efficiency of Different Platforms using OpenCL

Philip Heinisch, Katharina Ostaszewski and Hendrik Ranocha (TU Braunschweig)

When considering different hardware platforms, not just the time-to-solution can be of importance but also the energy necessary to reach it. This is not only the case with battery powered and mobile devices but also with high-performance parallel cluster systems due to financial and practical limits on power consumption and cooling. Recent developments in hard- and software have given programmers the ability to run the same code on a range of different devices giving rise to the concept of heterogeneous computing. Many of these devices are optimized for certain types of applications. To showcase the differences and give a basic outlook on the applicability of different architectures for specific problems, the cross platform OpenCL framework was used to compare both time- and energy-to-solution. A large set of devices ranging from ARM processors to server CPUs and consumer and enterprise level GPUs has been used with different benchmarking testcases taken from applied research applications. While the results show the overall advantages of GPUs in terms of both runtime and energy efficiency compared to CPUs, ARM devices show potential for certain applications in massively parallel systems. This study also highlights how OpenCL enables the use of the same codebase on many different systems and hardware platforms without specific code adaptations.

Training machine learning network on Adreno Mobile GPUs using OpenCL

Hongqiang Wang, Bolan Jiang, Jeng-Hau Lin, Adarsh Golikeri, Alex Bourd and Li He (QUALCOMM)

Training machine learning networks on mobile devices has become an interesting topic recently, thanks to the rising concerns on privacy, power, and latency in the traditional centralized training process, and also the needs for customized networks for different end users. In this poster, we present our early work on enabling network training on Qualcomm’s advanced Adreno mobile GPUs using OpenCL. We have implemented layers using OpenCL with a convex optimization method, stochastic gradient descent (SGD), to demonstrate the effectiveness and efficiency of network training on the GPUs. The network structures can be defined in a fashion similar to conventional deep-learning libraries and initialized through the Xavier initialization method to control the variance and accelerate the convergence.

In addition, various OpenCL optimization techniques for Adreno GPUs to accelerate the training process are presented, including how to choose the optimal data layout and format, and how to pack data, etc. We will discuss its performance and also the challenges of training on mobile devices.

An Android demo app using OpenCL to illustrate the training process of the well-known LeNet network has been developed and can be presented for the poster. Besides forward and backward propagations, the demo app also supports dataset shuffling, validation, and testing.

Presentation Slides – Pending

Making banking secure via bio metrics application built using oneAPI and DPC++ based on SYCL/C++

Alessandro De Oliveira Faria (OITI / NETi Technology) and Sujata Tibrewala (Intel)

In this presentation we will look at oneVPL (oneAPI Video Processing Library) and how it is used in Certiface technology designed to combat fraud and protect honest people by ability to differentiate between a live person and a recorded video. Certiface is based to harness heterogeneous computing architecture including CPUs and GPUs from servers to notebooks. The software tools such as oneVPL, computer vision techniques with openCV, openVINO and Deep Learning technologies based on Intel features such as TBB, IPP and MKL, and high-performance computing. This technology processes millions of faces per second in the cloud, making banking transaction operations in Brazil secure, fast and effective.

The oneAPI Video Processing Library lets developers add high-speed, real-time transcoding, decoding, and encoding to their applications. Its single video API provides direct access to advanced CPU and GPU instructions, and gives developers total control of the video hardware for their processing needs.

The library is perfect for applications spanning broadcasting, OTT and VOD, in-cloud gaming, and remote desktop solutions:- 1/ Includes high-performance, hardware-accelerated AVC, HEVC, and AV1 codecs 2/ Supports deployment on CPUs and GPUs 3/ Flexible API enables developers to maximize application exposure to hardware optimisation.

oneAPI and oneVPL derive much of it’s goodness from being derived from SYCL.

SYCL is a programming model that lets developers support a wide variety of devices (CPUs, GPUs, and more) from a single code base. Given the growing heterogeneity of processor roadmaps, moving to a platform-independent model such as SYCL is essential for modern software developers. SYCL has the further advantage of supporting a single-source style of programming from completely standard C++. In this presentation, we will introduce oneAPI, oneVPL and provide programmers a model of how to use these libraries to solve their own problems.

Performance portability of the MG-CFD mini-app with SYCL

Szilniczky-Erőss Botond (Pázmány Péter Catholic University)

In this poster, we carry out the comparative performance analysis of different SYCL compilers and libraries on a range of CPU and GPU architectures using the MG-CFD Mini-App.

MG-CFD is a multi-grid computational fluid dynamics proxy application, representative of a large industrial CFD code, Rolls-Royce Hydra, a nonlinear Reynolds-Averaged compressible Navier-Stokes solver, used to design turbomachinery components. MG-CFD uses an unstructured mesh discretisation, meaning the connectivity between mesh elements is explicitly described using mapping arrays. The application is built using the OP2 domain specific library, which automates parallel execution, resolving potential race conflicts. OP2 has been extended to automatically generate SYCL parallelisations of unstructured mesh loops, and to manage data between memory spaces.
SYCL promises good portability between different hardware architectures, although at this point the coverage of individual compilers and libraries is still fairly small. As with OpenCL, performance portability of the same code is not guaranteed, often changes are required in the algorithms and the parallelisation. We are evaluating a range of compilers, including Intel’s LLVM SYCL compiler, hipSYCL, ComputeCPP, triSYCL, sycl-gtx, and compare them to reference versions using OpenMP and CUDA. Functioning combinations of these are studied on Broadwell, Skylake, and Cascade Lake generation Intel CPUs, ARM ThunderX2, IBM Power9 CPUs, as well as NVIDIA V100, AMD Radeon VII, and Intel HD 530 GPUs.

We measure the execution time and achieved bandwidth of the application and its computational steps on different devices, and evaluate the difference between the different compilers and libraries. Two parallelization strategies will be investigated to handle the race conditions commonly occurring in unstructured meshes. One strategy is coloring where to ensure there is no data races between elements, we color them based on potential conflicts, and the execute them color by color. The other is via the use of atomic operations, which are expensive hardware operations to handle race conditions, though these are not available on all platforms. We will examine the fraction of the theoretical maximum performance of the hardware that we are able to reach, and calculate the performance portability metric for the SYCL version of this application.

Presentation and Slides – Pending

SYCL-based Monte Carlo simulation of neutron transport

Dmitry Savin, (Dukhov Automatics Research Institute)

Geant4 is a C++ toolkit for the simulation of the passage of particles through matter. Its development is mainly supported for the simulation of the experiments at Large Hadron Collider (LHC) at CERN, but its areas of application also include nuclear physics and studies in medical and space science.

The Monte Carlo simulation of the detectors is becoming the bottleneck in the analysis of the LHC data due to the increased luminosity and granularity.
The most resources consuming part of the Monte Carlo simulation of a particle detector is the electromagnetic shower due to including a large number of particles while governed by relatively simple physics.

A GeantV vector prototype capable of a full electromagnetic shower simulation was developed and tested for the physics results and performance.  It employs a more granular parallelism compared to the Geant4 simulation which allows to group similar tracks into baskets and process unsequentially.  The core findings were that the basketizing overhead is significant and the performance gain due to the vectorization is much less than expected – about ten to twenty percent instead of four to eight times – while the data locality and the avoidance of virtual calls and branchings lead to a 2-3 speedup due to the better usage of both the data and instruction caches.  Also the usage of accelerators and Machine Learning (ML) bases models was found needed to meet the computing performance goals.

The second most resources consuming part is the simulation of the neutron interactions due to the large number of interactions of each particle. While the precise tracking of neutrons in High Energy Physics (HEP) applications is not always needed, it is more important in nuclear medicine and radiation protection which deal with lower energies. We are developing the CHIPS-TPT package for the Monte Carlo simulation of the transport of neutrons with energies up to 20 MeV with strict energy, momentum and quantum numbers conservation in each interaction, which allows to reproduce single scattering effects and avoid  unphysical fluctuations. Due to the absence of a sufficient theoretical model most neutron transport simulations use a big amount of evaluated nuclear data and are memory-bound and thus sensitive to the data layout and access patterns. Therefore a data-oriented design with a set of computing kernel specialized at compile-time for the maximum utilization of the hardware can significantly benefit the performance.

All of the above makes SYCL a candidate for the framework to use in the simulation. SYCL for the Monte Carlo simulation of neutron transport was assessed at Los Alamos National Laboratory (LANL) and the applicability was found rather limited due to the restrictions imposed by programming model (LA-UR-19-25636). Nevertheless those restrictions seem to benefit the performance regardless of the framework, we develop a SYCL-based prototype for the simulation of the neutron transport in a simplified geometry and access the overall feasibility and the effect on the performance.

Presentation and Slides – Pending