IWOCL & SYCLCON 2023 Award Winners

Outstanding Poster

Leveraging MLIR for Better SYCL Compilation. Víctor Pérez-Carrasco, Codeplay Software

Outstanding Presentation

Comparing the Performance of SYCL Runtimes for Molecular Dynamics Applications. Andrey Alekseenko, KTH Royal Institute of Technology

Outstanding Paper

Standardizing Complex Numbers in SYCL, Nevin Liber, Argonne National Laboratory

Conference Program and Recordings

Including links to the video presentations, slides and the ACM proceedings.

Program Quick Links: Tuesday Tutorials | Wednesday Talks & Panels | Thursday Talks | Posters

Tuesday 18 April, 2023 | Full Day Tutorials

Welcome Refreshments
08:30 – 09:15 GMT
Tutorial 1: Introduction to SYCL [1996]
Course Leaders: Christopher Edsall, University of Cambridge.
09:15 – 17:00 GMT
show / hide abstract

Tutorial Format: A Full Day Workshop

Tutorial Outline: This introductory workshop explores the challenges of programming for heterogeneous high performance computing systems. In order to grow the SYCL community, we need not just experts but new users to try out the technology.

In this hands-on training, learners will be guided by Intel-qualified Research Software Engineers from Research Computing Services at the University of Cambridge. The team have delivered this training to multiple audiences over the last several years. The workshop will explain how SYCL can solve the challenges of programming in a heterogeneous world. It will help learners to use tools like Intel’s oneAPI to enable their research workflows. The workshop will provide an introduction to the SYCL language and programming model with an emphasis on practical exercises. Learners will be led through short presentations, followed by exercises giving them a solid foundation to build on. Learners will gain experience in using the SYCL programming language to target different types of accelerator devices (e.g., CPUs, GPUs, FPGAs). Research Computing Services will provide access to the CSD3 supercomputer at Cambridge for the purpose of the workshop. The session will include familiarisation on the use Jupyter notebooks and a programming challenge towards the end.

The modules to be covered include:

  • Introduction to oneAPI and SYCL
  • SYCL Program Structure
  • SYCL Unified Shared Memory
  • SYCL Sub Groups
  • SYCL Kernel Reductions
  • SYCL Task Scheduling and Data Dependency.

Delegates attending this tutorial should bring their own laptop to run the hands on sessions. Remote access to the Cambridge supercomputers will be provided using the Cambridge Service for Data-Driven Discovery (CSD3) and Open OnDemand. These services require multi factor authentication so attendees will need a smart phone to complete the installation.

Tutorial 2: SYCL Techniques and Best Practices [2791]
Tutorial Lead: Rod Burns, Codeplay Software.  | Co-presenters (provisional): Hugh Delaney, Codeplay Software. Aksel Alpay, University of Heidelberg. Ronan Keryell, AMD. Igor Vorobtsov, Intel.
09:15 – 17:00 GMT
show / hide abstract

The tutors involved in this tutorial have presented multiple times at IWOCL and SYCLcon and have helped to expand and develop the SYCL Academy open source learning materials. This tutorial brings a more advanced set of lessons.

The SYCL programming model means heterogeneous programming using C++ is now more accessible than ever. SYCL uses modern standard C++, and it’s a programming model that lets developers support a wide variety of devices (CPUs, GPUs, FPGAs, and more) from a single code base. The growing popularity of this programming model means that developers are eager to understand how to use all the features of SYCL and how to achieve great performance for their code.

Gain further expertise of SYCL in a practical environment focused more on writing code than Powerpoint with help from experts in the SYCL community.

This tutorial assumes existing knowledge and some experience of using SYCL to develop code for accelerators such as GPUs.

The concepts introduced to attendees will cover some of the topics that build on the fundamentals of SYCL including strategies for optimizing code, managing data flow, how to use different memory access patterns, understanding work group sizes, using vectorization, the importance of ND ranges, and making the most of the multiple devices available on your architecture.

The majority of the tutorial consists of hands-on coding activities, these exercises have been tried and tested at previous conferences and are regularly updated to improve the learning experience. We anticipate significant improvements to these this year. Almost every short presentation is accompanied by a session where attendees will run through a coding exercise. The tutorial organizers will provide several ways to run through the code examples, either on the attendee’s own laptop where possible, but also in a cloud environment giving access to a hosted multi-GPU system. Attendees can choose to use multiple SYCL compilers and target different processors from multiple vendors.

Wednesday 19 April, 2023 | Conference Sessions

Registration and Welcome Refreshments
08:30 – 09:15 GMT
Welcome
Simon McIntosh-Smith, Conference Chair. University of Bristol.
09:15 – 09:30 GMT
SYCL State of the Union
Michael Wong, Distinguished Engineer at Codeplay. Khronos SYCL Working Group Chair.
09:30 – 10:00 GMT
Experiences Migrating CUDA to SYCL: A Molecular Docking Case Study [2679]
Leonardo Solis-Vasquez, Technical University of Darmstadt.  | Co-authors: Edward Mascarenhas, Intel. Andreas Koch, Technical University of Darmstadt.
10:00 – 10:30 GMT
show / hide abstract
In recent years, Intel introduced oneAPI as a unified and cross-architecture programming model based on the Data Parallel C++ (DPC++) language, which in turn, is based on the C++ and SYCL standard languages. In order to facilitate the migration of legacy CUDA code originally written for NVIDIA GPUs, developers can employ the Intel DPC++ Compatibility Tool, which aims to automatically migrate code from CUDA to SYCL. While this tool-assisted code migration is a good starting point for leveraging the Intel oneAPI ecosystem, manual steps for code completion and tuning are still required. In this paper, we present our experiences migrating AutoDock-GPU, a widely-used molecular docking application, from CUDA to SYCL. Our discussion focuses on: (1) the use of this automated source-code migration tool, (2) the required manual code refinement for functionality and optimization, and (3) the comparison of the performance achieved in this manner on multi-core CPUs as well as on high-end GPUs, such as NVIDIA A100 and the recently-launched Intel Data Center Max Series 1550 device.
Coffee Break and Poster Session
10:30 – 11:00 GMT
SYCLomatic Compatibility Library: Making Migration to SYCL Easier [2501]
Andy Huang, Intel.
11:00 – 11:30 GMT
show / hide abstract

SYCL[1] is a royalty-free, cross-platform abstraction C++ programming model for heterogeneous computing. SYCL provides necessary programming interfaces like device, queue, kernel, memory interface including buffer, accessor as well as features like USM. As a programing model for heterogeneous computing, Intel oneAPI[2] provides a SYCL compiler and runtime to support SYCL kernel based programing and set of optimized libraries to support API-based programming.
SYCLomatic[3] is a source-to-source migration tool which helps to migrate existing CUDA application source code to SYCL source code by leveraging SYCL interfaces and the optimized libraries provided by Intel oneAPI. One of the major challenges of SYCLomatic is that, in some cases, due to differences in programming API, expressing the identical semantic of a single line of CUDA code in SYCL requires additional data structures or multiple lines of operations. To assist the migration and make the migrated code performant and maintainable, SYCLomatic implements a compatibility library, which consists of additions to SYCL interfaces and a set of compatible APIs for popular libraries. Without the dependency to SYCLomatic, the compatibility library can be used as a standalone library for SYCL programming.
In this talk, we are going to share the reason of creating the compatibility library and the design of the compatibility library.
Closing Semantic Gaps:
The first part of the compatibility library is to close the semantic gaps with CUDA code by adding new functionality to SYCL interfaces like device, queue, malloc, image accessor, etc. by introducing new classes.
1.Utility features to access queues in different devices and threads:
Keeping and passing around the sycl::device pointer between host functions is tedious. In the compatibility library, a singleton device manager class is introduced and used to track the usage of each device in different CPU threads.
With the device manager class, it is easy to achieve following features:
a.Get the “current” device in a thread: The class keeps a map between threads and the last used device in the thread. The map makes it easier to access the wanted device in a host function.
b.Get the default queue for a device: When offloading a task to a device, SYCL requires developer to create a new queue on the device if the pointer of previous created queue is not available. The class keeps a default queue for each device which will be available globally. When a developer needs to use the queue on a device, the class provides a convenient interface to get the default queue of the device.
c.Device level operation (create queue, synchronize, reset): The class records all the creation of queues and maps the queues to the devices. Therefore, device level synchronization can be achieved easily.
2.Pointer-like memory operation for non-USM mode:
Since managing memory through pointer operations is more friendly for programmers, emulating pointer operations with sycl::buffer provides pointer-like memory operations including malloc, free, arithmetic, etc. for the devices which do not support USM.
3.Flexible interface to fetch Image data:
The compatibility library introduces a class which simplifies the operation of fetching image data, e.g., extracting 1 or 2 channels from the image accessor.
Compatible APIs:
The second part of the compatibility library is to provide syntactic sugar for frequently used API calls.
1.Free functions for atomic operation:
With sycl::atomic_ref, performing an atomic operation requires following 2 steps:
a.Construction of sycl::atomic_ref
b.Executing the atomic operation on the sycl::atomic_ref
The compatibility library introduces a set of templated atomic calls to help developers simplify their code.
2.Utility Classes to simplify device memory allocation:
Since sycl::malloc cannot be used to allocate a multi-dimension array and requires multiple steps to create a device-accessible static or global variable, a device memory class performs memory allocation and keeps the dimension information, also providing the following features:
a.Simple interface to allocate a multi-dimension array and pass it to device
b.Simple interface to create a static or global variable which can be accessed in device
3.2D and 3D Memory Operation (USM, non-USM):
SYCL does not provide 2D and 3D memory operations like allocation, memcpy, memset, etc. Therefore, free functions providing this functionality saves a lot of efforts for developers.
4.Compatible APIs for popular CUDA libraries:
Libraries like BLAS (Basic Linear Algebra Subprograms), CCL (Collective Communication Library), DNN (Deep Neural Network Library), STL algorithm, FFT (Fast Fourier transform), etc. are widely used in heterogeneous applications. While Intel oneAPI package provides the SYCL-based implementations of the libraries, there is some difference in the API design for libraries from different implementation which provide similar core functionality. The compatibility library contains APIs to bridge the API usage gap and let developers implement SYCL applications with the interface they are familiar with.
Since SYCL is a relatively young language specification, many existing heterogeneous computing applications, libraries, and frameworks may not have a SYCL implementation. With the compatibility library closing some of the syntax/semantic gaps between SYCL and other heterogeneous computing languages, developers should be able to create SYCL-based libraries/framework with less effort.
To improve the functionality and useability of the compatibility library, there is still work to do, like making the compatibility library to co-exist with SYCL-implemented components in the aspect of device selection, queue activation, task synchronization, etc. and bridging gaps with more APIs from popular CUDA libraries.

Reference:
1. https://www.khronos.org/sycl/
2. https://www.intel.com/content/www/us/en/developer/tools/oneapi/toolkits.html
3. https://github.com/oneapi-src/SYCLomatic

Notices & Disclaimers
Intel technologies may require enabled hardware, software or service activation.
No product or component can be absolutely secure.
Your costs and results may vary.Intel, the Intel logo and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. SYCL is a registered trademark of the Khronos Group, Inc.

Porting SYCL accelerated neural network frameworks to edge devices [1949]
Dylan Angus, Codeplay Software.  
11:30 – 12:00 GMT
show / hide abstract

Portable hardware acceleration has become increasingly necessary with the rise of the popularity of edge computing. Edge computing, referring to the distributed computing paradigm that encourages data to be processed and stored as close to the source of origination as possible, is needed in areas where bandwidth and latency are restricted and network stability, privacy, or security are unreliable or insecure. Examples of such situations are autonomous mobile robotics, such as autonomous tractors, which often have numerous cameras connected to the host, all needing processing in areas where there can be no reliable connection to a cloud-based platform. Additionally, bridge surveying drones, where mapping and path-planning are needed with low latency, can benefit from a lightweight, compact, low-powered device, especially when there are size and energy consumption requirements.

Thus, edge devices, which work as small but compact computers, leverage onboard accelerators to tackle various Robotics, Computer Vision and AI tasks directly on the device without needing an external connection. These accelerators often take the popular form of a GPU like Nvidia’s Jetson development kit series, which are driven by the same workflows of Nvidia’s AI software and cloud-native frameworks while staying lean, compact and less energy-demanding. However, with the increasing popularity of FPGAs, in the future, we could see more edge devices like AMD and Xilinx’s KR260 robotics development kit that operate at low power.

Hence, with the surge of the usefulness of edge devices and variety in the brand and type of accelerators, the need for hardware portability in edge devices expands as well. Thus, as we will show in this talk, SYCL as an open-standard, high-level parallel programming model which provides portability not only at the API level but also at the compiler level provides this hardware portability by enabling the same software to be run on both CPU, GPU and FPGA-based edge devices. Additionally, we will show how we maintain performance through device-specific kernel specialisation.

The Open Neural Network Exchange (ONNX) is an open-source artificial intelligence ecosystem of technology companies and research organizations that establish open standards for representing machine learning algorithms and software tools. ONNX is available on GitHub. This presentation will explain how we used DPC++, an open-source SYCL implementation, to compile the SYCL backend of the ONNX runtime, to target NVIDIA’s Jetson series architecture. DPC++ allows us to compile for the ONNX runtime SYCL backend and use the Jetson’s onboard GPU and also use ComputeAorta, Codeplay’s multi-target, multi-platform framework, as an OpenCL implementation to target the Jetson’s onboard CPU. We will show the performance we get using the ONNX runtime CPU backend and the SYCL backend targeting Jetson’s GPU and CPU. The ONNX runtime SYCL backend is implemented using the lightweight templated SYCL-BLAS and SYCL-DNN libraries that include kernels with tuning parameters such as cache size, workgroup size and local memory size based on the device-specific hardware. Once tuned for the Jetson, the SYCL backend showed comparable performance with the native CUDA backend used by ONNX.

Finally, using the ONNX runtime SYCL backend and an Nvidia Jetson Xavier NX edge device, we were able to perform tests to demonstrate inference of drone-to-drone detection using LIDAR data. The tests use a trained PointNet model with the SYCL backend to segment and classify the lidar points between a drone and the background.

For future work, we hope to enable and tune SYCL-DNN/SYCL-BLAS for other Jetson devices as well as FPGA and RISC-V-based edge devices.

Performance Evolution of Different SYCL Implementations based on the Parallel Least Squares Support Vector Machine Library [1602]
Marcel Breyer, University of Stuttgart.  | Co-authors: Alexander Van Craen and Dirk Pflüger, University of Stuttgart.
12:00 – 12:30 GMT
show / hide abstract

In machine learning and scientific computing, some of the biggest challenges are efficient and performant portable computing. With our Parallel Least Squares Support Vector Machine (PLSSVM) library, we have not only developed an unrivaled Support Vector Machine (SVM) implementation for huge dense data sets, but we have also created a representative benchmark for a frequently encoun tered task in scientific computing, a (modified) matrix-vector multiplication. PLSSVM supports multiple backends—OpenMP, CUDA, HIP, OpenCL, and SYCL—to be able to target the most widely used hardware platforms in machine learning and scientific computing.

In this paper, we use PLSSVM to compare different DPC++ compiler and hipSYCL library versions over the period of one year. Furthermore, we compared two versions (one from February and the other from November 2022), with each other and report their respective performance evolution in dept. We also put these results in relation to our other implemented backends and report their performance portability on three different hardware platforms, an NVIDIA and AMD GPU and an Intel CPU.

Our results show that installing new compiler or library versions can have surprisingly vast impacts in both directions. In our case, the nd_range kernel runtimes were up to 83 % faster on an NVIDIA GPU when using a newer DPC++ compiler. Also for hipSYCL, using the new omp.accelerated compilation flow improves the nd_range performance on CPUs by over 90 %. When compared to OpenCL, in our results, SYCL also offers a better performance portability while being easier to use indicated by drastically fewer lines of code needed in our PLSSVM library. While OpenCL
only has a performance portability of 50 %, DPC++ achieved the highest value with 70 % within the performance metric provided by Pennycook et al.

The code, utility scripts, and documentation are all publicly available on GitHub: https://github.com/SC-SGS/PLSSVM.

Panel Discussion: Machine Learning with OpenCL and SYCL [ML Panel]
  • Panel Chair: Tom Deakin, University of Bristol
  • Chris Gearing, Mobileye
  • Marcel Breyer, University of Stuttgart
  • Dylan Angus, Codeplay Software
  • Michael Wong, Codeplay
12:30 – 13:00 GMT
show / hide abstract
Lunch Break and Poster Session
13:00 – 14:00 GMT
One Pass to Bind Them: The First Single-Pass SYCL Compiler with Unified Code Representation Across Backends [2566]
Aksel Alpay, Heidelberg University.  | Co-authors: Vincent Heuveline, Heidelberg University.
14:00 – 14:30 GMT
show / hide abstract

Current SYCL implementations rely on multiple compiler invocations to generate code for host and device, and typically even employ one compiler invocation
per required backend code format such as SPIR-V, PTX or amdgcn. This makes generating “universal” binaries that can run on all devices supported by a
SYCL implementation very time-consuming, or outright impractical. The ability to generate such universal binaries is however important e.g. when a software
vendor wishes to distribute binaries to users that rely on unknown hardware configurations.

To address this issue, we present the very first SYCL implementation with a single-source, single compiler pass (SSCP) design and a unified code
representation across backends. This allows a single compiler invocation to generate a binary that can execute kernels on all supported devices,
dramatically reducing both compile times as well as the user effort required to generate such universal binaries. Our work is publicly available as part of
Open SYCL, the SYCL implementation formerly known as hipSYCL, and supports Intel GPUs through SPIR-V, NVIDIA GPUs through CUDA PTX and AMD GPUs through ROCm amdgcn code.

Our new compiler operates in two phases: At compile time, during the regular host compilation pass, it extracts the LLVM IR of kernels. This IR is then
stored in a backend-independent fashion in the host binary. At runtime, the embedded LLVM IR is then lowered to the format required by backend drivers
(e.g. PTX, SPIR-V, amdgcn). This approach enables portability of a single code representation even if backends do not support a common code format, while still allowing interoperability with vendor-specific optimized libraries.

We find that our new compiler can generate highly portable binaries that run on any NVIDIA, Intel or AMD ROCm GPU with only 20% additional compilation time compared to a regular clang host compilation. On our test system, this is roughly 2.2x faster than compiling with the existing Open SYCL compiler for just three AMD GPUs.

We also show that the cost of the additional runtime compilation steps can be expected to be approximately comparable to the cost of runtime compilation
that backend drivers already perform today, e.g. to lower SPIR-V to machine code.

Lastly, we present early performance results on four different GPUs from three vendors. We find that performance is usually within 10% of current multipass
SYCL compiler techniques, with the maximum deviations ranging from a performance regression of 13% to a speedup of 27%. This implies that compared
to current SYCL compilation techniques, our new compiler achieves similar performance while substantially decreasing compile times, and increasing the
portability of generated binaries.

.

Implementation Techniques for SPMD Kernels on CPUs [966]
Joachim Meyer, Saarland University.  | Co-authors: Aksel Alpay, Holger Fröningand, and Vincent Heuveline, Heidelberg University. Sebastian Hack, Saarland University.
14:30 – 15:00 GMT
show / hide abstract

More and more frameworks and simulations are developed using heterogeneous programming models such as OpenCL, SYCL, CUDA, or HIP. A significant hurdle to mapping these models to CPUs in a performance-portable manner is that implementing work-group barriers for such kernels requires providing forward-progress guarantees so that all work-items can reach the barrier.

This work provides guidance for implementations of single-program multiple-data (SPMD) programming models, such as OpenCL, SYCL, CUDA, or HIP, on non-SPMD devices, such as CPUs. We discuss the trade-offs of multiple approaches to handling work-group-level barriers. We present our experience with the integration of two known compiler-based approaches for low-overhead work-group synchronization on CPUs. Thereby we discuss a general design flaw in deep loop fission approaches, as used in the popular Portable Computing Language (PoCL) project, that makes them miscompile certain kernels.

For our evaluation, we integrate PoCL’s “loopvec” kernel compiler into Open SYCL and implement continuation-based synchronization (CBS) in the same. We compare both against Open SYCL’s library-only fiber implementation using diverse hardware: we use recent AMD Rome and Intel Icelake server CPUs but also two Arm server CPUs, namely Fujitsu’s A64FX and Marvell’s ThunderX2. We show that compiler-based approaches outperform library-only implementations by up to multiple orders of magnitude. Further, we adapt our CBS implementation into PoCL and compare it against its loopvec approach in both, PoCL and Open SYCL. We find that our implementation of CBS, while being more general than PoCL’s approach, gives comparable performance in PoCL and even surpasses it in Open SYCL. Therefore we recommend its use in general.

Accelerating Simulink/Matlab projects with SYCL. [8638]
Uwe Dolinsky, Codeplay Software.
15:00 – 15:30 GMT
show / hide abstract

Matlab/Simulink is a very popular software development tool which is widely used in academia and industry to build and evaluate complex dynamical systems. It combines graphical modelling with the ability to develop algorithms directly in the Matlab language and offers various toolboxes targeting a wide range of applications in automotive, robotics, image processing, machine learning and other areas. For deployment on various platforms the Matlab/Simulink projects are typically translated into C/C++ by MathWorks’ Simulink/Matlab Coder tool and subsequently built by the C/C++ toolchain for the targeted platform.

In this talk we present a new tool flow to accelerate Simulink/Matlab projects with SYCL. This enables Matlab/Simulink projects to take advantage of the growing open-source SYCL ecosystem to accelerate complex Simulink models on a wide range of diverse platforms in a standards-based way. This enables Matlab/Simulink projects to directly benefit from performance-optimized SYCL algorithms and tools to tune the performance of Simulink models on different hardware. The presented tool flow does not require Matlab/Simulink to be installed and no Matlab/Simulink dependencies are required. The approach is non-disruptive in that the Simulink/Matlab developers do not need to know SYCL and do not need to adapt Simulink/Matlab solutions to take advantage of SYCL.

In this work Simulink/Matlab solutions are translated with the help of open-source tools into C++ code calling an API which can be customized to use different algorithms and libraries as backends. For example, vector/matrix operations are typically performed in this flow by the Eigen library while scalar operations are performed by the standard C++ library. API functions for specific Matlab/Simulink operations or operands can be implemented or overloaded/specialized to use different libraries if needed. For example operations on large matrices or vectors can be implemented using SYCL-BLAS to take advantage of highly parallel linear algebra operations that can be autotuned to maximize performance for a given platform.

The presented tool flow converts entire Simulink solutions into C++ code. It does so by reading in Simulink solutions files (in .slx or .mdl format) and extracting the models and their associated sub models and blocks. It also reads in and integrates data files (*.mat files), data dictionary files, and Matlab files to initialize the workspace. The blocks are then scheduled and translated into C++ code constituting the model step which is run to execute the model. Models can contain sub models, model references and sub systems – which can contain embedded Matlab files that are executed when the associated blocks are executed. These Matlab files are converted into C++ code and integrated with the code generated from the containing Simulink block. The tool flow solved various challenges when converting Matlab code into C++, and supports most Matlab language features.

The presented tool flow has already been applied to accelerate complex Matlab/Simulink models in automotive projects: developed as part of a collaboration project with Williams Advanced Engineering the tool flow was used to accelerate large Simulink models of novel hybrid battery types for Electric Vehicles via SYCL. These models run on an embedded platform directly on the battery and acceleration enables more complex and accurate models to be run. This presentation shows the tool flow applied to opensource Simulink/Matlab code. We will present the results at SYCLcon.

Performance varies by use, configuration and other factors. Your costs and results may vary.

Coffee Break and Poster Session
15:30 – 16:00 GMT
Evaluation of SYCL Suitability for High-Performance Critical Systems. [9555]
Leonidas Kosmidis, Barcelona Supercomputing Center.  | Co-authors: Matina Maria Trompouki, Barcelona Supercomputing Center. Cristina Peralta, Universitat Politècnica de Catalunya and BSC.
16:00 – 16:30 GMT
show / hide abstract

Upcoming safety critical systems require high performance processing, which can be provided by multi-cores and embedded GPUs found in several Systems-on-chip (SoC) targeting these domains. So far, only low-level programming models and APIs, such as CUDA or OpenCL have been evaluated. In this paper, we evaluate the effectiveness of a higher level programming model, SYCL for critical applications executed in such embedded platforms. In particular, we are interested in two aspects: performance and programmability.

In order to conduct our study, we use the open source GPU4S Bench benchmarking suite for space and an open source pedestrian detection application representing the automotive sector, which we port into SYCL and analyze their behavior. We perform our evaluation on a high-performance platform featuring an NVIDIA GTX 1080Ti as well as a representative embedded platform, the NVIDIA Xavier AGX which is considered a good candidate for future safety critical systems in both domains and we compare our results with other programming models. Our results show that in several cases SYCL is able to obtain performance close to highly optimised code using CUDA or NVIDIA libraries, with significantly lower development effort and complexity, which confirms the suitability of SYCL for programming high-performance safety critical systems.

What’s New in SYCL for Safety Critical Systems  [9678]
Erik Tomusk, Codeplay Software.  | Co-authors: Verena Beckham, Codeplay Software.
16:30 – 17:00 GMT
show / hide abstract

In April 2022, Codeplay and CoreAVI initiated the SYCL SC Exploratory Forum within Khronos to evaluate industry interest in a new Khronos API based on SYCL and targeted at safety-critical industries. A year later, we take stock of the progress the Exploratory Forum has made on defining SYCL for Safety-Critical Systems, and we share some of the insights we have gained.

Safety-critical industries, like avionics, automotive, nuclear, and rail, require their software to be compliant to safety standards such as ISO 26262, ISO 21448/SOTIF, DO-178C, and UL4600, as well as to adhere to guidelines such as those defined by AUTOSAR and MISRA. While safety-critical industries have traditionally been cautious about adopting new or unproven technologies, interest by these industries in C++ and heterogeneous programming has increased significantly in recent years. This is driven, in large part, by the need for AI technologies to implement advanced features, such as autonomous behavior. Compute-heavy workloads like AI require high-level programming frameworks as well as considerable computing power, which can only be achieved by a heterogeneous system design.

SYCL’s single-source C++ programming model has already become popular in the HPC industry. The proposed SYCL for Safety-Critical Systems API aims to open up high-level heterogeneous compute to safety-critical industries by introducing modifications and extensions to SYCL to make both SYCL applications and SYCL implementations easier to certify to industry safety standards.

In this talk, we will give an overview of what certification to a safety standard implies for a compiler and runtime based on SYCL.

Khronos Exploratory Forums are designed to be open to companies and individuals who are not yet Khronos participants. A key aim of the SYCL SC Exploratory Forum was to hear from “end-user” companies in safety-critical domains, and to evaluate the market for a safety-critical API based on SYCL. The talk will give an overview of the companies that participated and their general feedback.

In the initial phase, the SYCL SC Exploratory Forum heard presentations from its participants and collated a “wish list” of features for a high-level heterogeneous compute API. The talk will give an overview of features that were requested and a discussion of some of the more interesting points.

In the second stage, the members of the Forum analyzed these “wishes” according to their relevance to a safety-critical standard specifically based on SYCL. A list of core requirements for the SYCL for Safety-Critical Systems API was distilled from the wish list and will act as a guide during the definition of the new standard. The talk will include an overview of the requirements, background on the finer technical points, and some of the technical discussions that were had around these topics.

The presentation will also describe some of the open questions that are still to be answered during the design of the SYCL for Safety-Critical Systems API.

The presentation will close with a call to join the discussions and help define the details of this new standard, which promises to open up the SYCL programming model to safety-critical industries.

Panel Discussion: OpenCL and SYCL [Panel Discussion]
  • Panel Chair: Simon McIntosh-Smith, University of Bristol
  • Leonidas Kosmidis, Barcelona Supercomputing Center
  • Aksel Alpay, University of Heidelberg
  • Nevin Liber, Argonne National Laboratory
  • Michael Wong, Codeplay
17:00 – 18:00 GMT
show / hide abstract
Close Day 2
18:00 GMT
Conference Dinner – Grand Hall, Sidney Sussex College
19:00 – 22:00 GMT

Thursday 20 April, 2023 | Conference Sessions

Welcome Refreshments
08:30 – 09:15 GMT
Welcome and An Introduction to the use of SYCL at the Cambridge Open Zettascale Lab
Dr Paul J. Calleja, Research Computing Resources, University of Cambridge.
09:15 – 10:00 GMT
show / hide abstract
Our local host will welcome everyone to Thursday’s sessions and provide an overview of the Research Computing Services at Cambridge, the Cambridge Open Zettascale Lab and all the work they are doing on SYCL.
Transforming Fortran weather and climate applications to OpenCL using PSyclone. [5971]
Sergi Siso, Hartree Centre STFC UKRI.  | Co-authors: Andrew Porter and Rupert Ford, Hartree Centre STFC UKRI.
10:00 – 10:30 GMT
show / hide abstract
Specialized hardware accelerators have gained popularity in high-performance computing as a way to increase the performance and power efficiency characteristics of the systems. However, a large number of scientific applications running on these systems are written in Fortran, which does not provide a performance-portable vendor-agnostic API to target heterogeneous architectures. In this paper, we use the PSyclone source-to-source code generation and transformation system to automatically translate a subset of the Fortran language to OpenCL for weather and climate applications conforming to the PSyKAl kernel-based parallelism model. This allows Fortran applications to take advantage of the mature accelerator portability and runtime compilation capabilities of OpenCL. However, a direct translation does not always produce optimal code for each device. To improve the performance portability of the generated code, the presented approach also allows application developers to provide a list of additional code transformations needed to make the generated OpenCL kernels appropriate for each target architecture. We tested the system with the NEMOLite2D Fortran application, which produced competitive performance portability results for CPU and GPU platforms from different vendors and provided an initial port of the application to an FPGA system.
Coffee Break and Poster Session
10:30 – 11:00 GMT
Comparing the Performance of SYCL Runtimes for Molecular Dynamics Applications [2519]
Andrey Alekseenko, KTH Royal Institute of Technology  | Co-authors: Szilárd Páll, KTH Royal Institute of Technology.
11:00 – 11:30 GMT
show / hide abstract

SYCL is a cross-platform, royalty-free standard for programming a wide range of hardware accelerators. It is a powerful and convenient way to write standard C++ 17 code that can take full advantage of available devices. There are already multiple SYCL implementations targeting a wide range of platforms, from embedded to HPC clusters. Since several implementations can target the same hardware, application developers and users must know how to choose the most fitting runtime for their needs. In this talk, we will compare the runtime performance of two major SYCL runtimes targeting GPUs, DPC++ and hipSYCL, for the purposes of GROMACS, a high-performance molecular dynamics engine.

Molecular dynamics (MD) applications were one of the earliest adopters of GPU acceleration, with force calculations being an obvious target for offloading. It is an iterative algorithm where, in its most basic form, on each step, forces acting between particles are computed, and then the equations of motions are integrated. As the computational power of the GPUs grew, the strong scaling problem became apparent: the biophysical systems modeled with molecular dynamics typically have fixed sizes, and the goal is to perform more time steps, each taking less than a millisecond of wall time. This places high demands on the underlying GPU framework, requiring it to efficiently schedule multiple small tasks with minimal overhead, allowing to achieve overlap between CPU and GPU work for large systems and allowing to keep GPU occupied for smaller systems. Another requirement is the ability of application developers to have control over the scheduling to optimize for external dependencies, such as MPI communication.

GROMACS is a widely-used MD engine, supporting a wide range of hardware and software platforms, from laptops to the largest supercomputers. Thus, performance portability is crucial to keep the code not only efficient but also maintainable. The initial support for NVIDIA accelerators, using CUDA, was added to GROMACS in 2010. Since then, heterogeneous parallelization has been a major target for performance optimization, not limited to NVIDIA devices but later adding support for GPUs of other vendors, as well as Xeon Phi accelerators. GROMACS initially adopted SYCL in its 2021 release as a replacement for its previous GPU portability layer, OpenCL. In further releases, the number of offloading modes supported by the SYCL backend steadily increased. As of GROMACS 2023, SYCL support in GROMACS achieved near feature parity with CUDA while allowing to use a single code to target the GPUs of all three major vendors with minimal specialization.

While this clearly supports the portability promise of modern SYCL implementations, the performance of such portable code remains an open question, especially given the strict requirements of MD algorithms. In this talk, we compare the performance of GROMACS across a wide range of system sizes when using DPC++ and hipSYCL runtimes on high-performance NVIDIA, AMD, and Intel GPUs. Besides the analysis of individual kernel performance, we focus on the runtime overhead and the efficiency of task scheduling when compared to a highly optimized implementation using the native frameworks and discuss the possible sources of suboptimal performance and the amount of vendor-specific code branches required to achieve the optimal performance.

Particle track reconstruction on heterogeneous platforms with SYCL [1533]
Bartosz Sobol, Jagiellonian University.  | Co-authors: Grzegorz Korcyl, Jagiellonian University.
11:30 – 12:00 GMT
show / hide abstract

With the SYCL programming model comes the promise of relatively easy parallel and accelerated code development as well as out-of-the-box portability between various hardware platforms from different vendors.

One of the areas which can highly benefit from this kind of characteristics of the programming model is particle physics experiments, where large amounts of data need to be processed on multiple stages by a wide variety of algorithms of different profiles. Such a data processing pipeline is often required to consume streaming data from the detectors in an online manner. Modern hardware platforms, accelerators, and their increasing performance are an opportunity for collaborations to collect and analyze more data, more effectively and with better accuracy.

On the other hand, building a complex software stack by teams with a limited number of developers becomes more and more challenging in a multi-vendor landscape and with new programming models and APIs emerging.

As the physics experiments are designed and computing solutions evaluated many years ahead of the actual run, there is also a need for the codebase of this kind of scientific software to be future-proof, e.g., being able to run on a next-generation computing cluster that uses GPU accelerators from different vendors or entirely different platforms like upcoming powerful APU devices.

In this project, we begin with a simple single-threaded implementation of particle track reconstruction algorithm proposed for one of the subdetectors in the PANDA experiment being under development as a part of the FAIR Facility at GSI, Darmstadt, Garmany.

We start with a task to port the algorithm to SYCL with minimal effort, I.e., trying to keep the kernel code as close to the original implementation as possible, while attempting to maintain good parallelization and competitive performance in an accelerated environment.

After many iterations, experimentation with different memory layouts as well as various approaches to express parallelism and data flow to tame the memory-bounded characteristics of the algorithm, we came up with a final version, that’s still similar in terms of code structure to the original implementation and can achieve satisfying performance across all kinds of different targets.

This ultimate implementation, comprising 7 kernels and multiple auxiliary accelerated functions, was evaluated using major SYCL implementations: hipSYCL and DPC++. Benchmarks were conducted on a wide variety of platforms from leading vendors including NVIDIA V100, NVIDIA A100, and AMD MI250 GPUs as well as AMD EPYC Rome and Intel Cascade Lake CPUs, and finally AMD/Xilinx Alveo U280 FPGA accelerator card. For the latter, an experimental AMD/Xilinx compiler based on Intel’s LLVM version was used.

We also compare the performance with CUDA implementation built in the same manner as the final SYCL one, showing that it can achieve performance comparable to the native version.

We show that developing performant and portable code with truly single source code for CPU and GPU is possible and accessible for developers with an intermediate understanding of parallelization and how to effectively interact with GPU-based accelerators.

Finally, for more exotic types of devices, like FPGA-based accelerators, some host code modifications are required to successfully compile and execute the software on such platforms. While not competitive in terms of performance, we believe that the ability to run this kind of algorithm on FPGA without significant adjustments is an achievement in itself.

Stellar Mergers with HPX-Kokkos and SYCL: Methods of using an Asynchronous Many-Task Runtime System with SYCL [3335]
Gregor Daiß, University of Stuttgart.  | Co-authors: Patrick Diehl and Hartmut Kaiser, Louisiana State University. Dirk Pflüger, University of Stuttgart.
12:00 – 12:30 GMT
show / hide abstract

Ranging from NVIDIA GPUs to AMD GPUs and Intel GPUs: Given the heterogeneity of available accelerator cards within current supercomputers, portability is a key aspect for modern HPC applications. In Octo-Tiger, an astrophysics application simulating binary star systems and stellar mergers, we rely on Kokkos and its various execution spaces for portable compute kernels. In turn, we use HPX, a distributed task-based runtime system, to coordinate kernel launches, CPU tasks, and communication. This combination allows us to have a fine interleaving between portable CPU/GPU computations and communication, enabling scalability on various supercomputers.

However, for HPX and Kokkos to work together optimally, we need to be able to treat Kokkos kernels as HPX tasks. Otherwise, instead of integrating asynchronous Kokkos kernel launches into HPX’s task graph, we would have to actively wait for them with fence commands, which wastes CPU time better spent otherwise. Using an integration layer called HPX-Kokkos, treating Kokkos kernels as tasks already works for some Kokkos execution spaces (like the CUDA one), but not for others (like the SYCL one).

In this work, we started making Octo-Tiger and HPX itself compatible with SYCL. To do so, we introduce numerous software changes most notably an HPX-SYCL integration. This integration allows us to treat SYCL events as HPX tasks, which in turn allows us to better integrate Kokkos by extending the support of HPX-Kokkos to also fully support Kokkos’ SYCL execution space.

We show two ways to implement this HPX-SYCL integration and test them using Octo-Tiger and its Kokkos kernels, on both an NVIDIA A100 and an AMD MI100. We find modest, yet noticeable, speedups ($1.11$x to $1.15$x for the relevant configurations) by enabling this integration, even when just running simple single-node scenarios with Octo-Tiger where communication and CPU utilization are not yet an issue. We further find that the integration using event polling within the HPX scheduler works far better than the alternative implementation using SYCL host tasks.

.

VkFFT and beyond – a platform for runtime GPU code generation [4934]
Dmitrii Tolmachev, ETH Zurich.
12:30 – 13:00 GMT
show / hide abstract

This talk will present the VkFFT version 1.3 and the new platform for runtime GPU code generation it is based on. The main reason for this update is to make algorithms implemented in VkFFT available for many other GPU applications and standardize the way the code is generated in it.

The platform presented allows fine-tuning of the algorithms for a particular GPU and API they are executed on at runtime. It aims to make it easier for competent GPU programmers to express themselves to different APIs, as the design logic of modern GPUs is fairly similar between all vendors. This is the main difference between the platform and other existing API-independent ways to write code, as they usually aim at fast prototyping and simple optimizations under the hood for beginner-level GPU programmers.

The platform has a hierarchical structure design: Application -> Plan -> Code. At the application stage, the platform performs all interactions
with the user and resources management. This includes configuration parsing, calls to the application initialization, update, dispatch and deletion with optional binary caching. The plan stage is the internal configuration stage that constructs the intermediate representation of the problem to be solved. This includes all algorithm decision-making, resource allocation, calls to the code generator and code compilation. The code generation stage produces a string that will hold GPU code for a particular API that can be later compiled and used. It is further divided into multiple levels: level 2 subkernels – a clear description of the problem via a sequence of calls to lower levels; level 1 subkernels – simple routines: matrix-vector multiplication, FFT, pre- and post-processing, R2C/R2R mappings; level 0 subkernels – memory management, basic math, functions inlining, API-dependent definitions. The code generator operates on special data containers, that can hold either known during the plan creation integer/float values or strings of variable names. Using a multiplication operation that performs A=B*C as an example, if all containers have known values, A can be precomputed during plan creation. If A , B and C are register names, we print to the kernel an operation of multiplication to be executed.

This talk will also discuss multiple algorithms implemented with this platform. On the example of VkFFT we will demonstrate the overall platform structure and the general GPU application design guidelines, mainly related to optimization of memory layout, such as having no CPU-GPU transfers during execution except for asynchronous downloads from the GPU, minimization of GPU dedicated memory-L2-L1 communication and maximization of on-chip memory usage. To go even further, we will demonstrate how a finite difference solver can be implemented with a help of the platform using only low-level warp shuffling instructions to perform on-chip data transfers instead of using the shared memory of the streaming multiprocessor (on-chip memory accessible by all threads). This considerably reduces the number of communications between threads, which can be a performance-limiting factor for high-order schemes. We will demonstrate the benchmark comparison of warp communication performance of modern GPUs, including high-end HPC GPUs from Nvidia and AMD and consumer-level solutions.

Lunch Break and Poster Session
13:00 – 14:00 GMT
Towards Deferred Execution of a SYCL Command Graph [4608]
Ewan Crawford, Codeplay Software.  | Co-authors: Pablo Reble and Julian Miller, Intel. Ben Tracy, Codeplay Software.
14:00 – 14:30 GMT
show / hide abstract

A key concept in SYCL’s execution model is the use of command groups that create a directed acyclic graph of kernel executions at runtime. A command group object defines a set of dependencies or edges that must be satisfied for kernels or nodes to be executed. However, because command group submission is tied to execution on the queue, without having a prior construction step before starting execution, optimization opportunities can be missed from the runtime not being made aware of a defined dependency graph ahead of execution. This represents de facto a built-in eager execution mode in SYCL in contrast to a lazy execution mode where definition and submission of work is decoupled.
We propose an extension to the SYCL 2020 specification, which closes this gap by introducing the concept of a command graph. We add new mechanisms for the user to build a command graph for later execution. Commands are added to a graph, finalized to prepare for execution and finally executed on a queue. The extension decouples overhead associated with submission by performing expensive operations and optimizations at finalize time and allowing for batching of commands at submission time. This command batching is supported by many SYCL backends but not exposed to users through the SYCL API.
In addition to the benefits to the SYCL runtime, there are also advantages to the user developing SYCL applications. Repetitive workloads no longer must redundantly issue the same sequence of commands. Instead, a graph is only constructed once and submitted for execution as many times as is necessary, only changing the data in input buffers or USM (Unified Shared Memory) allocations. For applications from specific domains, such as machine learning as well as computer vision, where the same command group pattern is run repeatedly for different inputs, this is particularly useful.

This talk is presented in two sections. First, we provide an overview of the specification for the extension. This includes two distinct mechanisms for graph building: An explicit API that provides a new set of functions for expressing a command graph directly in SYCL code, and the “Record & Replay” API that is designed to retrofit existing codebases and enable the use of existing libraries and frameworks with minor modifications. We discuss the mechanisms available for modifying a graph after construction and the motivation for the API design compared to other similar mechanisms in use today in other programming models.
In the second section of our talk, we detail the work in progress for implementing the extension in Intel’s DPC++ runtime, in particular the early-stage prototype available at [4]. We will show execution traces demonstrating the potential overhead reduction that is possible, as well as current limitations, and what we’ve learned from implementing it so far. This includes an overview of how our implementation maps to the various backends available and how to address situations where there is no backend support.
We also examine plans for the future of our proposal and implementation and the optimization possibilities that it enables such as inter-node memory reuse and interactions with other relevant SYCL extensions.

A SYCL Extension for User-Driven Online Kernel Fusion. [6453]
Víctor Pérez-Carrasco, Codeplay Software.  | Co-authors: Lukas Sommer, Victor Lomüller, Kumudha Narasimhan and Mehdi Goli, Codeplay Software.
14:30 – 15:00 GMT
show / hide abstract

Heterogeneous programming models such as SYCL allow developers to integrate a variety of accelerators found in today’s heterogeneous systems into an application with ease. However, while offloading specific tasks to specialized accelerators can deliver significant performance improvements for many applications, short-running device kernels remain a challenge for most heterogeneous programming models.

Each invocation of a device kernel is linked to some overhead, caused by the necessary data-transfers, kernel launch and synchronization between host and device. In particular, for a sequence of short-running kernels, this can lead to an unfavourable ratio of overhead and actual computation, resulting in performance degradation.

One potential solution to address this problem is to merge multiple small, memory-bound, short-running kernels into a single larger kernel. This leads to better use of the device’s resources and amortizes the device launch overhead. Yet, manually creating fused kernels can be an error-prone, challenging task for developers, and the resulting kernels are less reusable and maintainable.

The extension to the SYCL API presented in this talk aims to automate the creation of fused kernels. It provides a mechanism for users or software frameworks using SYCL to instruct the runtime to automatically fuse multiple device kernels at runtime, without the need for manual implementation of the fused kernel. Users or software frameworks can use their application and domain knowledge, as well as runtime context information, to determine when fusion of kernels is legal and profitable, while the actual process of creating a fused kernel is automated by the SYCL runtime.

Reducing the kernel launch overhead is however not the only way kernel fusion can improve application performance. The LLVM-based JIT compiler integrated into the SYCL runtime implementation for automatic creation of fused kernels can perform further optimizations.

One such optimization is the internalization of dataflow. Intermediate results that originally needed to be communicated via global memory between the different kernels now become internal dataflow of the fused kernel. Replacing slow global memory accesses for this internalized dataflow with faster accesses to local memory or even registers can yield significant performance improvements for many applications.

The extension presented in this talk is currently an experimental vendor extension, targeting SYCL version 2020. The initial proof-of-concept implementation was based on Codeplay’s ComputeCpp SYCL implementation and has also been contributed and open-sourced as part of the DPC++ SYCL implementation.

To demonstrate the performance improvements unlocked by the extension, two different types of workloads are evaluated on Intel CPU and integrated Intel GPUs.

For a set of sixteen typical operator sequences from neural networks with various input sizes, kernel fusion achieves speedups between 0.9x and 2.26x on GPU (geo.-mean 1.35x), and between 1.02x and 3.2x on CPU (geo.-mean 1.78x). For complete neural networks, this translates to 1.19x (Resnet 50) and 1.68x (VGG 16) speedup on CPU, and 1.15x (Resnet 50) and 1.02x (VGG 16) speedup on GPU.

For the six benchmarks 3mm, bicg, correlation, covariance, fdtd2d and gramschmidt from the SYCL Bench benchmark suite with different input sizes, fusion achieves speedups between 0.98x and 4.91x on GPU (geo.-mean 1.34x), and speedups between 0.82x and 3.28x on CPU (geo.-mean 1.06x).

In summary, this talk presents a SYCL extension automating the creation of fused kernels on user request and shows the potential performance benefits of such an extension on different workloads.

Towards a SYCL API for Approximate Computing [3599]
Lorenzo Carpentieri, University of Salerno.  | Co-authors: Biagio Cosenza, University of Salerno.
15:00 – 15:30 GMT
show / hide abstract

Approximate computing exploits the gap between the accuracy provided by a system and the accuracy required by an application. Many applications such as image processing and neural networks, are tolerant of a certain amount of error, and have the potential for significant improvements in terms of execution time and energy consumption if a small amount of error can be accepted. The most advanced software approximation techniques are mixed precision, which uses a lower precision data representation for both integer and floating point variables; perforation, which exploits data locality by skipping some iterations and assuming that nearby data have similar values; and relaxed synchronization.

In this technical talk, we present SYprox, a SYCL-based API supporting a broad set of approximation techniques in modern C++.
SYprox introduces a set of semantics that extend SYCL’s buffers and accessors to provide a high-level easy-to-use programming API.
SYprox’s approximation framework seamlessly adds three additional stages to standard computational processing implementing perforation, input reconstruction and output reconstruction. It supports data perforation and elision patterns for efficient approximation, as well as signal reconstruction algorithms for error mitigation.
Finally, we present our preliminary results, showing improvements in terms of performance and accuracy over to state-of-the-art approaches, while maintaining at the same time a high-level programming approach.

Coffee Break and Poster Session
15:30 – 16:00 GMT
Towards Alignment of Parallelism in SYCL and ISO C++ [2381]
John Pennycook, Intel.  | Co-authors: Ben Ashbaugh, James Brodman, Michael Kinsner, Steffen Larsen, Greg Lueck, Roland Schulz and Michael Voss, Intel.
16:00 – 16:30 GMT
show / hide abstract

The SYCL and ISO C++ specifications use different terminology to describe parallelism, which is confusing to developers and hinders the SYCL community’s efforts to influence the direction of C++ through experiments and proof points. Critically, SYCL does not provide mechanisms for developers to reason about specific device behaviors that may impact the execution of parallel programs, such as the forward progress guarantees at various levels of the execution model hierarchy. The N-dimensional range (ND-range) execution model currently defined by SYCL extends the C++ model, but does not relate it to concepts or formalisms of C++ parallelism.

This talk presents: (1) a detailed analysis of parallelism terminology in SYCL and ISO C++; (2) proposed modifications to the SYCL standard, to align with C++17; and (3) a generalized abstract ND-range execution model introducing the notion of hierarchical forward progress guarantees. To demonstrate the potential impact of these changes, we outline a new extension to SYCL enabling developers to understand and potentially control device behavior across the hierarchy. Although discussed in the context of SYCL, the changes outlined in this paper have broader implications for all languages building upon an ND-range model (e.g. OpenCL). Our abstract hierarchical execution model applies generally to modern data parallel languages, many of which don’t yet comprehend the hierarchical nature of the hardware architectures that they target..

Standardizing complex numbers in SYCL [1187]
Nevin Liber, Argonne National Laboratory.  | Co-authors: Thomas Applencourt, Brice Videau, Kevin Harms, and Bryce Allen, Argonne National Laboratory. Amanda Dufek, National Energy Research Scientific Computing Center. Jefferson le Quellec and Aiden Belton-Schure, Codeplay.
16:30 – 17:00 GMT
show / hide abstract
Complex numbers are used in many high-performance computing applications for scientific simulations. They were missing in the SYCL 2020 specification, resulting in fragmented and inconsistent implementations in the SYCL ecosystem. To address this, we devised an extension to the standard to provide sycl::complex type together with operators and math functions and developed a header-only implementation of this extension with liberal open-source licensing that can be used in any cycle implementation.
Awards and Closing [Awards]
Simon McIntosh-Smith, University of Bristol.
17:00 – 17:15 GMT
Close of Conference
17:15 GMT

Posters

The poster sessions will take place during the breaks and lunch on Wednesday and Thursday.

Leveraging MLIR for Better SYCL Compilation [9693]
Víctor Pérez-Carrasco, Codeplay Software.  | Co-authors: Ettore Tiotto, Whitney Tsang, Arnamoy Bhattacharyya and James Brodman, Intel. Lukas Sommer, Victor Lomüller and Jefferson le Quellec, Codeplay Software.
show / hide abstract

Recent years have raised awareness of the fact that many optimizing C++ compilers, such as Clang/LLVM, miss optimization opportunities due to the lack of a suitable high-level intermediate representation. The typical compilation flow of such a compiler would lower from a representation close to the original or pre-processed source code, e.g., an abstract syntax tree (AST), directly to a low-level, CFG- and SSA-based intermediate representation such as LLVM IR.

However, this lowering loses much of the high-level information and structure of the original source code as it cannot always be represented accurately in the low-level intermediate representation. Compiler optimization passes working on this low-level IR try to recover relevant parts of the high-level information (e.g., loops) and programmer’s intent to optimize the code. If they fail to recover the necessary information, important optimization opportunities might be missed.

This insight about loss of high-level information in compilers has driven the creation of the MLIR framework. The MLIR framework, through its core abstraction called “dialect”, enables the creation of a set of multiple intermediate representations capturing high-level semantics and domain-specific information. The progressive lowering process from source code to executable then happens in much smaller steps and allows optimization passes to operate at the appropriate level of abstraction to leverage high-level information.

With SYCL being strongly based on C++, SYCL compilers likewise suffer from the same problem as current C++ compilers, which naturally raises the question of whether the MLIR framework can be used to improve SYCL compilation.

Our poster will present important insights from our ongoing investigation into this question. It will present an overview of an architecture for an MLIR-based SYCL compiler, demonstrating how MLIR can be integrated into the typical compilation flow for SYCL applications and how the resulting compilation output interacts with existing SYCL runtime implementations.

The poster will also report on the status of an ongoing collaboration project between Codeplay and Intel, developing an MLIR-based SYCL compiler as an open-source project based on Intel’s existing DPC++ SYCL compiler and runtime implementation. At the time of writing, the MLIR-based device compiler can already compile a substantial portion of the SYCL application tests in Intel’s fork of the LLVM test-suite, and we seek to further improve coverage and extend the compilation to the host-part of SYCL applications.

The design principles and core abstractions of the MLIR dialect for SYCL, developed as part of the project, will be discussed in detail, demonstrating how MLIR enables compiler optimization passes to better understand the semantics of SYCL applications.

The poster will outline several opportunities for MLIR to significantly improve the code generated for SYCL applications over existing, LLVM-based compilation flows. One instance of such a potential improvement is MLIR’s ability to represent host and device code together in its nested intermediate representation. The ability to access host and device code at the same time from compiler optimization passes will enable transformations across the border between host and device code for SYCL applications. An example for such a transformation is the propagation of constants from host code into device code, an ability not found in DPC++’s current LLVM-based compilation flow. This may also pave a new avenue to single-pass compilation of SYCL applications in the future.

Overall, the poster will provide a clear vision of how MLIR in the future will allow SYCL programs to be better optimized during compilation and the necessary steps to implement such a compilation flow.

Open Standard Software Stack for Low Latency Offloading from Lightweight Devices to Remote Heterogeneous Platforms. [5988]
Jan Solanti, Tampere University.  | Co-authors: Topi Leppänen and Pekka Jääskeläinen, Tampere University.
show / hide abstract

Mobile computing devices have become ubiquitous but emerging
applications frequently have performance requirements that mobile, battery-powered, devices have difficulty meeting. One solution is to concentrate a number of high powered compute devices and accelerators in clusters and provide shared remote access to them to a fleet of mobile devices. To minimize the introduced network latency, clusters are located at network access points. This approach is also known as Multi-access Edge Computing (MEC). [1]

To this end, we developed a prototype of a full offloading software stack centered around the OpenCL implementation PoCL [2]. As a
widely available standard API OpenCL allows for easy developing and porting of applications. The key components of the stack are:
• AlmaIF, a unified memory-mapped interface for black-box accelerator designs [3].
• PoCL-Remote, a backend for PoCL that forwards OpenCL commands to a compute server over TCP. When multiple servers are used together, PoCL-Remote optimizes synchronization and data transfers to happen directly between the involved servers without a network round-trip to the client.[4] • Nano-PoCL, a minified configuration of PoCL that is small enough to be usable on microprocessors with limited memory and only a basic operating system (FreeRTOS [5]).
• A proof-of-concept implementation of the OpenVX API [6] built on top of OpenCL to make writing portable computer visions applications more productive for engineers.
• The client application using the OpenVX API.

Since the entire stack is centered around standard OpenCL, the different components do not need to be aware of each other’s internal details. This poster showcases the software stack by demonstrating how all of its components work together to deliver easy-to-use extensible platform for edge compute offloading. We implemented a demonstrator which offloads the computation of an OpenVX application from a small drone to a remote FPGA server. The input from a camera is transmitted to the remote server alongside the processing commands automatically as OpenCL buffers and commands, in the same fashion as if the computation was performed with any local OpenCL implementation.

2 DEMONSTRATOR APPLICATION

The application for this demonstrator was a simple OpenVX application running on a nano-drone. The application gets its input from the drone’s built-in camera, feeds the image to an OpenVX
node called MinMaxLoc to find the position of the brightest pixel in the image and commands the drone to adjust its hovering altitude and yaw angle to face directly towards this point.

2.1 Experimental Setup

The laboratory setup for the demonstrator consists of the CrazyFlie nano-drone[7] with the AI Deck add-on [8] to run the demonstrator application, a WiFi 6 router and a PYNQ-Z1 FPGA SoC board.
The AI Deck attached to the drone includes a RISC-V processor running the demonstrator application in a FreeRTOS environment, a greyscale camera, and a WiFi controller. The PYNQ-Z1 board is connected to the router with an ethernet cable. It acts as a compute server with a small FPGA and an ARM CPU that executes the PoCL-Remote daemon. PoCL-Remote daemon uses the OpenCL API to forward the commands to a local OpenCL implementation, in this case PoCL, which in turn manages the FPGA accelerator via the AlmaIF interface.

3 OPENCL-BASED OPENVX IMPLEMENTATION

In order to enhance application developer productivity, we support a subset of OpenVX for describing computer vision domain applications. This raises the abstraction level from the more verbose
OpenCL-based programming model, while adding practically no performance overhead. Conference’17, July 2017, Washington, DC, USA Jan Solanti, Topi Leppänen, Pekka Jääskeläinen A heterogeneous task graph is produced from the OpenVX model, which can be cached e.g. using the OpenCL command buffer mechanism [9] for low latency relaunches. This could be used for example to process multiple frames with the same compute pipeline.

4 NANO-POCL

Nano-PoCL is a cut down version of the PoCL runtime library for embedded microprocessors. In our demonstrator application the final firmware image consisting of the main application, FreeRTOS,
Nano-PoCL and some board support libraries measures slightly below 200 kilobytes, roughly 40 kilobytes of which are PoCL diagnostic messages that can be optionally disabled at compile time.
This implementation relies on POSIX APIs for threading and networking. FreeRTOS provides customizable implementations of these. Runtime kernel compilation has been removed from the build as the PoCL-Remote backend is the main target for the demonstrator application but prebuilt kernel binaries could be loaded instead for local execution on the RISC-V core.

5 POCL-REMOTE

PoCL-Remote consists of two parts: A new backend to the PoCL library and a server daemon. The backend connects to the server daemon via TCP. OpenCL devices available on the machine running the daemon are exposed to applications as part of the PoCL OpenCL platform. This way applications can be changed from using local compute devices to offloading to shared remote accelerators with a simple configuration change. As PoCL-Remote is built around the OpenCL API, servers do not need any knowledge about individual applications.

Applications that are designed with remote devices in mind can optionally identify OpenCL devices as either remote or local. This information can then be used to optimize the partitioning of compute tasks to optimize data transfers and utilization of various heterogeneous accelerators. PoCL-Remote also optimizes buffer transfers and command synchronization between remote devices. Direct server-to-server communication is preferred when possible, as servers are expected to have better connectivity to each other than to the mobile client. Transfers between devices on the same server are delegated to the local OpenCL runtime on the server. [4].

6 ALMAIF
AlmaIF [3] is a memory-mapped interface for integratiing customized accelerators to PoCL. The interface supports both programmable and fixed-function accelerators, and doesn’t restrict the
implementation method of the accelerator. This allows implementing the accelerator on either FPGA fabric or as custom silicon. For this demonstrator, we used an FPGA SoC, where the host
processor runs the PoCL-Remote daemon. The daemon calls PoCL’s AlmaIF backend to execute OpenCL programs on FPGA. On the FPGA, we implemented an application specific instruction-set processor (ASIP) as a soft processor [10]. The processor was specialized with customized function units to accelerate the single OpenCL built-in kernel used in this demonstrator. The AlmaIF backend has been open-sourced as part of PoCL codebase, and the method for generating AlmaIF-compatible soft processors has been open-sourced as part of OpenASIP [10] frame-
work.

Toward a CAD Tool for SYCL programming [5550]
Erwan Fabiani,Université de Bretagne Occidentale.  | Co-authors: Chiara Relevat, Erell Cottour and Paul Allaire, Université de Bretagne Occidentale. Loic Lagadec, ENSTA Bretagne.
show / hide abstract

This poster presents a work in progress which aims to design and exploit a CAD tool for SYCL programming. The motivations and objectives of the design framework are presented. We describe its architecture implemented around an abstract model of a SYCL program, the modes of specification of an application, the functionalities of code generation and program transformation. The availability of heterogeneous platforms combining multicore, FPGA, GPGPU, manycores, CGRA on the same chip is a reality that will become increasingly important. Knowing how to exploit the available heterogeneous computing power requires mastering the complexity of the various underlying computation models and their interactions.
The SYCL standard is a response to this challenge, by standardizing the definition of computation kernels and memory transfers, by simplifying the portability of kernels (between devices within the same platform) or their reuse on different platforms.

Nevertheless, if this considerably simplifies the programming of heterogeneous platforms, the fact remains that the dimension of the design space, in particular for the choice of the granularity of the kernels and their execution device, remains a major concern.
Moreover, the expression of the dependencies between kernels by memory transfers can be a source of error and can benefit from an automatic generation of code based on relations between kernels expressed in a direct way.

The objective of our framework is to design methods and tools to solve SYCL program design issues by following a model-driven engineering methodology.
Two main issues are highlighted here. The first one concerns the match between the abstract model of kernel dependencies and their expression in SYCL. The definition of dependencies via memory transfers limits the analysis of existing code for its evolution and the continuous verification of its structural conformity to the initial specifications. Conversely, the automatic generation of interaction code from a high-level model is a way to increase productivity and reliability.

The second one concerns the exploration of the space of possible solutions for an application, due to the implementation capacity of a kernel on several material supports. This raises the question of the choice of the best solutions for a case study via the performance measurement of the different variants. Having an environment allowing them to be generated automatically is part of the solution.

The central element of the framework is a high-level object model of all kernels of an application and their dependencies. This model is made up of classes that follow the structure of a SYCL program, seen as a graph whose nodes are the command groups that depend on the data they manipulate.
The Data class, which characterizes an accessor, is associated with a variable (type, number of dimensions, size of each dimension) and provides its name and access mode (read, write, read_write). In the case of a dynamic construction, the access mode is automatically inferred from the dependencies expressed.
The Command_group class characterizes a command group, designating a kernel and a device on which it must run. An instance also contains a list of Data objects which correspond to the accessors used by the kernel.

The model does not express the internal code of the kernels which are seen as black boxes. Their SYCL code (in the form of a lambda function) is stored in independent files, which are referenced by a unique name. A constraint is of course that the instances of Data associated with a command group are of the same nature as the parameters used in the code of the kernel.
The Node class encapsulates a Command_group and makes it possible to describe the dependencies between command groups within a directed graph (class Graph).
The Graph class has an instance method to automatically label the nodes in a possible order of execution respecting the dependencies expressed. This is necessary for the code generation phase which requires sequentializing the declaration of command groups.

There are two ways to express a SYCL program in our framework and generate the associated object. (1) By using a SYCL code parser (on a restricted subset of the language), which makes it possible to reuse existing code to integrate it into the framework. The parser isolates the kernel codes in independent files, analyzes the dependencies between kernels and expresses them in the generated graph. (2) By making a direct entry in a C++ program with the methods of instance creation, in particular for the expression of dependency between command groups.
The framework is equipped with a utility for viewing the dependency graph of command group executions (based on Graphviz): either on the complete graph, or by device. This helps to compare the expression conformity of the program against the specifications, or it allows to analyze the structure of an existing SYCL program integrated with the parser.
One of the objectives of the project is to allow the automatic transformation of code by integration into the abstract model. Currently, this is done in the following two cases: The modification of the structure and characteristics of an existing program (dependencies, kernels used, devices, accessors) without having to directly modify the code. This is implemented via an HMI. The addition of monitoring functionalities, for example to measure performances or integrate assertions. In the current version, the user can request a code generation with measurement of the execution time of each command group, which is implemented by the automatic creation of events.

The generation of SYCL code done by a method which extract all the information contained in a Graph object and integrate the code of the kernels. The order of declaration of command groups is directed by the automatically calculated labeling.

Prospects for further development of the framework include:
Adapting standard SYCL code to device-specific primitives in an implementation, eg the pipe primitive for FPGA in DPC++.
The automatic generation of variants of a program for different buffer sizes in order to characterize the performance of the granularity choices.
The automatic addition of dependency constraints, distinct from the functional dependencies of the application, to guarantee temporal properties.

Profile Guided Optimization Transfer-Learning for OpenCL/SYCL Kernel Compilation and Runtime [5658]
Wenju He, Intel.  | Co-authors: Maosu Zhao, Yuxin Zou and Feng Zou, Intel.
show / hide abstract

Reducing SYCL kernel compilation time and overhead of runtime are important topics for heterogeneous computing performance. Profile-Guided Optimization (PGO) is an optimization technique widely used in compiler to better optimize code. We apply PGO to both SYCL kernel compilation and backend runtime. Two experiments are conducted on a few OpenCL/SYCL benchmark suites, including Rodinia. The first experiment demonstrates transfer-learning that profiling data collected from SPEC CPU® 2006 benchmark can benefit SYCL kernel compilation on SYCL benchmarks. The second experiment also demonstrates transfer-learning that profiling data collected from some OpenCL/SYCL benchmarks could be used to reduce CPU backend runtime overhead in unseen benchmarks.

Optimizing compiler code requires compiler developer to have expertise. Hand-crafted tuning could be counter-effective. Profiling info in PGO tracks how the application runs. With profiling info, the compiler can enable better optimizations by reordering basic blocks and inlining functions according to their frequency. The LLVM community claims that overall compile time can be reduced by 20% by applying PGO to Clang and LLVM.
Intel SYCL CPU backend executes SYCL kernel parallelly on CPU cores. We tune both device kernel compilation and runtime on CPU device. The CPU backend supports both just-in-time (JIT) and ahead-of-time (AOT) compilation of device kernel. AOT performs device kernel compilation during application compilation and thus saves application execution time. The kernel compilation time could be reduced using PGO approach.
CPU backend implements SYCL programming model APIs, including platform, device, context, queue, event and memory management. Runtime overhead is evident in the case of lightweight kernel which is must faster when running on host. Traditional approach of reducing runtime overhead is using profiler to identify hot functions and then optimize it. For instance, we identified memory copy command is slow and then we implemented a parallel approach of enqueuing a copy kernel. This approach is limited since the runtime consists of a lot of functions and less-hot functions also have optimization headroom.

Our PGO experiment contains following four steps:
1. Base build is the reference compiler with default configuration.
2. PGO-gen runs reference on selected benchmark to generate and collect profiling data.
3. PGO-use builds a new compiler using the profiling data.
4. Test the new compiler on benchmarks that aren’t used in PGO-gen.

We use several standard benchmarks, e.g., SPEC CPU® 2006, SPEC ACCEL OpenCL and Rodinia. A couple internal benchmarks are also used for testing. These benchmarks covers many typical heterogeneous computing applications. There is no intersection between benchmarks that are used for PGO-gen and those for testing gains.
The reference is Intel SYCL CPU backend don’t have PGO enabled. We compared the performance of the PGO-optimized CPU backend with the reference CPU backend.

SYCL kernel compilation consists of frontend device code generation, sycl early optimization, middle-end optimization and backend codegen. Our experiment is only targeted on the combination time of middle-end optimization passes and backend codegen passes. We instrument compiler code to measure the time.
We conduct two stages of experiments:
1. Use SPEC CPU® 2006 for PGO-gen and measure kernel time change on OpenCL/SYCL benchmarks.
2. Use both SPEC CPU® 2006 and a sycl benchmark suite for PGO-gen and measure kernel time change on OpenCL/SYCL benchmarks. The sycl benchmark which is used for PGO-gen is excluded from testing.
After PGO optimization, the kernel compilation time of almost all cases has been reduced by more than 5%. On some cases like backprop, nw and b+tree in Rodinia benchmark, the compilation time has been reduced by more than 10%. Stage 1 only uses SPEC CPU® 2006 which optimized generic llvm optimization passes. Stage 2 outperforms stage 1, due to CPU backend compiler specific transformation passes are optimized by the sycl benchmark.
The five testing benchmark suites are not involved in PGO-gen stage of profiling info generation. Therefore, this experiment demonstrates transfer learning that PGO profiling info collected from a specific test suite could be utilized for new OpenCL/SYCL application.

The second experiment is to apply PGO to CPU backend runtime. We integrate the runtime into LLVM building infrastructure and build it with profiling instrumentation. Then the instrumented runtime is used to run a few benchmarks to collect profiling data.
It is impossible to obtain runtime gain from stage 1 in the first experiment because SPEC CPU® 2006 doesn’t use SYCL CPU backend runtime. To achieve the best coverage of runtime code, it is better to standard benchmark suites for profiling data collection. Therefore, we run SPEC ACCEL OpenCL benchmark and an internal standard SYCL benchmark to collect PGO profiling data, which is used to build an optimized runtime. These two benchmarks represent typical High-Performance Computing (HPC) applications.
The time measured in this experiment is the overall elapsed time of a benchmark application. These benchmarks run in JIT mode, so the overall time includes kernel build time, kernel execution time and runtime overhead. Kernel execution time isn’t expected to be improved since we haven’t collected kernel code profiling data yet. Kernel build time is generally much smaller than the overall time since most of benchmarks run kernels for many iterations to obtain stable performance results. Therefore, the overall time reduction in this experiment reflects the runtime overhead reduction.

The second experiment also demonstrates transfer learning that PGO profiling info from two benchmark suites could be generalized to other benchmarks. This is an interesting finding which could benefit other SYCL backends as well. In OpenCL/SYCL heterogeneous computing runtimes, the runtime code is reused for all benchmarks. This explains why it could benefit from the PGO transfer-learning.

It is worthy to analyze the details of how PGO reduces kernel compilation time and runtime overhead. This analysis in turn could help to develop a better PGO approach that benefits heterogeneous computing better.
The execution time of kernel could also be reduced if PGO for kernel code profiling is enabled. Our next step is to implement it in our SYCL CPU backend. This feature requires two compilation stages for device code, i.e., PGO-gen and PGO-use. Compiler customer should be aware of the two stages. It is not possible to apply transfer learning since kernels have different branch probabilities.

Machine Learning for Vectorization Decision in OpenCL/SYCL Kernel [8926]
Wenju He, Intel.  | Co-authors: Yuxin Zou and Feng Zou, Intel.
show / hide abstract
Streamline Ahead-of-Time SYCL CPU Device Implementation through Bypassing SPIR-V [6506]
Wenju He, Intel.  | Co-authors: Yilong Guo, Xinmin Tian, Hideki Saito, Wenwan Xing, Feng Zou, Chunyang Dai, Maosu Zhao and Haonan Yang, Intel.
show / hide abstract
SYCL is an emerging parallel programming language built on top of modern C++ language standards, which enables heterogeneous computing for various devices such as CPUs, GPUs, FPGAs and accelerators. Implementing SYCL for CPU as a compute device imposes many challenges to achieve optimal performance, as there are several SYCL language constructs such as ND-range, work-group, sub-group and barrier that are not natively supported by CPU hardware. Our SYCL CPU device implementation aims to implement SYCL to fully utilize CPU cores and memory system resource effectively for achieving optimal performance. In this paper, we present a SYCL CPU device non-SPIRV path design and implementation that aims to improve both performance and debuggability on SYCL CPU device support, where SPIR-V friendly LLVM IR is used as the intermediate representation for device code instead of SPIR-V. In which, SPIR-V generation is by-passed to reduce compilation overhead as well as potential IR information loss caused by IR translation. The new CPU device implementation supports the Ahead-Of-Time (AOT) compilation, which can be enabled via the option “-fsycl-targets=x86_64” in Intel® oneAPI Toolkit. Furthermore, we provide insights of our OpenCL CPU backend components that are fully leveraged for the new compiler implementation, including the OpenCL plugin and runtime that manages work-scheduling, CPU resources, as well as the device compiler that performs specific transformations to support SYCL/OpenCL execution model.
Portability and Scaling of the DPEcho GR-MHD SYCL code: What’s new for numerical Astrophysics in SYCL2020 [7387]
Salvatore Cielo, Leibniz Supercomputing Centre.  | Co-authors: Margarita Egelhofer, Leibniz Supercomputing Centre. Alexander Pöppl, Intel.
show / hide abstract

We present the first performance measurements on HPC-grade hardware of DPEcho, our MPI-parallel SYCL porting of the General-Relativistic Magneto-Hydrodynamics (GR-MHD) code ECHO for numerical simulations of Astrophysical Black Holes. We take advantage of the recent additions in SYCL2020 to benchmark GPUs of
different vendors with a representative workload from numerical astrophysics with DPEcho.

Numerical experiments are paramount for astrophysics, as the modelled systems feature physical processes, sizes and timescales which cannot be investigated experimentally, as well as complex interactions often impossible to describe analytically. This has led to the creation of countless simulation codes, based on various hydrodynamics and gravity schemes. These numerical methods are
highly computationally- and memory-intensive. Some codes are being developed and maintained for multiple decades, and thus it is of vital importance to both tune and optimise them for new architectures, and to modernise the codebase to conform to current best practises of the field.

Computing centres such as the Leibniz Supercomputing Centre (LRZ), part of the Gauss Centre for Supercomputing (GCS), devote a large part of their resources and specialised support to such ends, bridging also the gaps between science users, HPC developers and hardware/software vendors. Concerning the Authors specifically, partnerships of LRZ and Intel for support to numerical sciences have often proven successful. The DPEcho code is the result of such a collaboration. Together with domain scientists, we decided to rewrite the ECHO Fortran legacy code using SYCL and modern C++. In prior work, we introduced DPEcho’s main features and usage model. The code is available on GitHub as free software under the Apache License. Having chosen SYCL, we can take advantage of the unprecedented power offered by heterogeneous computing systems, including the latest HPC-grade GPUs. Unlike proprietary programming models such as NVIDIA’s CUDA, SYCL is a cross-platform layer, notlimited to a single vendor’s hardware: this alleviates the need for redundant development, maintenance and coordination of multiple GPU portings. At the same time, SYCL allows for more fine-granular control than other, more general programming methods such as OpenMP offloading.

For the present work, we updated DPEcho to SYCL2020. This newest iteration of the standard contains several changes that made the API more accessible to domain scientists:
• device selection has been simplified with the inclusion of device aspects;
• several simplifications to the parallel_for construct enable the implementation of simple kernels with reductions which previously required nd_range kernels;
• the inclusion of Unified Shared Memory (USM) in the standard enables a programming mode more familiar to domain scientists that usually have a working knowledge of lower-level programming languages such as C or FORTRAN;

Furthermore, the recently released, free-to-use plugins developed by Codeplay for use with Intel’s oneAPI toolkit enable the generation of code for NVIDIA’s or AMD’s GPUs without the need to compile LLVM oneself. Thus, we were able to execute DPEcho using hardware by all three major vendors, on the Intel Data Center GPU Max Series (formerly Ponte Vecchio), the NVIDIA H100 and the NVIDIA
A100, the AMD Instinct MI100, and on a number of Intel CPUs. In all cases, we utilised the oneAPI SYCL implementation (version 2023.0.0 for Intel GPUs and Intel CPUs, the Intel LLVM fork and/or the Codeplay oneAPI plugin for NVIDIA and AMD GPUs), which encodes a large subset of the SYCL2020 standard.

For the evaluation, we setup a representative GR-MHD test – the propagation of plane-parallel Alfvén waves (magnetic waves powered by magnetic field tension) along the 3D diagonal of a cubic simulation box (Figure 1), with periodic boundary conditions along all directions. Alfvén waves are dispersionless, thus especially suitable for testing the accuracy of numerical schemes. We aim at
filling efficiently the GPU memory and execution units, in order to minimise the relative SYCL initialisation and scheduling overhead, and test production-ready configurations. We therefore perform a scaling test over the problem size (at different resolutions) from 64 to 512 grid elements per side. Given the relatively high resolution we achieve, we make the test more demanding by fitting multiple wave creases along each side. Preliminary performance measurements are presented in Figure 2. For tile-able GPUs, we show both single-tile (ST in the figure) and implicit scaling (i.e. using the full hardware, IS in the figure) results. Besides a predictable variability due to this fact, all the tested hardware brings considerable speedup over the pure CPU runs. Despite the minimal data transfer granted by the USM and malloc_device, all GPUs reach a plateau of performance only for larger boxes (>192^3), due to scheduling overhead and device utilisation.

According to our experience, the development of simulation codes with SYCL presents several advantages over GPU proprietary programming models and OpenMP programming. It unlocks true heterogeneous computing, inclusive of GPUs of different vendors and generations. This is especially useful as domain scientists are not limited in hardware for prototyping, compute time applications and their own hardware purchases. Furthermore, it allows HPC professionals to retain a single code base and benchmark varied hardware with highly performing and portable code and it takes advantage of the SYCL development and user community, as in the case of the recent SYCL2020. Finally, we note that GPUs and accelerators come in different sizes and numbers per node. Therefore other quantities (e.g. energy efficiency) beside the raw performance, would be desirable for a fair comparison. Discussing how to present such information, i.e. how to expose the true value of an accelerator
for HPC, is among our goals of our contribution to SYCLcon.

Edge Acceleration for Machine Learning-based Motion Artifact Detection on fNIRS Dataset [800]
Shufan Yang, Edinburgh Napier University | Co-authors: Yunjia Xia, Rui Loureiro, Yunyi Zhao and Hubin Zhao University College London. Uwe Dolinsky, Codeplay Software.
show / hide abstract

Machine Learning has potential applications across a wide spectrum of devices. However, current approaches for domain-specific accelerators have encountered difficulties in satisfying the most recent computational demands for machine learning applications. This work aims to create an adaptive acceleration framework for fNIRS motion artifact detection, which will be specifically designed for wearable devices. We evaluate the performance of the SVM classifier that has been implemented using SYCL on our fNIRS dataset across diverse devices and discuss the potential to accelerate more advanced motion artifact classifiers at the edge.

Parallel Algorithm for a Hidden Markov Model with an Indefinite Number of States and Heterogeneous Observation Data [8064]
Varvara Roubtsova, Hydro-Quebec.
show / hide abstract

In addition to being a modern technique used in speech recognition applications, Hidden Markov Models (HMMs) are widely used in other areas to predict equipment life cycles and optimize maintenance, for example. Problems of this type have a very limited and fragmented set of observable data, as well as limited information on the possible states of the system. This article proposes a strategy for organizing HMM parallel learning, which is effectively implemented using OpenCL on GPU devices. The originality of this approach lies in the parallel implementation of the learning algorithm for a model with an indefinite number of states and heterogeneous observed data: sometimes only the observed signal is available, and sometimes the state of the system is known. The code presented in this article are parallelized on several GPU devices.

Introduction to TTL, an Open Source Tensor and Tiling Library for OpenCL [4336]
Christopher Gearing, Mobileye. | Co-authors: Ayal Zaks, Mobileye.
show / hide abstract

OpenCL has an Image data type to represent multi-dimensional arrays of image data for processing in device Kernels. Whilst providing powerful functionality for image processing, they are designed towards the capabilities of GPUs and provide closed opaque functionality and data formats. This presentation presents a recently released Open-Source library, “Tensor and Tiling library”. TTL has been created to support multi-dimensional data for non-GPUs. Like the GPU Image type, the TTL tensor type has built-in attributes to describe its shape, layout and underlying data. The library provides methods for reading, writing, reasoning about the relative positioning of related Tensors, and attributes of the data beyond the extent of the Tensor.

OpenCL C provides asynchronous data copy functions built into the language, allowing data transportation between the host and device memory systems. Many usage patterns where the devices have limited memory require the data to be Tiled and then pipelined through the device. Whilst the OpenCL primitives make this pipelining possible, the pipelining code can be a significant part of the development effort when the actual value added is the algorithm implementation itself. The TTL library offloads the tiling and pipelining boilerplate code allowing the programmer to focus on algorithm development.

The poster presents the Tensor and Tiling library which is hosted at https://github.com/KhronosGroup/OpenCL-TTL

ABOUT CAMBRIDGE AND LOGISTICS

Location, Travel & Hotels

IWOCL and SYCLCON 2023 was hosted at the University of Cambridge, UK (Sidney Sussex College) close to the historic centre of Cambridge, UK.

  • Sidney Sussex College, Sidney Street, Cambridge CB2 3HU, UK
  • Get Directions
  • Visit Cambridge Tourism Guide for additional city and travel information.
  • Accommodation: A limited number of college rooms are available for conference attendees.

Cambridge is a university city located 89 km north of London and sits at the heart of the high technology ‘Silicon Fen’, which contains industries such as software and bioscience and many start-up companies born out of the university. Cambridge is well connected by road, rail and air transport, with London Stansted Airport (STN) 48 km south of Cambridge, offering direct flights across Europe and Internationally.