Proceedings

The ACM Digital Proceedings are now available.

The IWOCL 2024 Call for Papers was posted before the January 1, 2024 deadline and so are not published under the ACM’s new Open Access Model.  We expect IWOCL 2025 to be Open Access.

Award Winners

This year’s awards went to:

Location and Times

All the sessions took place on the 6th Floor of the Student Center East Tower (SEC). See Location and Travel for additional information. All times are USA, CST (Central Standard Time).

Quick Links: SYCL HackFest | SYCL Tutorial | Conference (Wed and Thu) | Posters (Wed / Thu)

Monday April 8, 2024

SYCL HackFest

Tuesday April 9, 2024

Adv. SYCL Tutorial

Wednesday 10 – Thursday 11 April, 2024

OpenCL and SYCL Conference Sessions

Monday April 08, 2024 | HackFest Day

This session will take place in Room 605. Registration and refreshments will be in Room 603.

  • Session Chair: Sri Ramkrishna, oneAPI Community Manager, Intel.
  • Mentors: Tom Deakin, University of Bristol and SYCL Working Group Chair, Aksel Alpay, Heidelberg University and creator of AdaptiveCpp (formerly hipSYCL), Thomas Applencourt, Argonne National Laboratory and SYCL Advisory Panel Chair, and Hugh Delaney, Codeplay Software.

Agenda
08:00 – Registration and Welcome Coffee
08:30 – Start of HackFest
11:00 – Coffee Break
12:00 – HackFest session
13:30 – Lunch Break (including viewing of Solar eclipse)
14:30 – HackFest session
16:00 – Coffee Break
16:30 – HackFest session
19:00 – Session closes

The SYCL HackFest @ IWOCL 2024 is sponsored by Intel and Codeplay Software.

Tuesday April 09, 2024 | Advanced SYCL Tutorial Day

Advanced SYCL Programming Techniques and Best Practices: A Hands-On Tutorial
08:30 – 17:30

This session will take place in Room 605. Registration and refreshments will be in Room 603.

  • Tutorial Chair: Hugh Delaney, Codeplay
  • Tutors: Thomas Applencourt, Argonne National Laboratory, Abhishek Bagusetty, Argonne National Laboratory, and Aksel Alpay, Uni of Heidelberg.

This advanced tutorial goes beyond the basic concepts of SYCL to offer participants a dynamic and interactive learning experience through a blend of short lectures followed by hands-on coding exercises that reinforce the advanced concepts introduced.

Attendance at this tutorial is included in the 3-Day IWOCL Conference Pass.

Agenda
08:30 – Registration and Welcome Coffee
09:00 – Start of Tutorial
11:00 – Coffee Break
12:00 – Tutorial session
13:00 – Lunch Break
14:00 – Tutorial session
16:00 – Coffee Break
16:30 – Tutorial session
17:30 – Session closes

show / hide tutorial description

Tutorial Format: A Full Day Hand-On Tutorial

Tutorial Outline: This advanced tutorial goes beyond the basic concepts of SYCL to offer participants a dynamic and interactive learning experience through a blend of short lectures followed by hands-on coding exercises that reinforce the advanced concepts introduced.

The material builds on the fundamentals of SYCL to provide participants with a deeper understanding. Topics covered will include strategies for:

  • Optimizing code
  • Managing data flow
  • Using different memory access patterns
  • Understanding work group sizes
  • Using vectorization
  • Working with ND ranges, and their importance
  • Making the most of the multiple devices available on your architecture

The hands-on coding exercises are designed to progressively challenge participants, starting with fundamental SYCL constructs and advancing to complex scenarios.

Throughout the tutorial, instructors will guide participants in developing a deeper appreciation for SYCL’s role in heterogeneous computing and its potential to unlock significant performance gains. By the end of the session, participants will have honed their SYCL programming skills and gained the confidence to leverage SYCL in their projects.

Drinks Reception & Conference T-Shirts Collection
17:30 – 19:00 – University of Chicago, Student Center East, East Terrace (2nd floor)
All SYCL Tutorial, Two-Day and Three-Day pass holders are welcome to join us for this informal drinks reception. Drinks and light snacks will be provided and the reception will close at 19:00. The official IWOCL conference dinner is on Wednesday evening.

Wednesday – Conference Sessions

This session will take place in Room 605. Registration and refreshments will be in Room 603. Presenters are identified in bold.

Registration and Coffee
08:30 – 09:00
Welcome and Introduction [Welcome]
Tom Deakin, University of Bristol and IWOCL 2024 Conference Chair
09:00
Video Recording – Pending | View Slides

KEYNOTE PRESENTATIONS #1

OpenCL: An Update from the Khronos Working Group
Kévin Petit, Arm
09:15
Video Recording – Pending | View Slides
SYCL: An Update from the Khronos Working Group
Tom Deakin, University of Bristol and SYCL Working Group Chair
09:30
Video Recording – Pending | View Slides
SYCL SC: An Update from the Khronos Working Group
Victor Perez and Hugh Delaney, Codeplay
09:45
Video Recording – Pending | View Slides
INVITED TALK
OpenCL on Mobile [1001]
Kévin Petit, Arm.
10:00
Video Recording – Pending | View Slides
Performance Improvement of Meta’s LLaMa on Adreno GPU with MLC LLM Framework [1494]
Authors: Siva Rama Krishna Reddy B, Hongqiang Wang, Alex Bourd, Krishna Raju Vegiraju, Li He, Elina Kamenetskaya and Amir Momeni, Qualcomm.
10:30
show / hide abstract

This technical presentation details first of its kind industry best performance of Meta’s LLaMa-7b model support on Adreno GPU with MLC LLM (Machine Learning Compilation for Large Language Models) framework. MLC LLM (https://llm.mlc.ai/) is an open-source universal solution that allows any LLM (Large Language Model) model to be deployed natively on a diverse set of hardware backends and native applications. MLC LLM demonstrated LLM models working on iOS, Android, Windows, Linux, Mac and web browsers. MLC under the hood use TVM (Tensor Virtual machine – tvma.ai) for model compilation, optimization and device execution.

In this submission we will detail about the baseline performance MLC LLM has achieved on various Adreno GPU generations and the improvements Qualcomm has made to achieve industries best performance. MLC LLM baseline performance for decode on Snapdragon Gen 2 was 6 tokens per second and on Snapdragon Gen 3 was 8 tokens per second.

MLC LLM early implementation for Android is driven by hand crafted dispatch IR (Intermediate Representation) that generates corresponding OpenCL kernels. Recent amendments use Dlite based scheduling mechanism. Both approaches have scope for improvements for Adreno GPU. We choose the dispatch-based code generation as baseline for improvements.

In our early analysis and hardware level profiling of generated OpenCL kernels revealed under utilization of Adreno threading capability where less threads are in action leaving few cores ideal. Identified inappropriate vectorization that resulted in inefficient load and store. Also identified excess local memory utilization per workgroup. There were few network level enhancement to avoid unnecessary layout transformations.

Improvements started by increasing work group size by 2x and 4x to compute partial sums followed by reduction. This approach occupied the cores completely and improved the kernel level performance by nearly 35%. Vectorization was rectified to enable coalesced data loads to improve the data load throughput which is very significant in decode multiplications (vector to matrix multiplication) which are data bound. Per workgroup local memory utilization was reduced to avoid pressure on local memory demand. The network level optimizations include removing unnecessary transposes by using modified MatMul with original data layouts. These approaches reduced global memory bandwidth along with some savings on transposes.

Above mentioned optimizations have improved the overall performance of LLaMa-7b-v1/v2 on Snapdragon Gen 2 to 11 tokens/sec and Snapdragon Gen 3 to 14 tokens per second. Most of the improvements here are generic and has improved the prompt processing also significantly.

LLaMa-7b support on Android platforms with this kind of performance is first of its kind. In this talk, we will detail about the baseline dispatch details and the improvements we have made to achieve said performance.

Video Recording – Pending | View Slides
Coffee Break and Networking
11:00am – 11:30am
Emulating Command Buffer Extensions with OpenCL Layers [6895]
Authors: Ewan Crawford, Codeplay Software, and James Brodman and Ben Ashbaugh, Intel
11:30
show / hide abstract

When a new OpenCL extension is defined, it typically takes several months to several years before implementations of the extension are widespread and available for general usage by applications. In unfortunate cases, an extension may never be implemented for some devices or operating systems. This lack of availability can hinder adoption of an extension by shifting complexity to application developers, who may need to develop two different code paths, one with support for the extension and one without.

In our technical presentation we will describe how OpenCL layers were used to implement a command buffer emulation layer with support for the provisional cl_khr_command_buffer and cl_khr_command_buffer_mutable_dispatch extensions. The command buffer emulation layer brings functional support for these extensions to most OpenCL devices in the market, allowing developers to develop OpenCL applications using both extensions even when they are not available natively.

We will describe how the emulation layer works with a focus on how existing OpenCL APIs provide the functionality used to emulate command buffers. We will describe OpenCL features that were helpful for emulation, such as built-in reference counting for OpenCL objects and the ability to clone OpenCL kernels and their arguments. We will describe some command buffer features that were tricky to emulate, such as event profiling and certain types of error checking. These features required additional work and creativity, though we were ultimately able to emulate them. We will also describe a small handful of command buffer features that we currently do not emulate properly, such as the command buffer pending state.

Our presentation will include success stories, such as how the command buffer emulation layer was used to develop the OpenCL conformance test suite (CTS) for command buffers. We will describe how the command buffer emulation layer is currently being used by the OpenCL working group to rapidly prototype new command buffer features and layered extensions that add new command buffer functionality. We will describe how the command buffer emulation layer is being used to develop and test SYCL graphs, which generate OpenCL command buffers. Even for implementations that support command buffers natively, the command buffer emulation layer provides an alternative implementation that may be useful when debugging applications using command buffers.

Our presentation will include a brief evaluation of the performance of the command buffer emulation layer. We will compare the performance of the command buffer emulation layer against a command buffer-like implementation in the application itself, approximating a traditional alternative code path to support OpenCL implementations without support for the command buffer extensions. We will also compare the performance of the command buffer emulation layer against implementations with native support for command buffers, such as POCL, and any other native implementations we can identify and acquire prior to the conference.

Our presentation will close with a broader discussion of layers in the OpenCL ecosystem, including how OpenCL can better support and evangelize layers like the command buffer emulation layer.

The source code for the command buffer emulation layer is available under a permissive license (MIT).

Video Recording – Pending | View Slides
Towards Efficient OpenCL Pipe Specification for Hardware Accelerators [7099]
Authors: Topi Leppänen, Joonas Multanen, Leevi Leppänen and Pekka Jääskeläinen, Tampere University.
12:00
show / hide abstract
FPGAs are programmable devices that are interesting for streaming-style applications. Using vendor-independent programming models such as OpenCL for FPGAs can aid the development effort and prevent vendor lock-in. OpenCL pipes included in the OpenCL
standard offer a natural way to describe fine-grained task pipelines. However, the current use of the OpenCL pipe in FPGA OpenCL implementations is either non-compliant or not performance optimized due to several implementation challenges. In this paper we pinpoint the key implementation complexities and suggest possible specification updates and implementation choices that enable description of efficient task pipelines in a portable, vendor- and device-independent manner. We design a performance-optimized hardware pipe prototype, tackling a key challenge (runtime-defined connectivity) in going towards an OpenCL compliant, yet portable pipe implementation. The evaluation of our prototype on an FPGA shows that in a computer vision application, the proposed dynamically connected pipe component is 2.5x faster than an OpenCL buffer-based design. We evaluate the cost of the flexibility offered by the dynamic pipe prototype to be 4.4% in area utilization out of the total device resources and 1.6x latency overhead compared to a fixed connectivity design.
Video Recording – Pending | View Slides
Optimisation and Evaluation of Breadth First Search with oneAPI/SYCL on Intel FPGAs: from Describing Algorithms to Describing Architectures [8468]
Authors: Kaan Olgu and Simon McIntosh-Smith, University of Bristol, Tobias Kenter, Paderborn University, and Jose Nunez-Yanez, Linkoping University.
12:30
show / hide abstract
FPGAs are often used in scientific fields to process graph algorithms due to their energy efficiency, reconfigurability, and fine-grained parallelism. However, these algorithms face challenges in memory access patterns, scalability, and programmability. The SYCL2020 implementation in the Intel oneAPI toolchain supports FPGA targets alongside SYCL2020 features like modern C++ with a single-source offloading to improve programmability. This study analysed the Breadth-First Search algorithm on Stratix 10 FPGA with the Intel oneAPI toolchain. The implementation was done in two phases. At first, we applied the typical optimisations proposed in the official guidelines alongside an automatic cache to achieve proper pipelining and improve random memory accesses performance. However, limitations occurred with fine-grained parallelism, and it was competitive only to some related work that utilised hardware-description languages or established high-level synthesis tools. For the second phase, we added bit-level representations of data in memory, banking in on-chip memory, and fine-grained control over parallel data streams. The second implementation was generally superior or on par with all compared designs, outperforming other works in 10 out of 15 tested datasets, including various synthetic RMAT and real-world datasets, with a peak performance of 1021 MTEPS.
Video Recording – Pending | View Slides
Lunch and Networking
1:00pm – 2:00pm
Intel SHMEM: an OpenSHMEM Runtime with GPU-initiated Operations using SYCL [8996]
Authors: Md Rahman, David Ozog and Lawrence Stewart, Intel.
14:00
show / hide abstract
Modern high performance systems are increasingly heterogeneous, providing users options to use general purpose Graphics Processing Units (GPU) and other accelerators for additional performance benefits. High Performance Computing (HPC) applications are often carefully designed to introduce overlap between communications and computation for increased efficiency. With greater adoption of GPUs, there is an emerging need to overlap communications both with host computation as well as with GPU computation. This has led to efforts to extend popular communication paradigms and specific libraries to support GPU aware communication and more recently, GPU-initiated communication operations. Here, we present Intel® SHMEM, a library that enables users to write programs that are GPU aware, in that API calls support GPU memory, and which support GPU initiated communications, by embedding OpenSHMEM style calls within GPU kernels. We also propose generic extensions to the OpenSHMEM standard that can enable users to fully utilize GPUs.
Video Recording – Pending | View Slides
SYCL-Bench 2020: Benchmarking SYCL 2020 on AMD, Intel, and NVIDIA GPUs [2186]
Authors: Luigi Crisci, Lorenzo Carpentieri and Biagio Cosenza, University of Salerno, Peter Thoman, University of Innsbruck, and Aksel Alpay and Vincent Heuveline, University of Heidelberg.
14:30
show / hide abstract
Today, the SYCL standard represents the most advanced programming model for heterogeneous computing, delivering both productivity, portability, and performance in pure C++17.
SYCL 2020, in particular, represents a major enhancement that pushes the boundaries of heterogeneous programming by introducing a number of new features.As the new features are implemented by existing compilers, it becomes critical to assess the maturity of the implementation through accurate and specific benchmarking.This paper presents SYCL-Bench 2020, an extended benchmark suite specifically designed to evaluate six key features of SYCL 2020: unified shared memory, reduction kernel, specialization constants, group algorithms, in-order queue, and atomics.
We experimentally evaluate SYCL-Bench 2020 on GPUs from the three major vendors, i.e., AMD, Intel, and NVIDIA, and on two different SYCL implementations AdaptiveCPP and oneAPI DPC++.
Video Recording – Pending | View Slides
Evaluation of SYCL’s Different Data Parallel Kernels [7601]
Authors: Marcel Breyer, Alexander Van Craen and Dirk Pflüger, University of Stuttgart .
15:00
show / hide abstract
SYCL provides programmers with four, and in the case of AdaptiveCpp even five, ways for calling and writing a device kernel. This paper analyzes the performance of these diverse kernel invocation types for DPC++ and AdaptiveCpp as SYCL implementations on an NVIDIA A100 GPU, an AMD Instinct MI210 GPU, and a dual-socket AMD EPYC 9274F CPU. Using the example of a kernel matrix assembly, we show why the performance can differ by a factor of 100 in the worst case on the same hardware for the same problem using different SYCL implementations and kernel invocation types.
Video Recording – Pending | View Slides
Improving Performance Portability of the Procedurally Generated High Energy Physics Event Generator MadGraph Using SYCL [0147]
Authors: Nathan S. Nichols, J. Taylor Childers and Tyler James Burch, Argonne National Laboratory, and Laurence Field, CERN.
15:30
show / hide abstract
Event Generators are essential tools for simulating Standard Model particle interactions, representing the initial step in modeling proton-proton collisions in the Large Hadron Collider (LHC) at CERN. Traditionally relying on a few algorithms like MadGraph5_aMC@NLO, there is a pressing need for enhanced portability across diverse computing platforms. In response, MadGraph5_aMC@NLO was expanded to support Nvidia GPUs using CUDA. This paper explores further advancements by porting MadGraph to SYCL, aiming to improve portability and leverage the heterogeneous architectures of modern supercomputers. Physics figures-of-merit are used to compare performance on Nvidia GPUs, AMD GPUs, and Intel GPUs and CPUs. A comprehensive performance comparison of MadGraph on Nvidia, AMD, and Intel GPUs, and Intel CPUs is presented, using physics figures-of-merit. This includes benchmarks against CUDA (on GPU) and OpenMP (on CPU) implementations, offering a holistic view of performance across different platforms. Weak scaling results on the Polaris (Nvidia GPUs) and Aurora (Intel GPUs) systems demonstrate that SYCL not only ensures portability but also maintains competitive performance. The adoption of SYCL achieves this without the complexities of managing multiple code bases or extensive use of preprocessor directives, presenting a promising approach for future developments in high-energy physics simulations.
Video Recording – Pending | View Slides
Coffee Break and Networking
4:00pm – 4:30pm
Powering Amber Molecular Dynamics Simulations on GPUs with SYCL [3381]
Authors: Andreas Goetz, San Diego Supercomputer Center and Guoquan Chen, Ram Ramanujam and Kittur Ganesh, Intel.
16:30
show / hide abstract

Amber is a biomolecular simulations software package that is used by thousands of scientists in academia, national labs, and industry for computational drug discovery and related research. Amber is particularly well known for its high-performance molecular dynamics (MD) program, which features a very efficient CUDA implementation for Nvidia graphics processing units (GPUs) that has been continuously optimized since its initial release over a decade ago. A feature of this implementation is that it makes extensive use of FP32 floating point arithmetic while controlling numerical stability by employing 64-bit fixed-point integer arithmetic for the accumulation of energy and force contributions. To maximize performance, the CUDA implementation runs entirely on the GPU with data transfer between CPU and GPU only at program start and for checkpointing and MD trajectory output. This enables efficient MD simulations with millions of atoms both on datacenter and consumer grade hardware. Amber is therefore used on hardware ranging from small desktop workstations to supercomputers for large-scale ensemble simulations. Recently, a HIP/ROCm implementation has become available to enable Amber MD simulations on AMD devices.

In this technical presentation we present our experiences in developing a SYCL implementation of the Amber MD engine by porting the CUDA code using Intel oneAPI software development tools and Intel Xe architecture GPUs. The goals of this effort are twofold. First, to enable Amber MD simulations on Intel Xe architecture GPUs, and second to achieve performance portability across devices from different vendors by employing a single-source model based on a modern C++ standard. In the long run we hope to increase developer productivity by avoiding having to maintain multiple code bases for different hardware architectures.

This porting effort was non-trivial as the Amber MD code is quite complex, containing a large set of highly optimized C++ CUDA kernels for particle force calculations, time stepping, temperature and pressure control and enhanced sampling algorithms. The CUDA code alone consists of 58 header files and 23 source files with 459 global CUDA kernels and 292 CUDA device functions, including Nvidia PTX assembly instructions to optimize performance on different Nvidia GPU hardware generations. In addition, MPI is used for distributed memory parallelization to enable multi-GPU parallelization and large-scale ensemble simulations. We will give an overview of this porting effort and discuss some relevant optimizations of the SYCL implementation such as the replacement of subgroup shuffles with explicit shared local memory operations, memory space casting in global atomics, use of esimd-based based sorting algorithms in the oneDPL library, and the use of MPI-based explicit scaling for simulations that execute on two tiles of Intel GPU Max 1550 devices. Numerical tests demonstrate that the accuracy of the SYCL implementation is on par with the CUDA reference code and benchmark simulations demonstrate strong performance on Intel Data Center GPU Max 1500 hardware. We will conclude with a discussion of our plans for future releases with additional features for more complex MD algorithms and efforts to investigate performance portability of this SYCL code across different vendors. This work will enable Amber MD simulations on a wide range of accelerator hardware, thus enabling scientists to gain deeper understanding of structure-function relationships of biomacromolecules and advance fields such as computational drug design where high-performance MD simulations are critical.

Video Recording – Pending | View Slides
Experience of Porting LAMMPS Application with KOKKOS/SYCL to Aurora [4642]
Authors: Yasaman Ghadar and Chris Knight, Argonne National Laboratory, Stan Moore, Sandia National Laboratory, Daniel Arndt, Oak Ridge National Laboratory, and Renzo Bustamante and Michael Brown, Intel.
17:00
show / hide abstract

This talk provides an overview of the work that has been done by a team of experts at multiple national labs and Intel to improve the performance of the Spectral Neighbor Analysis Potential (SNAP), a machine-learning potential in the LAMMPS molecular dynamics code on Aurora, Argonne’s exascale machine. The SNAP potential relates the total potential energy of the system (composed of the sum of energies from individual atoms) to weighted bi-spectrum (descriptor) components.

Aurora’s main programming model is SYCL, using Intel’s DPC++ implementation. This work was part of a larger effort to enable the EXAALT project, part of DOE’s exascale computing project, to run performantly on DOE’s exascale machines. The EXAALT ECP project targets molecular dynamics simulation of materials related to nuclear fusion and fission. LAMMPS supports potentials for a wide range of systems, including solid-state materials, soft matter, liquids, and coarse-grained or mesoscopic systems. LAMMPS runs on a single processor or in parallel using message-passing techniques and a spatial decomposition of the simulation domain. Many of its models have versions that provide accelerated performance on CPUs and GPUs. The code is designed to be easily modified or extended with new functionalities. LAMMPS is written in C++ and includes support for OpenMP on CPUs, a GPU package with native CUDA, HIP, and OpenCL backends, and a Kokkos performance portability package. Kokkos is the primary programming model for EXAALT and provides a SYCL backend used on Aurora.

A mini-app called “TestSNAP” was developed to represent the SNAP potential and its bottleneck kernels and to enable quick code modifications and evaluate the impact on performance. Today, we use both the mini-app and the full LAMMPS application for benchmarking. Three kernels in SNAP were identified as the most time-consuming: Compute_FusedDeiDrj, Compute_Ui, and Compute_Yi. Intel VTune and Advisor tools were used to profile the code. Several improvements were implemented, including changes to both the LAMMPS implementation of SNAP and the Kokkos SYCL backend spanning exploration of subgroup sizes and work-item distributions, the index calculation for multidimensional Kokkos views, and manually specifying memory pointers for address spaces.

These optimizations (plus a few others) were benchmarked on ALCF Aurora, and compared to the original code on OLCF Frontier and ALCF Polaris. One independent 2000 atom replica was run on each tile, GCD, or GPU simultaneously, mimicking the EXAALT workflow. Two nodes were used on each machine, giving a total of 24 replicas on Aurora, 16 replicas on Frontier, and 8 replicas on Polaris. On Aurora, these optimizations gave a 1.49x speedup over the code in the public LAMMPS repository. The average cumulative performance of one PVC GPU (both tiles) is now 1.04x higher than an A100 GPU on Polaris and 1.08x higher than an MIX50X GPU (both GCDs) on Frontier. We are currently scaling up the runs to use 1024 nodes on Aurora.

This work was done on a pre-production supercomputer with early versions of the Aurora software development kit. This research used resources of the Argonne Leadership Computing Facility, a U.S. Department of Energy (DOE) Office of Science user facility at Argonne National Laboratory and is based on research supported by the U.S. DOE Office of Science-Advanced Scientific Computing Research Program, under Contract No. DE-AC02-06CH11357. SNL is managed and operated by NTESS under DOE NNSA contract DE-NA0003525. This manuscript has been authored by UT-Battelle, LLC, under Grant DE-AC05-00OR22725 with the U.S. Department of Energy (DOE).

Video Recording – Pending | View Slides
Unlocking Performance Portability on LUMI-G Supercomputer: A Virtual Screening Case Study [3461]
Authors: Gianmarco Accordi, Davide Gadioli, Gianluca Palermo, Politecnico di Milano, Luigi Crisci, Lorenzo Carpentieri, Biagio Cosenza, Universityof Salerno, and Andrea R. Beccari, EXSCALATE, Dompé Farmaceutici.
17:30
show / hide abstract
High-Performance Computing is the target system for virtual screening applications, which aim to suggest which candidates to test in the drug discovery process. The HPC heterogeneity of modern systems raises the functional and performance portability challenge. LiGen is a well-known virtual screening application that can offload the most demanding computation on GPUs. It has been used to perform extreme-scale virtual screening campaigns on HPC systems equipped with NVIDIA cards using a CUDA implementation. This paper reports the experience of running its SYCL implementation on the LUMI-G HPC system that leverages AMD GPUs. Based on the experimental results, the LiGen SYCL implementation performs well on AMD GPUs, enabling LiGen to run a virtual screening campaign on LUMI-G HPC infrastructure.
Video Recording – Pending | View Slides
Preparing HACC for Exascale Cosmology on Aurora Using SYCL [6020]
Authors: Esteban M. Rangel, Adrian Pope, and Nicholas Frontiere, Argonne National Laboratory, S. John Pennycook, Zhiqiang Ma and Varsha Madananth, Intel.
18:00
show / hide abstract

The Aurora supercomputer at the Argonne Leadership Computing Facility (ALCF) is one of the first exascale computing resources being put forth by the US Department of Energy (DOE). Driven by 63,744 Intel Data Center GPU Max Series and 21,248 Intel Xeon CPU Max Series processors, Aurora’s theoretical peak compute capability is more than two exaflops. Many of the scientific applications targeting Aurora will be using SYCL; in this presentation, we detail our efforts to prepare one such application for Aurora, providing insight into the porting and optimization process.

HACC (Hardware/Hybrid Accelerated Cosmology Code) is an extreme-scale cosmological simulation code developed and optimized for DOE supercomputers since the first petascale systems. The simulation’s solvers are algorithmically tuned for compute architectures, with the GPU implementation first emerging to target OLCF-3 (Titan). An extension of HACC with physics for resolving gas hydrodynamics uses a new Conservative Reproducing Kernel (CRK) formulation of Smoothed Particle Hydrodynamics (SPH). The additional physics in CRK-HACC makes simulations significantly more computationally demanding and drives the need for GPU acceleration on current exascale systems. CRK-HACC is under active development and has CUDA, HIP, and SYCL implementations – maintaining all three of these implementations with minimal developer effort is of significant interest to the HACC team, and this introduces more challenges than a more straightforward migration from CUDA to SYCL.

In the first part of this talk, we will discuss how we developed the initial SYCL implementation of CRK-HACC, paying close attention to our efforts to retain a common host code shared across CUDA, HIP, and SYCL kernels. These efforts include a customized CUDA-to-SYCL migration pipeline that builds upon SYCLomatic to produce SYCL kernels described as function objects (as opposed to lambda expressions). We will also detail our approach to achieving high levels of “performance portability” across AMD, NVIDIA and Intel GPUs using SYCL, which required us to develop an abstraction capable of compiling to multiple different implementations of “shuffle” operations: the sycl::select_from_group function from SYCL 2020, a shuffle operation emulated via work-group local memory, and a highly specialized shuffle operation implemented for Intel GPUs in assembly (vISA). The combination of these techniques across host and device code carefully balances code maintainability and performance portability, and we believe these techniques will generalize well to other applications from other domains.

In the second part of this talk, we will provide an update on our (currently in progress) efforts to further reduce the effort associated with maintaining separate CUDA, HIP, and SYCL variants of CRK-HACC’s kernels. Specifically, we will detail the design considerations and performance characteristics of an embedded domain-specific language (eDSL) that we are developing for CRK-HACC. This eDSL is essentially a thin C++ wrapper around CUDA, HIP and SYCL which allows a single source code to target any backend, and which is intended to provide CRK-HACC with greater control over CUDA and HIP code generation than relying on SYCL implementations that are themselves still in development.

Finally, we demonstrate the scaling of the SYCL version of CRK-HACC running on Aurora at large scale (anticipated to be approximately 2000 nodes, or 12000 GPUs). To the best of our knowledge, this presentation will be one of the first to present SYCL performance results at such a scale.

Video Recording – Pending | View Slides
Day 1 Close
18:30
Conference Dinner
19:00 – 21:00
Tufano’s Vernon Park Tap, 1073 W. Vernon Park Pl Chicago, IL 60607 | 8 min walk from the venue | See directions

Thursday – Conference Sessions

This session will take place in Room 605. Registration and refreshments will be in Room 603. Presenters are identified in bold.

Welcome Coffee and Networking
08:30 – 09:00
Welcome from Argonne National Laboratory [Welcome]
Brice Videau, Computer Scientist, Argonne National Laboratory (a long time OpenCL contributor and part of the Performance Engineering team at Argonne Leadership Computing Facility).
09:00
Video Recording – Pending | View Slides

KEYNOTE PRESENTATION #2

Rusticl: Compute for the Linux desktop?
Karol Herbst, Red Hat, Linux GPU Driver Software Engineer.
09:15
show / hide abstract
In the HPC space OpenCL plays an important factor for GPU compute related workloads. However on the desktop side the situation isn’t all that great. I’d like to share my thoughts on where I think the challenges are, what we in mesa can do about it and what could be changed in the overall ecosystem to improve the situation to make GPU compute more dominent on the desktop as well. I’d like to also share our own experience with OpenCL both from implementing it but also from using it within mesa and what our plans are moving forward in regards to OpenCL.
Video Recording – Pending | View Slides
AdaptiveCpp Stdpar: C++ Standard Parallelism Integrated Into A SYCL Compiler [0367]
Authors: Aksel Alpay and Vincent Heuveline, Universität Heidelberg
10:00
show / hide abstract

Expressing data parallel programs using C++ standard parallelism is attractive not only due to the simplicity of the model, but also due to its highly idiomatic nature. This programming model, commonly referred to as stdpar, can also be used for accelerator programming by offloading calls to standard algorithms, and is supported by multiple vendors, such as NVIDIA with nvc++, AMD with roc-stdpar, and Intel with the new ICPX -fsycl-pstl-offload flag.

We present AdaptiveCpp stdpar, a novel stdpar implementation that is part of the AdaptiveCpp SYCL implementation. AdaptiveCpp stdpar is the very first open-source stdpar implementation based on SYCL, and allows users to start developing applications using C++ standard algorithms, and then progressively move to SYCL as more control is needed.

Our solution supports CPUs as well as GPUs from NVIDIA, Intel and AMD. We find that it outperforms all vendor stdpar compilers on HPC GPUs in the majority of tested applications, in some configurations by up to an order of magnitude. Furthermore, we show how AdaptiveCpp outperforms nvc++ in a latency-bound code for all tested problem sizes by up to 80% on NVIDIA A100 due to novel optimizations. Our stdpar implementation deviates from existing implementations by relying on a tighter integration with compiler and runtime, including e.g. dedicated optimization passes to elide synchronization, automatically prefetching required allocations, and an offloading heuristic.

Video Recording – Pending | View Slides
Leveraging Standard C++ for Better SYCL Libraries [4331]
Authors: Mikhail Klimenko, Intel.
10:30
show / hide abstract
SYCL is an emerging heterogeneous programming model, that provides an easy entry for users to develop programs that can be executed on a plurality of accelerator architectures. However, due to the design choices, SYCL approach works well with examples or end-user applications, but involves much more work for library developers, which are required to be more generic and handle multiple use cases.

This paper introduces an approach to provide a user-friendly approach to create generic libraries and reduce the boilerplate code by leveraging the std::variant and std::visit from the standard C++. This approach was tested in an in-house high-performance kernel library for server-grade GPUs and demonstrated a significant uplift in productivity and increased the code maintainability.
Coffee Break and Networking
11:00 – 11:30
Towards a Unified Group Abstraction for SYCL [1215]
Authors: James Brodman, John Pennycook, Ben Ashbaugh, Michael Kinsner, Steffen Larsen, Greg Lueck, and Roland Schulz, Intel Corporation and Gordon Brown, Codeplay Software.
11:30
show / hide abstract

SYCL 2020 introduced several new and notable features related to groups of work-items, including a new group type (sub-groups) for performance tuning and a new library of group functions and algorithms (such as reductions) to improve programmer productivity. This new functionality was deliberately designed such that the work-group and sub-group classes would share a common interface, leaving the door open to explore additional group types in future versions of SYCL.

The experiences of developers working on complex SYCL applications have shown that although the SYCL 2020 group functionality is useful, there is room to improve usability and a need to address emerging use cases. For example, the requirement that group algorithms must be called in converged control flow by all members of a group often forces developers to structure their code in ways that may feel unnatural, and which may negatively impact the readability and maintainability of their code. Furthermore, our experience teaching SYCL has highlighted several challenges that can complicate learning for new developers, such as using the term “group” to mean both “any group of work-items” and “a work-group” (i.e., sycl::group). SYCL’s group functionality must evolve to address these concerns but must do so in a way that prioritizes backwards compatibility.

In this talk, we propose several additions to the SYCL specification designed to address these issues, including six new group types that we are exploring in the context of the Data Parallel C++ compiler. The first group type, a root group, contains all the work-items in an ND-range and provides access to device-wide barriers and algorithms within a kernel. The second group type, a fixed size partition of an existing group, enables a programming model similar to hierarchical parallelism. The third group type, a logical partition of an existing group, provides explicit control over predicated work-item execution in diverged control flow. The fourth group type, a scoped partition of an existing group, is intended to permit SYCL implementations to expand the group hierarchy without breaking user code, using an interface inspired by sub-device creation. The fifth group type, a tangle, represents all work-items in the same control flow and introduces convergence guarantees aligned with the expectations of many developers. The final group type, an opportunistic group, represents the set of work-items that an implementation guarantees to be executing together. Collectively, these six group types significantly expand the use cases that can be addressed by SYCL’s group abstraction.

We also explore options for supporting generic programming with groups of work-items, with an aim to unify a developer’s view of work-groups, sub-groups, and all our proposed groups using C++20 concepts. Such unification is necessary to ensure that SYCL’s group functions and algorithms can expand support to new group types. Aiming for unification should also help to ensure that the designs of any future group types are compatible with existing SYCL 2020 conventions and developer expectations.

We conclude with a discussion of problems that remain unsolved and that will need to be addressed in future work, including difficulties related to exposing coordination and communication primitives for arbitrary (and user-defined) collections of work-items.

Video Recording – Pending | View Slides
Using SYCL Joint Matrix Extension for Fast and Portable Matrix Operations [9394]
Authors: Bing Yu, James Brodman, Dounia Khaldi, Dmitry Sidorov, Mateusz Belicki and Yury Plyakhin, Intel.
12:00
show / hide abstract
Joint matrix is a new SYCL extension for matrix hardware programming that unifies targets such as Intel Advanced Matrix Extensions (Intel AMX), Intel Xe Matrix Extensions (Intel XMX), NVIDIA* Tensor Cores, AMD* Matrix Cores, etc. Matrix operations are at the core of generic Machine Learning (ML) frameworks like Tensorflow and libraries such as oneAPI Deep Neural Network Library (oneDNN), which both rely heavily upon matrix hardware acceleration; these software platforms are the go-to solutions for many users and applications that want high performance from such hardware. However, for users who need to build their own neural network applications, these libraries and frameworks become (1) too high-level, since users cannot add custom-made optimizations, and (2) too heavyweight, since the size of these libraries is large. Moreover, new operations are often introduced in the ML domain for which such frameworks and libraries do not provide timely and efficient solutions. Thus, when a new operation is introduced by users, e.g., a new activation function that is not handled by these libraries and frameworks, these users have little to no choice on how to experiment with such new operations. Moreover, they have to accept the performance output they get, because there is no convenient mechanism to exert direct control over the implementation of such operations on currently supported hardware. Users looking for more performance currently resort to using assembly or intrinsics that are target-specific time-consuming to learn, difficult to experiment with and error prone. For these reasons, APIs are needed to write custom workload-specific optimizations, and this is where Joint Matrix can help. In this work, we introduce a middle-level abstraction that guarantees one-to-one mapping to typical intrinsics for maximum performance while allowing for higher abstractions of the hardware specifics to ensure portability and productivity. This solution, Joint Matrix, is a new SYCL extension that enables users to write their code once and run it everywhere and faster. CUDA wmma fragments and Vulkan cooperative matrix are similar to SYCL joint_matrix extension but are NVIDIA* Tensor Cores specific. Joint Matrix is portable across different CPUs and GPUs from multiple vendors. Moreover, Joint Matrix has a lower level of abstraction than the ML frameworks and libraries, enabling it to provide performance, productivity, and fusion capabilities but, at the same time, offer portability by using one single code to target different matrix hardware. Intel AMX and Intel XMX are two examples of such matrix hardware units for which we showcase the application of Joint Matrix. For code generation and optimization, we extend both LLVM IR and SPIRV with matrix abstractions. The initial performance evaluation for validating the SYCL extensions we developed on Intel AMX and Intel XMX shows results comparable to those based on the oneAPI Deep Neural Network Library (oneDNN) General Matrix Multiplication (GEMM) library function, which is highly optimized and written in assembly. This technical presentation focuses on the following topics: – the design and development of the new SYCL matrix extension Joint Matrix for programming matrix hardware that includes an API for both matrix multiply and element-wise operations for pre- and post- processing fusion; – the compilation and code generation of Joint Matrix-based programs for Intel AMX and Intel XMX hardware; – the application and validation of this extension using the GEMM benchmark and the ability to fuse kernels such as GEMM and GELU.
Video Recording – Pending | View Slides
SYCL Properties with Compile-Time Information [2778]
Authors: Steffen Larsen, Ben Ashbaugh, James Brodman, Michael Kinsner, Greg Lueck, John Pennycook and Roland Schulz, Intel and Gordon Brown, Codeplay Software.
12:30
show / hide abstract

In this presentation we give an overview of a SYCL extension introducing property lists with compile-time information about the contained properties. Additionally, we present a selection of extensions that use these properties, both to give alternatives to existing SYCL 2020 features as well as introducing new functionality.

In contrast to existing SYCL property lists, this compile-time properties extension allows SYCL programmers to encode the presence of the contained properties as part of the property list’s type. This allows the existence of properties in a property list to be queried during compilation of a SYCL application rather than during execution. Additionally, properties usable in these new property lists can contain data that can be queried at either the time of compilation or the time of execution.

Where the strength of these new property lists comes into play is not only to allow static assertions on invalid properties being applied to objects, but also in how the compile-time information can be used to customize SYCL objects, kernel functions kernel arguments, etc. Since it relies fully on C++ templates, the ways in which these customizations can be applied includes, but are not limited to , SFINAE on SYCL object member functions and specializations of SYCL objects based on the properties in the supplied property list. DPC++ additionally uses this compile-time available information to simplify the flow of information through the compiler, making it easier for the implementation developers to add new properties that have special meaning in intermediate kernel code.

Currently, SYCL 2020 allows programmers to specify a selection of traits on kernels such as required group and sub-group sizes for the kernel, which aspects are required, as well as hints about vector computational width and appropriate work-group sizes. These are applied using SYCL-specific C++ attributes, meaning the compiler needs to be taught about these new attributes to be able to honor them. This requirement becomes problematic when compiling SYCL code with host compilers that do not know these attributes. However, using an extension to the command submission interfaces and shortcuts allowing them to take compile-time properties as an additional argument, these kernel traits can be expressed using properties with the kernel traits encoded as new properties with compile-time information. By doing this, kernel traits no longer need to be expressed through an extension of C++ but can instead be specified using regular C++.

Not only do these properties allow for regular C++ alternatives to existing SYCL 2020 features, but they also enable a plethora of new extension features. One of these new extensions features adds the ability for SYCL programmers to annotate kernel arguments. These annotations are tied to kernel arguments through a wrapper class, associating properties with an underlying object. This wrapper class can then be captured by the kernels, allowing the programmer to communicate additional information about these arguments to the compiler through the properties in the property list. An example of such additional information is the “restrict” property which informs the compiler that a given annotated pointer kernel argument does not alias any other kernel arguments, following the semantics of the “restrict” keyword in C99.

In conclusion, the compile-time property list extension presented here offers a flexible and extendible way of enabling SYCL programmers to apply properties to SYCL objects, kernel functions, kernel arguments, etc. Not only creating alternative implementations of existing SYCL 2020 features, but also giving rise to further extensions to SYCL 2020. This compile-time property list extension and extensions building upon it are available as experimental extensions in DPC++.

Video Recording – Pending | View Slides
Lunch and Networking
1:00pm – 2:00pm
Events Events Events [2787]
Authors: James Brodman, Ben Ashbaugh, Michael Kinsner, Steffen Larsen, Greg Lueck, John Pennycook and Roland Schulz, Intel Corporation and Gordon Brown, Codeplay Software.
14:00
show / hide abstract

The SYCL programming model supports expressing complex dependences through a directed acyclic graph, or DAG. Graphs in SYCL may be expressed in two ways: either through declaring data dependences using accessor objects or through declaring task dependences using event objects. Events in SYCL seem like a simple abstraction at first glance, but a closer inspection reveals a depth of complexity and nuance that programmers may not immediately appreciate. This talk will examine these issues as well as present ideas for new mechanisms in SYCL focused on events that address feature requests and concerns from programmers. Earlier proposed extensions such as the discard_events property in DPC++ or coarse-grained events in AdaptiveCpp attempted to address these issues by altering the semantics of event objects instead of eliminating unwanted events entirely.

The classic use of events in SYCL is for synchronization. Events form the backbone for building task-based DAGs when using out-of-order queues or multiple queues. Command submissions return events that future commands may depend on. This mechanism is essential when using out-of-order queues or synchronizing commands on multiple SYCL queues. Additionally, events are one way that host code may synchronize with device code without waiting on every command in a queue individually. However, events may seem superfluous for applications that primarily use an in-order queue where the primary method of synchronization is intrinsic to the semantics of the queue itself. Such applications are often encountered when porting codes to SYCL from other accelerator programming models.

While events are a powerful abstraction, the current semantics of SYCL require them to be created for every command submission. While useful, event creation is not free, and the cost may be backend-dependent. Event creation can lead to unnecessary and unexpected host overheads for applications that do not require their use to ensure correct program execution, such as applications that use in-order queues or applications which do not use the events provided when commands are submitted. A SYCL backend implementation may wish to elide the creation and usage of native events. However, if the SYCL interface always returns an event, then they must be created without knowing whether the user will use them or not. This talk will propose adding new launch mechanisms to SYCL that add several capabilities, including managing whether an event is always created upon command submission using an opt-in capability.

Events may also carry profiling information if the event comes from a command submission to a queue that was created with the `enable_profiling` property. Event profiling is a useful feature that allows programmers to query when commands are submitted, begin execution, and complete. However, this feature also has a cost, and applications may not require this information for every command submitted to a queue. Queues also lack flexibility to toggle this behavior on and off during runtime, only accepting the property at queue creation. This talk will propose new APIs to support more fine-grained event profiling that allow applications to measure only those commands of interest.

Events are a key part of the SYCL programming model that power many different capabilities including task scheduling and command profiling. However, application experience has shown that events can introduce undesired and unnecessary overheads. Additionally, not all applications wish to use the full DAG capabilities of out-of-order queues, preferring the simpler semantics of in-order queues. We believe that the new extensions presented in this talk grant programmers the flexibility to use the model that first their applications the best while only incurring performance penalties introduced by events when desired or necessary. Providing the right abstractions for the job will ensure the long-term viability and success of SYCL.

Video Recording – Pending | View Slides
SYCL Bindless Images [9301]
Authors: Sean Stirling, Przemyslaw Malon, Isaac Ault, Duncan Brawley, Chedy Najjar and Peter Žužek, Codeplay Software.
14:30
show / hide abstract

SYCL has proven lately to be a very useful standard for heterogeneous computing, being used in both embedded hardware as well as in High Performance Computing. One area where the standard is a bit lacking, though, is support for processing images, which hasn’t changed much from the original design that reflected OpenCL images. To address this disparity, our work focuses on the development and implementation of Bindless Images, a significant update to SYCL images, aimed at aligning image support more closely with contemporary graphics APIs.

During this presentation we will introduce images, touching on how they’re used and then the areas that SYCL images fall short. Next, we will introduce our work on Bindless Images, provide a comparative analysis on the differences with respect to SYCL 2020 images, and highlight the interoperability support they provide. We’ll finish by outlining real-world use cases, their impact on the project’s design and future, and finally the future work we have planned.

We start off with an introduction into the importance of images, how and where they’re used, and explore a bit of the history of bindless textures.

We continue with a deep dive into why SYCL images are insufficient, talking about lackluster flexibility, tenuous control, feature sparsity, and difficulties translating other image APIs into SYCL.

Then comes the main part, which introduces our work on Bindless Images as a means to provide developers with an environment that is both more flexible and versatile. Bindless Images are currently an experimental DPC++ vendor extension, available in the open source Intel LLVM repository. At the moment they are only available for the CUDA backends, with certain minor limitations.

A comparative analysis is presented, shedding light on the limitations of the existing SYCL 2020 images and showcasing the newfound flexibility offered by Bindless Images. This includes, but is not limited to, being able to separate image memory allocation from the actual image, which among other things enables using existing Unified Shared Memory (USM) allocations as image memory.

An important aspect of Bindless Images is the ability to provide interoperability with other graphics APIs. SYCL 2020 interoperability support is limited to APIs that are also used as a SYCL backend, so called SYCL backend interop. However, SYCL doesn’t have a way to interop with non-backend APIs yet. In our work we present the current focus of interoperability with Vulkan and DirectX 12.

The interoperability section covers not only being able to import image memory from other graphics APIs, but also the importing of synchronization primitives like semaphores.

There are ongoing efforts to improve Bindless Images for real-world use cases, such as integrating them into the SYCL backend for Blender, and providing support for automatic translation of CUDA images to Bindless Images via SYCLomatic. We discuss a bit about these efforts and how they’re influencing the future direction of the project.

Lastly, we further discuss future work, which includes adding support for more backends and preparing an official Khronos extension.

Through this presentation, we aim to highlight the transformative impact of Bindless Images on the SYCL standard, marking a significant step toward empowering developers with a more adaptable and feature-rich environment for heterogeneous computing. As we delve into each aspect, attendees will gain insights into the intricacies of Bindless Images, fostering a deeper understanding of its implications for the SYCL programming model and its potential for bringing flexible image manipulation across a range of devices in a single standard. This research not only addresses existing limitations but also sets the stage for continued collaboration and innovation in the realm of heterogeneous computing.

Video Recording – Pending | View Slides
An Online Compiler for SYCL Kernels and Some Related Ideas [3019]
Authors: Greg Lueck, Ben Ashbaugh, James Brodman, Michael Kinsner, Steffen Larsen, John Pennycook and Roland Schulz, Intel and Gordon Brown, Codeplay Software.
15:00
show / hide abstract

This technical presentation will describe two related extensions to the SYCL language that DPC++ is developing. The main motivation is to provide the ability to “online” compile (i.e. just in time) a SYCL kernel from a string when the application runs. OpenCL provides the ability to online compile a kernel from an OpenCL C string, and CUDA also provides a similar feature with their NVRTC library. DPC++ users have been requesting similar functionality.

SYCL’s specialization constants provide a similar feature, but it is more limited than online kernel compilation. Specialization constants allow an application to tune values used within a kernel, but they do not provide a good way to dynamically change the algorithm a kernel uses. Therefore, online compilation can provide a benefit beyond specialization constants. Applications can dynamically construct a string that defines a kernel, customizing the kernel according to the input data set or according to the features available on the target device.

The online compiler requires a change in the way parameters are passed to a SYCL kernel. For example, capturing variables through a lambda expression no longer makes sense when the kernel is defined at runtime. Instead, the application must set kernel argument values explicitly via an API like “set_arg”, where each argument is identified by its index in the parameter list. This is problematic for variables captured through a lambda expression, though, because C++ does not define an order to the captures. This led us to propose a new syntax for defining a SYCL kernel as a plain function, where each function argument corresponds to one kernel argument. Thus, the application can dynamically create a string that defines a function. Once the string is online compiled, the application invokes the kernel by explicitly setting values for each of the function arguments.

This new kernel syntax led us to propose a second SYCL extension – free function kernels. If the online compiler allows the user to define a kernel as a plain function, why not allow this syntax even for SYCL applications that do not use the online compiler? We think this syntax may be more familiar to users migrating from CUDA, so it may help attract new users from the CUDA ecosystem. This syntax raises some interesting questions, though. For example, how does the kernel get the invocation ID if the function arguments are the kernel arguments? How does the language distinguish between “range” kernels, “nd-range” kernels, and “single-task” kernels? Our presentation explains how we solved these problems.

We also think the online compiler can be extended to other languages, providing new opportunities for interoperability between SYCL and other languages. For example, a SYCL application could dynamically create a string that defines a kernel, where the string uses CUDA syntax or OpenCL C syntax instead of SYCL syntax. We think this could also help users migrate from CUDA. For example, a CUDA application using the NVRTC library could migrate the “main” part of the application to SYCL while still dynamically generating a string that defines kernels in CUDA (for use on backends that support CUDA).

Another possibility is to allow the application to dynamically load a kernel using SPIR-V. For example, the application could read a file defining a SPIR-V module and then invoke kernels from that module using the same APIs we propose for online kernel compilation. We think this would enable new types of interoperations in SYCL. A user could define a kernel in any language that can be compiled to SPIR-V and then invoke the kernel from a SYCL application. Ninja users could also hand craft kernels by coding them directly in SPIR-V.

These extensions are in early stages of development in DPC++, so the goal of our presentation is to raise awareness of the idea and gather community feedback.

Video Recording – Pending | View Slides
SimSYCL: A SYCL Implementation Targeting Development, Debugging, Simulation and Conformance [8509]
Authors: Fabian Knorr and Peter Thoman University of Innsbruck, and Luigi Crisci, University of Salerno
15:30
show / hide abstract
The open SYCL standard has established itself as a cross-vendor, cross-platform means to develop software which benefits from GPU and accelerator parallelism. Inherent difficulties in portability between and debuggability of programs for these targets remain.
However, as we demonstrate, the SYCL specification lends itself to be implemented purely in software in a manner that is accessible to debuggers and which can be employed to simulate the characteristics of various hardware targets.We introduce SimSYCL, a novel library-only SYCL implementation with extensive simulation and verification capabilities. By executing all SYCL commands synchronously on the host CPU, it is able to diagnose various manifestations of undefined behavior within kernels, and grants developers the ability to step into kernels with an ordinary debugger to discover other logic errors.We demonstrate that the reduced complexity of this approach, combined with an implementation focus on fast compilation, considerably speeds up the edit-compile-debug cycle compared to other SYCL implementations, while maintaining reasonable runtime performance. Furthermore, we show how SimSYCL’s simulation capabilities allow unit-testing user code for cross-platform portability, and that its comprehensive validation detects and reports several classes of user errors which remain undiagnosed by performance-focused implementations.
Video Recording – Pending | View Slides
Coffee Break and Networking
16:00 – 16:30
Enabling RAJA on Intel GPUs with SYCL [7615]
Authors: Brian Homerding, Argonne National Laboratory, and Arturo Vargas, Tom Scogland, Robert Chen, Mike Davis and Rich Hornung, Lawrence Livermore National Laboratory.
16:30
show / hide abstract

To achieve high performance, modern HPC systems take advantage of heterogeneous GPU architectures. Often these GPUs are utilized through a vendor preferred parallel programming model. Unfortunately, this results in application code that is not portable across vendors.

To address this issue, open parallel programming models have been introduced. One such parallel programming model is provided by the RAJA Portability Suite. RAJA is a portability layer that provides an abstract application developer API as a library through modern C++. In RAJA, computational kernels are lowered down to a backend language appropriate for the target architecture. Additionally, RAJA is designed to provide control to the application developer over kernel execution while minimizing modification to the application source code.

In this work, we describe the SYCL backend implementation in RAJA for Intel GPU execution. We discuss the implementation of various features in the SYCL backend along with the challenges and lessons learned. Finally, we investigate the performance impact of executing several HPC kernels through RAJA when compared to direct SYCL implementations.

Video Recording – Pending | View Slides
Experiences with Implementing Kokkos’ SYCL Backend [0702]
Authors: Daniel Arndt, Damien Lebrun-Grandie, Oak Ridge National Laboratory and Christian Trott, Sandia National Laboratories.
17:00
show / hide abstract
With the recent diversification of the hardware landscape in the high-performance computing community, performance-portability solutions are becoming more and more important. One of the most popular choices is Kokkos. In this paper, we describe how Kokkos maps to SYCL 2020, how SYCL had to evolve to enable a full Kokkos implementation, and where we still rely on extensions provided by Intel’s oneAPI implementation. Furthermore, we describe how applications can use Kokkos and its ecosystem to already explore upcoming C++ features also when using the SYCL backend. Finally, we are providing some performance benchmarks comparing native SYCL and Kokkos and also discuss hierarchical parallelism in the SYCL 2020 interface.
Video Recording – Pending | View Slides
Workshop Closes
18:30
Video Recording – Pending | View Slides

Posters

The posters will be on display during the breaks and lunch on Wednesday and Thursday.

Unified AddressSanitizer Framework for SYCL Kernel on CPU and GPU Device [8015]
Authors: Yang Zhao, Wenju He, Yingcong Wu, Maosu Zhao, Ge Jin and Chunyang Dai, Intel
show / hide abstract

In response to the increasing demand for incorporating LLVM sanitizers into heterogeneous computing environments, this poster outlines the development of a unified sanitizer framework for SYCL kernel code. The focus of this unified sanitizer framework is on detecting memory access errors, memory leaks, uninitialized memory usage, undefined behavior, and multi-threading issues in SYCL kernel code.The challenge lies in designing a framework that caters to the diverse features supported by each device. For example, CPU kernel code shares the same address space with host code, while GPU kernel code has a separate address space distinct from host code, with additional complexities introduced by the GPU’s memory hierarchy. Directly migrating LLVM sanitizers to GPU devices becomes intricate due to these variations.

Our ongoing implementation of a unified framework leverages AddressSanitizer (ASan), a runtime memory error detector for C/C++ languages, specifically tailored for SYCL kernels. The extension of ASan’s instrumentation pass to SPIRV IR facilitates compatibility with the latest LLVM in the SYCL frontend compiler and mitigates differences in backend compilers. Additionally, a new layer named the Sanitizer Layer is added to the Unified Runtime (UR), which is implemented by standard UR APIs, providing common ASan runtime support for UR Adapters.

Currently, the implementation includes the detection of out-of-bounds errors on Unified Shared Memory (USM) and static local memory for both OpenCL CPU and Level-Zero backends. Key steps involve extending the AddressSanitizer pass from LLVM to instrument SYCL kernels, enabling the sanitizer layer in the UR loader, and implementing memory access checks based on shadow memory within kernel code.

Upon execution, the framework reports detailed error information, including work-item ID, kernel name, source code line details, and data access details. The partial open-source release in the GitHub intel/llvm repository demonstrates the capability to identify out-of-bounds memory access errors in SYCL kernel code.

The presented solution significantly reduces the effort required for supporting diverse offload scenarios and empowers users with comprehensive error information for efficient debugging and resolution.

Besides, this solution not only can be easily extended to accommodate other SYCL backends such as CUDA and HIP but can also streamline support for OpenMP offload. Furthermore, it provides a foundation for the incorporation of additional sanitizers like MemorySanitizer and ThreadSanitizer.

Smoothing the Migration from CUDA to SYCL: SYCLomatic Utility Features [7810]
Authors: Ziran Zhang, Zhiming Wang, Chenwei Sun and Andy Huang, Intel
show / hide abstract

SYCLomatic is an open-sourced tool aiming to assist customers to migrate existing heterogeneous computing project written in CUDA language to project written in SYCL language. Typically, 90%-95% of CUDA code automatically migrates to SYCL code and some Inline comments are provided to help developer complete and tune the code.When trying to migrate a CUDA project, developers can leverage SYCLomatic with the workflow in Fig.1. SYCLomatic can migrate most of the CUDA code, including CUDA APIs, CUDA types and kernels, to SYCL code automatically with provided compilation database. Some inline comments are inserted in the migrated source code as a mention of not perfect migration, optimization chance or not supported CUDA APIs. Developer may take some efforts to address these comments to complete the code migration or improve the run-time performance of the migrated code. Besides migrating the CUDA source code, SYCLomatic can generate Makefiles for the migrated project with the information of compilation database which saves huge manual efforts for building the migrated project. After the migration, the migrated SYCL project can be built with SYCL compilers[3] and run on multiple devices, CPU, GPU, FPGA and other accelerators.

Besides code migration, SYCLomatic have further introduced utility features to improve the user experience: evaluating the migration engineering effort required (Analysis Mode), auto migration of the build scripts (CMake script migration) and debugging/verifying the migrated code (CodePin).

Acceleration of Quantum Transport Simulations with OpenCL [6158]
Authors: Yosang Jeong and Hoon Ryu, Korea Institute of Science and Technology Information
show / hide abstract
The Recursive Green’s Function (RGF) is a subset of methods for calculation of the Non-Equilibrium Green’s Function (NEGF), which is popularly used for simulations of quantum transport in nanoscale devices. The computation of RGF involves repeated multi- plication of dense & complex matrix blocks as illustrated in Figure 1 (particularly Step 3), so enhancement of corresponding performance is critical to accelerate NEGF simulations. This work introduces acceleration techniques for RGF computations using GPUs, and validates the practicality of suggested techniques with solid sets of benchmark results. The code has developed with OpenCL and CLBlast, using an in-house tight-binding simulation code package as a baseline. Performance validation has been con- ducted with test cases summarized in Table 1. As Figure 2 shows, our code modernization that is conducted with OpenCL & CLBlast ZGEMM against NVIDIA & AMD GPU devices can speed-up the whole RGF computation and the hotspot up to 18.63 and 48.73 times, respectively, against the case where the computation has been done in host with hybrid parallelization based on MPI-OpenMP.
Optimization of Fast Fourier Transform (FFT) for Qualcomm Adreno GPU [1048]
Authors: Skyler Szot, Hongqiang Wang and Alexander Angus, Qualcomm
show / hide abstract

The Fast Fourier Transform (FFT) is a widely used algorithm in digital signal processing. The FFT computes the discrete Fourier transform (DFT) of a sequence, converting from temporal or spatial domain to frequency domain. The DFT operation is useful for many signal processing applications, but computing directly from definition is too slow to be practical. An FFT algorithm reduces the complexity from O(N^2) to O(NlogN), where N is the data size.

This work describes how to accelerate the FFT algorithm for Qualcomm’s Adreno GPUs using OpenCL. We discuss optimization of one-dimensional FFT implementations such as the Cooley-Tukey FFT and Stockham FFT.

Ray Tracer Based Lidar Simulation Using SYCL [7219]
Authors: Peizhao Qiu and Danial Chitnis, University of Edinburgh
show / hide abstract
Lidar plays an increasingly important role in the world [ 6]. By providing distance information about its surrounding environment to the perception system, Lidar can significantly enhance and correct the actions and decisions made by the underlying control or artificial intelligence system. However, empirical experimental data alone cannot supply the required insight regarding the propagation of light travelling at the individual photon level. Simulations can effectively generate the entire path of photons from a laser source to a photon receiver, aiding researchers in understanding light propagation more thoroughly [1]. Ray tracing is a common method used in these simulations, treating light as a collection of photon particles moving linearly through space. This technique is grounded in the foundational principles of light behaviour and can produce high-fidelity, physics-based representations of photon paths. However, as ray tracing is based on the Monte Carlo algorithm, its computational speed is often constrained. In some extreme scenarios, millions or even billions of simulations may be required to achieve a single reliable result. Ray tracing algorithms are ideally suited for acceleration through parallel processing since the propagation of each ray can be computed separately [2, 5]. This characteristic makes it an excellent choice for simulating Lidar photons. Additionally, ray tracing algorithms are fairly established in computer graphics, hence, the efforts can focused on parallelizing the Lidar simulation.
In the ray tracing algorithm, each ray is independent of the others and can be easily distributed across different computational units. This characteristic allows the algorithm to efficiently utilize all available computational resources. As a heterogeneous programming framework SYCL is capable of compiling and executing on various hardware backends with a single source code, including parallel computing hardware like GPUs [ 4]. This heterogeneity makes the ray tracing particularly compatible with Lidar simulation. Furthermore, SYCL’s versatility as a heterogeneous framework makes it an ideal choice for developing scientific simulators, which often need to operate across diverse platforms from different vendors to maximize computational resources.
This work showcases the implementation of a ray tracer, which is integral to developing a Lidar simulator for design and performance verification purposes. In this implementation, both the Buffer and Accessor Model and the Unified Shared Memory (USM) model are employed for data management. However, USM is the preferred approach due to its more explicit method of memory access. Additionally, the USM implementation enhances code readability through its CUDA-style, pointer-based notation, which is more intuitive for researchers and engineers who may not have a strong background in computer science. Compared to other parallel programming frameworks, SYCL runs by default on the program host device, significantly easing the development and debugging of SYCL programs. Contrary to CUDA, a SYCL implementation can execute on the machine’s CPU, automatically harnessing the CPU’s computational resources via multi-threading.
In this study, a comparison was conducted between a C++ single-thread implementation and a SYCL implementation that offloads computations to different hardware (GPU and CPU). Both the single-thread C++ and SYCL implementations utilize the same underlying functions for performing ray tracing. In our hardware setup (AMD Ryzen Threadripper 64-Cores, NVIDIA RTX4090, and Intel GPU MAX 1100 ), the SYCL CPU implementation is approximately four times faster than the traditional C++ implementation. Furthermore, by offloading to a GPU, this same SYCL implementation can achieve a performance boost of about 11 times compared to the initial C++ implementation. It should be noted that performance varies significantly with different hardware configurations.
Generally, SYCL is highly effective at leveraging the capabilities of modern CPUs, which possess more computational cores.Our implementation has provided valuable insights into the practicalities of ray tracing in SYCL. We found that while a basic form of ray tracing can be implemented relatively easily, SYCL’s inherent limitations pose challenges for certain data structures and algorithms commonly used in ray tracing. For instance, due to the restriction on dynamic memory allocation within SYCL kernel code, our program operates within a pre-defined memory size. Consequently, certain assumptions were made, such as limiting the number of bounces a ray can make to 50. While this may introduce a theoretical bias, it is a reasonable compromise in practical scenarios, unlikely to significantly impact simulation accuracy, considering the diminishing energy of photons after multiple bounces.Another challenge was the inability to use recursion in SYCL kernel code. This required us to restructure part of the ray tracing method. For instance, an axis-aligned bounding box (AABB) [ 3] was utilized for our acceleration structure, storing objects in a binary tree without recursion. The tree’s maximum depth was pre-defined at 50 levels, allowing for a theoretical maximum of 250 objects in a scene.Although SYCL’s limitations impose theoretical upper bounds on the number of ray bounces and scene objects, these limits are sufficiently high to be negligible for most practical purposes.
Moving forward, we plan to apply this photon propagation framework in our next phase of Lidar simulation development. In the future, we will further explore the potential of our implementation and aim to overcome the current limitations, potentially leading to more versatile and robust simulation models.
CodePin: An Instrumentation-Based Debug Tool of SYCLomatic [2710]
Authors: Andy Huang, Zhiming Wang, Wenhui Ni, Sheng Chen and Nithin George, Intel.
show / hide abstract

SYCL is a royalty-free, cross-platform programming model for heterogeneous computing based on C++. Intel oneAPI[1] provides a SYCL compiler implementation and runtime library to support SYCL kernel-based programming and a set of optimized libraries to enable API-based programming.

SYCLomatic[2] is an open-source project to assist developers migrating existing code written in other programming languages into the SYCL C++ heterogeneous programming model. Currently, SYCLomatic supports source-to-source migration of applications in CUDA into SYCL source code that leverages SYCL interfaces and the optimized libraries provided by Intel oneAPI.
Due to the differences in software and the hardware stack targeted by the two languages, the run-time behavior of the migrated SYCL program can sometimes be inconsistent with the original CUDA version. Among others, this can be attributed to,

• Difference in arithmetic precision between hardware
• Semantic difference between the CUDA and SYCL APIs
• Issues caused by the difference in data type implementation and the size of the data type between CUDA and SYCL
• Errors introduced during the automatic migration

In such cases, identifying the points of divergence between the CUDA and the SYCL programs could be difficult, especially since there is no debugging facility to monitor and compare the execution status of the CUDA and the SYCL version simultaneously.

Normally, to debug inconsistent runtime behavior of application, a user needs to add instrumentation to implement data checkpoints in both CUDA and SYCL code. At these checkpoints, these instrumentation should perform the following steps:

• Synchronize unfinished device task
• Copy memory from device-to-host
• Dump target object into a format for comparison

In general, the first two steps can be done with simple API calls. The major effort is in the 3rd step since the user needs to implement serialization function to dump both the CUDA and SYCL objects.

To reduce the effort of manual instrumentation, we introduce CodePin, a sub-feature of SYCLomatic which can automatically inject instrumentations to realize data checkpoints before/after specific API calls and kernel calls. Fig. 1 details the general workflow of using CodePin to perform program comparison. As shown in the figure, the user can use CodePin to generate a CUDA and the migrated SYCL code with instrumentations injected at data checkpoints. When these programs are executed, these instrumentations will dump values of data structures into report files at each data checkpoint. The user can then compare the reports to identify the data checkpoint where the behaviors of the two programs started to diverge.

As shown in Fig.2, if CodePin is enabled when migrating CUDA project to SYCL, extra ASTMatchers will be enabled to match selected CUDA API and all CUDA kernel calls. When the ASTMatchers match the selected API or kernel calls in the Abstract Syntax Tree (AST), instrumentations are generated and injected around these calls into the CUDA and SYCL codes. Furthermore, a static analysis function will be invoked to analyze the CUDA class memory layout of all the related variables. The result of the memory layout analysis including the field name, the type and the offset of all field members will be saved in a schema file in the form of JSON strings.

While the CUDA schema can be acquired by analyzing the information provided by AST, obtaining the SYCL schema is relatively tricky since the AST of the migrated SYCL code is not available. CodePin maintains a list of schemas of native SYCL classes and constructs the SYCL schema of a user-defined class by adjusting the CUDA schema according to the type migration logic in SYCLomatic.
Another challenge of dumping data is to establish the size of dynamically allocated objects. Since the size of such allocation can only be determined at run-time, CodePin adds instrumentation after each device malloc function like cudaMalloc(). This instrumentation records the address of the pointer and size of the allocated memory in a global map. The data checkpoint API can then reference the map to retrieve the number of objects behind the pointer.
Fig.3 shows a usage example of CodePin to generate the instrumented CUDA Program and the migrated SYCL Program from the original CUDA Program. As highlighted, the original CUDA code hardcoded the size of “int3” as 12 in line 35 and this will cause incorrect execution result in the migrated SYCL code because the size of “sycl::int3” is 16. To debug the issue, the CodePin instruments function gen_prolog_API_CP() and gen_epilog_API_CP() to collect data checkpoint before and after the kernel call. After execution, the two data checkpoints are generated and compared as shown in Fig.4. The comparison reveals the divergence in the behavior of the CUDA and SYCL version at the highlighted checkpoint collected before kernel call, which hint the data passed into kernel is not correct.

In conclusion, CodePin can reduce the debug/verification effort when migrating CUDA programs to SYCL with SYCLomatic. Given a CUDA source code, CodePin generates instrumented CUDA code as well as its migrated SYCL code. When executed, these instrumented programs generate reports at each data checkpoint which can be used to identify points of program divergence between the CUDA and SYCL programs. CodePin can analyze most user-defined CUDA class and generate the schema file for CUDA and SYCL classes except classes with virtual inheritance. New features like auto-comparing the data report, loading a data check point from CUDA report in the SYCL side to verify the behavior afterward, etc., are under development. CodePin is distributed open-source and available at the SYCLomatic repository.

Lessons Learned Migrating CUDA to SYCL: A HEP Case Study with ROOT RDataFrame [2721]
Authors: Jolly Chen, University of Amsterdam, CERN, Monica Dessole, CERN and Ana Lucia Varbanescu, University of Twente.
show / hide abstract

The world’s largest particle accelerator, located at CERN, produces petabytes of data that need to be analysed efficiently, to study the fundamental structures of our universe. ROOT is an open-source C++ data analysis framework, developed for this purpose. Its high-level data analysis interface, RDataFrame, currently only supports CPU parallelism. Given the increasing heterogeneity in computing facilities, it becomes crucial to efficiently support GPGPUs to take advantage of the available resources. SYCL allows for a single-source implementation, which enables support for different architectures.

In this poster, we describe a CUDA implementation and the migration process to SYCL, focusing on a core high energy physics operation in RDataFrame — histogramming. We detail the challenges that we faced when integrating SYCL into a large and complex code base. Furthermore, we perform an extensive comparative performance analysis of two SYCL compilers, AdaptiveCpp and DPC++, and the reference CUDA implementation. We highlight the performance bottlenecks that we encountered, and the methodology used to detect these. Based on our findings, we provide actionable insights for developers of SYCL applications.

Accelerating Machine Learning Inference on GPUs with SYCL [3098]
Authors: Ioanna-Maria Panagou and Nikolaos Bellas, University of Thessaly, Lorenzo Moneta and Sanjiban Sengupta CERN.
show / hide abstract
Recently, machine learning has established itself as a valuable tool for researchers to analyze their data and draw conclusions in various scientific fields, such as High Energy Physics (HEP). Commonly used machine learning libraries, such as Keras and Pytorch, might provide functionality for inference, but they only support their own models, are constrained by heavy dependencies and often provide only a Python API and not a C++ one. SOFIE, which stands for System for Optimized Fast Inference code Emit, a part of the ROOT project developed at CERN, creates standalone C++ inference code from an input model in one of the popular machine learning formats. This code is directly invokable from other C++ projects and has minimal dependencies. In this work, we extend the functionality of SOFIE to generate SYCL code for machine learning model inference that can run on various GPU platforms and is only dependent on Intel MKL BLAS and portBLAS libraries, achieving a speedup of up to x258 for large convolutional models.
SYCLomatic: SYCL Adoption for Everyone – Moving from CUDA to SYCL Gets Progressively Easier [3317]
Authors: Robert Mueller-Albrecht, Intel.
show / hide abstract
A growing number of diverse AI and high-performance computing (HPC) workloads need efficient and cost-effective use of accelerators, including GPUs and others, an ever more varied set of hardware configurations, and easier, portable multiarchitecture programming.

Compute-intensive applications, scientific and medical research workloads, advanced image processing workloads, and AI-assisted data science use cases all have one thing in common. Moving software code from one hardware platform to another has been cumbersome and costly for a long time. Single-vendor architecture programming limits code reuse and portability. This limits the adoption and proliferation of the latest technology advances that could speed up computing, innovation, and scale solutions with greater productivity.

In early 2022, the open-source SYCLomatic project went from being an ambitious effort on GitHub to being a released product, complete with its binary distribution counterpart, the Intel DPC++ Compatibility Tool.

It provides a solution for many workloads whose accelerated offload compute implementation is still based only on CUDA. SYCLomatic takes a Visual Studio solution or a makefile-based project, analyzes it, if desired, captures the build configuration in a CLANG JSON database file, and then processes all the CUDA kernel and C++ source and header files in the project.

The result is a new, fully migrated project with an annotated C++ source providing warnings and guidance on additional steps to complete the migration, where automation alone is not sufficient to make the right heuristic decisions with 100% accuracy.
The initial level of completeness for the migration has improved considerably over the last two years, as well as the level of comprehensive guidance the tool provides.

But, the evolution has not stopped there. SYCLomatic automatically migrates an ever-growing set of CUDA library calls from cuBLAS, cuFFT, cuRAND, cuSolver, cuSparse, cuDNN, Thrust, CUB, and NCCL to equivalent open standards SYCL-based library calls covered by the Unified Accelerated Foundation (UXL) and its underlying oneAPI specification. Thus, it takes full advantage of the open SYCL abstraction layer as well as the complementary comprehensive set of compilers, libraries, tools, and code samples the Unified Accelerated Foundation and its vision for a multiarchitecture and multivendor software ecosystem for all accelerators based on open-source standards provides.

The work presented will cover a step-by-step approach to adding SYCL support to your C++ code and migrating existing CUDA-based implementations to SYCL.

On Demand Specialization of SYCL Kernels with Specialization Constants and Spec Constant Length Arrays (SCLA) [3484]
Authors: Víctor Pérez, Victor Lomüller, Lukas Sommer, Julian Oppermann, Romain Biessy, Tadej Ciglarič and Mehdi Goli, Codeplay Software.
show / hide abstract

SYCL is a heterogeneous programming model that allows users to specialize their code using standard C++ features such as templates. Specializing code and taking available hardware features into account can enable optimizations for different targets. Yet, specialization using C++ templates may lead to an increase in the resulting binary size or, worst case scenario, may not even be feasible at all. In these cases, SYCL programmers might find a powerful tool in specialization constants to specialize their code at runtime, potentially optimizing binary size and compilation time.

Specialization constants (sycl::specialization_id) are values which are guaranteed to be constant during the execution of a kernel and whose value is only given at kernel launch time. This approach enables SYCL implementations to specialize the kernels at runtime, generating code on the go using the known constant values. This simple mechanism allows programmers to write complex kernels with specialization constant–driven control-flow that will be simplified at kernel launch time. Also, as the specialization only happens at runtime, specialization constants do not have an impact on binary size at application compile time.

The case of SPIR-V targets is of special interest, as this intermediate language provides native support for spec constants. SPIR-V spec constants work similarly to SYCL’s, while adding support for two additional constructs: specialization operations, yielding a specialization constant from an operation taking specialization constants or constants as arguments; and specialization constant length arrays, arrays whose length can be given by a specialization constant. In this talk, we will prove how this latter capability can be leveraged in SYCL code to write more size-efficient and, in specific cases, even better performing kernels. Thus, we propose a new SYCL extension, Spec Constant Length Arrays (SCLA).

The SYCL device memory model defines three distinct address spaces: global memory, accessible to all work-items in all work-groups; local memory, shared among all work-items in a work-group; and private memory, private to each work-item. As private memory does not need to be shared between work-items, SYCL implementations may use faster—but more limited—physical device memory to map it, e.g., registers. In the current SYCL specification, there is no way of expressing dynamic private memory allocations, so we propose SCLA as an extension to enable this feature.

The current SCLA experimental implementation in DPC++ is built on top of Clang’s C++ variable length arrays (VLA) extension, replacing the integral size argument with a SYCL specialization constant.
This syntax was chosen to prove the concept. In the future, we plan to evolve the extension to use a more idiomatic syntax for SYCL and make further improvements.

We use portDNN as a use-case scenario for SCLA. portDNN is an open-source SYCL library providing several heavily-specialized kernels implementing deep neural network operators. Experiments were performed on a system with an Intel Core i7-8700 CPU with 31GiB RAM and an Intel UHD Graphics 630, running Ubuntu 22.04.3 LTS (Linux kernel 6.2.0) and Intel Level Zero driver version 1.3.26918.

portDNN specializes its kernels using different vector widths. In our experiments, we replaced vectors with SCLA allocations in the kernels by providing a vector-like interface built with spec constants on top of our extension.
For now, this change was performed for two of the thirteen available operators (binaryop and pointwise), but would be applicable to more operators.

This change led to reducing the library size by 10%, while maintaining performance and generating the same code JIT. The reported geometric mean speedup was 1.01x, taking the kernel execution time as a metric, when running portDNN-provided binaryop and pointwise benchmarks.

As specialization now happens at runtime, we can see the number of files to be compiled at application compile time reduced by 20%, leading to 1.19x compile time speedup and 1.11x link time speedup, yielding a total build time speedup of 1.18x. Interestingly, compilation times of the new individual files saw a 0.54x slowdown, but the reduction in the number of files being compiled paid off. This reduction in compilation time will however have an effect on JIT compilation time. In our experiments, we reported a 0.91x slowdown in JIT compilation time, which is however expected to be run just once and cached afterwards.

With use-cases such as the replacement of explicit vectorization by arrays which have their size given through specialization constants, SCLA could be the first in a series of SYCL extensions building new features on top of spec constants. One example of this would be a vector-like interface that would allow creating dynamically sized vectors specialized for the hardware vector width for algorithms such as those in portDNN, while greatly optimizing the resulting binary size.

SCLA was also tested on portFFT, an open-source library implementing Fast Fourier Transforms using SYCL.
In the absence of the ability to dynamically allocate private memory, the library had to make conservative allocations to cope with different parameter sizes.
Using SCLA in place of these allocations and always allocating only the necessary size leads to 1.09x speedup on the same system taking the same metrics. As the only change in the generated code was the private memory allocation size, we attribute the speedup to register spilling taking place in the original code due to the larger allocation size.

These results demonstrate how specialization constants and SCLA allow to delay specializing SYCL programs to the runtime when full information about the target hardware is available. This can significantly reduce binary size, an important factor for shipping libraries. It allows such algorithms to run in more memory-constrained environments while keeping a high degree of specialization and thus facilitates portability of SYCL code.

In the poster, we will present the current SCLA extension and how it can be used instead of templates to specialize code in libraries such as portDNN or portFFT at runtime, leading to significant reduction in binary size and compile time. We will also share ideas of how the extension can be evolved in the future to a more mature and SYCL-like interface.

An Efficient Approach to Resolving Stack Overflow of SYCL Kernel on Intel® CPUs [8471]
Authors: Wenwan Xing, Wenju He and Xinmin Tian, Intel.
show / hide abstract
SYCL is a parallel programming language and enables heterogeneous computing on various devices. SYCL CPU device uses CPU as a device to run SYCL kernel. The device distributes SYCL work-groups execution to CPU threads. When a kernel uses a large private or local memory, there could be stack overflow since a thread’s stack size can’t be changed after thread creation. Previously, we used context swapping to resolve the issue. However, it has a significant performance penalty. This paper proposes a new approach which replaces alloca instruction with access from heap. Experiment results on 21 SYCL workloads show the new approach doesn’t have performance penalty. The new approach could be adopted by other SYCL or SPMD CPU devices, e.g., SYCL Native CPU device, since they all face the same problem.