Jump to: SYCL Hackathon | Wednesday | Thursday | Friday | Posters
Location and Times
All the sessions took place on the 5th floor of the Heidelberg University Interdisciplinary Centre for Scientific Computing.
Awards Winners
This year’s awards went to:
Outstanding Full Paper: Achieving High-throughput Strided Data Movement Across GPUs.
Outstanding Short Paper or Technical Talk: Write once, deploy on several IHV-GPUs – 3D Rendering with SYCL cross-vendor support and performance using Blender Cycles.
Outstanding Poster: Performance-Portable implementation of the shallow water equations on CPUs, GPUs, and FPGAs.
SYCL Hackathon and Tutorials
- Dates: Monday 7 – Tuesday 8 April, 2025
- Times: 08:30 – 17:30
- Hackathon Chair: Tom Deakin, University of Bristol and Khronos SYCL Working Group Chair
- Tutors: Thomas Applencourt and Brice Videau. Argonne National Laboratory. Aksel Alpay, University of Heidelberg. Igor Vorobtsov, Intel. Lukas Sommer and Duncan Brawley, Codeplay.
show / hide abstract
- SYCL Walkthrough: 9 Circles of SYCL. This talk will take you from zero to hero in SYCL. We’ll introduce key concepts of SYCL and provide you with the essential tools to get started—or refresh your knowledge in preparation for the follow-up hackathon. Presented by Thomas Applencourt, Argonne National Laboratory.
- AdaptiveCpp Best Practices. This lecture on SYCL programming with Adaptive Cpp will cover best practices for optimizing performance, ensuring portability and writing efficient parallel code. Presented by Aksel Alpay, University of Heidelberg.
- Migrating AI Applications to SYCL with SYCLomatic. This tutorial will show how SYCLomatic, an open-source tool, can be used to migrate existing accelerated compute projects written in CUDA language to SYCL. Presented by Igor Vorobtsov, Intel.
- Advanced SYCL Academy Lectures. A selection of micro lectures that will help you take advantage some of SYCL’s more advanced features that can be applied to you own code. Various presenters.
- More to be announced …
*Attendees are expected to have access to their own SYCL development environment (local or remote). A limited number of logins to a SYCL development environment can be made available to anyone without access. Please contact [email protected] to request access.
International Workshop on OpenCL and SYCL
The person presenting the talk is highlighted in bold.
Wednesday 9 April
show / hide abstract
show / hide abstract
The main performance bottleneck of the prefill stage on Adreno GPUs occurs when the model computes the keys and values for each token in the input sequence. This computation involves many large matrix multiplications between the input embeddings and the weight matrices of the model, which requires careful optimization to achieve good performance. GPUs are ideal for accelerating highly parallelizable operations. However, matrix multiplication is a unique case in that it necessitates significant data sharing between individual computing work-items. For example, each element of the input matrices contributes many times to different components of the output matrix. Therefore, optimizing a matrix multiplication algorithm for Adreno GPUs involves leveraging the GPU memory subsystem. The main difficulty of optimizing matrix multiplication on GPU is that the larger the matrix, the more likely it is that a value has been replaced in the cache, and must be fetched from a higher latency memory such as Double Data Rate (DDR) memory. We introduce a tiled matrix multiplication approach to minimize repetitive reading of the same matrix elements from higher latency memories. We do this by grouping memory reads and writes so they are close together in the address space. Grouping is accomplished by splitting input and output matrices into sub-matrices called tiles. The resultant dot products are partially computed from the entire tile before moving on to read pointers outside of the tile boundaries.Following the prefill stage is the decode stage, where the model generates output tokens. Each token is predicted based on the previous tokens generated, and the information stored the prefill stage. The decode stage differs from prefill because each token must be generated sequentially, rather than processing an entire prompt in parallel. The decode stage is dominated by matrix-vector multiplications, which are a memory bounded problem on Adreno GPUs. We introduce a tuned matrix-vector multiplication kernel for Adreno GPU, which achieves high memory throughput. The key idea is that every output element of a matrix-vector multiplication is a dot product between a single row of the input matrix, and the entire input vector. This means the input vector is completely reusable, and can be cached in low latency memory. Our optimized kernel splits and assigns partial workloads to multiple SIMT lanes (or subgroups in OpenCL) to hide memory latency. Next, the vector is loaded only once and shared with all work items operating on the same column of the input matrix. Finally, a reduction of the partial results returns the final output.These kernel optimizations result in a high performance OpenCL backend implementation for llama.cpp. The OpenCL backend is practical for developers to easily deploy models for edge inference on devices such as the Snapdragon X Elite and Snapdragon 8 Elite mobile.In the presentation, we will share more technical details on porting, optimization, and issues, and also the roadmap on enabling more features for llama.cpp, including more quantization schemes, flash attention, int8 enablement, etc.
show / hide abstract
This technical presentation will provide a comprehensive update on our efforts to enhance the Tensor Virtual Machine (TVM), a renowned open-source compiler framework for artificial intelligence, and MLC (Machine Learning Compilation), a derivative framework based on TVM that primarily targets large language models (LLM) and large vision models (LVM). Historically, we have improved open-source solutions for the Adreno GPU through MLC and TVM on Android and Linux environments. In this presentation, we will begin with a brief overview of the TVM and MLC projects, followed by an update on their current status, including OpenCL backend support. We will also discuss the future directions of the TVM/MLC community, focusing on features that are important for GPU vendors. Additionally, we will review our previous contributions to the TVM/MLC projects, including the generative AI work presented at the last IWOCL conference. Our latest contributions will be highlighted, showcasing the various AI workloads we have recently enabled, such as Gemma, Qwen 7B, and Phi.
Our previous work on TVM and MLC primarily targeted Android platforms. However, the Copilot PC powered by the Windows on Snapdragon (WoS) platform, running Adreno GPUs, is gaining significant momentum. Adreno GPUs on Windows support the latest OpenCL 3, allowing us to maintain functional parity with Linux and Android platforms. We will discuss the details of enabling LLM and other AI workloads using TVM/MLC on WoS, including the infrastructure and additional repository that offer a prebuilt Python SDK with comprehensive documentation to help developers get started quickly.
This technical presentation will cover the available resources (repository, SDK, documentation, etc.), current model support, performance metrics for various models (Generative AI, Vision, etc.), and technical compatibility (OpenCL ML support, Qualcomm-specific tooling, etc.). We will also provide a summary of well-known Generative AI models supported by the Adreno GPU on Snapdragon X Elite platforms using the MLC/TVM solution, along with recent performance data on the latest Snapdragon Android platforms. Additionally, we have traditional vision and image model benchmarks to share.
We will also discuss ongoing enhancements to leverage the OpenCL ML extension for Generative AI solutions, which significantly boost prompt performance. We aim to have these features available for the developer community, with performance details shared by the conference timeline.
show / hide abstract
We present Shamrock, a native SYCL framework for astrophysics, designed to implement various numerical methods for modelling hydrodynamic flows, in particular Smoothed Particle Hydrodynamics (SPH).At the core of Shamrock lies a fast radix tree building algorithm that allows the tree to be rebuilt at each timestep with minimal cost, eliminating the need for tree communications or updates.
Additionally, a domain decomposition method is used on top of the radix tree, allowing for a nearly linear multi-GPU weak scalability, resulting in 92% weak scaling efficiency on 1024 Mi250x AMD graphical accelerators for large SPH simulations.
show / hide abstract
In this talk, we will present our progress in developing the SYCL-Graph API, and the use of this API in GROMACS, a molecular dynamics engine. SYCL-Graph provides a portable interface across multiple backends and devices that improves the performance of SYCL applications with specific characteristics. These characteristics include workloads with small kernels where execution time is dominated by launch overhead, and workloads that exhibit a repetitive pattern of execution.Molecular dynamics is a domain of applications where offloading work to the GPU to achieve performance is key. GROMACS is a widely used molecular dynamics engine, supporting a broad range of hardware and software platforms, from laptops to the largest supercomputers. GROMACS maintains a SYCL backend for GPU acceleration across multiple devices including NVIDIA, AMD, and Intel GPUs. The SYCL backend is a more recent addition than preexisting CUDA and OpenCL backends, which are limited in their portability and performance, respectively. To achieve the aim of the GROMACS SYCL backend being a single code to performantly target the GPUs of all three major vendors with minimal specialization, it is important that SYCL has feature parity with the alternative proprietary vendor specific APIs. One of the CUDA features used by GROMACS for which an equivalent is missing from the SYCL 2020 specification is CUDA Graphs.
The graph abstraction defines an explicitly lazy execution model by decoupling command creation from submission. These concepts are potentially tied together in SYCL implementations that eagerly submit work to a device when a command-group is submitted to a queue. By providing a prior construction stage before a workload of commands begins execution on the device, submission of multiple commands can happen in a single submission call instead of many, reducing the overhead of multiple independent command submissions. Deferral of command submission can also allow for the SYCL implementation to optimize dependencies within the user defined graph, which may improve concurrency and other performance metrics.
SYCL-Graph is a oneAPI vendor extension to the SYCL 2020 specification which defines such a graph abstraction. The extension sycl_ext_oneapi_graph maps well to equivalent graph abstractions in the CUDA and HIP backends of SYCL, and also to other heterogeneous APIs such as OpenCL command-buffers, Level-Zero command-lists, Vulkan command-buffers, and DX12 command-lists. The Intel oneAPI DPC++ compiler has support for SYCL Graph with Level-Zero, CUDA, HIP, and OpenCL backends.
GROMACS algorithms for calculating inter-particle forces place high demands on the underlying GPU framework, requiring it to efficiently schedule multiple small tasks with minimal overhead, attempting to achieve overlap between CPU and GPU work for large systems and allowing the GPU to stay occupied for smaller systems. Integrating the SYCL-Graph extension allows GROMACS to reduce the overhead of scheduling these short running tasks. In this talk we will elaborate on how the SYCL-Graph abstraction is used by GROMACS, and the kind of workloads it is most effective for. The performance benefits on multiple oneAPI SYCL backends when using SYCL-Graph in GROMACS compared to core SYCL execution will be presented.
Advances in the SYCL-Graph extension API and implementation in the DPC++ runtime since it was last presented will also be highlighted in the talk, including dynamic graph update. As well as future plans for the extension to enhance GROMACS usage, and usage in other key applications which have started to integrate the SYCL-Graph extension into their SYCL backends. This talk also discusses future plans for promoting the status of the SYCL Graph extension from experimental to a fully supported and stable API.
show / hide abstract
The addition of the oneAPI-SYCL backend to Cycles, introduced in 2022 by Intel and has been shipping since the Blender 3.3 LTS release, opens the possibility to compare SYCL’s cross-vendor capabilities and performance against native implementations using a well-established user application. While the initial oneAPI-SYCL backend was fully functional on non-Intel GPUs such as Nvidia and AMD, its performance was behind the native vendor backends (i.e., 70% of Cuda for NV and 90% of HIP for AMD). Since then, a lot of work has been invested by Intel and Codeplay in SYCL itself and the Cycles oneAPI-SYCL backend to improve its performance not only on Intel but also on Nvidia and AMD GPUs.In this talk, we will present details about the initial implementation of the oneAPI-SYCL backend for Cycles, starting with a short introduction to Cycles’ code structure which includes 66 GPU kernels of varying code complexity, where the largest one includes around 200K instructions .
After that, we give insights on the work done by Intel and Codeplay to add cross-vendor SYCL extensions required to support features, such as global device data and bindless textures, that are heavily used by Cycles’ native CUDA and HIP backends and were not supported by SYCL at the time of the initial oneAPI-SYCL integration.Bindless textures were used in the CUDA backend to leverage hardware texturing units. This fixed function hardware performs texture lookups and filtering and often comes with optimized caches. The new ‘sycl_ext_oneapi_bindless_images’ extension now allows to do the same in SYCL, and unlike ‘sycl::image’, it does not require the number of textures to be known at compile time.Pointers to scene data arrays are stored in constant memory in the CUDA backend, allowing for fast and easy access through global variables. Our initial SYCL implementation had to work around the lack of global variables with a helper object in global or private memory that holds all these pointers. This led to an increase in code complexity and adds an extra level pointer dereferencing. With the ‘sycl_ext_oneapi_device_global’ extension, the SYCL backend can now use the same method as the CUDA backend, and we saw performance improvements over the previous approach.
The addition of these extensions to SYCL and their implementation in the oneAPI-SYCL backend of Cycles was crucial to improving its performance not only on Intel but also on Nvidia and AMD GPUs when using the oneAPI-SYCL backend of Cycles. During the talk, we will give a detailed analysis of how Cycles uses these extensions and how they improve performance. At its current state, the Cycles’ oneAPI-SYCL backend runs on Intel, Nvidia, and AMD GPUs while achieving close to 95% of the performance of the native CUDA backends when running on Nvidia.These performance results show that SYCL can be a serious single-source and cross-vendor alternative to traditional vendor-specific multi-source/backend solutions. Not only in high-performance computing (HPC) but also in complex client/consumer software systems.
show / hide abstract
- Venue: Kulturbrauerei Hotel and Brauhaus
- Address: Leyergasse 6, 69117 Heidelberg. Google Map
- https://www.heidelberger-kulturbrauerei.de/
Thursday 10 April
show / hide abstract
The SYCL 2020 specification says that all features are available through a single <sycl/sycl.hpp> header which is great for simplicity of use, but at the same time has a strong disadvantage in having users pay for what they are not using.
Considering the expected growth of the SYCL standard by incorporating various extensions into future versions of the core specification, the problem will be even more apparent going forward.
Extensions by themselves can also introduce a lot of unnecessary compile-time overhead by being implicitly included into <sycl/sycl.hpp>. This could be done for simplicity of use and for the simplicity of the implementation. For example, all extensions in our (Intel’s) SYCL implementation are designed that way. An even more concerning fact is that all proposed KHR extensions (available at the moment of preparing this abstract) are also designed this way, thus broadcasting this practice and enforcing it on other implementations.
In this technical talk we will explore the possibility of splitting the <sycl/sycl.hpp> header file into a few smaller headers dedicated to specific features similar to how it is done in C++. Finer headers granularity would allow SYCL users to be in control of what they include, thus positively impacting compilation times of their applications.
We prepared an overview of features provided by the core SYCL 2020 and made an analysis of their usage (using zjin-lcf/HeCBench to gather that data): how often each feature is being used, which other features it is used together with, etc. Accompanied by compile-time cost measurements for every feature, together it provides a list of features which would have the biggest impact if they were not included by default.
However, splitting of <sycl/sycl.hpp> is not as simple as it could seem at the first glance – even if we are just talking about providing an extension functionality through a separate header file. There are many points of view to consider and many questions to solve.
For example, some of the SYCL runtime classes like queue, handler and range/nd_range are often used together. Providing dedicated headers for them all could be overkill, negatively impacting the user experience.
Some other SYCL features could be used together less often, even though they are interconnected, like “stream::operator<<(const stream&, const half&)” – if “stream” and “half” are to be placed in separate headers, do we need them both to have this operator available? Or should we treat one of them as the “main” one, which automatically includes the other? Similar questions could be asked about math built-in functions and half data type and there are more examples like that.
When we look into outlining functionality provided by SYCL extensions into separate header files, there are some unique questions as well. For example, what if a SYCL extension modifies some core SYCL class by adding a new method? And even if it only adds a standalone function, it may add an “aspect” and still modify the core SYCL functionality. How should cases like this be handled? Would it still be reasonable to ask user to perform a separate “#include” to get access to that functionality?
We would like to present our thoughts and feedback on the topic as one of the SYCL standard implementors and propose a possible split of <sycl/sycl.hpp> header, including some recommendations about how to implement and design SYCL extensions taking into account the questions above.
show / hide abstract
A stated goal of SYCL to support and develop heterogeneous computing, and the fact that both host and device code are to be compiled and linked into a single application binary, determine to some extent the objective complexity of the compilation process in the current SYCL implementations. Regardless of how an implementation distributes its transformation actions along the data flow and balance compiler and runtime architecture design to account for implications of targeting multiple backends, we should expect some reasoning in terms of unified representation of the source code, materialization of compute kernels across a wide range of concrete backend APIs, and hence concerns of when, how and what is to be converted on this way from backend-independent to device-specific.
There is no simple solution to the intricacy emerging from interacting standards and architectures, however, certain models and representations prove to be helpful in dealing with this complexity. In this presentation, we discuss SPIR-V and its native LLVM backend as a combination of a format and a transformation step that manages to simplify what is both complex and complicated in a SYCL implementation. SPIR-V helps here in the same way as LLVM IR, abstracting developers away from vendor-specific instruction sets and languages (from the frontend perspective), and the corresponding LLVM backend provides a previously missing link in the chain of the llvm-to-backend infrastructure for the case when we would like to keep it inner with respect to LLVM, without external dependencies.
Existing SYCL implementations employ the Khronos LLVM/SPIR-V Translator to convert the LLVM IR code to SPIR-V for ingestion by Intel GPUs. The open-source version of the Khronos Translator and its multiple patched versions are widely used to transform the generic device code IR to the SPIR-V intermediate representation and de-facto became a critical part of many toolchains. Expressed in that way, however, this approach is both less universal and more problematic in a technical sense than it could be. SPIR-V, being an intermediate language for representing graphical-shader stages and compute kernels for multiple Khronos APIs, deserves and, hopefully, gradually accepts a wider acceptation than purely Intel-specific use cases. This intermediate representation defined by the Khronos Group provides a portable and standardized form of a program for a wider range of hardware, serving as a cross-vendor unifying IR for programming heterogeneous accelerators such as GPUs, FPGAs, and NPUs. The dependency to Khronos Translator, in practice, may appear to be less than ideal solution, overcomplicating involved processes. It is not upstreamable with respect to LLVM, which means that an upstreamed SYCL implementation would need to rely on this external dependency – a step that complicates toolchains and is unlikely to gain a quick and happy acceptance from the LLVM community.
Last year substantially added activity, growth and development to the existing state of affairs. In this presentation, we demonstrate how our recent progress in implementation of SYCL conformance of the LLVM SPIR-V backend helps all involved actors, promoting and widening the use of the SPIR-V standard, improving overall maturity of the SPIR-V backend, and simplifying relevant toolchains, not only in SYCL, but also in the AI compilers scope.
We describe new features of the SPIR-V backend and show how the very concrete goal to replace Khronos Translator in the DPC++ compiler with the LLVM backend necessarily has been driving positive adjustments in its general correctness and usefulness, thus causing tangible changes in the ecosystem of related software systems. Functional improvements allowed to add the SPIR-V backend into the DPC++ compilation flow as a tool that converts LLVM IR to SPIR-V. Progress is the LLVM SPIR-V backend logics of translation and overall stability, applicable to both compute and graphical flavors and not specific to any single vendor, activated the community and unlocked for the backend a way to break out of its prior experimental status within LLVM.
We explain our testing strategy and, given that a guarantee of the constant progress is established processes, the steps we are taking to ensure the SPIR-V backend preserves the required level of quality. The SPIR-V backend presence within the upstream LLVM codebase ensures tighter integration and simplifies maintenance of the hosting SYCL implementation, that is DPC++. A stability of the LLVM SPIR-V backend and even its mere presence within the LLVM project has a positive impact also on DPC++ quality assurance processes. The SPIR-V backend makes it easier to influence LLVM development by blocking or addressing major breaking changes until they properly accommodate needs of the computational flavor of SPIR-V, ensuring compatibility and stability for OpenCL and SYCL users.
show / hide abstract
With the recent diversification of the hardware landscape in the high-performance computing (HPC) community, performance-portability solutions are becoming more and more important. One of the most popular choices is Kokkos, which recently became a Linux Foundation project. Most of its development is supported by the US Department of Energy and the French Alternative Energies and Atomic Energy Commission.
Kokkos is implemented as a C++ library with multiple backends to support CPUs as well as various GPU architectures. These backends include OpenMP, CUDA, HIP, and also SCYL. This approach enables users to leverage the preferred vendor toolchain for the respective platform (e.g. CUDA, ROCm, OneAPI). The SYCL backend is used to target Intel GPUs, in particular to support the Aurora exascale supercomputer.
However, SYCL itself also offers a large degree of portability, and in fact Kokkos’ CI for SYCL has been running on NVIDIA hardware due to a lack of access to Intel GPUs.
In this presentation, we describe our experience with using Kokkos SYCL backend on AMD GPUs targeting the Frontier supercomputer at Oak Ridge National Laboratory. The two major SYCL implementations are DPC++ and AdaptiveCpp. While the Kokkos SYCL backend has been implemented using the former, the latter was the first implementation to target AMD GPUs. We will discuss the experience with both of these SYCL implementations in terms of functionality and performance.
Using Kokkos to evaluate SYCL toolchains has a number of benefits. Kokkos’ use of SYCL is fairly complex, exercising features such as graphs, relocatable device functions, atomics – including for non-arithmetic types, as well as pinned and page migratable memory allocations. Kokkos also needs to implement capabilities such as Kokkos’ hierarchical parallelism that are not a straight-forward mapping to SYCL capabilities.
Furthermore, a large number of libraries and applications that represent diverse use cases are implemented in Kokkos, providing readily available test cases for a toolchain evaluation.
Preliminary results show that support for AMD GPUs in DPC++ is much less mature than for NVIDIA GPUs or Intel GPUs. While the situation has improved significantly over the last year, we encounter still many runtime failures, dispatching problems, and code generation issues. With AdaptiveCpp the challenges arise even earlier in the evaluation process. Since Kokkos’ SYCL implementation is largely focused on supporting Intel GPUs, we opted to leverage SYCL extensions which are available in DPC++ but not in AdaptiveCpp. Furthermore, AdaptiveCpp appears to be less conformant with the SYCL2020 standard which Kokkos relies on. In some cases, we are able to work around the lack of feature support, in other cases we have to disable certain Kokkos capabilities to evaluate the toolchain.
Our evaluation will leverage Kokkos unit tests to establish basic functionality and feature completeness. We then will use ArborX, Trilinos, and LAMMPS to evaluate useability and performance in end user scenarios. ArborX is a library focused on geometric search, Trilinos is a collection of scientific libraries that in particular includes distributed linear algebra solvers, while LAMMPS is a widely used molecular dynamics code. We will use the canonical benchmarks and test suites of these software packages as a measure of usability and performance of the SYCL toolchains.
show / hide abstract
Efficient execution of kernels on CPUs is crucial for the performance portability of heterogeneous programming models. SYCL, one such heterogeneous programming model, provides a hierarchical SPMD model to leverage the hierarchical structure of the many computation units on GPUs.
Each instantiation of a SPMD kernel is called a work-item. Work-items are arranged together into sub-groups. Whereas work-groups join together a number of sub-groups. To implement algorithms that are not embarrassingly parallel, SYCL provides barriers on the sub-group and work-group level. If a work-item reaches a work-/sub-group barrier, it stops progressing until all other work-items in its work-/sub-group have reached the barrier.
Previous work presented continuation-based synchronization (CBS) as a method to map work-groups onto CPUs. However, in this setting, it is not sufficiently explored how to implement sub-groups efficiently as well.
CBS splits kernels into barrier-free regions. Barrier-free regions are regions of the kernel that are between consecutive barriers. Thus, a barrier-free region of a kernel can be mapped onto a CPU by putting a work-item loop around it. A work-item loop iterates over all the work-items in a work-group. To map the whole kernel onto a CPU, CBS puts work-item loops around each barrier-free region and uses a state machine to switch between these regions.
We compare two approaches to map sub-groups onto CPUs: Whole-function vectorization (WFV) and hierarchical CBS. Hierarchical CBS uses CBS once on the work-group level, and then on each created work-group barrier-free region to take care of the sub-group level. In contrast, WFV vectorizes the work-item loops, created by CBS on the work-group level, with the sub-group size as the vector width. Vectorizing the loops using WFV maps each work-item in a sub-group to a vector lane. We achieve synchronization because WFV guarantees that every instruction in a WFV vectorized loop is executed for every vector lane in lockstep. Hence, no instruction is executed after a sub-group barrier before all work-items in the sub-group have executed all instructions in front of the sub-group barrier.
One major difference between these two approaches is their vectorization strategy. WFV always vectorizes the work-item loops with the same vector width. Thus, WFV performs outer-loop vectorization if the work-group barrier-free region contains a loop. While with hierarchical CBS, LLVM can choose for each loop whether or not to vectorize, and which vector width to use for vectorization. However, LLVM is best equipped to vectorize the innermost loop. Whether innermost loop or outer loop vectorization leads to better performance depends on loop dependencies, trip count, and memory access patterns.
To improve the performance of hierarchical CBS, we implement a multi-versioning scheme that multi-versions the kernel on the condition that there are no incomplete sub-groups. Hence, the sub-group work-item loop has a static trip count: the sub-group size.
Thus, in many cases, LLVM completely unrolls the sub-group work-item loop as part of its simplification pipeline, which enables more aggressive optimizations in the optimization pipeline. Additionally, LLVM knows the vectorized loop does not need a scalar epilogue if the implementation-defined sub-group size is divisible by the vectorization width. Lastly, LLVM might choose to vectorize loops that it would not have vectorized otherwise. On the other hand, when using AdaptiveCpp’s JIT compiler the work-group variables get replaced by constants. Thus, even without the multiversioning scheme, the sub-group work-item loop has a static trip count if the work-group has no incomplete sub-groups.
We implemented these two approaches in AdaptiveCpp, an open-source SYCL implementation. Our performance evaluation shows that both approaches outperform AdaptiveCpp’s current sub-group size of one approach on benchmarks that are amendable to use sub-groups, and WFV slightly outperforms hierarchical CBS on a diverse set of benchmarks.
show / hide abstract
Specializing kernels by including runtime information during just-in-time (JIT) -compilation can improve performance at the expense of potentially generating more kernels.
In this work, we contribute the runtime adaptivity framework that we have implemented in AdaptiveCpp. This framework can automatically generate specialized kernels at JIT-time, automatically taking into account various information about the kernel invocation, such as work group sizes, data alignments of pointer kernel arguments, or the kernel argument values themselves.
While similar approaches have already been investigated for other programming models, to our knowledge, AdaptiveCpp is the first SYCL implementation that can automatically leverage such information for the purpose of generating highly optimized kernels.
Our solution is available and enabled by default in the AdaptiveCpp SYCL implementation and supports CPUs, Intel GPUs, NVIDIA GPUs and AMD GPUs.
Using a set of of mini-apps and benchmarks on NVIDIA, AMD and Intel hardware, we find that AdaptiveCpp with our new framework outperforms CUDA by 30% in the geometric mean, and HIP and oneAPI by 44% and 23%, respectively. We find that our framework is highly effective, achieving performance gains for SYCL code in excess of 5x in the most extreme cases.
We also discuss the impact of each individual implemented optimization technique, and find that for the tested NVIDIA hardware, the combination of all techniques is important. On AMD and Intel, specializing work group size and kernel argument values was most important.
Furthermore, we show how a combination of a persistent on-disk JIT-cache, careful design and choice of optimization techniques, as well as categorization of optimization techniques can mitigate overheads due to the additional JIT compilations to the point where they are generally no longer of concern for most applications.
show / hide abstract
Debugging heterogeneous applications requires specialized tools that are aware of the complexities of parallel computing and GPUs. Intel® Distribution for GDB is an enhancement of GDB, providing a solution for debugging SYCL applications targeting Intel GPUs. It allows developers to inspect the state of the program running on a GPU via a textual user interface. GDB also has a textual machine interface (MI) that Integrated Development Environments (IDEs) can utilize to communicate with GDB as a debug engine behind the curtain and implement a graphical user interface on top. In this presentation we demonstrate two IDE plugins that we developed, one for Visual Studio on Windows and the other for VS Code on Linux, that leverage the capabilities of Intel® Distribution for GDB to the users with the convenience of a GUI. We cover a detailed overview of setting up the debugging environment for heterogeneous applications, followed by a demo of debugging a SYCL application with compute kernels offloaded to
Intel GPUs. We show how to inspect the state of threads and lanes, and query device information. We present how to utilize various views that
are available in Visual Studio and VS Code, like disassembly, call stack, memory view, locals and watch view etc. We demonstrate code
execution, using regular, conditional and lane-specific breakpoints, and we show how to control threads via the scheduler-locking features
for stepping and resuming. Altogether, this presentation will provide insights into leveraging Intel® Distribution for GDB and its various
features to enhance productivity and accelerate development cycles. Attendees will leave equipped with practical knowledge to optimize their heterogeneous debugging workflows.
show / hide abstract
The Intel SYCL CPU device supports native debugging of device code, which makes it essential to handle debugging information properly. Workgroup barriers complicate this process due to workgroup loop fission on the device. When all workitems in a workgroup execute instructions up to a barrier, a special buffer is used to save divergent cross-barrier values for recovery in the subsequent loop. Managing debug information for these crossbarrier values is challenging. In this paper, we propose an efficient approach to manage debug information for divergent values. Additionally, we introduce a novel barrier region algorithm that accurately identifies the correct insertion points to update the debugging information of divergent values in each fissioned workgroup loop. This minimizes the number of store instructions needed to update the debug information. Our approaches correctly preserve debugging information while achieving minimal code size.
Friday 11 April
show / hide abstract
SYCL 2020 is a versatile framework for heterogeneous computing, and while it can be layered over various API backends, OpenCL remains a favored choice for many implementers. In this talk, we will describe how to implement SYCL 2020 atop a modern OpenCL foundation. We will discuss OpenCL features that, while optional, become indispensable for a robust and capable SYCL implementation. Additionally, we will explore OpenCL advancements currently in development, designed to support existing and emerging SYCL use-cases. Whether you’re a seasoned developer or new to the world of SYCL and OpenCL, this talk will equip you with valuable insights and open-source resources to understand why SYCL and OpenCL are better together.
show / hide abstract
Intel recently proposed the kernel compiler extension for SYCL, which allows applications to build kernels at runtime from source code and intermediate languages. The extension enables full specialisation of kernels even if the compile-time enumeration of all variants is intractable, for example a GEMM kernel that shall be tailored to specific shapes and the target device.
The kernel compiler extension adds the option to create SYCL 2020 kernel bundles from source strings (currently, OpenCL and SYCL) or binary data (SPIR-V), and query kernel objects from the bundle by their name instead of a C++ identifier. We focus here on the SYCL language support, which we refer to as “SYCL-RTC” (for runtime compilation) for short.
To define kernels, developers use the free-function kernel extension for SYCL. This extension provides a user-friendly way to define SYCL device kernels as C++ free functions. Additionally, in contrast to the undefined order of lambda captures, the extension guarantees a stable order of kernel arguments, which is required to later set the arguments for kernel launch via set_args. Inside the kernel itself, virtually no code changes are required compared to regular SYCL code.
Our initial prototype wrote the SYCL code and additional headers to temporary files, invoked DPC++ as a separate process, and loaded the SPIR-V binary data from the produced file. While this approach conformed to the specification, we identified several shortcomings. First, to obtain an executable SPIR-V representation of the kernels, at least four processes that read and write files are launched. Secondly, due to a limitation in DPC++, an unnecessary host compilation pass is run. Lastly, reading and writing temporary files is detrimental for security.
To overcome these limitations and reduce the compilation overhead at runtime, we present a new lean compilation pipeline that fully executes in memory, and thereby puts the modularity of the compiler to the test. By lazily loading a JIT compiler library, we can invoke LLVM and Clang APIs from the SYCL runtime. We use Clang’s LibTooling interface to compile the SYCL source code and any additional header files from a virtual file system overlay to an LLVM IR module containing only the device code. From there, we call the library version of the LLVM linker to link device libraries, and apply a subset of post-processing actions, such as device code splitting and property extraction, by running passes from DPC++’s SYCLLowerIR library. The resulting IR modules are translated to SPIR-V binaries. After wrapping the binaries and properties into an internal format handled by DPC++’s program manager class, execution continues the same way as regular SYCL kernels.
We plan to make the new pipeline the default implementation for SYCL-RTC in a future oneAPI release. Preliminary experiments hint at the in-memory pipeline having half the overhead compared to the subprocess-based prototype, due to the elimination of the host compilation and file IO, while supporting more complex compilation requests. Still, compiling SYCL code is computationally expensive, with a substantial part of the compilation time being spent on processing the SYCL headers. We believe runtime compilation is a suitable testbed for using pre-compiled headers as well as lighter-weight versions of sycl.hpp that support only a subset of the functionality. Improvements in this area would then also benefit the regular toolchain.
This technical presentation should appeal to future SYCL-RTC users and compiler enthusiasts alike. We will start by giving an overview of the SYCL-RTC API and how applications can benefit from runtime compilation. Next, we will provide a detailed overview of the implementation and discuss the challenges involved in compiling SYCL code at runtime and using Clang and LLVM libraries for in-memory compilation. Lastly, we will discuss the impact on compilation time at runtime and options to further improve this in the future.
show / hide abstract
SYCL aims to be as close to standard C++ as possible to lower the bar to entry and increase its adoption. However, there are still limitations and restrictions as to which C++ functionality is available to be used by SYCL kernels.
Virtual functions support is one of those features which can be used by host part of a SYCL application, but they can’t be invoked from a SYCL kernel or a device function executed by a SYCL device.
Virtual functions allow us to use dynamic polymorphism which is very important thing when it comes to flexibility and reusability of the code, allowing objects to be treated uniformly while exhibiting different behaviors.
The fact that other offloading programming models and technologies like OpenMP and CUDA do support virtual functions puts SYCL into a disadvantageous position. We (at Intel) have a SYCL extension proposal (https://github.com/intel/llvm/pull/10540) which adds support of virtual functions to SYCL. The extension bridges the gap to CUDA & OpenMP, making SYCL closer to feature parity with other offloading models.
Note that dynamic polymorphism can also be implemented using raw function pointers, but they can’t be used in SYCL device code either. Even through virtual functions and function pointers are very similar under the hood, the extension presented in this talk only covers the former.
In this talk we would like to give an overview of that extension walking through examples of how it can be used and showcase its key elements with motivation for their design. We will also cover other topics like support for function pointers in SYCL, SPIR-V extension we developed as part of this work and a bit of implementation details of the extension in our implementation.
Adding virtual functions support to SYCL is quite tricky. One reason for that is because many backends also don’t support virtual functions: for example, OpenCL is based on C and not on C++ and it doesn’t support function pointers either, so some OpenCL extensions should be written for that.
Another reason why this is tricky is design of the SYCL itself and features the core specification has.
For example, SYCL has a term “device function” which is any SYCL kernel function and, recursively, functions it calls. The problem with virtual functions is that compiler doesn’t statically know which exact function is called when a virtual call is performed. What if a virtual function uses features that are illegal (like exceptions) in device code? Some extra markup is needed to help the compiler classify virtual functions which is proposed by the extension and will be discussed in the talk.
Another challenge is the presence of optional kernel features like fp64 or fp16 (also known as aspects). An application may contain kernels which are specialized for devices with or without support for certain aspects. When the call graph is statically known, a compiler can easily detect and highlight any misuses of optional kernel features. However, presence of virtual functions once again makes it impossible to understand (in general case) which exact function is being called, which means that an implementation needs some extra information supplied by the end user to help it understand which optional kernel features may be in use by kernels that perform indirect calls through virtual functions.
By giving this talk we would like to share awareness of the extension and gather feedback about it to better shape it and make it more useful.
show / hide abstract
Bindless Images is a DPC++ vendor extension to SYCL designed to modernize the usage of images and provide greater control over their memory in SYCL. We have presented Bindless Images before, where we briefly mentioned the ability to import external memory from graphics APIs, namely from DirectX 12 and Vulkan. In this presentation, our objectives are twofold: firstly, to provide an overview of the recent advancements in Bindless Images since our last discourse; and secondly, to undertake an in-depth exploration of interoperability aspects.
Since our last presentation the main focus has been on implementing new image types, as well as interoperability with those types. A significant amount of time has been spent on ensuring Bindless Images work not only on the CUDA backend, but also on Level Zero and on HIP.
A very important aspect has been integrating Bindless Images into a larger project and finding performance optimizations – in Blender. The talk will touch briefly on the integration of Bindless Images into Blender as this topic will be covered more extensively in a separate presentation. Instead we will spend some more time discussing existing and potential future Blender interoperability with graphics APIs and how it relates to Bindless Images.
The primary focus of discussion regarding interoperability centers on the integration of external memory into SYCL. This includes its interaction with the SYCL API, the resulting calls made to the Unified Runtime, and ultimately, the alignment of these processes with CUDA and Level Zero. We showcase a few usage examples of how this works with DirectX and with Vulkan. Additionally, we explore strategies for extending the interoperability to other APIs. This includes a discussion on how the extension is easily mapped to the general notion of external memory interoperability in other APIs, but encounters issues when trying to map it to an API that works differently, with the motivating example being CUDA graphics interoperability API.
A portion of the discussion is dedicated to the opposite scenario of exporting memory from SYCL for use with external APIs such as DirectX or Vulkan.
A complementary feature is the use of semaphores to synchronize data between SYCL and DirectX/Vulkan. We talk about this scenario, how we initially implemented it for the CUDA backend, and how it has been extended to Level Zero with joint effort from multiple teams.
We will then discuss the option of separating the interoperability features in Bindless Images into their own dedicated extension, with motivating examples of how these features can be more generic.
We conclude the presentation with a glimpse into the future of Bindless Images and the work we have set out for the next year.
show / hide abstract
Leveraging the performance benefits of modern parallel hardware, including GPUs or accelerators, is crucial for
high-throughput data parallel applications. Frequently, multiple accelerators, each featuring a separate, dedicated
memory space, are used to perform a single computation, either on a single node or across a cluster. In these cases,
depending on the type of computation, single columns of matrices or other highly strided, sparse data often need to be
efficiently transferred between devices.
While this may seem like a rather trivial operation, the current version of the vendor-agnostic SYCL standard does not
provide a straightforward and effective means of accomplishing this goal. Even more surprisingly, using low-level vendor
APIs specifically designed for this purpose can still lead to suboptimal performance in relatively common use cases.
In this paper, we investigate high-performance strided data movement in detail, including a formalization of copy
operations and various strategies which can be employed to enact them. We have implemented a library which provides a
simple interface for requesting a copy, which internally performs operations such as chunking, staging, asynchronous
execution, and linearization using optimized device kernels. We evaluate the performance of this library on a variety of
hardware platforms and demonstrate that it can achieve significantly higher throughput for strided data movement than
the current state-of-the-art.
Posters
Posters will be on display during the breaks Wednesday through Friday.
show / hide abstract
Graph analytics play a crucial role in a wide range of fields, including social network analysis, bioinformatics, and scientific computing, due to their ability to model and explore complex relationships. However, optimizing graph algorithms is inherently difficult due to their memory-bound constraints, often resulting in poor performance on modern massively parallel hardware. In addition, most state-of-the-art implementations are designed for NVIDIA GPUs, limiting their applicability on supercomputers equipped with AMD and Intel GPUs, for example. To address these challenges, we propose SYgraph, a portable heterogeneous graph analytics framework written in SYCL. SYgraph provides an efficient two-layer bitmap data layout optimized for GPU memory, eliminates the need for pre- or post-processing steps, and abstracts the complexity of working with diverse target platforms. Experimental results demonstrate that SYgraph delivers competitive performance on NVIDIA GPUs while being able to target any SYCL-supported device, such as AMD and Intel GPUs.
show / hide abstract
Setting kernel arguments in OpenCL, including Shared Virtual Memory (SVM) buffers and arguments passed by value (such as scalars and structs), traditionally requires multiple API calls, making the process cumbersome and error-prone. This work introduces an automated solution that parses the OpenCL kernel at compile time to extract parameter types, names, and kernel names. Using this information, a specialized smart object class is generated for each kernel, encapsulating all necessary functionality, including a `setArgs` method. For example, given a kernel `myKernel(int a, float* buffer, MyStruct structArg)`, a corresponding class is generated, allowing users to set all arguments in one call, such as `kernelObj->setArgs(10, bufferPtr, myStruct)`. This approach simplifies kernel interaction by seamlessly handling SVM buffers and all types of by-value arguments, reduces boilerplate code, and ensures error-free argument configuration for a wide range of use cases.
show / hide abstract
The shallow water equations describe fluid flows where the horizontal length scales are much larger than the vertical scales. Applications for the shallow water equations can be found in the modelling of tides, estuaries, tsunamis, floods, or atmospheric flows. We present an implementation of the shallow water equations for coastal ocean domains, running on CPUs, GPUs, and FPGAs by utilizing SYCL as a platform-agnostic programming model.
The implementation was first presented at PASC 2024 and uses a discontinuous Galerkin discretization on unstructured, triangular meshes. As each triangle can be computed independently from all other triangles, the discretization is well-suited for parallelization and exeuction on CPUs and GPUs. Additonally, we also target FPGAs with a fundamentally different execution model. There, the numerical algorithm is translated into a deep pipeline, such that we can process one triangle per clock cycle in the ideal case.
Efficient memory access patterns are essential for good performance on all architectures. While GPUs achieve their peak memory throughput if data like the mesh information is stored in a struct of arrays layout, an array of structs layout performs better on FPGAs. Because FPGAs do not have a memory hierarchy similar to CPUs and GPUs, we implemented static caches to reduce memory bottlenecks. These static caches enable high throughput for the smaller mesh sizes, as all required data resides in on-chip memory buffers. FPGAs with high-bandwidth memory are less limited by the memory bandwidth, but require a manual distribution of the data buffers across different memory channels.
To accomodate the different data layouts, FPGA caches, and manual distribution over memory channels on FPGAs with high-bandwidth memory, all data accesses must be separated from the numerical algorithm. By leveraging C++ templates, we are able to share the numerical algorithm across all architectures. An architecture dependent SYCL layer is then responsible to provide memory access classes suitable for CPUs and GPUs or FPGAs.
We compare the performance of the different architectures by computing throughput in terms of degrees of freedom per second (DOF/s). Tested hardware includes current-generation CPUs like the Intel Sapphire Rapids, AMD Genoa-X or Nvidia Grace chips, the high-end Nvidia H100 and AMD MI210 GPUs as well as FPGAs with and without high-bandwidth memory. Software counters also provide energy consumption on this hardware, and we utilize this to also compare the tested hardware in terms of energy efficiency (degress of freedom per Joule).
Our results show that for larger sizes GPUs with their high memory bandwidth and floating-point performance yield the highest throughput of all architectures. CPUs are slower by approximately a factor of ten, and the performance on the FPGAs is severely limited by their memory bandwidth. However, for meshes that do fit into the custom built caches, FPGAs have the same performance as GPUs or can even be faster, as CPUs and GPUs are typically underutilized by the small meshes. In terms of energy efficiency, GPUs are typically 4x — 10x more energy efficient than CPUs. FPGAs are able to compete in terms of energy efficiency with the CPUs, as their lower performance is also reflected in lower power consumption.
Preliminary results of an MPI+SYCL implementation of the solver show good scalability for systems with four Nvidia A100 or AMD MI210 GPUs, and on eight AMD EPYC CPUs with a total of 512 cores.
show / hide abstract
In this poster, we introduce device offloading[1] MemorySanitizer for SYCL, a dynamic debugging tool designed to detect uses of uninitialized memory (UUM) within the device code of SYCL in Intel oneAPI DPC++. Building upon the core algorithm of LLVM MemorySanitizer[2], which is traditionally used for host C/C++ code, our enhanced sanitizer extends its capabilities to identify UUM in kernel code for both CPU and GPU backends. This represents the first memory debugging tool capable of identifying UUM in SYCL kernel code. Our initial implementation focuses on device Unified Shared Memory (USM) to address common use cases with minimal effort. We detail the algorithmic enhancements, including shadow memory mapping, device code instrumentation, and the development of host and offloading runtime libraries. Our tool seamlessly integrates with SYCL, providing robust debugging support for heterogeneous computing environments. Through a practical example, we demonstrate the effectiveness of our MemorySanitizer in identifying and reporting UUM, thereby enhancing the debugging capabilities for developers. This work advances the state of memory debugging tools for SYCL, facilitating more reliable and efficient development in heterogeneous computing.
show / hide abstract
Over the past few years, two standards have gained significant momentum spurred by the growing interest in the design and development of customized, domain-specific hardware accelerators for AI and data analytics. On the hardware side, RISC-V has emerged as an open and extensible architecture with flexible vector extensions (RVV) for scaling computationally-intensive workloads. On the software side, SYCL has emerged as a standardized, cross-architecture programming model that can provide performance portability across several accelerators. In this work, we describe our efforts to bring these two standards together by (i) developing an FPGA-based hardware acceleration platform that integrates CVA6 RISC-V core with the ARA2 RVV implementation, and (ii) customizing the platform to enable SYCL kernel offload via the oneAPI Construction Kit. We evaluate several SYCL applications using this platform to demonstrate the benefit of RVV and make our platform publicly available as a standards-based testbed for evaluating SYCL applications on RISC-V vector accelerators.