THIS PAGE IS IN THE PROCESS OF BEING UPDATED – PLEASE CHECK BACK REGULARLY

The person presenting the talk is highlighted in bold.

Wednesday 6 May

08:30-09:00 | Registration and Coffee
09:00-09:30
Welcome and Introduction
Speakers: TBD
09:30-10:00 – KEYNOTE PRESENTATION
Session Title
Ben Ashbaugh, Intel
show / hide abstract
Pending
10:00-10:30
Session Title
Hu He, Tsinghua University
show / hide abstract
Pending
10:30-11:15 | Morning Coffee Break & Posters
11:15-11:45
Advancing OpenCL-Based LLM Inference in llama.cpp: Optimizing Heterogeneous GPU Performance and Enabling Next‑Generation Model Architectures
Hongqiang Wang, Li He, Shangqing Gu, Shaofei Qi, Yunjie Xu and Alex Bourd, Qualcomm
show / hide abstract

Over the past year, we have significantly advanced the OpenCL backend of llama.cpp, expanding its performance, portability, and model coverage across a range of Qualcomm Adreno GPUs and other heterogeneous devices. Building on our initial OpenCL integration—designed to provide broad compatibility and a unified compute-focused execution path—we have developed new optimizations that substantially improve inference throughput for both small and large models. In particular, we introduce targeted kernel enhancements that accelerate small-model workloads, device-customized optimizations for diverse Qualcomm GPUs, and substantial improvements for Mixture-of-Experts (MoE) architectures. For the OpenAI’s GPT-OSS-20B model with MoE on the latest Qualcomm Snapdragon X2 Elite device, we achieved a more than fourfold speedup in prefill performance, increasing throughput from approximately 120 tokens/s to over 500 tokens/s through specialized kernel restructuring, memory‑access tuning, and load-balancing strategies across experts.

This work also highlights the unique opportunities and constraints of using OpenCL as a backend for modern LLMs. Unlike APIs with native cooperative matrix or tensor core abstractions, OpenCL currently lacks a standardized cooperative matrix interface, requiring us to engineer portable yet highly optimized GEMM implementations for both dense and MoE workloads. We discuss how this API-level limitation influences kernel design, and we present several strategies—such as adaptive tiling, subgroup‑aware parallelization, and device‑specific kernel variants—to achieve high utilization without relying on dedicated matrix-acceleration instructions. Additionally, we examine the growing diversity of LLM architectures, including models that incorporate dynamic routing, sparsity patterns, or emerging attention mechanisms, and how these trends increase pressure on backend flexibility, programmability, and kernel specialization.

Finally, we outline our roadmap for the OpenCL backend in llama.cpp, including support for additional quantization schemes, FlashAttention-style kernels, expanded INT8 inference paths, and improvements to portability and auto-tuning across more GPU vendors. These enhancements continue to make OpenCL a viable and efficient choice for deploying LLM inference on a broad spectrum of edge and client devices, including Snapdragon X Elite, Snapdragon 8 Elite, and other GPUs where portability and performance must coexist.

11:45-12:15
Optimizing AI Workloads on Intel GPUs with OpenCL
Michal Mrozek, Intel
show / hide abstract

Artificial Intelligence (AI) workloads are growing at an unprecedented pace, both in complexity and scale. From large language models to real-time inference in edge devices, the demand for high-performance, low-latency execution on heterogeneous platforms has never been greater. GPUs have become the backbone of AI acceleration, but achieving optimal performance is not simply a matter of raw compute power. It requires a deep understanding of how workloads interact with the underlying hardware and software stack, and how to fine-tune every layer for efficiency. This talk focuses on Intel GPUs and the OpenCL programming model, exploring how driver-level optimizations and advanced submission strategies can unlock new levels of performance for AI pipelines.

Talk will consists of 3 main sections:

1. Queue Model Design and Challenges
Efficient task submission is the foundation of GPU performance. We begin by examining the design of queue models for AI workloads, which differ significantly from conventional compute pipelines. Attendees will learn:

  • How to structure queues to balance throughput and latency.
  • Strategies for minimizing synchronization overhead.
  • Techniques for reducing resource contention in multi-queue environments.
  • We will also discuss heuristics for deciding when to submit tasks immediately versus batching them, and how these decisions impact overall execution efficiency. Real-world examples will illustrate how improper queue design can lead to bottlenecks, and how optimized models can deliver measurable gains.

2. Ultra Low Latency Submission (ULLS)
For latency-sensitive AI applications—such as real-time inference or interactive workloads—traditional submission models are insufficient. Enter Ultra Low Latency Submission (ULLS), a technology designed to minimize dispatch overhead and enable near-instantaneous task execution. This section will cover:

  • The architectural principles behind ULLS.
  • How ULLS changes the decision-making process for task submission.
  • Trade-offs between ULLS and conventional batching strategies.
  • We will present how ULLS reduces jitter and improves predictability, making it ideal for scenarios where every microsecond counts.

3. Memory Allocation and Optimization

Memory management is often the silent performance killer in AI workloads. Frequent allocations, fragmentation, and unpredictable access patterns can introduce significant overhead. This session will explore:

  • How AI workloads allocate memory and why traditional approaches fail.
  • Driver-level optimizations such as memory pooling and resource recycling.
  • Techniques for reducing allocation latency and improving predictability.
  • We will show how these optimizations not only improve raw performance but also enhance stability under heavy load. Practical examples will demonstrate how memory pooling can reduce allocation overhead by orders of magnitude.
12:15-12:45
Open Source Deep Learning Compiler powering GenAI on Adreno GPU
Siva Rama Krishna Reddy B, Krishna Raju Vegiraju, Hongqiang Wang and Alex Bourd; Qualcomm and Visvesvarya Technical University
show / hide abstract

This technical presentation will provide a comprehensive update on our recent efforts to power Generative AI solution of Adreno GPU using open-source solutions. For many years we have been contributing and enhancing OpenCL backend support for Adreno GPU in projects like TVM (Tensor Virtual Machine) and in recent years MLC (Machine Learning Compilation). Recently Qualcomm is contributing to new evolving projects like Llama.cpp too.

In this presentation, we will be sharing past year enhancements to TVM and MLC that significantly push Generative AI performance on Adreno GPU across Mobile and Compute products. To be precise this presentation covers below aspects.

  1. Though TVM community have deprecated Relay and moved to Relax compiler, which is more advanced, Adreno GPU enhancements are not present. Now, we have up-streamed Adreno GPU enhancements like Texture paths, specialized layouts, memory management and OpenCLML extension support. Now, the mainline Relax solution has everything we support for Adreno GPU.
  2. Generative AI performance is significantly improved by adding new accelerated Ops to OpenCLML. We now have additional BYOC compilation passed that can offload these new ops. Prefill performance of all Gen Ai models has been improved by 2X with this enhancement. This acceleration is available from OpenCLML extension 5 or above.
  3. We are going to support Vulkan backend in parallel to OpenCL going forward. Thanks to TVM’s target independent lowering which allows reuse of > 90% of optimizations across both backend leaving only codegen and few runtime enhancements which are one time effort. This path additionally enables Vulkan specific acceleration like op-operative matmul to deliver performance close to OpenCLML extension driven acceleration with in pure open-source specification way.

Additionally, we will be sharing some performance numbers, various resources available for above solutions for the audience.

12:45-14:15 | Lunch Break & Posters
14:15-14:45
Session Title
Pekka Jääskeläinen, Tampere University
show / hide abstract
Pending
14:45-15:15
CLVizulayer
Ewan Crawford, StreamHPC
show / hide abstract

OpenCL allows application developers to compile their OpenCL code against an Installable Client Driver (ICD) loader, a shared library that contains symbols for the OpenCL entry-points. At runtime the ICD loader then forwards the OpenCL entry-point calls on to the chosen OpenCL implementation (ICD). This arrangement not only allows developers to build code in a vendor agnostic way but also enables the existence of a layer mechanism[1]. Layers are a mechanism for intercepting OpenCL calls made to the ICD loader before they are forwarded on to the ICD. A concept that has been proven in other APIs like Vulkan, and enables creation of layers for application debugging, tracing, or even implementing additional functionality[2]. A user running an OpenCL application can then choose which layers to load when they invoke the executable.

This presentation will debut the CLVizulayer tool[3], an OpenCL ICD loader layer for printing the graph of asynchronous device tasks to a file in Graphviz DOT[4] format. This tool allows an OpenCL user to graphically see the directed acyclic graph (DAG) of OpenCL device submissions in the application, with nodes representing clEnqueue commands, and edges based on the prerequisites in the OpenCL execution model[5] that constrain the execution of those commands.

Being able visualize the full device scheduling graph of an application provides a different perspective to timeline tracing, such as you would get from the OpenCL intercept layer[6] or other profiling tools, which shows the ordering of command execution. For example, if two commands don’t share any dependencies they may happen to implement sequentially due to implementation defined behavior, and would be represented sequentially in a timeline trace. Although a user diving into the trace detail may find the events returned and the wait list for each command, it is not as immediately apparent what the complete graph of execution constraints is compared to the visual format output by CLVizulayer.

For large applications the full scheduling graph can be huge, so finer grained mechanisms are also defined by CLVizulayer to give developers more control over what commands appear in the DOT file, but at the expense of having to modify their application. Two extensions, cl_ext_dot_print & cl_ext_command_buffer_dot_print are defined by the tool to give users this extra control. cl_ext_dot print allows a user to express the specific OpenCL command-queues to start capturing, and an explicit end point to write out the DOT graph of captured commands. While cl_ext_command_buffer_dot_print lets a user print out the commands of a cl_command_buffer_khr object, a mechanism that already has an equivalent in CUDA Graph cudaGraphDebugDotPrint()[7] and sycl_ext_oneapi_graph command_graph::print_graph()[8]

In the technical talk we will give an overview of the layer, covering how it works and how the layer can be used. Including modifying code to use the 2 extensions defined in the project. Additionally we will show traces captured from ML and HPC workloads, including llama.cpp[9], Leela Chess Zero, GROMACS, and LAMMPS to illustrate the value of using the layer.

[1] https://github.com/Kerilk/OpenCL-Layers-Tutorial[2] https://www.iwocl.org/wp-content/uploads/6895-James-Brodman-Intel.pdf[3] https://github.com/EwanC/CLVizulayer[4] https://graphviz.org/doc/info/lang.html[5] https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_API.html#_execution_model[6] https://dl.acm.org/doi/10.1145/3204919.3204933[7] https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__GRAPH.html#group__CUDART__GRAPH_1gbec177c250000405c570dc8c4bde20db[8] https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc[9] https://www.iwocl.org/wp-content/uploads/iwocl-2025-hongqiang-wang-lamacpp-backend-update.pdf

15:15-15:45
Towards OpenCL in Safety Critical Systems: Lessons Learnt from a RISC-V Space GPU Platform
Marc Solé i Bonet, Jannis Wolf and Leonidas Kosmidis; Universitat Politècnica de Catalunya (UPC) & Barcelona Supercomputing Center (BSC)
show / hide abstract
 Khronos’ OpenCL is the defacto, open API for GPU compute acceleration for more than a decade in numerous embedded systems. With the recent need for high performance processing and high autonomy in safety critical systems, GPUs are in the process of being adopted new domains with specific requirements. In this short research paper, we describe our experience in the processor of enabling OpenCL support in a RISC-V based GPU platform we have developed for space systems. We discuss the limitations we have identified for supporting the full OpenCL specification, which prevents having a conformant implementation and calls for a safety critical subset. We explain the work-around we have used using a lower level API, demonstrating the feasibility of GPU computations in safety critical systems.
15:45-16:30 | Afternoon Coffee Break & Posters
00:00-00:00
Session Title
Paulius, Organisation
show / hide abstract
Pending
17:30 | Close

Thursday 10 May

08:30 | Welcome Tea and Coffee
09:00-09:30
Session Title
Thomas Applencourt, Argonne National Laboratory
show / hide abstract
Pending
09:30-10:00
ProtoSYCL: A Sample Implementation of a SYCL Compiler for Conformance Test Suite Development
Michael Aziz, Intel. Yvan Labiche, Carleton University
show / hide abstract
The SYCL Conformance Test Suite (CTS) is critical for ensuring an implementation conforms to the specification. However, verifying its correctness and completeness is limited by the time it takes for production-grade SYCL implementations to implement new features: defects and coverage gaps may go undetected until an implementation supporting the new features becomes available. Detecting specification and testing defects early is preferred, as that is when they are easier to resolve. To address this, we introduce ProtoSYCL, a sample implementation of a SYCL compiler designed specifically for test suite development. ProtoSYCL is a single-source single compiler pass (SSCP) implementation of the SYCL 2020 specification that prioritizes conformance over performance. ProtoSYCL curently passes 68 of the 72 CTS test categories and has already led to the discovery of several defects therein. Furthermore, we show that the coverage achieved by subsets of the CTS on ProtoSYCL correlates with that achieved by those subsets on existing production-grade SYCL implementations. These results indicate that ProtoSYCL can serve as an effective proxy for finding defects and identifying coverage gaps in the CTS.
10:00-10:30
SYrtos: Extending SYCL for Real-Time Programming
Biagio Cosenza & Giovanni De Pierro, University of Salerno. Federico Terraneo, Daniele Cattaneo & Giovanni Agosta, Politecnico di Milano
show / hide abstract
 In real-time programming, it is important to distinguish between events that require high responsiveness and those with fewer restrictions. This work introduces SYrtos, an extension to the SYCL standard that supports real-time applications. SYrtos enhances the SYCL execution model by adding the capability to specify priorities for queues and kernels. This feature, coupled with the ability to perform task preemption, allows different applications to be co-scheduled on the same hardware device while meeting real-time guarantees.
The SYrtos specification is designed for implementation on a variety of hardware and software platforms, ranging from microcontrollers with dedicated real-time operating systems to more powerful Linux-based systems. This enables true application portability across the entire embedded computing spectrum. This short paper presents a proof-of-concept implementation of the SYrtos interface on top of the Linux POSIX API, as well as preliminary experimental results on an Intel CPU.
10:30-11:15 | Morning Break and Posters
11:15-11:45
Safety-Oriented GPU Programming via Language Restriction
Marcos Rodriguez, Jon Pedernales, and Irune Yarza, Ikerlan Research Center. Leonidas Kosmidis, Barcelona Supercomputing Center
show / hide abstract
Emerging safety-critical applications increasingly rely on GPU-based computation to meet performance requirements; however, existing GPU software development workflows remain largely performance-driven and are not fully aligned with functional safety standards such as IEC 61508 and ISO 26262.
In particular, current approaches provide limited support for deterministic, analysable, and verifiable behaviour, which are essential for safety compliance.
A key factor in achieving compliance is the selection of programming languages and memory models that enable deterministic, analysable, and verifiable behaviour. This paper presents a functional-safety-oriented development flow for GPU applications, starting from CUDA code automatically generated from Simulink models. The generated code is restructured using multiple memory models to show the language and pipeline flexibility. A state-of-the-art tool – SYCLomatic – is then used to migrate the CUDA code to SYCL, enhancing portability and hardware abstraction. However, as standard SYCL does not directly satisfy functional safety requirements, Clang-based static analysis is applied to identify language constructs that are considered unsafe under applicable safety standards. These constructs are systematically replaced with safer alternatives, establishing a compliant code base and providing a foundation for future adoption of a safety-certified language subset such as Safety-Critical SYCL (SYCL SC).
11:45-12:15
AdaptiveCpp Portable CUDA: A SYCL-Compatible CUDA Compiler for CPUs and GPUs from Multiple Vendors
Aksel Alpay and Vincent Heuveline, Heidelberg University
show / hide abstract
We present AdaptiveCpp portable CUDA (PCUDA), a new compiler and runtime which enables compilation and execution of HIP/CUDA code on CPUs, NVIDIA GPUs, AMD GPUs and Intel GPUs. PCUDA is implemented as part of the AdaptiveCpp compiler project, and is designed to be fully interoperable on the source and runtime level with other programming models that AdaptiveCpp supports, in particular SYCL.
To our knowledge, AdaptiveCpp PCUDA is the first HIP/CUDA compiler that has full interoperability with SYCL, while also inheriting SYCL’s portablity. This can be beneficial e.g. for iterative or partial porting of CUDA projects to SYCL.
We also argue that using PCUDA, SYCL and HIP/CUDA programming models can be compared more accurately, because the exact same compiler can be used for comparisons. We describe PCUDA’s design, and demonstrate its capabilities using a set of benchmarks on an AMD Ryzen 9 9950X CPU, as well as Intel Arc B580, NVIDIA A100 and AMD MI300X GPUs. We find that the CUDA versions of these benchmarks, when compiled with PCUDA, closely match performance of the AdaptiveCpp-compiled SYCL versions. Larger deviations can be attributed to differences in the input code. This indicates that the SYCL abstractions do not inherently introduce performance penalties compared to HIP/CUDA’s lower abstraction level. The overall performance of PCUDA is found to be highly competitive, in the geometric mean across benchmarks outperforming vendor GPU compilers between 7% and 31%, depending on platform and configuration. We also show that PCUDA compiles the CloverLeaf application roughly 10x faster than SYCL, indicating a severe compile time disadvantage of the SYCL model.
Comparing kernel submission throughput between SYCL and PCUDA reveals submission overheads in SYCL that are primarily caused by unnecessary event creation.
12:15-12:45
Enabling Vendor-Portable GPU Acceleration for Python-Based Quantum Chemistry with SYCL
Abhishek Bagusetty, Alvaro Vazquez-Mayagoitia, and Brice Videau, Argonne National Laboratory. Qiming Sun, Bytedance Seed
show / hide abstract
gpu4PySCF is a GPU-accelerated extension of PySCF that enables quantum chemistry simulations (e.g., density functional theory, Hartree–Fock, and post-Hartree–Fock methods) using a two-layer acceleration design: (i) Python-level array/tensor operations implemented with CuPy, and (ii) custom CUDA kernels for integral evaluation and related primitives. While effective on NVIDIA GPUs, this design creates practical portability barriers for heterogeneous HPC systems increasingly composed of Intel, AMD, and NVIDIA accelerators. This paper presents a methodology-focused porting approach for gpu4PySCF from CUDA to vendor-portable module: using CuPy with DPNP at the Python layer and translate CUDA kernels to SYCL at the kernel layer plus a thin compatibility header to minimize kernel source changes. We summarize concrete migration patterns, integration mechanisms between Python and SYCL kernels, expected challenges (e.g., sub-group semantics, local memory declaration scope, atomic memory model), and build/deployment practices for multi-backend compilation with DPC++. The result is a reusable blueprint for porting Python-driven scientific GPU applications to SYCL without rewriting the full software stack.
12:45-14:15 | Lunch Break and Posters
14:15-14:45
Evaluating the AdaptiveCpp Single-Pass (SSCP) SYCL compiler for GROMACS on Modern AMD Accelerators
Bálint Soproni, Aksel Alpay and Vincent Heuveline, Heidelberg University
show / hide abstract

The SYCL specification allows for multiple implementation strategies, in particular SSCP (single-source, single compiler pass) and SMCP (single-source, multiple compiler passes). The default compiler of the AdaptiveCpp SYCL implementation is an SSCP JIT compiler, which has previously been shown to deliver substantial speedups for certain applications, while also reducing compilation times. However, systematic performance evaluations of that compiler have focused mostly on small or medium-sized applications. Additionally, to our knowledge, the impact of supporting both SSCP as well as SMCP compilers in large production code bases has not yet been thoroughly studied.

In this work, we explore the applicability of the AdaptiveCpp JIT compiler to a highly-optimized, production code base: GROMACS – a widely used molecular dynamics software package that currently relies on SYCL and the AdaptiveCpp SMCP compiler to target AMD GPUs. We evaluate the ported application across a variety of input problems covering common simulation scenarios on MI210, MI300A, and MI300X AMD GPUs. We show that the SSCP JIT compiler outperforms the currently used SMCP AdaptiveCpp compiler in high-atom-count workload configurations by up to 10-25\% and increases the peak simulation throughput of each tested GPU by up to 10\%, measured in terms of simulated atoms per second. These findings confirm that the performance advantages of the SSCP JIT compiler also translate to production applications like GROMACS.

14:45-15:15
Using Intel Shared System USM to ease porting applications to run on GPUs with SYCL – a GROMACS case study
Mark Abraham and Francois Dugast, Intel
show / hide abstract

Porting mature scientific applications to heterogeneous hardware remains a key challenge for the scientific community, particularly for codes that rely heavily on complex pointer-based data structures and CPU-centric memory layouts. This talk presents how Intel’s newly implemented Shared System USM smoothly overcomes many of these legacy barriers and enables a natural migration path to GPU execution. By providing a single, coherent virtual address space shared between host and device, shared system USM allows developers to preserve familiar pointer semantics, reuse existing data structures, and incrementally introduce SYCL kernels without extensive refactoring or buffer-access boilerplate. Unlike traditional SYCL Unified Shared Memory, memory pages are inherently cross-device. They can be transferred automatically by the OS kernel, which often performs better than the runtime used in other shared-memory approaches. This builds upon new capabilities available in the Linux kernel and Intel GPU drivers.

We will present a practical, stepwise adoption strategy tailored to real-world HPC codes:

  1. Profile the application on the workload of interest to identify regions with high computational density suitable for a GPU port, e.g. using Intel Advisor from the oneAPI Base Toolkit.
  2. Preserve traditional host allocations with system new or malloc().
  3. Retain idiomatic C++ containers and RAII.
  4. Port the target kernel(s) to SYCL, e.g. by mapping the outermost loop to a sycl::parallel_for() construct. Care should still be taken to avoid I/O. The needs of concurrent access must be considered.
  5. Profile the code, e.g. with VTune from the oneAPI Base Toolkit and
    1. Exploit selective prefetching mechanisms where indicated, e.g. for large allocations
    2. Evolve to asynchronous command graphs, e.g. for overlapping with concurrent CPU tasks.
    3. Optimize memory allocation where indicated, e.g. by grouping related data on the same memory page or aligning allocations to page boundaries.
    4. Optimize memory layout, e.g. so concurrent GPU threads can access similar data, and
    5. Optimize memory access patterns, e.g. so more computational work is done for each memory access.

We will illustrate this strategy using both case studies in and mini apps based on the molecular-dynamics application GROMACS running on currently available hardware and software. This application already has mature GPU ports but still has a simple CPU-only “reference” implementation that we can port for this illustration. It includes complex data structures such as N-body pair lists that can be problematic to use on GPUs without shared system USM. It contains examples of the above optimization approaches that can be understood readily in this context. We will demonstrate how much of the available performance can be obtained for limited effort.
Overall, we demonstrate how shared system USM shortens time-to-GPU, reduces code churn, makes it easier to implement both CPU and GPU ports in the same code base, supports rapid experimentation, and can even provide better performance than explicit-transfer constructs — key themes for the IWOCL community’s focus on performance portability and developer productivity. Intel’s shared system USM implementation provides a compelling pathway for bringing established scientific software to SYCL-enabled GPUs, aligning cleanly with IWOCL’s mission to advance open, portable, and maintainable heterogeneous programming models.

00:00-00:00
Session Title
Speaker, Organisation
show / hide abstract
Pending
15:45-16:30 | Afternoon Break and Posters
16:30-18:00
Panel Discussion
Speakers TBD
show / hide abstract
Pending
17:30 | Close

Friday 11 May

08:30 | Welcome Tea and Coffee
09:00-09:30
Comparing the Performance of Heterogeneous Conjugate Gradient and Cholesky Solvers on Various Hardware Using SYCL
Tim Thüring, Alexander Strack and Dirk Pflüger, University of Stuttgart
show / hide abstract
Many important real-world applications such as System Identification with Gaussian Processes involve solving linear systems with symmetric positive-definite matrices. The iterative CG method and direct solvers based on the Cholesky decomposition are two popular methods that can be applied in this case. Since often very large systems have to be solved when dealing with such real-world scenarios, GPUs are commonly used to accelerate the computations. However, homogeneous approaches that only leverage the GPU in the system do not take full advantage of the often powerful CPUs located in modern HPC systems.
In this work we present multi-vendor, heterogeneous implementations of the CG method and the Cholesky decomposition that leverage the CPU and GPU of a heterogeneous system simultaneously using SYCL. Furthermore, we compare their runtime behavior to traditional, homogeneous approaches. The results show that for large matrices our heterogeneous implementation is up to 32 percent faster for the CG method and up to 29 percent faster for the Cholesky decomposition compared to the corresponding GPU-only implementations.
In addition, for large matrices our heterogeneous implementation of the Cholesky decomposition can achieve at least 12 percent faster runtimes across several systems with GPUs from NVIDIA, AMD, and Intel.
09:30-10:00
Lightweight Tracing Interface for SYCL’s USM Model Implemented in AdaptiveCpp
Jakob Niessner and Aksel Alpay, Heidelberg University. Thomas Applencourt, Argonne National Laboratory
show / hide abstract
SYCL is a royalty-free, cross-platform abstraction layer that enables code for heterogenous compute architectures to be written in a “single-source” style using nearly complete standard C++ (except for some restrictions). Due to its versatility its field of applications range from scientific computing to telecommunication. However, unlike most other heterogeneous programming models, such as Kokkos, OpenMP target,target and OpenCL, the SYCL standard lacks a universal tooling utility. While Intel has recognized this shortcoming and introduced XPTI for their own SYCL compiler, icpx, this tool is vendor-specific and was not designed to be ported to other compilers. In this paper, we introduce an alternative approach for tracing the SYCL API. In order to do this, we introduced a set of entry and exit points for SYCL’s USM model, and implemented them in AdaptiveCpp as a proof of concept.
10:00-10:30
Update on SYCL Usage in the Shamrock Framework and Exascale Scalability
Timothée David Cléris, IPAG, Univeristy Grenoble Alpes
show / hide abstract

At IWOCL 2025, we introduced Shamrock, a native SYCL-MPI multi-GPU framework designed for particle- and mesh-based methods, providing scalable neighbor finding and load balancing through layered abstractions. Shamrock solvers are internally expressed as a graph structure consisting of a DAG extended with subcycles within subgraphs. The very large number of possible execution paths makes explicit kernel scheduling with streams impractical. Consequently, the initial version of Shamrock primarily relied on sycl::buffers, except for direct GPU-aware MPI communications.

While convenient, sycl::buffers can introduce unexpected overheads such as submission latency and increased register pressure, and they are incompatible with GPU-aware MPI, which requires USM pointers. Over the past year, we therefore migrated Shamrock from sycl::buffers to USM using a lightweight in-house wrapper designed to minimize code changes during the transition.

In parallel, we adapted Shamrock to run and scale efficiently on the Aurora supercomputer. The original version already achieved 93% parallel efficiency on 2,048 GPU tiles on the Adastra supercomputer. However, scaling beyond 10,000 GPUs revealed several edge cases in SYCL–MPI interactions, including unknown memory costs associated with large numbers of messages, message loss between senders and receivers, and performance degradation in large collective operations. Addressing these issues required substantial adaptations to the code.

As a result, Shamrock now achieves up to 220 billion SPH particles processed per second on 24,000 GPU tiles, using 768 TB of GPU memory in a weak-scaling experiment, while sustaining 83% parallel efficiency.

In this presentation, we detail our experience migrating from sycl::buffers to USM and discuss our recent large-scale results on Aurora. Finally, we outline future directions for Shamrock’s usage of SYCL, including tighter integration of MPI communication into the SYCL queue DAG, the use of modern GPU-specific accelerators, and the need for fully asynchronous memory management beyond the current SYCL 2020 specification.

10:30-11:15 | Morning Break and Posters
11:15-11:45
A GPU Backend for the SYCL-Based 2D Stencil Framework StencilStream and a Comparison with its FPGA Backends
Tim Stöhr, Jan-Oliver Opdenhövel, Christian Plessl and Tobias Kenter, Paderborn University
show / hide abstract
2D stencil codes are generally easy to define, but often nontrivial to efficiently execute on highly parallel computing architectures. The SYCL-based stencil simulation framework StencilStream solves this problem by offering performance-portable abstraction layers that separate the concerns of application developers and performance engineers. With this work, we present a new GPU backend for the previously FPGA-focused framework, as well as improvements to its FPGA backends and a comparison between them. Through the use of automatic memory layout transformations, we can exploit up to 88.8% of the global memory throughput available on an Nvidia A100 GPU and achieve throughputs of up to 131.6 GCells/s for the 5-point Jacobi benchmark. In addition to that, the introduction of spatial parallelism to StencilStream’s FPGA backends leads to speedups of up to 307% relative to the previous StencilStream version for the HotSpot benchmark and allow to achieve up to 176.1 GCells/s for the 5-point Jacobi benchmark on an Altera/Intel Stratix 10 FPGA.
11:45-12:15
Porting ThunderKittens from CUDA to SYCL for Intel GPU: Process, Challenges, and Lessons Learned
Yehong Jiang, Shen Chen, Fangwen Fu, Vincent Lu, Y K Chen, Hong Wang and Xinmin Tian, Intel
show / hide abstract
This paper presents our experience porting ThunderKittens, a CUDA-based GPU programming framework for writing fast deep learning kernels, to SYCL for Intel GPUs. We detail a three-stage methodology: automated conversion, manual adaptation, and architecture-specific optimization. Our work focuses on translating ThunderKittens’ C++ template library that implements the framework’s domain-specific language (DSL), replacing PTX template implementations with Intel GPU-compatible alternatives. We address fundamental architectural adaptations including transitioning from CUDA’s register data layout and shared memory-based execution model to Intel GPU’s optimal solutions featuring different register data layouts, and L1 cache prefetch-based memory latency coverage schemes. Our work reveals both the capabilities and limitations of current CUDA-to-SYCL translation tools. This case study reveals that while 100% performance portability remains challenging due to architectural differences, structural performance portability is achievable through consistent abstractions that maintain familiar programming patterns across platforms. Our methodology provides valuable guidance for developers undertaking similar cross-platform GPU programming efforts, particularly when adapting complex framework-level abstractions across different GPU architectures
12:15-12:45
 A cross-vendor implementation of PopSift using SYCL
Mohammad Fadel Al Khafaji and Carsten Griwodz, University of Oslo. Håkon Kvale Stensland, Simula Research Laboratory
show / hide abstract
Implementing the SIFT algorithm by Lowe, PopSift is an open-source CUDA implementation used to extract and describe keypoints in 2D images. It is meant to be a faithful SIFT implementation and a stage of the photogrammetry pipeline that handles natural feature extraction of AliceVision Meshroom, a 3D Computer Vision framework. It is used in a range from object recognition to 3D reconstruction. However, it is vendor-locked to Nvidia GPUs as it is written in CUDA, limiting its use and exposure. Therefore, reimplementing the application in SYCL is to make it available on a wider range of platforms. PopSift-SYCL is a portable, cross-vendor, and efficient port of the original PopSift that achieves both good speed and correctness. It can be compiled by both Intel’s DPC++ compiler and the open source AdaptiveCPP compiler. We present the stages of PopSift-SYCL and its performance.
12:45-14:00 | Lunch Break
14:00 | Close

Posters

Posters will be on display during the breaks Wednesday through Friday.