IWOCL 2026 Conference Program

The person presenting the talk is highlighted in bold.

Jump to: Wednesday | Thursday | Friday | Posters

Awards

This year’s awards went to:

- Outstanding Full Paper Award: AdaptiveCpp Portable CUDA: A SYCL0compatible CUDA Compiler for CPUs and GPUs from Multiple Vendors – Aksel Alpay and Vincent Heuveline, Heidelberg University
- Outstanding Short Paper Award: Advancing OpenCL-Based LLM Inference in llama.cpp – Hongqian Wang, Li He, Shangqing Gu, Shaofei Qi, Yunjie Xu and Alex Bourd, Qualcomm
- Outstanding Poster: Performance portability of the AKAZE algorithm using SYCL across heterogeneous architectures – Sonia Haddouche, Franck Danober, Robin Lembach, Christophe Guillet, Thales. Erwan Fabiani, Univ Brest. Loïc Lagadec, ENSTA
- Outstanding Presentation: Update on SYCL Usage in the Shamrock Framework and Exascale Scalability – Timothée David Cléris, IPAG, Univeristy Grenoble Alpes
- Outstanding Presentation: OpenCL: An Update from the Khronos OpenCL Working Group – Ben Ashbaugh, Intel
- Outstanding Presentation: chipStar: OpenCL as a Portability Layer for CUDA/HIP Applications – Paulius Velesko, PGLC Consulting

Wednesday 6 May

08:30-09:00 | Registration and Coffee

09:00-09:30

Welcome and Introduction

Speakers: TBD

View Slides | Video Recording

09:30-10:30 – KEYNOTE PRESENTATION

chipStar: OpenCL as a Portability Layer for CUDA/HIP Applications

Paulius Velesko, PGLC Consulting

show / hide abstract

CUDA’s dominance has created a portability problem: GPU-accelerated applications are effectively locked to NVIDIA hardware. Porting large codebases to alternative programming models such as SYCL, OpenMP, or Kokkos is expensive and creates a permanent maintenance burden. In practice, most development effort targets CUDA as the single most widely used model, so alternative models tend to lag behind in features and maturity. chipStar sidesteps this entirely by compiling unmodified CUDA and HIP source code into fat binaries built on OpenCL and SPIR-V — open, cross-vendor standards with broad industry and community support.

Unlike source-to-source translators or binary-level approaches, chipStar preserves the original programming model and produces portable binaries that run without recompilation. We demonstrate this concretely: the same applications execute on Intel discrete and integrated GPUs, AMD and NVIDIA GPUs via Mesa’s rusticl, ARM Mali-G52 GPUs, RISC-V systems with PowerVR GPUs, and x86 and ARM CPUs. Performance is competitive with vendor-native toolchains, with overhead that is negligible on some platforms and a reasonable trade-off for portability on others.

RISC-V is particularly interesting as an emerging platform. With growing industry momentum, chipStar provides a path for CUDA/HIP workloads to follow this architecture as it matures. Production readiness is validated through HPC applications including GAMESS-GPU-HF, a quantum chemistry code with over 20,000 lines of GPU kernels, and libCEED, a high-order finite element library from the DOE CEED project.

chipStar’s reach extends further through Vulkan-based OpenCL implementations. Vulkan is now nearly ubiquitous — supported on Windows, Linux, Android, macOS, and iOS — and projects like clvk (OpenCL on Vulkan) and rusticl (OpenCL in Mesa) bring conformant OpenCL to platforms that lack native support. This means chipStar can target macOS GPUs via Metal, mobile GPUs on Android and iOS, and virtually any platform with a Vulkan driver.

View Slides | Video Recording

10:30-11:15 | Morning Coffee Break & Posters

11:15-11:45

AdaptiveCpp Portable CUDA: A SYCL-Compatible CUDA Compiler for CPUs and GPUs from Multiple Vendors

Aksel Alpay and Vincent Heuveline, Heidelberg University

show / hide abstract

We present AdaptiveCpp portable CUDA (PCUDA), a new compiler and runtime which enables compilation and execution of HIP/CUDA code on CPUs, NVIDIA GPUs, AMD GPUs and Intel GPUs. PCUDA is implemented as part of the AdaptiveCpp compiler project, and is designed to be fully interoperable on the source and runtime level with other programming models that AdaptiveCpp supports, in particular SYCL.
To our knowledge, AdaptiveCpp PCUDA is the first HIP/CUDA compiler that has full interoperability with SYCL, while also inheriting SYCL’s portablity. This can be beneficial e.g. for iterative or partial porting of CUDA projects to SYCL.

We also argue that using PCUDA, SYCL and HIP/CUDA programming models can be compared more accurately, because the exact same compiler can be used for comparisons. We describe PCUDA’s design, and demonstrate its capabilities using a set of benchmarks on an AMD Ryzen 9 9950X CPU, as well as Intel Arc B580, NVIDIA A100 and AMD MI300X GPUs. We find that the CUDA versions of these benchmarks, when compiled with PCUDA, closely match performance of the AdaptiveCpp-compiled SYCL versions. Larger deviations can be attributed to differences in the input code. This indicates that the SYCL abstractions do not inherently introduce performance penalties compared to HIP/CUDA’s lower abstraction level. The overall performance of PCUDA is found to be highly competitive, in the geometric mean across benchmarks outperforming vendor GPU compilers between 7% and 31%, depending on platform and configuration. We also show that PCUDA compiles the CloverLeaf application roughly 10x faster than SYCL, indicating a severe compile time disadvantage of the SYCL model.
Comparing kernel submission throughput between SYCL and PCUDA reveals submission overheads in SYCL that are primarily caused by unnecessary event creation.

View Slides | Video Recording

11:45-12:15

Comparing the Performance of Heterogeneous Conjugate Gradient and Cholesky Solvers on Various Hardware Using SYCL

Tim Thüring, Alexander Strack and Dirk Pflüger, University of Stuttgart

show / hide abstract

Many important real-world applications such as System Identification with Gaussian Processes involve solving linear systems with symmetric positive-definite matrices. The iterative CG method and direct solvers based on the Cholesky decomposition are two popular methods that can be applied in this case. Since often very large systems have to be solved when dealing with such real-world scenarios, GPUs are commonly used to accelerate the computations. However, homogeneous approaches that only leverage the GPU in the system do not take full advantage of the often powerful CPUs located in modern HPC systems.

In this work we present multi-vendor, heterogeneous implementations of the CG method and the Cholesky decomposition that leverage the CPU and GPU of a heterogeneous system simultaneously using SYCL. Furthermore, we compare their runtime behavior to traditional, homogeneous approaches. The results show that for large matrices our heterogeneous implementation is up to 32 percent faster for the CG method and up to 29 percent faster for the Cholesky decomposition compared to the corresponding GPU-only implementations.
In addition, for large matrices our heterogeneous implementation of the Cholesky decomposition can achieve at least 12 percent faster runtimes across several systems with GPUs from NVIDIA, AMD, and Intel.

View Slides | Video Recording

12:15-12:45

Lightweight Tracing Interface for SYCL’s USM Model Implemented in AdaptiveCpp

Jakob Niessner and Aksel Alpay, Heidelberg University. Thomas Applencourt, Argonne National Laboratory

show / hide abstract

SYCL is a royalty-free, cross-platform abstraction layer that enables code for heterogenous compute architectures to be written in a “single-source” style using nearly complete standard C++ (except for some restrictions). Due to its versatility its field of applications range from scientific computing to telecommunication. However, unlike most other heterogeneous programming models, such as Kokkos, OpenMP target,target and OpenCL, the SYCL standard lacks a universal tooling utility. While Intel has recognized this shortcoming and introduced XPTI for their own SYCL compiler, icpx, this tool is vendor-specific and was not designed to be ported to other compilers. In this paper, we introduce an alternative approach for tracing the SYCL API. In order to do this, we introduced a set of entry and exit points for SYCL’s USM model, and implemented them in AdaptiveCpp as a proof of concept.

View Slides | Video Recording

12:45-14:15 | Lunch Break & Posters

14:15-14:45 – INVITED PRESENTATION

OpenCL: An Update from the Khronos OpenCL Working Group

Ben Ashbaugh, Intel

View Slides | Video Recording

14:45-15:15

Advancing OpenCL-Based LLM Inference in llama.cpp: Optimizing Heterogeneous GPU Performance and Enabling Next‑Generation Model Architectures

Hongqiang Wang, Li He, Shangqing Gu, Shaofei Qi, Yunjie Xu and Alex Bourd, Qualcomm

show / hide abstract

Over the past year, we have significantly advanced the OpenCL backend of llama.cpp, expanding its performance, portability, and model coverage across a range of Qualcomm Adreno GPUs and other heterogeneous devices. Building on our initial OpenCL integration—designed to provide broad compatibility and a unified compute-focused execution path—we have developed new optimizations that substantially improve inference throughput for both small and large models. In particular, we introduce targeted kernel enhancements that accelerate small-model workloads, device-customized optimizations for diverse Qualcomm GPUs, and substantial improvements for Mixture-of-Experts (MoE) architectures. For the OpenAI’s GPT-OSS-20B model with MoE on the latest Qualcomm Snapdragon X2 Elite device, we achieved a more than fourfold speedup in prefill performance, increasing throughput from approximately 120 tokens/s to over 500 tokens/s through specialized kernel restructuring, memory‑access tuning, and load-balancing strategies across experts.

This work also highlights the unique opportunities and constraints of using OpenCL as a backend for modern LLMs. Unlike APIs with native cooperative matrix or tensor core abstractions, OpenCL currently lacks a standardized cooperative matrix interface, requiring us to engineer portable yet highly optimized GEMM implementations for both dense and MoE workloads. We discuss how this API-level limitation influences kernel design, and we present several strategies—such as adaptive tiling, subgroup‑aware parallelization, and device‑specific kernel variants—to achieve high utilization without relying on dedicated matrix-acceleration instructions. Additionally, we examine the growing diversity of LLM architectures, including models that incorporate dynamic routing, sparsity patterns, or emerging attention mechanisms, and how these trends increase pressure on backend flexibility, programmability, and kernel specialization.

Finally, we outline our roadmap for the OpenCL backend in llama.cpp, including support for additional quantization schemes, FlashAttention-style kernels, expanded INT8 inference paths, and improvements to portability and auto-tuning across more GPU vendors. These enhancements continue to make OpenCL a viable and efficient choice for deploying LLM inference on a broad spectrum of edge and client devices, including Snapdragon X Elite, Snapdragon 8 Elite, and other GPUs where portability and performance must coexist.

View Slides | Video Recording

15:15-15:45

Optimizing AI Workloads on Intel GPUs with OpenCL

Michal Mrozek, Intel

show / hide abstract

Artificial Intelligence (AI) workloads are growing at an unprecedented pace, both in complexity and scale. From large language models to real-time inference in edge devices, the demand for high-performance, low-latency execution on heterogeneous platforms has never been greater. GPUs have become the backbone of AI acceleration, but achieving optimal performance is not simply a matter of raw compute power. It requires a deep understanding of how workloads interact with the underlying hardware and software stack, and how to fine-tune every layer for efficiency. This talk focuses on Intel GPUs and the OpenCL programming model, exploring how driver-level optimizations and advanced submission strategies can unlock new levels of performance for AI pipelines.

Talk will consists of 3 main sections:

1. Queue Model Design and Challenges
Efficient task submission is the foundation of GPU performance. We begin by examining the design of queue models for AI workloads, which differ significantly from conventional compute pipelines. Attendees will learn:

How to structure queues to balance throughput and latency.
Strategies for minimizing synchronization overhead.
Techniques for reducing resource contention in multi-queue environments.
We will also discuss heuristics for deciding when to submit tasks immediately versus batching them, and how these decisions impact overall execution efficiency. Real-world examples will illustrate how improper queue design can lead to bottlenecks, and how optimized models can deliver measurable gains.

2. Ultra Low Latency Submission (ULLS)
For latency-sensitive AI applications—such as real-time inference or interactive workloads—traditional submission models are insufficient. Enter Ultra Low Latency Submission (ULLS), a technology designed to minimize dispatch overhead and enable near-instantaneous task execution. This section will cover:

The architectural principles behind ULLS.
How ULLS changes the decision-making process for task submission.
Trade-offs between ULLS and conventional batching strategies.
We will present how ULLS reduces jitter and improves predictability, making it ideal for scenarios where every microsecond counts.

3. Memory Allocation and Optimization

Memory management is often the silent performance killer in AI workloads. Frequent allocations, fragmentation, and unpredictable access patterns can introduce significant overhead. This session will explore:

How AI workloads allocate memory and why traditional approaches fail.
Driver-level optimizations such as memory pooling and resource recycling.
Techniques for reducing allocation latency and improving predictability.
We will show how these optimizations not only improve raw performance but also enhance stability under heavy load. Practical examples will demonstrate how memory pooling can reduce allocation overhead by orders of magnitude.

View Slides | Video Recording

15:45-16:30 | Afternoon Coffee Break & Posters

16:30-17:00

Open Source Deep Learning Compiler Powering GenAI on Adreno GPU

Siva Rama Krishna Reddy B, Krishna Raju Vegiraju, Hongqiang Wang and Alex Bourd; Qualcomm and Visvesvarya Technical University

show / hide abstract

This technical presentation will provide a comprehensive update on our recent efforts to power Generative AI solution of Adreno GPU using open-source solutions. For many years we have been contributing and enhancing OpenCL backend support for Adreno GPU in projects like TVM (Tensor Virtual Machine) and in recent years MLC (Machine Learning Compilation). Recently Qualcomm is contributing to new evolving projects like Llama.cpp too.

In this presentation, we will be sharing past year enhancements to TVM and MLC that significantly push Generative AI performance on Adreno GPU across Mobile and Compute products. To be precise this presentation covers below aspects.

Though TVM community have deprecated Relay and moved to Relax compiler, which is more advanced, Adreno GPU enhancements are not present. Now, we have up-streamed Adreno GPU enhancements like Texture paths, specialized layouts, memory management and OpenCLML extension support. Now, the mainline Relax solution has everything we support for Adreno GPU.
Generative AI performance is significantly improved by adding new accelerated Ops to OpenCLML. We now have additional BYOC compilation passed that can offload these new ops. Prefill performance of all Gen Ai models has been improved by 2X with this enhancement. This acceleration is available from OpenCLML extension 5 or above.
We are going to support Vulkan backend in parallel to OpenCL going forward. Thanks to TVM’s target independent lowering which allows reuse of > 90% of optimizations across both backend leaving only codegen and few runtime enhancements which are one time effort. This path additionally enables Vulkan specific acceleration like op-operative matmul to deliver performance close to OpenCLML extension driven acceleration with in pure open-source specification way.

Additionally, we will be sharing some performance numbers, various resources available for above solutions for the audience.

View Slides | Video Recording

17:00-17:30 – INVITED PRESENTATION

15 years of PoCL: Past Lessons, Present Reality, Future Directions

Pekka Jääskeläinen, Tampere University

show / hide abstract

More than fifteen years have passed since the first public release of the Portable Computing Language (PoCL) source code—an occasion that highlights how time truly flies. In this talk, Pekka reflects on the key lessons learned from stewarding a long lived open source project largely within a university environment. He will present an overview of PoCL’s current capabilities and maturity, discuss its evolving feature set, and offer a preview of several ongoing research and development efforts in his group that are expected to appear in the PoCL repository in the near future.

View Slides | Video Recording

17:30 | Close

Thursday 7 May

08:30 | Welcome Tea and Coffee

09:00-09:30 – INVITED PRESENTATION

SYCL: An Update from the Khronos Working Group, SYCL

Thomas Applencourt, Argonne National Laboratory

View Slides | Video Recording

09:30-10:00

ProtoSYCL: A Sample Implementation of a SYCL Compiler for Conformance Test Suite Development

Michael Aziz, Intel. Yvan Labiche, Carleton University

show / hide abstract

The SYCL Conformance Test Suite (CTS) is critical for ensuring an implementation conforms to the specification. However, verifying its correctness and completeness is limited by the time it takes for production-grade SYCL implementations to implement new features: defects and coverage gaps may go undetected until an implementation supporting the new features becomes available. Detecting specification and testing defects early is preferred, as that is when they are easier to resolve. To address this, we introduce ProtoSYCL, a sample implementation of a SYCL compiler designed specifically for test suite development. ProtoSYCL is a single-source single compiler pass (SSCP) implementation of the SYCL 2020 specification that prioritizes conformance over performance. ProtoSYCL curently passes 68 of the 72 CTS test categories and has already led to the discovery of several defects therein. Furthermore, we show that the coverage achieved by subsets of the CTS on ProtoSYCL correlates with that achieved by those subsets on existing production-grade SYCL implementations. These results indicate that ProtoSYCL can serve as an effective proxy for finding defects and identifying coverage gaps in the CTS.

View Slides | Video Recording

10:00-10:30

SYrtos: Extending SYCL for Real-Time Programming

Biagio Cosenza & Giovanni De Pierro, University of Salerno. Federico Terraneo, Daniele Cattaneo & Giovanni Agosta, Politecnico di Milano

show / hide abstract

In real-time programming, it is important to distinguish between events that require high responsiveness and those with fewer restrictions. This work introduces SYrtos, an extension to the SYCL standard that supports real-time applications. SYrtos enhances the SYCL execution model by adding the capability to specify priorities for queues and kernels. This feature, coupled with the ability to perform task preemption, allows different applications to be co-scheduled on the same hardware device while meeting real-time guarantees.
The SYrtos specification is designed for implementation on a variety of hardware and software platforms, ranging from microcontrollers with dedicated real-time operating systems to more powerful Linux-based systems. This enables true application portability across the entire embedded computing spectrum. This short paper presents a proof-of-concept implementation of the SYrtos interface on top of the Linux POSIX API, as well as preliminary experimental results on an Intel CPU.

View Slides | Video Recording

10:30-11:15 | Morning Break and Posters

11:15-11:45

Safety-Oriented GPU Programming via Language Restriction

Marcos Rodriguez and Jon Pedernales, Ikerlan Technology Research Center; Universitat Politècnica de Catalunya (UPC). Leonidas Kosmidis, Barcelona Supercomputing Center (BSC); Universitat Politècnica de Catalunya (UPC). Alejandro J. Calderon and Irune Yarza, Ikerlan Technology Research Center

show / hide abstract

Emerging safety-critical applications increasingly rely on GPU-based computation to meet performance requirements; however, existing GPU software development workflows remain largely performance-driven and are not fully aligned with functional safety standards such as IEC 61508 and ISO 26262.

In particular, current approaches provide limited support for deterministic, analysable, and verifiable behaviour, which are essential for safety compliance.

A key factor in achieving compliance is the selection of programming languages and memory models that enable deterministic, analysable, and verifiable behaviour. This paper presents a functional-safety-oriented development flow for GPU applications, starting from CUDA code automatically generated from Simulink models. The generated code is restructured using multiple memory models to show the language and pipeline flexibility. A state-of-the-art tool – SYCLomatic – is then used to migrate the CUDA code to SYCL, enhancing portability and hardware abstraction. However, as standard SYCL does not directly satisfy functional safety requirements, Clang-based static analysis is applied to identify language constructs that are considered unsafe under applicable safety standards. These constructs are systematically replaced with safer alternatives, establishing a compliant code base and providing a foundation for future adoption of a safety-certified language subset such as Safety-Critical SYCL (SYCL SC).

View Slides | Video Recording

11:45-12:15

Enabling Vendor-Portable GPU Acceleration for Python-Based Quantum Chemistry with SYCL

Brice Videau, Abhishek Bagusetty, and Alvaro Vazquez-Mayagoitia, Argonne National Laboratory. Qiming Sun, Quantum Engine

show / hide abstract

gpu4PySCF is a GPU-accelerated extension of PySCF that enables quantum chemistry simulations (e.g., density functional theory, Hartree–Fock, and post-Hartree–Fock methods) using a two-layer acceleration design: (i) Python-level array/tensor operations implemented with CuPy, and (ii) custom CUDA kernels for integral evaluation and related primitives. While effective on NVIDIA GPUs, this design creates practical portability barriers for heterogeneous HPC systems increasingly composed of Intel, AMD, and NVIDIA accelerators. This paper presents a methodology-focused porting approach for gpu4PySCF from CUDA to vendor-portable module: using CuPy with DPNP at the Python layer and translate CUDA kernels to SYCL at the kernel layer plus a thin compatibility header to minimize kernel source changes. We summarize concrete migration patterns, integration mechanisms between Python and SYCL kernels, expected challenges (e.g., sub-group semantics, local memory declaration scope, atomic memory model), and build/deployment practices for multi-backend compilation with DPC++. The result is a reusable blueprint for porting Python-driven scientific GPU applications to SYCL without rewriting the full software stack.

View Slides | Video Recording

12:15-12:45

Evaluating the AdaptiveCpp Single-Pass (SSCP) SYCL compiler for GROMACS on Modern AMD Accelerators

Bálint Soproni, Aksel Alpay and Vincent Heuveline, Heidelberg University

show / hide abstract

The SYCL specification allows for multiple implementation strategies, in particular SSCP (single-source, single compiler pass) and SMCP (single-source, multiple compiler passes). The default compiler of the AdaptiveCpp SYCL implementation is an SSCP JIT compiler, which has previously been shown to deliver substantial speedups for certain applications, while also reducing compilation times. However, systematic performance evaluations of that compiler have focused mostly on small or medium-sized applications. Additionally, to our knowledge, the impact of supporting both SSCP as well as SMCP compilers in large production code bases has not yet been thoroughly studied.

In this work, we explore the applicability of the AdaptiveCpp JIT compiler to a highly-optimized, production code base: GROMACS – a widely used molecular dynamics software package that currently relies on SYCL and the AdaptiveCpp SMCP compiler to target AMD GPUs. We evaluate the ported application across a variety of input problems covering common simulation scenarios on MI210, MI300A, and MI300X AMD GPUs. We show that the SSCP JIT compiler outperforms the currently used SMCP AdaptiveCpp compiler in high-atom-count workload configurations by up to 10-25% and increases the peak simulation throughput of each tested GPU by up to 10%, measured in terms of simulated atoms per second. These findings confirm that the performance advantages of the SSCP JIT compiler also translate to production applications like GROMACS.

View Slides | Video Recording

12:45-14:15 | Lunch Break and Posters

14:15-14:45

Using Intel Shared System USM to ease porting applications to run on GPUs with SYCL – a GROMACS case study

Mark Abraham and Francois Dugast, Intel

show / hide abstract

Porting mature scientific applications to heterogeneous hardware remains a key challenge for the scientific community, particularly for codes that rely heavily on complex pointer-based data structures and CPU-centric memory layouts. This talk presents how Intel’s newly implemented Shared System USM smoothly overcomes many of these legacy barriers and enables a natural migration path to GPU execution. By providing a single, coherent virtual address space shared between host and device, shared system USM allows developers to preserve familiar pointer semantics, reuse existing data structures, and incrementally introduce SYCL kernels without extensive refactoring or buffer-access boilerplate. Unlike traditional SYCL Unified Shared Memory, memory pages are inherently cross-device. They can be transferred automatically by the OS kernel, which often performs better than the runtime used in other shared-memory approaches. This builds upon new capabilities available in the Linux kernel and Intel GPU drivers.

We will present a practical, stepwise adoption strategy tailored to real-world HPC codes:

Profile the application on the workload of interest to identify regions with high computational density suitable for a GPU port, e.g. using Intel Advisor from the oneAPI Base Toolkit.
Preserve traditional host allocations with system new or malloc().
Retain idiomatic C++ containers and RAII.
Port the target kernel(s) to SYCL, e.g. by mapping the outermost loop to a sycl::parallel_for() construct. Care should still be taken to avoid I/O. The needs of concurrent access must be considered.
Profile the code, e.g. with VTune from the oneAPI Base Toolkit and
1. Exploit selective prefetching mechanisms where indicated, e.g. for large allocations
2. Evolve to asynchronous command graphs, e.g. for overlapping with concurrent CPU tasks.
3. Optimize memory allocation where indicated, e.g. by grouping related data on the same memory page or aligning allocations to page boundaries.
4. Optimize memory layout, e.g. so concurrent GPU threads can access similar data, and
5. Optimize memory access patterns, e.g. so more computational work is done for each memory access.

We will illustrate this strategy using both case studies in and mini apps based on the molecular-dynamics application GROMACS running on currently available hardware and software. This application already has mature GPU ports but still has a simple CPU-only “reference” implementation that we can port for this illustration. It includes complex data structures such as N-body pair lists that can be problematic to use on GPUs without shared system USM. It contains examples of the above optimization approaches that can be understood readily in this context. We will demonstrate how much of the available performance can be obtained for limited effort.
Overall, we demonstrate how shared system USM shortens time-to-GPU, reduces code churn, makes it easier to implement both CPU and GPU ports in the same code base, supports rapid experimentation, and can even provide better performance than explicit-transfer constructs — key themes for the IWOCL community’s focus on performance portability and developer productivity. Intel’s shared system USM implementation provides a compelling pathway for bringing established scientific software to SYCL-enabled GPUs, aligning cleanly with IWOCL’s mission to advance open, portable, and maintainable heterogeneous programming models.

View Slides | Video Recording

14:45-15:15

FunGT: SYCL-Based Graphics Engine – Path Tracing, Particle Simulation, and Real-Time Interoperability on Discrete and Integrated GPUs

Juan Garcia, Independent

show / hide abstract

Graphics applications have historically required vendor-specific GPU programming: CUDA for NVIDIA, HIP for AMD, and Level Zero for Intel. This forces developers to either support only one GPU vendor or write the same code three times. Meanwhile, devices ship with integrated GPUs, but most of graphics software does not take advantage of them. SYCL solves this problem by providing a single C++ programming model that runs on multiple GPU vendors. This work presents FunGT (Fun Graphics Tool), a C++ graphics engine that shows SYCL works for production graphics on integrated GPUs. We present path tracing, particle simulation, and physics acceleration running on Intel Arc and NVIDIA graphics.
FunGT implements three core graphics capabilities running entirely on integrated (Intel) and discrete GPUs (NVIDIA) via SYCL:

Path Tracing with Physically-Based Rendering
Real-time path tracing using SYCL with Cook-Torrance BRDF material evaluation and bounding volume hierarchy (BVH) acceleration structures. FunGT includes both SYCL and CUDA renderers, enabling direct performance comparison. The SYCL implementation uses sycl_ext_oneapi_bindless_images for bindless texture access, achieving similar functionality to CUDA’s texture objects. Path tracing on Intel Arc Graphics achieves performance competitive with entry-level discrete NVIDIA GPUs. We’ll show side-by-side rendering comparisons between both backends.
SYCL-OpenCL-OpenGL Particle Simulation
Since SYCL lacks native OpenGL interoperability, we leverage SYCL’s OpenCL backend to access the cl_khr_gl_sharing extension. The interoperability pipeline achieves 100+ FPS with 10,000 particles on Intel Arc integrated graphics. FunGT includes multiple particle effects (fireworks, vortex fields, explosions, smoke) running in real-time entirely on the GPU. Users can define custom particle behaviors as C++ lambdas that execute directly on the GPU. Currently, this requires recompilation when adding new behaviors. Future work (currently under development) will integrate SYCL runtime compilation via the kernel_compiler extension to enable runtime-programmable particle systems. This extension allows applications to compile SYCL source code at runtime using create_kernel_bundle_from_source and build functions, enabling users to write and execute new particle physics behaviors through a GUI without recompiling the application. Users could input custom lambda expressions (e.g., magnetic field simulation, custom turbulence) as text, which would be compiled to GPU kernels on-demand, making GPU-accelerated particle effects accessible to artists and non-programmers.
GPU-Accelerated Rigid Body Physics
Ongoing work on SYCL-accelerated collision detection and physics solving, including spatial hashing for broad-phase optimization and manifold-based collision resolution. We are exploring two architectural approaches: (1) Traditional GPU physics where SYCL kernels compute collision detection and impulse solving on the GPU, then transfer updated positions back to the CPU for rendering, the standard approach used by most physics engines. (2) Fully GPU physics pipeline where the entire simulation runs on GPU without CPU readback, leveraging the same SYCL-OpenCL-OpenGL interoperability pattern demonstrated in our particle system to update model matrices directly in OpenGL SSBOs for instanced rendering. The second approach eliminates CPU-GPU transfer overhead, particularly beneficial for integrated GPUs sharing system memory, and represents a novel application of SYCL-GL interop beyond particle effects.

This work proves that SYCL can power graphics applications on integrated and discrete GPUs. FunGT is developed as an independent open-source project hosted at https://github.com/FunGTs/FunGT, where video demonstrations and example code for all capabilities are available

View Slides | Video Recording

15:15-15:45

OpenCL – Building the Foundation for Computing in China

Hu He, Tsinghua University

show / hide abstract

Pending

View Slides | Video Recording

15:45-16:30 | Afternoon Break and Posters

16:30-18:00

SYCL and OpenCL Panel Discussion

Speakers to be confirmed

show / hide abstract

Pending

Video Recording

17:30 | Close

18:30-21:00

Conference Dinner (included with all 3-day passes)

Venue: Ratskeller Heilbronn
Address: Marktpl. 7, 74072 Heilbronn (OpenStreet Map)

Friday 8 May

08:30 | Welcome Tea and Coffee

09:00-09:30

SYCL Everywhere: An Open Source Ecosystem for Parallel Computing

Ben Ashbaugh, Intel

show / hide abstract

This talk will describe real-world engineering efforts in progress to make “SYCL everywhere” a reality and not just a catchy slogan. We will explore the technical building blocks that are coming together to create a truly universal SYCL ecosystem: how LLVM is evolving to support SYCL compilation, the role of SPIR-V as an intermediate representation for SYCL kernels, how OpenCL serves as a portability layer for accelerator offload, and open source projects like PoCL and Mesa that are quietly extending support to new devices. Whether you are a SYCL developer looking to support additional devices, a hardware vendor thinking about ecosystem support, or you are just curious about the future of parallel programming, you will walk away with concrete knowledge about what is happening now and actionable insights on how you can contribute towards making SYCL truly ubiquitous.

View Slides | Video Recording

09:30-10:00

CLVizulayer: A Tool for Visualising the Directed Acyclic Graph of OpenCL Device Submissions

Ewan Crawford, StreamHPC

show / hide abstract

OpenCL allows application developers to compile their OpenCL code against an Installable Client Driver (ICD) loader, a shared library that contains symbols for the OpenCL entry-points. At runtime the ICD loader then forwards the OpenCL entry-point calls on to the chosen OpenCL implementation (ICD). This arrangement not only allows developers to build code in a vendor agnostic way but also enables the existence of a layer mechanism[1]. Layers are a mechanism for intercepting OpenCL calls made to the ICD loader before they are forwarded on to the ICD. A concept that has been proven in other APIs like Vulkan, and enables creation of layers for application debugging, tracing, or even implementing additional functionality[2]. A user running an OpenCL application can then choose which layers to load when they invoke the executable.

This presentation will debut the CLVizulayer tool[3], an OpenCL ICD loader layer for printing the graph of asynchronous device tasks to a file in Graphviz DOT[4] format. This tool allows an OpenCL user to graphically see the directed acyclic graph (DAG) of OpenCL device submissions in the application, with nodes representing clEnqueue commands, and edges based on the prerequisites in the OpenCL execution model[5] that constrain the execution of those commands.

Being able visualize the full device scheduling graph of an application provides a different perspective to timeline tracing, such as you would get from the OpenCL intercept layer[6] or other profiling tools, which shows the ordering of command execution. For example, if two commands don’t share any dependencies they may happen to implement sequentially due to implementation defined behavior, and would be represented sequentially in a timeline trace. Although a user diving into the trace detail may find the events returned and the wait list for each command, it is not as immediately apparent what the complete graph of execution constraints is compared to the visual format output by CLVizulayer.

For large applications the full scheduling graph can be huge, so finer grained mechanisms are also defined by CLVizulayer to give developers more control over what commands appear in the DOT file, but at the expense of having to modify their application. Two extensions, cl_ext_dot_print & cl_ext_command_buffer_dot_print are defined by the tool to give users this extra control. cl_ext_dot print allows a user to express the specific OpenCL command-queues to start capturing, and an explicit end point to write out the DOT graph of captured commands. While cl_ext_command_buffer_dot_print lets a user print out the commands of a cl_command_buffer_khr object, a mechanism that already has an equivalent in CUDA Graph cudaGraphDebugDotPrint()[7] and sycl_ext_oneapi_graph command_graph::print_graph()[8]

In the technical talk we will give an overview of the layer, covering how it works and how the layer can be used. Including modifying code to use the 2 extensions defined in the project. Additionally we will show traces captured from ML and HPC workloads, including llama.cpp[9], Leela Chess Zero, GROMACS, and LAMMPS to illustrate the value of using the layer.

[1] https://github.com/Kerilk/OpenCL-Layers-Tutorial[2] https://www.iwocl.org/wp-content/uploads/6895-James-Brodman-Intel.pdf[3] https://github.com/EwanC/CLVizulayer[4] https://graphviz.org/doc/info/lang.html[5] https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_API.html#_execution_model[6] https://dl.acm.org/doi/10.1145/3204919.3204933[7] https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__GRAPH.html#group__CUDART__GRAPH_1gbec177c250000405c570dc8c4bde20db[8] https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc[9] https://www.iwocl.org/wp-content/uploads/iwocl-2025-hongqiang-wang-lamacpp-backend-update.pdf

View Slides | Video Recording

10:00-10:30

Update on SYCL Usage in the Shamrock Framework and Exascale Scalability

Timothée David Cléris, IPAG, Univeristy Grenoble Alpes

show / hide abstract

At IWOCL 2025, we introduced Shamrock, a native SYCL-MPI multi-GPU framework designed for particle- and mesh-based methods, providing scalable neighbor finding and load balancing through layered abstractions. Shamrock solvers are internally expressed as a graph structure consisting of a DAG extended with subcycles within subgraphs. The very large number of possible execution paths makes explicit kernel scheduling with streams impractical. Consequently, the initial version of Shamrock primarily relied on sycl::buffers, except for direct GPU-aware MPI communications.

While convenient, sycl::buffers can introduce unexpected overheads such as submission latency and increased register pressure, and they are incompatible with GPU-aware MPI, which requires USM pointers. Over the past year, we therefore migrated Shamrock from sycl::buffers to USM using a lightweight in-house wrapper designed to minimize code changes during the transition.

In parallel, we adapted Shamrock to run and scale efficiently on the Aurora supercomputer. The original version already achieved 93% parallel efficiency on 2,048 GPU tiles on the Adastra supercomputer. However, scaling beyond 10,000 GPUs revealed several edge cases in SYCL–MPI interactions, including unknown memory costs associated with large numbers of messages, message loss between senders and receivers, and performance degradation in large collective operations. Addressing these issues required substantial adaptations to the code.

As a result, Shamrock now achieves up to 220 billion SPH particles processed per second on 24,000 GPU tiles, using 768 TB of GPU memory in a weak-scaling experiment, while sustaining 83% parallel efficiency.

In this presentation, we detail our experience migrating from sycl::buffers to USM and discuss our recent large-scale results on Aurora. Finally, we outline future directions for Shamrock’s usage of SYCL, including tighter integration of MPI communication into the SYCL queue DAG, the use of modern GPU-specific accelerators, and the need for fully asynchronous memory management beyond the current SYCL 2020 specification.

View Slides | Video Recording

10:30-11:15 | Morning Break and Posters

11:15-11:45

A GPU Backend for the SYCL-Based 2D Stencil Framework StencilStream and a Comparison with its FPGA Backends

Tim Stöhr, Jan-Oliver Opdenhövel, Christian Plessl and Tobias Kenter, Paderborn University

show / hide abstract

2D stencil codes are generally easy to define, but often nontrivial to efficiently execute on highly parallel computing architectures. The SYCL-based stencil simulation framework StencilStream solves this problem by offering performance-portable abstraction layers that separate the concerns of application developers and performance engineers. With this work, we present a new GPU backend for the previously FPGA-focused framework, as well as improvements to its FPGA backends and a comparison between them. Through the use of automatic memory layout transformations, we can exploit up to 88.8% of the global memory throughput available on an Nvidia A100 GPU and achieve throughputs of up to 131.6 GCells/s for the 5-point Jacobi benchmark. In addition to that, the introduction of spatial parallelism to StencilStream’s FPGA backends leads to speedups of up to 307% relative to the previous StencilStream version for the HotSpot benchmark and allow to achieve up to 176.1 GCells/s for the 5-point Jacobi benchmark on an Altera/Intel Stratix 10 FPGA.

View Slides | Video Recording

11:45-12:15

Porting ThunderKittens from CUDA to SYCL for Intel GPU: Process, Challenges, and Lessons Learned

Yehong Jiang, Shen Chen, Fangwen Fu, Vincent Lu, Y K Chen, Hong Wang and Xinmin Tian, Intel

show / hide abstract

This paper presents our experience porting ThunderKittens, a CUDA-based GPU programming framework for writing fast deep learning kernels, to SYCL for Intel GPUs. We detail a three-stage methodology: automated conversion, manual adaptation, and architecture-specific optimization. Our work focuses on translating ThunderKittens’ C++ template library that implements the framework’s domain-specific language (DSL), replacing PTX template implementations with Intel GPU-compatible alternatives. We address fundamental architectural adaptations including transitioning from CUDA’s register data layout and shared memory-based execution model to Intel GPU’s optimal solutions featuring different register data layouts, and L1 cache prefetch-based memory latency coverage schemes. Our work reveals both the capabilities and limitations of current CUDA-to-SYCL translation tools. This case study reveals that while 100% performance portability remains challenging due to architectural differences, structural performance portability is achievable through consistent abstractions that maintain familiar programming patterns across platforms. Our methodology provides valuable guidance for developers undertaking similar cross-platform GPU programming efforts, particularly when adapting complex framework-level abstractions across different GPU architectures

12:15-12:45

A Cross-Vendor Implementation of PopSift Using SYCL

Mohammad Fadel Al Khafaji and Carsten Griwodz, University of Oslo. Håkon Kvale Stensland, Simula Research Laboratory

show / hide abstract

Implementing the SIFT algorithm by Lowe, PopSift is an open-source CUDA implementation used to extract and describe keypoints in 2D images. It is meant to be a faithful SIFT implementation and a stage of the photogrammetry pipeline that handles natural feature extraction of AliceVision Meshroom, a 3D Computer Vision framework. It is used in a range from object recognition to 3D reconstruction. However, it is vendor-locked to Nvidia GPUs as it is written in CUDA, limiting its use and exposure. Therefore, reimplementing the application in SYCL is to make it available on a wider range of platforms. PopSift-SYCL is a portable, cross-vendor, and efficient port of the original PopSift that achieves both good speed and correctness. It can be compiled by both Intel’s DPC++ compiler and the open source AdaptiveCPP compiler. We present the stages of PopSift-SYCL and its performance.

View Slides | Video Recording

12:45-14:00 | Lunch Break

14:00-16:30

Bring Your Own Code: Hands-on Support for OpenCL and SYCL Developers

Ben Ashbaugh, Intel; Aksel Alpay, University of Heidelberg; Brice Videau and Thomas Applencourt, Argonne National Lab; Paulius Velesko, PGLC Consulting

show / hide abstract

This Bring Your Own Code (BYOC) session offers hands-on support for developers working with OpenCL and SYCL to accelerate heterogeneous applications across CPUs, GPUs, and other accelerators. Participants are invited to bring their own projects and receive expert guidance on performance optimization, portability, debugging, and best practices. Whether you are prototyping or refining production code, this interactive environment provides practical insights to help you maximize performance and productivity.

16:30-17:15 | Afternoon Break and Posters

17:15-18:00

Bring Your Own Code: Hands-on Support for OpenCL and SYCL Developers (continued)

Ben Ashbaugh, Intel; Aksel Alpay, University of Heidelberg; Brice Videau and Thomas Applencourt, Argonne National Lab; Paulius Velesko, PGLC Consulting

show / hide abstract

18:00 | Close

Posters

Posters will be on display during the breaks Wednesday through Friday.

Performance portability of the AKAZE algorithm using SYCL across heterogeneous architectures

Sonia Haddouche, Franck Danober, Robin Lembach, Christophe Guillet, Thales. Erwan Fabiani, Univ Brest. Loïc Lagadec, ENSTA

show / hide abstract

This poster evaluates the performance and portability of AKAZE, a widely-used feature detection algorithm, implemented using the SYCL single-source programming model. To address the challenges of software portability across diverse hardware, we port AKAZE using AdaptiveCpp and benchmark it on three distinct architectures: x86 CPU, ARM CPU, and an embedded NVIDIA Jetson AGX Orin GPU. Our experimental results demonstrate that while GPU execution accelerates compute-intensive stages—such as multiscale decomposition, convolution, and derivatives—by up to 5.72×, the CPU remains up to 8.04× more efficient for the SURF descriptor and matching phases. These findings highlight the effectiveness of SYCL for cross-platform deployment and provide insights into the performance trade-offs of feature detection pipelines on heterogeneous systems.

Towards Portable High-throughput Conformer Generation for Drug Discovery on Heterogeneous HPC Platforms

Alberto Guarnieri, Filippo Barbari, Federico Ficarelli, Daniele Cesarini – CINECA. Candida Manelfi and Andrea Rosario Beccari – Dompé Farmaceutici

show / hide abstract

Optimizing Hardware Utilization and Simplifying Development for AI Kernels

Feng Yuan, Jinghui Gu, Yejun Guo and Jian Zhang, Intel China.

show / hide abstract

Traditional high-level parallel programming languages, such as SYCL, CUDA, and OpenCL, are widely used in the implementation of AI kernels. In these SIMT-based languages, Workthreads are assigned parallel execution semantics, for example, SYCL work-items or CUDA threads, while a Workgroup is a collection of Workthreads, such as a SYCL work-group or a CUDA block. Within a Workgroup, Workthreads are provided with special cooperation mechanisms. The abstractions offered by these languages make it possible to fully utilize hardware resources, which is a key reason for their widespread adoption. When developing an efficient kernel using such languages, developers typically need to perform two main tasks:

Understanding the semantics of the kernel and designing the algorithm;
Optimizing hardware resource utilization according to the characteristics of the target hardware architecture.

Accomplishing these tasks requires not only significant effort in developing efficient algorithms, but also a deep understanding of hardware architectures and software optimizations for hardware utilization. We present an optimized approach that hides hardware utilization optimizations from developers,

on the host, the approach automatically detects inputs and the current hardware platform configuration, and adaptively dispatching Workthreads/Workgroups;
In the kernel, a kernel-semantics–oriented mapping is introduced to optimize performance on behalf of kernel developers.

This approach significantly lowers the barrier for AI kernel developers to implement high-performance kernels. Developers are no longer required to understand hardware architectures or hardware–software co-design details and can instead focus solely on algorithm development.

IWOCL 2026 Conference Program

Awards

Wednesday 6 May

Thursday 7 May

Friday 8 May

Posters

Get IWOCL, OpenCL and SYCL Updates