Wednesday 6 May
show / hide abstract
show / hide abstract
show / hide abstract
Over the past year, we have significantly advanced the OpenCL backend of llama.cpp, expanding its performance, portability, and model coverage across a range of Qualcomm Adreno GPUs and other heterogeneous devices. Building on our initial OpenCL integration—designed to provide broad compatibility and a unified compute-focused execution path—we have developed new optimizations that substantially improve inference throughput for both small and large models. In particular, we introduce targeted kernel enhancements that accelerate small-model workloads, device-customized optimizations for diverse Qualcomm GPUs, and substantial improvements for Mixture-of-Experts (MoE) architectures. For the OpenAI’s GPT-OSS-20B model with MoE on the latest Qualcomm Snapdragon X2 Elite device, we achieved a more than fourfold speedup in prefill performance, increasing throughput from approximately 120 tokens/s to over 500 tokens/s through specialized kernel restructuring, memory‑access tuning, and load-balancing strategies across experts.
This work also highlights the unique opportunities and constraints of using OpenCL as a backend for modern LLMs. Unlike APIs with native cooperative matrix or tensor core abstractions, OpenCL currently lacks a standardized cooperative matrix interface, requiring us to engineer portable yet highly optimized GEMM implementations for both dense and MoE workloads. We discuss how this API-level limitation influences kernel design, and we present several strategies—such as adaptive tiling, subgroup‑aware parallelization, and device‑specific kernel variants—to achieve high utilization without relying on dedicated matrix-acceleration instructions. Additionally, we examine the growing diversity of LLM architectures, including models that incorporate dynamic routing, sparsity patterns, or emerging attention mechanisms, and how these trends increase pressure on backend flexibility, programmability, and kernel specialization.
Finally, we outline our roadmap for the OpenCL backend in llama.cpp, including support for additional quantization schemes, FlashAttention-style kernels, expanded INT8 inference paths, and improvements to portability and auto-tuning across more GPU vendors. These enhancements continue to make OpenCL a viable and efficient choice for deploying LLM inference on a broad spectrum of edge and client devices, including Snapdragon X Elite, Snapdragon 8 Elite, and other GPUs where portability and performance must coexist.
show / hide abstract
Artificial Intelligence (AI) workloads are growing at an unprecedented pace, both in complexity and scale. From large language models to real-time inference in edge devices, the demand for high-performance, low-latency execution on heterogeneous platforms has never been greater. GPUs have become the backbone of AI acceleration, but achieving optimal performance is not simply a matter of raw compute power. It requires a deep understanding of how workloads interact with the underlying hardware and software stack, and how to fine-tune every layer for efficiency. This talk focuses on Intel GPUs and the OpenCL programming model, exploring how driver-level optimizations and advanced submission strategies can unlock new levels of performance for AI pipelines.
Talk will consists of 3 main sections:
1. Queue Model Design and Challenges
Efficient task submission is the foundation of GPU performance. We begin by examining the design of queue models for AI workloads, which differ significantly from conventional compute pipelines. Attendees will learn:
- How to structure queues to balance throughput and latency.
- Strategies for minimizing synchronization overhead.
- Techniques for reducing resource contention in multi-queue environments.
- We will also discuss heuristics for deciding when to submit tasks immediately versus batching them, and how these decisions impact overall execution efficiency. Real-world examples will illustrate how improper queue design can lead to bottlenecks, and how optimized models can deliver measurable gains.
2. Ultra Low Latency Submission (ULLS)
For latency-sensitive AI applications—such as real-time inference or interactive workloads—traditional submission models are insufficient. Enter Ultra Low Latency Submission (ULLS), a technology designed to minimize dispatch overhead and enable near-instantaneous task execution. This section will cover:
- The architectural principles behind ULLS.
- How ULLS changes the decision-making process for task submission.
- Trade-offs between ULLS and conventional batching strategies.
- We will present how ULLS reduces jitter and improves predictability, making it ideal for scenarios where every microsecond counts.
3. Memory Allocation and Optimization
Memory management is often the silent performance killer in AI workloads. Frequent allocations, fragmentation, and unpredictable access patterns can introduce significant overhead. This session will explore:
- How AI workloads allocate memory and why traditional approaches fail.
- Driver-level optimizations such as memory pooling and resource recycling.
- Techniques for reducing allocation latency and improving predictability.
- We will show how these optimizations not only improve raw performance but also enhance stability under heavy load. Practical examples will demonstrate how memory pooling can reduce allocation overhead by orders of magnitude.
show / hide abstract
This technical presentation will provide a comprehensive update on our recent efforts to power Generative AI solution of Adreno GPU using open-source solutions. For many years we have been contributing and enhancing OpenCL backend support for Adreno GPU in projects like TVM (Tensor Virtual Machine) and in recent years MLC (Machine Learning Compilation). Recently Qualcomm is contributing to new evolving projects like Llama.cpp too.
In this presentation, we will be sharing past year enhancements to TVM and MLC that significantly push Generative AI performance on Adreno GPU across Mobile and Compute products. To be precise this presentation covers below aspects.
- Though TVM community have deprecated Relay and moved to Relax compiler, which is more advanced, Adreno GPU enhancements are not present. Now, we have up-streamed Adreno GPU enhancements like Texture paths, specialized layouts, memory management and OpenCLML extension support. Now, the mainline Relax solution has everything we support for Adreno GPU.
- Generative AI performance is significantly improved by adding new accelerated Ops to OpenCLML. We now have additional BYOC compilation passed that can offload these new ops. Prefill performance of all Gen Ai models has been improved by 2X with this enhancement. This acceleration is available from OpenCLML extension 5 or above.
- We are going to support Vulkan backend in parallel to OpenCL going forward. Thanks to TVM’s target independent lowering which allows reuse of > 90% of optimizations across both backend leaving only codegen and few runtime enhancements which are one time effort. This path additionally enables Vulkan specific acceleration like op-operative matmul to deliver performance close to OpenCLML extension driven acceleration with in pure open-source specification way.
Additionally, we will be sharing some performance numbers, various resources available for above solutions for the audience.
show / hide abstract
show / hide abstract
OpenCL allows application developers to compile their OpenCL code against an Installable Client Driver (ICD) loader, a shared library that contains symbols for the OpenCL entry-points. At runtime the ICD loader then forwards the OpenCL entry-point calls on to the chosen OpenCL implementation (ICD). This arrangement not only allows developers to build code in a vendor agnostic way but also enables the existence of a layer mechanism[1]. Layers are a mechanism for intercepting OpenCL calls made to the ICD loader before they are forwarded on to the ICD. A concept that has been proven in other APIs like Vulkan, and enables creation of layers for application debugging, tracing, or even implementing additional functionality[2]. A user running an OpenCL application can then choose which layers to load when they invoke the executable.
This presentation will debut the CLVizulayer tool[3], an OpenCL ICD loader layer for printing the graph of asynchronous device tasks to a file in Graphviz DOT[4] format. This tool allows an OpenCL user to graphically see the directed acyclic graph (DAG) of OpenCL device submissions in the application, with nodes representing clEnqueue commands, and edges based on the prerequisites in the OpenCL execution model[5] that constrain the execution of those commands.
Being able visualize the full device scheduling graph of an application provides a different perspective to timeline tracing, such as you would get from the OpenCL intercept layer[6] or other profiling tools, which shows the ordering of command execution. For example, if two commands don’t share any dependencies they may happen to implement sequentially due to implementation defined behavior, and would be represented sequentially in a timeline trace. Although a user diving into the trace detail may find the events returned and the wait list for each command, it is not as immediately apparent what the complete graph of execution constraints is compared to the visual format output by CLVizulayer.
For large applications the full scheduling graph can be huge, so finer grained mechanisms are also defined by CLVizulayer to give developers more control over what commands appear in the DOT file, but at the expense of having to modify their application. Two extensions, cl_ext_dot_print & cl_ext_command_buffer_dot_print are defined by the tool to give users this extra control. cl_ext_dot print allows a user to express the specific OpenCL command-queues to start capturing, and an explicit end point to write out the DOT graph of captured commands. While cl_ext_command_buffer_dot_print lets a user print out the commands of a cl_command_buffer_khr object, a mechanism that already has an equivalent in CUDA Graph cudaGraphDebugDotPrint()[7] and sycl_ext_oneapi_graph command_graph::print_graph()[8]
In the technical talk we will give an overview of the layer, covering how it works and how the layer can be used. Including modifying code to use the 2 extensions defined in the project. Additionally we will show traces captured from ML and HPC workloads, including llama.cpp[9], Leela Chess Zero, GROMACS, and LAMMPS to illustrate the value of using the layer.
[1] https://github.com/Kerilk/OpenCL-Layers-Tutorial[2] https://www.iwocl.org/wp-content/uploads/6895-James-Brodman-Intel.pdf[3] https://github.com/EwanC/CLVizulayer[4] https://graphviz.org/doc/info/lang.html[5] https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_API.html#_execution_model[6] https://dl.acm.org/doi/10.1145/3204919.3204933[7] https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__GRAPH.html#group__CUDART__GRAPH_1gbec177c250000405c570dc8c4bde20db[8] https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc[9] https://www.iwocl.org/wp-content/uploads/iwocl-2025-hongqiang-wang-lamacpp-backend-update.pdf
show / hide abstract
show / hide abstract
Thursday 10 May
show / hide abstract
show / hide abstract
show / hide abstract
The SYrtos specification is designed for implementation on a variety of hardware and software platforms, ranging from microcontrollers with dedicated real-time operating systems to more powerful Linux-based systems. This enables true application portability across the entire embedded computing spectrum. This short paper presents a proof-of-concept implementation of the SYrtos interface on top of the Linux POSIX API, as well as preliminary experimental results on an Intel CPU.
show / hide abstract
show / hide abstract
To our knowledge, AdaptiveCpp PCUDA is the first HIP/CUDA compiler that has full interoperability with SYCL, while also inheriting SYCL’s portablity. This can be beneficial e.g. for iterative or partial porting of CUDA projects to SYCL.
Comparing kernel submission throughput between SYCL and PCUDA reveals submission overheads in SYCL that are primarily caused by unnecessary event creation.
show / hide abstract
show / hide abstract
The SYCL specification allows for multiple implementation strategies, in particular SSCP (single-source, single compiler pass) and SMCP (single-source, multiple compiler passes). The default compiler of the AdaptiveCpp SYCL implementation is an SSCP JIT compiler, which has previously been shown to deliver substantial speedups for certain applications, while also reducing compilation times. However, systematic performance evaluations of that compiler have focused mostly on small or medium-sized applications. Additionally, to our knowledge, the impact of supporting both SSCP as well as SMCP compilers in large production code bases has not yet been thoroughly studied.
In this work, we explore the applicability of the AdaptiveCpp JIT compiler to a highly-optimized, production code base: GROMACS – a widely used molecular dynamics software package that currently relies on SYCL and the AdaptiveCpp SMCP compiler to target AMD GPUs. We evaluate the ported application across a variety of input problems covering common simulation scenarios on MI210, MI300A, and MI300X AMD GPUs. We show that the SSCP JIT compiler outperforms the currently used SMCP AdaptiveCpp compiler in high-atom-count workload configurations by up to 10-25\% and increases the peak simulation throughput of each tested GPU by up to 10\%, measured in terms of simulated atoms per second. These findings confirm that the performance advantages of the SSCP JIT compiler also translate to production applications like GROMACS.
show / hide abstract
Porting mature scientific applications to heterogeneous hardware remains a key challenge for the scientific community, particularly for codes that rely heavily on complex pointer-based data structures and CPU-centric memory layouts. This talk presents how Intel’s newly implemented Shared System USM smoothly overcomes many of these legacy barriers and enables a natural migration path to GPU execution. By providing a single, coherent virtual address space shared between host and device, shared system USM allows developers to preserve familiar pointer semantics, reuse existing data structures, and incrementally introduce SYCL kernels without extensive refactoring or buffer-access boilerplate. Unlike traditional SYCL Unified Shared Memory, memory pages are inherently cross-device. They can be transferred automatically by the OS kernel, which often performs better than the runtime used in other shared-memory approaches. This builds upon new capabilities available in the Linux kernel and Intel GPU drivers.
We will present a practical, stepwise adoption strategy tailored to real-world HPC codes:
- Profile the application on the workload of interest to identify regions with high computational density suitable for a GPU port, e.g. using Intel Advisor from the oneAPI Base Toolkit.
- Preserve traditional host allocations with system new or malloc().
- Retain idiomatic C++ containers and RAII.
- Port the target kernel(s) to SYCL, e.g. by mapping the outermost loop to a sycl::parallel_for() construct. Care should still be taken to avoid I/O. The needs of concurrent access must be considered.
- Profile the code, e.g. with VTune from the oneAPI Base Toolkit and
- Exploit selective prefetching mechanisms where indicated, e.g. for large allocations
- Evolve to asynchronous command graphs, e.g. for overlapping with concurrent CPU tasks.
- Optimize memory allocation where indicated, e.g. by grouping related data on the same memory page or aligning allocations to page boundaries.
- Optimize memory layout, e.g. so concurrent GPU threads can access similar data, and
- Optimize memory access patterns, e.g. so more computational work is done for each memory access.
We will illustrate this strategy using both case studies in and mini apps based on the molecular-dynamics application GROMACS running on currently available hardware and software. This application already has mature GPU ports but still has a simple CPU-only “reference” implementation that we can port for this illustration. It includes complex data structures such as N-body pair lists that can be problematic to use on GPUs without shared system USM. It contains examples of the above optimization approaches that can be understood readily in this context. We will demonstrate how much of the available performance can be obtained for limited effort.
Overall, we demonstrate how shared system USM shortens time-to-GPU, reduces code churn, makes it easier to implement both CPU and GPU ports in the same code base, supports rapid experimentation, and can even provide better performance than explicit-transfer constructs — key themes for the IWOCL community’s focus on performance portability and developer productivity. Intel’s shared system USM implementation provides a compelling pathway for bringing established scientific software to SYCL-enabled GPUs, aligning cleanly with IWOCL’s mission to advance open, portable, and maintainable heterogeneous programming models.
show / hide abstract
show / hide abstract
Friday 11 May
show / hide abstract
In addition, for large matrices our heterogeneous implementation of the Cholesky decomposition can achieve at least 12 percent faster runtimes across several systems with GPUs from NVIDIA, AMD, and Intel.
show / hide abstract
show / hide abstract
At IWOCL 2025, we introduced Shamrock, a native SYCL-MPI multi-GPU framework designed for particle- and mesh-based methods, providing scalable neighbor finding and load balancing through layered abstractions. Shamrock solvers are internally expressed as a graph structure consisting of a DAG extended with subcycles within subgraphs. The very large number of possible execution paths makes explicit kernel scheduling with streams impractical. Consequently, the initial version of Shamrock primarily relied on sycl::buffers, except for direct GPU-aware MPI communications.
While convenient, sycl::buffers can introduce unexpected overheads such as submission latency and increased register pressure, and they are incompatible with GPU-aware MPI, which requires USM pointers. Over the past year, we therefore migrated Shamrock from sycl::buffers to USM using a lightweight in-house wrapper designed to minimize code changes during the transition.
In parallel, we adapted Shamrock to run and scale efficiently on the Aurora supercomputer. The original version already achieved 93% parallel efficiency on 2,048 GPU tiles on the Adastra supercomputer. However, scaling beyond 10,000 GPUs revealed several edge cases in SYCL–MPI interactions, including unknown memory costs associated with large numbers of messages, message loss between senders and receivers, and performance degradation in large collective operations. Addressing these issues required substantial adaptations to the code.
As a result, Shamrock now achieves up to 220 billion SPH particles processed per second on 24,000 GPU tiles, using 768 TB of GPU memory in a weak-scaling experiment, while sustaining 83% parallel efficiency.
In this presentation, we detail our experience migrating from sycl::buffers to USM and discuss our recent large-scale results on Aurora. Finally, we outline future directions for Shamrock’s usage of SYCL, including tighter integration of MPI communication into the SYCL queue DAG, the use of modern GPU-specific accelerators, and the need for fully asynchronous memory management beyond the current SYCL 2020 specification.
show / hide abstract
show / hide abstract
show / hide abstract
Posters
Posters will be on display during the breaks Wednesday through Friday.
