Proceedings
The ACM Digital Proceedings are now available.
The IWOCL 2024 Call for Papers was posted before the January 1, 2024 deadline and so are not published under the ACM’s new Open Access Model. We expect IWOCL 2025 to be Open Access.
Award Winners
This year’s awards went to:
- Outstanding Poster: Lessons Learned Migrating CUDA to SYCL: A HEP Case Study with ROOT RDataFrame – Jolly Chen, University of Amsterdam, CERN. Co-authors: Monica Dessole, CERN and Ana Lucia Varbanescu, University of Twente.
- Outstanding Short Paper or Technical Talk: Evaluation of SYCL’s Different Data Parallel Kernels – Marcel Breyer, University of Stuttgart. Co-authors: Alexander Van Craen and Dirk Pflüger, University of Stuttgart.
- Outstanding Full Paper: SimSYCL: A SYCL Implementation Targeting Development, Debugging, Simulation and Conformance – Fabian Knorr, University of Innsbruck. Co-authors: Peter Thoman, University of Innsbruck, and Luigi Crisci, University of Salerno.
Location and Times
All the sessions took place on the 6th Floor of the Student Center East Tower (SEC). See Location and Travel for additional information. All times are USA, CST (Central Standard Time).
Quick Links: SYCL HackFest | SYCL Tutorial | Conference (Wed and Thu) | Posters (Wed / Thu)
Monday April 8, 2024
SYCL HackFest
Tuesday April 9, 2024
Adv. SYCL Tutorial
Wednesday 10 – Thursday 11 April, 2024
OpenCL and SYCL Conference Sessions
Monday April 08, 2024 | HackFest Day
This session will take place in Room 605. Registration and refreshments will be in Room 603.
- Session Chair: Sri Ramkrishna, oneAPI Community Manager, Intel.
- Mentors: Tom Deakin, University of Bristol and SYCL Working Group Chair, Aksel Alpay, Heidelberg University and creator of AdaptiveCpp (formerly hipSYCL), Thomas Applencourt, Argonne National Laboratory and SYCL Advisory Panel Chair, and Hugh Delaney, Codeplay Software.
Agenda
08:00 – Registration and Welcome Coffee
08:30 – Start of HackFest
11:00 – Coffee Break
12:00 – HackFest session
13:30 – Lunch Break (including viewing of Solar eclipse)
14:30 – HackFest session
16:00 – Coffee Break
16:30 – HackFest session
19:00 – Session closes
The SYCL HackFest @ IWOCL 2024 is sponsored by Intel and Codeplay Software.
Tuesday April 09, 2024 | Advanced SYCL Tutorial Day
This session will take place in Room 605. Registration and refreshments will be in Room 603.
- Tutorial Chair: Hugh Delaney, Codeplay
Tutors: Thomas Applencourt, Argonne National Laboratory, Abhishek Bagusetty, Argonne National Laboratory, and Aksel Alpay, Uni of Heidelberg.
This advanced tutorial goes beyond the basic concepts of SYCL to offer participants a dynamic and interactive learning experience through a blend of short lectures followed by hands-on coding exercises that reinforce the advanced concepts introduced.
Attendance at this tutorial is included in the 3-Day IWOCL Conference Pass.
Agenda
08:30 – Registration and Welcome Coffee
09:00 – Start of Tutorial
11:00 – Coffee Break
12:00 – Tutorial session
13:00 – Lunch Break
14:00 – Tutorial session
16:00 – Coffee Break
16:30 – Tutorial session
17:30 – Session closes
show / hide tutorial description
Tutorial Format: A Full Day Hand-On Tutorial
Tutorial Outline: This advanced tutorial goes beyond the basic concepts of SYCL to offer participants a dynamic and interactive learning experience through a blend of short lectures followed by hands-on coding exercises that reinforce the advanced concepts introduced.
The material builds on the fundamentals of SYCL to provide participants with a deeper understanding. Topics covered will include strategies for:
- Optimizing code
- Managing data flow
- Using different memory access patterns
- Understanding work group sizes
- Using vectorization
- Working with ND ranges, and their importance
- Making the most of the multiple devices available on your architecture
The hands-on coding exercises are designed to progressively challenge participants, starting with fundamental SYCL constructs and advancing to complex scenarios.
Throughout the tutorial, instructors will guide participants in developing a deeper appreciation for SYCL’s role in heterogeneous computing and its potential to unlock significant performance gains. By the end of the session, participants will have honed their SYCL programming skills and gained the confidence to leverage SYCL in their projects.
Wednesday – Conference Sessions
This session will take place in Room 605. Registration and refreshments will be in Room 603. Presenters are identified in bold.
KEYNOTE PRESENTATIONS #1
show / hide abstract
This technical presentation details first of its kind industry best performance of Meta’s LLaMa-7b model support on Adreno GPU with MLC LLM (Machine Learning Compilation for Large Language Models) framework. MLC LLM (https://llm.mlc.ai/) is an open-source universal solution that allows any LLM (Large Language Model) model to be deployed natively on a diverse set of hardware backends and native applications. MLC LLM demonstrated LLM models working on iOS, Android, Windows, Linux, Mac and web browsers. MLC under the hood use TVM (Tensor Virtual machine – tvma.ai) for model compilation, optimization and device execution.
In this submission we will detail about the baseline performance MLC LLM has achieved on various Adreno GPU generations and the improvements Qualcomm has made to achieve industries best performance. MLC LLM baseline performance for decode on Snapdragon Gen 2 was 6 tokens per second and on Snapdragon Gen 3 was 8 tokens per second.
MLC LLM early implementation for Android is driven by hand crafted dispatch IR (Intermediate Representation) that generates corresponding OpenCL kernels. Recent amendments use Dlite based scheduling mechanism. Both approaches have scope for improvements for Adreno GPU. We choose the dispatch-based code generation as baseline for improvements.
In our early analysis and hardware level profiling of generated OpenCL kernels revealed under utilization of Adreno threading capability where less threads are in action leaving few cores ideal. Identified inappropriate vectorization that resulted in inefficient load and store. Also identified excess local memory utilization per workgroup. There were few network level enhancement to avoid unnecessary layout transformations.
Improvements started by increasing work group size by 2x and 4x to compute partial sums followed by reduction. This approach occupied the cores completely and improved the kernel level performance by nearly 35%. Vectorization was rectified to enable coalesced data loads to improve the data load throughput which is very significant in decode multiplications (vector to matrix multiplication) which are data bound. Per workgroup local memory utilization was reduced to avoid pressure on local memory demand. The network level optimizations include removing unnecessary transposes by using modified MatMul with original data layouts. These approaches reduced global memory bandwidth along with some savings on transposes.
Above mentioned optimizations have improved the overall performance of LLaMa-7b-v1/v2 on Snapdragon Gen 2 to 11 tokens/sec and Snapdragon Gen 3 to 14 tokens per second. Most of the improvements here are generic and has improved the prompt processing also significantly.
LLaMa-7b support on Android platforms with this kind of performance is first of its kind. In this talk, we will detail about the baseline dispatch details and the improvements we have made to achieve said performance.
show / hide abstract
When a new OpenCL extension is defined, it typically takes several months to several years before implementations of the extension are widespread and available for general usage by applications. In unfortunate cases, an extension may never be implemented for some devices or operating systems. This lack of availability can hinder adoption of an extension by shifting complexity to application developers, who may need to develop two different code paths, one with support for the extension and one without.
In our technical presentation we will describe how OpenCL layers were used to implement a command buffer emulation layer with support for the provisional cl_khr_command_buffer and cl_khr_command_buffer_mutable_dispatch extensions. The command buffer emulation layer brings functional support for these extensions to most OpenCL devices in the market, allowing developers to develop OpenCL applications using both extensions even when they are not available natively.
We will describe how the emulation layer works with a focus on how existing OpenCL APIs provide the functionality used to emulate command buffers. We will describe OpenCL features that were helpful for emulation, such as built-in reference counting for OpenCL objects and the ability to clone OpenCL kernels and their arguments. We will describe some command buffer features that were tricky to emulate, such as event profiling and certain types of error checking. These features required additional work and creativity, though we were ultimately able to emulate them. We will also describe a small handful of command buffer features that we currently do not emulate properly, such as the command buffer pending state.
Our presentation will include success stories, such as how the command buffer emulation layer was used to develop the OpenCL conformance test suite (CTS) for command buffers. We will describe how the command buffer emulation layer is currently being used by the OpenCL working group to rapidly prototype new command buffer features and layered extensions that add new command buffer functionality. We will describe how the command buffer emulation layer is being used to develop and test SYCL graphs, which generate OpenCL command buffers. Even for implementations that support command buffers natively, the command buffer emulation layer provides an alternative implementation that may be useful when debugging applications using command buffers.
Our presentation will include a brief evaluation of the performance of the command buffer emulation layer. We will compare the performance of the command buffer emulation layer against a command buffer-like implementation in the application itself, approximating a traditional alternative code path to support OpenCL implementations without support for the command buffer extensions. We will also compare the performance of the command buffer emulation layer against implementations with native support for command buffers, such as POCL, and any other native implementations we can identify and acquire prior to the conference.
Our presentation will close with a broader discussion of layers in the OpenCL ecosystem, including how OpenCL can better support and evangelize layers like the command buffer emulation layer.
The source code for the command buffer emulation layer is available under a permissive license (MIT).
show / hide abstract
standard offer a natural way to describe fine-grained task pipelines. However, the current use of the OpenCL pipe in FPGA OpenCL implementations is either non-compliant or not performance optimized due to several implementation challenges. In this paper we pinpoint the key implementation complexities and suggest possible specification updates and implementation choices that enable description of efficient task pipelines in a portable, vendor- and device-independent manner. We design a performance-optimized hardware pipe prototype, tackling a key challenge (runtime-defined connectivity) in going towards an OpenCL compliant, yet portable pipe implementation. The evaluation of our prototype on an FPGA shows that in a computer vision application, the proposed dynamically connected pipe component is 2.5x faster than an OpenCL buffer-based design. We evaluate the cost of the flexibility offered by the dynamic pipe prototype to be 4.4% in area utilization out of the total device resources and 1.6x latency overhead compared to a fixed connectivity design.
show / hide abstract
show / hide abstract
show / hide abstract
SYCL 2020, in particular, represents a major enhancement that pushes the boundaries of heterogeneous programming by introducing a number of new features.As the new features are implemented by existing compilers, it becomes critical to assess the maturity of the implementation through accurate and specific benchmarking.This paper presents SYCL-Bench 2020, an extended benchmark suite specifically designed to evaluate six key features of SYCL 2020: unified shared memory, reduction kernel, specialization constants, group algorithms, in-order queue, and atomics.
We experimentally evaluate SYCL-Bench 2020 on GPUs from the three major vendors, i.e., AMD, Intel, and NVIDIA, and on two different SYCL implementations AdaptiveCPP and oneAPI DPC++.
show / hide abstract
show / hide abstract
show / hide abstract
Amber is a biomolecular simulations software package that is used by thousands of scientists in academia, national labs, and industry for computational drug discovery and related research. Amber is particularly well known for its high-performance molecular dynamics (MD) program, which features a very efficient CUDA implementation for Nvidia graphics processing units (GPUs) that has been continuously optimized since its initial release over a decade ago. A feature of this implementation is that it makes extensive use of FP32 floating point arithmetic while controlling numerical stability by employing 64-bit fixed-point integer arithmetic for the accumulation of energy and force contributions. To maximize performance, the CUDA implementation runs entirely on the GPU with data transfer between CPU and GPU only at program start and for checkpointing and MD trajectory output. This enables efficient MD simulations with millions of atoms both on datacenter and consumer grade hardware. Amber is therefore used on hardware ranging from small desktop workstations to supercomputers for large-scale ensemble simulations. Recently, a HIP/ROCm implementation has become available to enable Amber MD simulations on AMD devices.
In this technical presentation we present our experiences in developing a SYCL implementation of the Amber MD engine by porting the CUDA code using Intel oneAPI software development tools and Intel Xe architecture GPUs. The goals of this effort are twofold. First, to enable Amber MD simulations on Intel Xe architecture GPUs, and second to achieve performance portability across devices from different vendors by employing a single-source model based on a modern C++ standard. In the long run we hope to increase developer productivity by avoiding having to maintain multiple code bases for different hardware architectures.
This porting effort was non-trivial as the Amber MD code is quite complex, containing a large set of highly optimized C++ CUDA kernels for particle force calculations, time stepping, temperature and pressure control and enhanced sampling algorithms. The CUDA code alone consists of 58 header files and 23 source files with 459 global CUDA kernels and 292 CUDA device functions, including Nvidia PTX assembly instructions to optimize performance on different Nvidia GPU hardware generations. In addition, MPI is used for distributed memory parallelization to enable multi-GPU parallelization and large-scale ensemble simulations. We will give an overview of this porting effort and discuss some relevant optimizations of the SYCL implementation such as the replacement of subgroup shuffles with explicit shared local memory operations, memory space casting in global atomics, use of esimd-based based sorting algorithms in the oneDPL library, and the use of MPI-based explicit scaling for simulations that execute on two tiles of Intel GPU Max 1550 devices. Numerical tests demonstrate that the accuracy of the SYCL implementation is on par with the CUDA reference code and benchmark simulations demonstrate strong performance on Intel Data Center GPU Max 1500 hardware. We will conclude with a discussion of our plans for future releases with additional features for more complex MD algorithms and efforts to investigate performance portability of this SYCL code across different vendors. This work will enable Amber MD simulations on a wide range of accelerator hardware, thus enabling scientists to gain deeper understanding of structure-function relationships of biomacromolecules and advance fields such as computational drug design where high-performance MD simulations are critical.
show / hide abstract
This talk provides an overview of the work that has been done by a team of experts at multiple national labs and Intel to improve the performance of the Spectral Neighbor Analysis Potential (SNAP), a machine-learning potential in the LAMMPS molecular dynamics code on Aurora, Argonne’s exascale machine. The SNAP potential relates the total potential energy of the system (composed of the sum of energies from individual atoms) to weighted bi-spectrum (descriptor) components.
Aurora’s main programming model is SYCL, using Intel’s DPC++ implementation. This work was part of a larger effort to enable the EXAALT project, part of DOE’s exascale computing project, to run performantly on DOE’s exascale machines. The EXAALT ECP project targets molecular dynamics simulation of materials related to nuclear fusion and fission. LAMMPS supports potentials for a wide range of systems, including solid-state materials, soft matter, liquids, and coarse-grained or mesoscopic systems. LAMMPS runs on a single processor or in parallel using message-passing techniques and a spatial decomposition of the simulation domain. Many of its models have versions that provide accelerated performance on CPUs and GPUs. The code is designed to be easily modified or extended with new functionalities. LAMMPS is written in C++ and includes support for OpenMP on CPUs, a GPU package with native CUDA, HIP, and OpenCL backends, and a Kokkos performance portability package. Kokkos is the primary programming model for EXAALT and provides a SYCL backend used on Aurora.
A mini-app called “TestSNAP” was developed to represent the SNAP potential and its bottleneck kernels and to enable quick code modifications and evaluate the impact on performance. Today, we use both the mini-app and the full LAMMPS application for benchmarking. Three kernels in SNAP were identified as the most time-consuming: Compute_FusedDeiDrj, Compute_Ui, and Compute_Yi. Intel VTune and Advisor tools were used to profile the code. Several improvements were implemented, including changes to both the LAMMPS implementation of SNAP and the Kokkos SYCL backend spanning exploration of subgroup sizes and work-item distributions, the index calculation for multidimensional Kokkos views, and manually specifying memory pointers for address spaces.
These optimizations (plus a few others) were benchmarked on ALCF Aurora, and compared to the original code on OLCF Frontier and ALCF Polaris. One independent 2000 atom replica was run on each tile, GCD, or GPU simultaneously, mimicking the EXAALT workflow. Two nodes were used on each machine, giving a total of 24 replicas on Aurora, 16 replicas on Frontier, and 8 replicas on Polaris. On Aurora, these optimizations gave a 1.49x speedup over the code in the public LAMMPS repository. The average cumulative performance of one PVC GPU (both tiles) is now 1.04x higher than an A100 GPU on Polaris and 1.08x higher than an MIX50X GPU (both GCDs) on Frontier. We are currently scaling up the runs to use 1024 nodes on Aurora.
This work was done on a pre-production supercomputer with early versions of the Aurora software development kit. This research used resources of the Argonne Leadership Computing Facility, a U.S. Department of Energy (DOE) Office of Science user facility at Argonne National Laboratory and is based on research supported by the U.S. DOE Office of Science-Advanced Scientific Computing Research Program, under Contract No. DE-AC02-06CH11357. SNL is managed and operated by NTESS under DOE NNSA contract DE-NA0003525. This manuscript has been authored by UT-Battelle, LLC, under Grant DE-AC05-00OR22725 with the U.S. Department of Energy (DOE).
show / hide abstract
show / hide abstract
The Aurora supercomputer at the Argonne Leadership Computing Facility (ALCF) is one of the first exascale computing resources being put forth by the US Department of Energy (DOE). Driven by 63,744 Intel Data Center GPU Max Series and 21,248 Intel Xeon CPU Max Series processors, Aurora’s theoretical peak compute capability is more than two exaflops. Many of the scientific applications targeting Aurora will be using SYCL; in this presentation, we detail our efforts to prepare one such application for Aurora, providing insight into the porting and optimization process.
HACC (Hardware/Hybrid Accelerated Cosmology Code) is an extreme-scale cosmological simulation code developed and optimized for DOE supercomputers since the first petascale systems. The simulation’s solvers are algorithmically tuned for compute architectures, with the GPU implementation first emerging to target OLCF-3 (Titan). An extension of HACC with physics for resolving gas hydrodynamics uses a new Conservative Reproducing Kernel (CRK) formulation of Smoothed Particle Hydrodynamics (SPH). The additional physics in CRK-HACC makes simulations significantly more computationally demanding and drives the need for GPU acceleration on current exascale systems. CRK-HACC is under active development and has CUDA, HIP, and SYCL implementations – maintaining all three of these implementations with minimal developer effort is of significant interest to the HACC team, and this introduces more challenges than a more straightforward migration from CUDA to SYCL.
In the first part of this talk, we will discuss how we developed the initial SYCL implementation of CRK-HACC, paying close attention to our efforts to retain a common host code shared across CUDA, HIP, and SYCL kernels. These efforts include a customized CUDA-to-SYCL migration pipeline that builds upon SYCLomatic to produce SYCL kernels described as function objects (as opposed to lambda expressions). We will also detail our approach to achieving high levels of “performance portability” across AMD, NVIDIA and Intel GPUs using SYCL, which required us to develop an abstraction capable of compiling to multiple different implementations of “shuffle” operations: the sycl::select_from_group function from SYCL 2020, a shuffle operation emulated via work-group local memory, and a highly specialized shuffle operation implemented for Intel GPUs in assembly (vISA). The combination of these techniques across host and device code carefully balances code maintainability and performance portability, and we believe these techniques will generalize well to other applications from other domains.
In the second part of this talk, we will provide an update on our (currently in progress) efforts to further reduce the effort associated with maintaining separate CUDA, HIP, and SYCL variants of CRK-HACC’s kernels. Specifically, we will detail the design considerations and performance characteristics of an embedded domain-specific language (eDSL) that we are developing for CRK-HACC. This eDSL is essentially a thin C++ wrapper around CUDA, HIP and SYCL which allows a single source code to target any backend, and which is intended to provide CRK-HACC with greater control over CUDA and HIP code generation than relying on SYCL implementations that are themselves still in development.
Finally, we demonstrate the scaling of the SYCL version of CRK-HACC running on Aurora at large scale (anticipated to be approximately 2000 nodes, or 12000 GPUs). To the best of our knowledge, this presentation will be one of the first to present SYCL performance results at such a scale.
Thursday – Conference Sessions
This session will take place in Room 605. Registration and refreshments will be in Room 603. Presenters are identified in bold.
KEYNOTE PRESENTATION #2
show / hide abstract
show / hide abstract
Expressing data parallel programs using C++ standard parallelism is attractive not only due to the simplicity of the model, but also due to its highly idiomatic nature. This programming model, commonly referred to as stdpar, can also be used for accelerator programming by offloading calls to standard algorithms, and is supported by multiple vendors, such as NVIDIA with nvc++, AMD with roc-stdpar, and Intel with the new ICPX -fsycl-pstl-offload flag.
We present AdaptiveCpp stdpar, a novel stdpar implementation that is part of the AdaptiveCpp SYCL implementation. AdaptiveCpp stdpar is the very first open-source stdpar implementation based on SYCL, and allows users to start developing applications using C++ standard algorithms, and then progressively move to SYCL as more control is needed.
Our solution supports CPUs as well as GPUs from NVIDIA, Intel and AMD. We find that it outperforms all vendor stdpar compilers on HPC GPUs in the majority of tested applications, in some configurations by up to an order of magnitude. Furthermore, we show how AdaptiveCpp outperforms nvc++ in a latency-bound code for all tested problem sizes by up to 80% on NVIDIA A100 due to novel optimizations. Our stdpar implementation deviates from existing implementations by relying on a tighter integration with compiler and runtime, including e.g. dedicated optimization passes to elide synchronization, automatically prefetching required allocations, and an offloading heuristic.
show / hide abstract
This paper introduces an approach to provide a user-friendly approach to create generic libraries and reduce the boilerplate code by leveraging the std::variant and std::visit from the standard C++. This approach was tested in an in-house high-performance kernel library for server-grade GPUs and demonstrated a significant uplift in productivity and increased the code maintainability.
show / hide abstract
SYCL 2020 introduced several new and notable features related to groups of work-items, including a new group type (sub-groups) for performance tuning and a new library of group functions and algorithms (such as reductions) to improve programmer productivity. This new functionality was deliberately designed such that the work-group and sub-group classes would share a common interface, leaving the door open to explore additional group types in future versions of SYCL.
The experiences of developers working on complex SYCL applications have shown that although the SYCL 2020 group functionality is useful, there is room to improve usability and a need to address emerging use cases. For example, the requirement that group algorithms must be called in converged control flow by all members of a group often forces developers to structure their code in ways that may feel unnatural, and which may negatively impact the readability and maintainability of their code. Furthermore, our experience teaching SYCL has highlighted several challenges that can complicate learning for new developers, such as using the term “group” to mean both “any group of work-items” and “a work-group” (i.e., sycl::group). SYCL’s group functionality must evolve to address these concerns but must do so in a way that prioritizes backwards compatibility.
In this talk, we propose several additions to the SYCL specification designed to address these issues, including six new group types that we are exploring in the context of the Data Parallel C++ compiler. The first group type, a root group, contains all the work-items in an ND-range and provides access to device-wide barriers and algorithms within a kernel. The second group type, a fixed size partition of an existing group, enables a programming model similar to hierarchical parallelism. The third group type, a logical partition of an existing group, provides explicit control over predicated work-item execution in diverged control flow. The fourth group type, a scoped partition of an existing group, is intended to permit SYCL implementations to expand the group hierarchy without breaking user code, using an interface inspired by sub-device creation. The fifth group type, a tangle, represents all work-items in the same control flow and introduces convergence guarantees aligned with the expectations of many developers. The final group type, an opportunistic group, represents the set of work-items that an implementation guarantees to be executing together. Collectively, these six group types significantly expand the use cases that can be addressed by SYCL’s group abstraction.
We also explore options for supporting generic programming with groups of work-items, with an aim to unify a developer’s view of work-groups, sub-groups, and all our proposed groups using C++20 concepts. Such unification is necessary to ensure that SYCL’s group functions and algorithms can expand support to new group types. Aiming for unification should also help to ensure that the designs of any future group types are compatible with existing SYCL 2020 conventions and developer expectations.
We conclude with a discussion of problems that remain unsolved and that will need to be addressed in future work, including difficulties related to exposing coordination and communication primitives for arbitrary (and user-defined) collections of work-items.
show / hide abstract
show / hide abstract
In this presentation we give an overview of a SYCL extension introducing property lists with compile-time information about the contained properties. Additionally, we present a selection of extensions that use these properties, both to give alternatives to existing SYCL 2020 features as well as introducing new functionality.
In contrast to existing SYCL property lists, this compile-time properties extension allows SYCL programmers to encode the presence of the contained properties as part of the property list’s type. This allows the existence of properties in a property list to be queried during compilation of a SYCL application rather than during execution. Additionally, properties usable in these new property lists can contain data that can be queried at either the time of compilation or the time of execution.
Where the strength of these new property lists comes into play is not only to allow static assertions on invalid properties being applied to objects, but also in how the compile-time information can be used to customize SYCL objects, kernel functions kernel arguments, etc. Since it relies fully on C++ templates, the ways in which these customizations can be applied includes, but are not limited to , SFINAE on SYCL object member functions and specializations of SYCL objects based on the properties in the supplied property list. DPC++ additionally uses this compile-time available information to simplify the flow of information through the compiler, making it easier for the implementation developers to add new properties that have special meaning in intermediate kernel code.
Currently, SYCL 2020 allows programmers to specify a selection of traits on kernels such as required group and sub-group sizes for the kernel, which aspects are required, as well as hints about vector computational width and appropriate work-group sizes. These are applied using SYCL-specific C++ attributes, meaning the compiler needs to be taught about these new attributes to be able to honor them. This requirement becomes problematic when compiling SYCL code with host compilers that do not know these attributes. However, using an extension to the command submission interfaces and shortcuts allowing them to take compile-time properties as an additional argument, these kernel traits can be expressed using properties with the kernel traits encoded as new properties with compile-time information. By doing this, kernel traits no longer need to be expressed through an extension of C++ but can instead be specified using regular C++.
Not only do these properties allow for regular C++ alternatives to existing SYCL 2020 features, but they also enable a plethora of new extension features. One of these new extensions features adds the ability for SYCL programmers to annotate kernel arguments. These annotations are tied to kernel arguments through a wrapper class, associating properties with an underlying object. This wrapper class can then be captured by the kernels, allowing the programmer to communicate additional information about these arguments to the compiler through the properties in the property list. An example of such additional information is the “restrict” property which informs the compiler that a given annotated pointer kernel argument does not alias any other kernel arguments, following the semantics of the “restrict” keyword in C99.
In conclusion, the compile-time property list extension presented here offers a flexible and extendible way of enabling SYCL programmers to apply properties to SYCL objects, kernel functions, kernel arguments, etc. Not only creating alternative implementations of existing SYCL 2020 features, but also giving rise to further extensions to SYCL 2020. This compile-time property list extension and extensions building upon it are available as experimental extensions in DPC++.
show / hide abstract
The SYCL programming model supports expressing complex dependences through a directed acyclic graph, or DAG. Graphs in SYCL may be expressed in two ways: either through declaring data dependences using accessor objects or through declaring task dependences using event objects. Events in SYCL seem like a simple abstraction at first glance, but a closer inspection reveals a depth of complexity and nuance that programmers may not immediately appreciate. This talk will examine these issues as well as present ideas for new mechanisms in SYCL focused on events that address feature requests and concerns from programmers. Earlier proposed extensions such as the discard_events property in DPC++ or coarse-grained events in AdaptiveCpp attempted to address these issues by altering the semantics of event objects instead of eliminating unwanted events entirely.
The classic use of events in SYCL is for synchronization. Events form the backbone for building task-based DAGs when using out-of-order queues or multiple queues. Command submissions return events that future commands may depend on. This mechanism is essential when using out-of-order queues or synchronizing commands on multiple SYCL queues. Additionally, events are one way that host code may synchronize with device code without waiting on every command in a queue individually. However, events may seem superfluous for applications that primarily use an in-order queue where the primary method of synchronization is intrinsic to the semantics of the queue itself. Such applications are often encountered when porting codes to SYCL from other accelerator programming models.
While events are a powerful abstraction, the current semantics of SYCL require them to be created for every command submission. While useful, event creation is not free, and the cost may be backend-dependent. Event creation can lead to unnecessary and unexpected host overheads for applications that do not require their use to ensure correct program execution, such as applications that use in-order queues or applications which do not use the events provided when commands are submitted. A SYCL backend implementation may wish to elide the creation and usage of native events. However, if the SYCL interface always returns an event, then they must be created without knowing whether the user will use them or not. This talk will propose adding new launch mechanisms to SYCL that add several capabilities, including managing whether an event is always created upon command submission using an opt-in capability.
Events may also carry profiling information if the event comes from a command submission to a queue that was created with the `enable_profiling` property. Event profiling is a useful feature that allows programmers to query when commands are submitted, begin execution, and complete. However, this feature also has a cost, and applications may not require this information for every command submitted to a queue. Queues also lack flexibility to toggle this behavior on and off during runtime, only accepting the property at queue creation. This talk will propose new APIs to support more fine-grained event profiling that allow applications to measure only those commands of interest.
Events are a key part of the SYCL programming model that power many different capabilities including task scheduling and command profiling. However, application experience has shown that events can introduce undesired and unnecessary overheads. Additionally, not all applications wish to use the full DAG capabilities of out-of-order queues, preferring the simpler semantics of in-order queues. We believe that the new extensions presented in this talk grant programmers the flexibility to use the model that first their applications the best while only incurring performance penalties introduced by events when desired or necessary. Providing the right abstractions for the job will ensure the long-term viability and success of SYCL.
show / hide abstract
SYCL has proven lately to be a very useful standard for heterogeneous computing, being used in both embedded hardware as well as in High Performance Computing. One area where the standard is a bit lacking, though, is support for processing images, which hasn’t changed much from the original design that reflected OpenCL images. To address this disparity, our work focuses on the development and implementation of Bindless Images, a significant update to SYCL images, aimed at aligning image support more closely with contemporary graphics APIs.
During this presentation we will introduce images, touching on how they’re used and then the areas that SYCL images fall short. Next, we will introduce our work on Bindless Images, provide a comparative analysis on the differences with respect to SYCL 2020 images, and highlight the interoperability support they provide. We’ll finish by outlining real-world use cases, their impact on the project’s design and future, and finally the future work we have planned.
We start off with an introduction into the importance of images, how and where they’re used, and explore a bit of the history of bindless textures.
We continue with a deep dive into why SYCL images are insufficient, talking about lackluster flexibility, tenuous control, feature sparsity, and difficulties translating other image APIs into SYCL.
Then comes the main part, which introduces our work on Bindless Images as a means to provide developers with an environment that is both more flexible and versatile. Bindless Images are currently an experimental DPC++ vendor extension, available in the open source Intel LLVM repository. At the moment they are only available for the CUDA backends, with certain minor limitations.
A comparative analysis is presented, shedding light on the limitations of the existing SYCL 2020 images and showcasing the newfound flexibility offered by Bindless Images. This includes, but is not limited to, being able to separate image memory allocation from the actual image, which among other things enables using existing Unified Shared Memory (USM) allocations as image memory.
An important aspect of Bindless Images is the ability to provide interoperability with other graphics APIs. SYCL 2020 interoperability support is limited to APIs that are also used as a SYCL backend, so called SYCL backend interop. However, SYCL doesn’t have a way to interop with non-backend APIs yet. In our work we present the current focus of interoperability with Vulkan and DirectX 12.
The interoperability section covers not only being able to import image memory from other graphics APIs, but also the importing of synchronization primitives like semaphores.
There are ongoing efforts to improve Bindless Images for real-world use cases, such as integrating them into the SYCL backend for Blender, and providing support for automatic translation of CUDA images to Bindless Images via SYCLomatic. We discuss a bit about these efforts and how they’re influencing the future direction of the project.
Lastly, we further discuss future work, which includes adding support for more backends and preparing an official Khronos extension.
Through this presentation, we aim to highlight the transformative impact of Bindless Images on the SYCL standard, marking a significant step toward empowering developers with a more adaptable and feature-rich environment for heterogeneous computing. As we delve into each aspect, attendees will gain insights into the intricacies of Bindless Images, fostering a deeper understanding of its implications for the SYCL programming model and its potential for bringing flexible image manipulation across a range of devices in a single standard. This research not only addresses existing limitations but also sets the stage for continued collaboration and innovation in the realm of heterogeneous computing.
show / hide abstract
This technical presentation will describe two related extensions to the SYCL language that DPC++ is developing. The main motivation is to provide the ability to “online” compile (i.e. just in time) a SYCL kernel from a string when the application runs. OpenCL provides the ability to online compile a kernel from an OpenCL C string, and CUDA also provides a similar feature with their NVRTC library. DPC++ users have been requesting similar functionality.
SYCL’s specialization constants provide a similar feature, but it is more limited than online kernel compilation. Specialization constants allow an application to tune values used within a kernel, but they do not provide a good way to dynamically change the algorithm a kernel uses. Therefore, online compilation can provide a benefit beyond specialization constants. Applications can dynamically construct a string that defines a kernel, customizing the kernel according to the input data set or according to the features available on the target device.
The online compiler requires a change in the way parameters are passed to a SYCL kernel. For example, capturing variables through a lambda expression no longer makes sense when the kernel is defined at runtime. Instead, the application must set kernel argument values explicitly via an API like “set_arg”, where each argument is identified by its index in the parameter list. This is problematic for variables captured through a lambda expression, though, because C++ does not define an order to the captures. This led us to propose a new syntax for defining a SYCL kernel as a plain function, where each function argument corresponds to one kernel argument. Thus, the application can dynamically create a string that defines a function. Once the string is online compiled, the application invokes the kernel by explicitly setting values for each of the function arguments.
This new kernel syntax led us to propose a second SYCL extension – free function kernels. If the online compiler allows the user to define a kernel as a plain function, why not allow this syntax even for SYCL applications that do not use the online compiler? We think this syntax may be more familiar to users migrating from CUDA, so it may help attract new users from the CUDA ecosystem. This syntax raises some interesting questions, though. For example, how does the kernel get the invocation ID if the function arguments are the kernel arguments? How does the language distinguish between “range” kernels, “nd-range” kernels, and “single-task” kernels? Our presentation explains how we solved these problems.
We also think the online compiler can be extended to other languages, providing new opportunities for interoperability between SYCL and other languages. For example, a SYCL application could dynamically create a string that defines a kernel, where the string uses CUDA syntax or OpenCL C syntax instead of SYCL syntax. We think this could also help users migrate from CUDA. For example, a CUDA application using the NVRTC library could migrate the “main” part of the application to SYCL while still dynamically generating a string that defines kernels in CUDA (for use on backends that support CUDA).
Another possibility is to allow the application to dynamically load a kernel using SPIR-V. For example, the application could read a file defining a SPIR-V module and then invoke kernels from that module using the same APIs we propose for online kernel compilation. We think this would enable new types of interoperations in SYCL. A user could define a kernel in any language that can be compiled to SPIR-V and then invoke the kernel from a SYCL application. Ninja users could also hand craft kernels by coding them directly in SPIR-V.
These extensions are in early stages of development in DPC++, so the goal of our presentation is to raise awareness of the idea and gather community feedback.
show / hide abstract
However, as we demonstrate, the SYCL specification lends itself to be implemented purely in software in a manner that is accessible to debuggers and which can be employed to simulate the characteristics of various hardware targets.We introduce SimSYCL, a novel library-only SYCL implementation with extensive simulation and verification capabilities. By executing all SYCL commands synchronously on the host CPU, it is able to diagnose various manifestations of undefined behavior within kernels, and grants developers the ability to step into kernels with an ordinary debugger to discover other logic errors.We demonstrate that the reduced complexity of this approach, combined with an implementation focus on fast compilation, considerably speeds up the edit-compile-debug cycle compared to other SYCL implementations, while maintaining reasonable runtime performance. Furthermore, we show how SimSYCL’s simulation capabilities allow unit-testing user code for cross-platform portability, and that its comprehensive validation detects and reports several classes of user errors which remain undiagnosed by performance-focused implementations.
show / hide abstract
To achieve high performance, modern HPC systems take advantage of heterogeneous GPU architectures. Often these GPUs are utilized through a vendor preferred parallel programming model. Unfortunately, this results in application code that is not portable across vendors.
To address this issue, open parallel programming models have been introduced. One such parallel programming model is provided by the RAJA Portability Suite. RAJA is a portability layer that provides an abstract application developer API as a library through modern C++. In RAJA, computational kernels are lowered down to a backend language appropriate for the target architecture. Additionally, RAJA is designed to provide control to the application developer over kernel execution while minimizing modification to the application source code.
In this work, we describe the SYCL backend implementation in RAJA for Intel GPU execution. We discuss the implementation of various features in the SYCL backend along with the challenges and lessons learned. Finally, we investigate the performance impact of executing several HPC kernels through RAJA when compared to direct SYCL implementations.
show / hide abstract
Posters
The posters will be on display during the breaks and lunch on Wednesday and Thursday.
show / hide abstract
In response to the increasing demand for incorporating LLVM sanitizers into heterogeneous computing environments, this poster outlines the development of a unified sanitizer framework for SYCL kernel code. The focus of this unified sanitizer framework is on detecting memory access errors, memory leaks, uninitialized memory usage, undefined behavior, and multi-threading issues in SYCL kernel code.The challenge lies in designing a framework that caters to the diverse features supported by each device. For example, CPU kernel code shares the same address space with host code, while GPU kernel code has a separate address space distinct from host code, with additional complexities introduced by the GPU’s memory hierarchy. Directly migrating LLVM sanitizers to GPU devices becomes intricate due to these variations.
Our ongoing implementation of a unified framework leverages AddressSanitizer (ASan), a runtime memory error detector for C/C++ languages, specifically tailored for SYCL kernels. The extension of ASan’s instrumentation pass to SPIRV IR facilitates compatibility with the latest LLVM in the SYCL frontend compiler and mitigates differences in backend compilers. Additionally, a new layer named the Sanitizer Layer is added to the Unified Runtime (UR), which is implemented by standard UR APIs, providing common ASan runtime support for UR Adapters.
Currently, the implementation includes the detection of out-of-bounds errors on Unified Shared Memory (USM) and static local memory for both OpenCL CPU and Level-Zero backends. Key steps involve extending the AddressSanitizer pass from LLVM to instrument SYCL kernels, enabling the sanitizer layer in the UR loader, and implementing memory access checks based on shadow memory within kernel code.
Upon execution, the framework reports detailed error information, including work-item ID, kernel name, source code line details, and data access details. The partial open-source release in the GitHub intel/llvm repository demonstrates the capability to identify out-of-bounds memory access errors in SYCL kernel code.
The presented solution significantly reduces the effort required for supporting diverse offload scenarios and empowers users with comprehensive error information for efficient debugging and resolution.
Besides, this solution not only can be easily extended to accommodate other SYCL backends such as CUDA and HIP but can also streamline support for OpenMP offload. Furthermore, it provides a foundation for the incorporation of additional sanitizers like MemorySanitizer and ThreadSanitizer.
show / hide abstract
SYCLomatic is an open-sourced tool aiming to assist customers to migrate existing heterogeneous computing project written in CUDA language to project written in SYCL language. Typically, 90%-95% of CUDA code automatically migrates to SYCL code and some Inline comments are provided to help developer complete and tune the code.When trying to migrate a CUDA project, developers can leverage SYCLomatic with the workflow in Fig.1. SYCLomatic can migrate most of the CUDA code, including CUDA APIs, CUDA types and kernels, to SYCL code automatically with provided compilation database. Some inline comments are inserted in the migrated source code as a mention of not perfect migration, optimization chance or not supported CUDA APIs. Developer may take some efforts to address these comments to complete the code migration or improve the run-time performance of the migrated code. Besides migrating the CUDA source code, SYCLomatic can generate Makefiles for the migrated project with the information of compilation database which saves huge manual efforts for building the migrated project. After the migration, the migrated SYCL project can be built with SYCL compilers[3] and run on multiple devices, CPU, GPU, FPGA and other accelerators.
Besides code migration, SYCLomatic have further introduced utility features to improve the user experience: evaluating the migration engineering effort required (Analysis Mode), auto migration of the build scripts (CMake script migration) and debugging/verifying the migrated code (CodePin).
show / hide abstract
show / hide abstract
The Fast Fourier Transform (FFT) is a widely used algorithm in digital signal processing. The FFT computes the discrete Fourier transform (DFT) of a sequence, converting from temporal or spatial domain to frequency domain. The DFT operation is useful for many signal processing applications, but computing directly from definition is too slow to be practical. An FFT algorithm reduces the complexity from O(N^2) to O(NlogN), where N is the data size.
This work describes how to accelerate the FFT algorithm for Qualcomm’s Adreno GPUs using OpenCL. We discuss optimization of one-dimensional FFT implementations such as the Cooley-Tukey FFT and Stockham FFT.
show / hide abstract
show / hide abstract
SYCL is a royalty-free, cross-platform programming model for heterogeneous computing based on C++. Intel oneAPI[1] provides a SYCL compiler implementation and runtime library to support SYCL kernel-based programming and a set of optimized libraries to enable API-based programming.
SYCLomatic[2] is an open-source project to assist developers migrating existing code written in other programming languages into the SYCL C++ heterogeneous programming model. Currently, SYCLomatic supports source-to-source migration of applications in CUDA into SYCL source code that leverages SYCL interfaces and the optimized libraries provided by Intel oneAPI.
Due to the differences in software and the hardware stack targeted by the two languages, the run-time behavior of the migrated SYCL program can sometimes be inconsistent with the original CUDA version. Among others, this can be attributed to,
• Difference in arithmetic precision between hardware
• Semantic difference between the CUDA and SYCL APIs
• Issues caused by the difference in data type implementation and the size of the data type between CUDA and SYCL
• Errors introduced during the automatic migration
In such cases, identifying the points of divergence between the CUDA and the SYCL programs could be difficult, especially since there is no debugging facility to monitor and compare the execution status of the CUDA and the SYCL version simultaneously.
Normally, to debug inconsistent runtime behavior of application, a user needs to add instrumentation to implement data checkpoints in both CUDA and SYCL code. At these checkpoints, these instrumentation should perform the following steps:
• Synchronize unfinished device task
• Copy memory from device-to-host
• Dump target object into a format for comparison
In general, the first two steps can be done with simple API calls. The major effort is in the 3rd step since the user needs to implement serialization function to dump both the CUDA and SYCL objects.
To reduce the effort of manual instrumentation, we introduce CodePin, a sub-feature of SYCLomatic which can automatically inject instrumentations to realize data checkpoints before/after specific API calls and kernel calls. Fig. 1 details the general workflow of using CodePin to perform program comparison. As shown in the figure, the user can use CodePin to generate a CUDA and the migrated SYCL code with instrumentations injected at data checkpoints. When these programs are executed, these instrumentations will dump values of data structures into report files at each data checkpoint. The user can then compare the reports to identify the data checkpoint where the behaviors of the two programs started to diverge.
As shown in Fig.2, if CodePin is enabled when migrating CUDA project to SYCL, extra ASTMatchers will be enabled to match selected CUDA API and all CUDA kernel calls. When the ASTMatchers match the selected API or kernel calls in the Abstract Syntax Tree (AST), instrumentations are generated and injected around these calls into the CUDA and SYCL codes. Furthermore, a static analysis function will be invoked to analyze the CUDA class memory layout of all the related variables. The result of the memory layout analysis including the field name, the type and the offset of all field members will be saved in a schema file in the form of JSON strings.
While the CUDA schema can be acquired by analyzing the information provided by AST, obtaining the SYCL schema is relatively tricky since the AST of the migrated SYCL code is not available. CodePin maintains a list of schemas of native SYCL classes and constructs the SYCL schema of a user-defined class by adjusting the CUDA schema according to the type migration logic in SYCLomatic.
Another challenge of dumping data is to establish the size of dynamically allocated objects. Since the size of such allocation can only be determined at run-time, CodePin adds instrumentation after each device malloc function like cudaMalloc(). This instrumentation records the address of the pointer and size of the allocated memory in a global map. The data checkpoint API can then reference the map to retrieve the number of objects behind the pointer.
Fig.3 shows a usage example of CodePin to generate the instrumented CUDA Program and the migrated SYCL Program from the original CUDA Program. As highlighted, the original CUDA code hardcoded the size of “int3” as 12 in line 35 and this will cause incorrect execution result in the migrated SYCL code because the size of “sycl::int3” is 16. To debug the issue, the CodePin instruments function gen_prolog_API_CP() and gen_epilog_API_CP() to collect data checkpoint before and after the kernel call. After execution, the two data checkpoints are generated and compared as shown in Fig.4. The comparison reveals the divergence in the behavior of the CUDA and SYCL version at the highlighted checkpoint collected before kernel call, which hint the data passed into kernel is not correct.
In conclusion, CodePin can reduce the debug/verification effort when migrating CUDA programs to SYCL with SYCLomatic. Given a CUDA source code, CodePin generates instrumented CUDA code as well as its migrated SYCL code. When executed, these instrumented programs generate reports at each data checkpoint which can be used to identify points of program divergence between the CUDA and SYCL programs. CodePin can analyze most user-defined CUDA class and generate the schema file for CUDA and SYCL classes except classes with virtual inheritance. New features like auto-comparing the data report, loading a data check point from CUDA report in the SYCL side to verify the behavior afterward, etc., are under development. CodePin is distributed open-source and available at the SYCLomatic repository.
show / hide abstract
The world’s largest particle accelerator, located at CERN, produces petabytes of data that need to be analysed efficiently, to study the fundamental structures of our universe. ROOT is an open-source C++ data analysis framework, developed for this purpose. Its high-level data analysis interface, RDataFrame, currently only supports CPU parallelism. Given the increasing heterogeneity in computing facilities, it becomes crucial to efficiently support GPGPUs to take advantage of the available resources. SYCL allows for a single-source implementation, which enables support for different architectures.
In this poster, we describe a CUDA implementation and the migration process to SYCL, focusing on a core high energy physics operation in RDataFrame — histogramming. We detail the challenges that we faced when integrating SYCL into a large and complex code base. Furthermore, we perform an extensive comparative performance analysis of two SYCL compilers, AdaptiveCpp and DPC++, and the reference CUDA implementation. We highlight the performance bottlenecks that we encountered, and the methodology used to detect these. Based on our findings, we provide actionable insights for developers of SYCL applications.
show / hide abstract
show / hide abstract
Compute-intensive applications, scientific and medical research workloads, advanced image processing workloads, and AI-assisted data science use cases all have one thing in common. Moving software code from one hardware platform to another has been cumbersome and costly for a long time. Single-vendor architecture programming limits code reuse and portability. This limits the adoption and proliferation of the latest technology advances that could speed up computing, innovation, and scale solutions with greater productivity.
In early 2022, the open-source SYCLomatic project went from being an ambitious effort on GitHub to being a released product, complete with its binary distribution counterpart, the Intel DPC++ Compatibility Tool.
It provides a solution for many workloads whose accelerated offload compute implementation is still based only on CUDA. SYCLomatic takes a Visual Studio solution or a makefile-based project, analyzes it, if desired, captures the build configuration in a CLANG JSON database file, and then processes all the CUDA kernel and C++ source and header files in the project.
The result is a new, fully migrated project with an annotated C++ source providing warnings and guidance on additional steps to complete the migration, where automation alone is not sufficient to make the right heuristic decisions with 100% accuracy.
The initial level of completeness for the migration has improved considerably over the last two years, as well as the level of comprehensive guidance the tool provides.
But, the evolution has not stopped there. SYCLomatic automatically migrates an ever-growing set of CUDA library calls from cuBLAS, cuFFT, cuRAND, cuSolver, cuSparse, cuDNN, Thrust, CUB, and NCCL to equivalent open standards SYCL-based library calls covered by the Unified Accelerated Foundation (UXL) and its underlying oneAPI specification. Thus, it takes full advantage of the open SYCL abstraction layer as well as the complementary comprehensive set of compilers, libraries, tools, and code samples the Unified Accelerated Foundation and its vision for a multiarchitecture and multivendor software ecosystem for all accelerators based on open-source standards provides.
The work presented will cover a step-by-step approach to adding SYCL support to your C++ code and migrating existing CUDA-based implementations to SYCL.
show / hide abstract
SYCL is a heterogeneous programming model that allows users to specialize their code using standard C++ features such as templates. Specializing code and taking available hardware features into account can enable optimizations for different targets. Yet, specialization using C++ templates may lead to an increase in the resulting binary size or, worst case scenario, may not even be feasible at all. In these cases, SYCL programmers might find a powerful tool in specialization constants to specialize their code at runtime, potentially optimizing binary size and compilation time.
Specialization constants (sycl::specialization_id) are values which are guaranteed to be constant during the execution of a kernel and whose value is only given at kernel launch time. This approach enables SYCL implementations to specialize the kernels at runtime, generating code on the go using the known constant values. This simple mechanism allows programmers to write complex kernels with specialization constant–driven control-flow that will be simplified at kernel launch time. Also, as the specialization only happens at runtime, specialization constants do not have an impact on binary size at application compile time.
The case of SPIR-V targets is of special interest, as this intermediate language provides native support for spec constants. SPIR-V spec constants work similarly to SYCL’s, while adding support for two additional constructs: specialization operations, yielding a specialization constant from an operation taking specialization constants or constants as arguments; and specialization constant length arrays, arrays whose length can be given by a specialization constant. In this talk, we will prove how this latter capability can be leveraged in SYCL code to write more size-efficient and, in specific cases, even better performing kernels. Thus, we propose a new SYCL extension, Spec Constant Length Arrays (SCLA).
The SYCL device memory model defines three distinct address spaces: global memory, accessible to all work-items in all work-groups; local memory, shared among all work-items in a work-group; and private memory, private to each work-item. As private memory does not need to be shared between work-items, SYCL implementations may use faster—but more limited—physical device memory to map it, e.g., registers. In the current SYCL specification, there is no way of expressing dynamic private memory allocations, so we propose SCLA as an extension to enable this feature.
The current SCLA experimental implementation in DPC++ is built on top of Clang’s C++ variable length arrays (VLA) extension, replacing the integral size argument with a SYCL specialization constant.
This syntax was chosen to prove the concept. In the future, we plan to evolve the extension to use a more idiomatic syntax for SYCL and make further improvements.
We use portDNN as a use-case scenario for SCLA. portDNN is an open-source SYCL library providing several heavily-specialized kernels implementing deep neural network operators. Experiments were performed on a system with an Intel Core i7-8700 CPU with 31GiB RAM and an Intel UHD Graphics 630, running Ubuntu 22.04.3 LTS (Linux kernel 6.2.0) and Intel Level Zero driver version 1.3.26918.
portDNN specializes its kernels using different vector widths. In our experiments, we replaced vectors with SCLA allocations in the kernels by providing a vector-like interface built with spec constants on top of our extension.
For now, this change was performed for two of the thirteen available operators (binaryop and pointwise), but would be applicable to more operators.
This change led to reducing the library size by 10%, while maintaining performance and generating the same code JIT. The reported geometric mean speedup was 1.01x, taking the kernel execution time as a metric, when running portDNN-provided binaryop and pointwise benchmarks.
As specialization now happens at runtime, we can see the number of files to be compiled at application compile time reduced by 20%, leading to 1.19x compile time speedup and 1.11x link time speedup, yielding a total build time speedup of 1.18x. Interestingly, compilation times of the new individual files saw a 0.54x slowdown, but the reduction in the number of files being compiled paid off. This reduction in compilation time will however have an effect on JIT compilation time. In our experiments, we reported a 0.91x slowdown in JIT compilation time, which is however expected to be run just once and cached afterwards.
With use-cases such as the replacement of explicit vectorization by arrays which have their size given through specialization constants, SCLA could be the first in a series of SYCL extensions building new features on top of spec constants. One example of this would be a vector-like interface that would allow creating dynamically sized vectors specialized for the hardware vector width for algorithms such as those in portDNN, while greatly optimizing the resulting binary size.
SCLA was also tested on portFFT, an open-source library implementing Fast Fourier Transforms using SYCL.
In the absence of the ability to dynamically allocate private memory, the library had to make conservative allocations to cope with different parameter sizes.
Using SCLA in place of these allocations and always allocating only the necessary size leads to 1.09x speedup on the same system taking the same metrics. As the only change in the generated code was the private memory allocation size, we attribute the speedup to register spilling taking place in the original code due to the larger allocation size.
These results demonstrate how specialization constants and SCLA allow to delay specializing SYCL programs to the runtime when full information about the target hardware is available. This can significantly reduce binary size, an important factor for shipping libraries. It allows such algorithms to run in more memory-constrained environments while keeping a high degree of specialization and thus facilitates portability of SYCL code.
In the poster, we will present the current SCLA extension and how it can be used instead of templates to specialize code in libraries such as portDNN or portFFT at runtime, leading to significant reduction in binary size and compile time. We will also share ideas of how the extension can be evolved in the future to a more mature and SYCL-like interface.