The 2022 conference will be a blend of live tutorials and panel sessions, alongside on-demand presentation of the papers, technical talks and posters, all supported by a dedicated slack workspace to ask questions and network.
- Instructions for joining the live sessions, slack channels and viewing the presentations will be included in the registration confirmation.
- The on demand presentations will be available to registered delegates from Wednesday May 11 at 9:00 CET
- If you are missing a registration email please see: What to check when you don’t receive Eventbrite emails
Tutorials
Application Development with SYCL
This tutorial will provide a way for developers to gain expertise with SYCL in a practical environment focused more on writing code than Powerpoint.
Attendees will gain a background of how the designers of the SYCL standard have addressed heterogeneous programming in C++ through industry collaboration. SYCL has gained widespread support in recent years and is available on Exascale systems, desktops, embedded systems, FPGAs, and automotive platforms. Regardless of the particular constructs in the future – the material in this course will prove timeless.
This course will start by teaching the fundamentals of heterogeneous parallelism using SYCL. It will also teach you how to make use of modern C++ and the SYCL programming model to build parallel algorithms for heterogeneous devices. Most of the programming focus will be on GPUs, but some time will be spent applying the techniques to simple FPGA examples. The course will teach you how to apply some common GPU optimizations.
Agenda
Times are approximate and show Central European
- Introduction – Aksel Alpay, U of Heidelberg
- PDT 9am – 9:30am, UK 5pm-5:30pm, CET 6pm – 6:30pm
- Enqueuing a SYCL Kernel – Igor Vorobstov, Intel
- PDT 9:30am – 10:15am, UK 5:30pm – 6:15pm, CET 6:30pm – 7:15pm
- SYCL Data Management – Peter Zuzek, Codeplay
- PDT 10:15am – 11:00am, UK 6:15pm – 7:00pm, CET 7:15pm – 8:00pm
- ND Range Kernels – Ronan Keryell, AMD
- PDT 11:00am – 12:00pm, UK 7:00pm – 8:00pm, CET 8:00pm – 9:00pm
- Image Convolution Case Study – Peter Zuzek, Codeplay
- PVT 12:00pm – 1:00pm, UK 8:00pm – 9:00pm, CET 9:00pm – 10:00pm
- Date: Tuesday May 10, 2022
- Start time: 09:00 PDT | 18:00 CET | Duration: Approx 4 Hours
- SYCL | Paper ID: 008
- View Presentation Recording (no registration required)
- Accompanying slides and exercises: https://github.com/codeplaysoftware/syclacademy/tree/iwocl22
oneAPI Developer Summit
On day one of this Intel hosted virtual summit you will hear from a range of speakers from industry and academia working on innovative xPU solutions developed on oneAPI. On day two Ben Odom from the Intel Developer Evangelist team will present a ‘oneAPI SYCL Essentials’ tutorial.
- Developer Summit: Monday May 9, 2022 – 9:00 CST Duration: 3:45 hours
- oneAPI SYCL Essentials: Tuesday May 10, 2022 – 9:00 CST Duration: 2:00 hours
- This sponsored tutorial is run by Intel in association with SYCLcon.
- Registration and additional information: Intel’s oneAPI Developer Summit
Live Panel Discussions
OpenCL Panel Discussion
Panel Chair: Simon McIntosh-Smith, University of Bristol
- Date: Wednesday May 11, 2022
- Start time: 16:00 BST, 17:00 CET, 11:00 EST, 08:00 PDT
- Duration: Approx. 60 mins
- LIVE Zoom Webinar
- Panel Members
- Neil Trevett, Khronos and NVIDIA
- Balaji Calidas, Qualcomm
- Moritz Lehmann, University of Bayreuth
- Nagy-Egri Máté Ferenc, Stream HPC
- Ben Ashbaugh, Intel
- Alastair Murray, Codeplay
- Kevin Petit, Arm
- View Presentation Recording
- No registration required
OpenCL | Paper ID: P1
SYCL Panel Discussion
Panel Chair: Tom Deakin, University of Bristol
- Date: Thursday May 12, 2022
- Start time: 16:00 BST, 17:00 CET, 11:00 EST, 08:00 PDT
- Duration: Approx. 60 mins
- LIVE Zoom Webinar
- Panel Members
- Michael Wong, Codeplay
- Aksel Alpay, Heidelberg University
- Ronan Keryell, AMD
- Roland Schulz, Intel
- Verena Beckham, Codeplay
- Nevin Liber, Argonne National Laboratory
- View Presentation Recording
- No registration required
SYCL | Paper ID: P2
NEW Accelerating Machine Learning with OpenCL
This special Khronos Group live webinar is co-located with IWOCL as part of their new Machine Learning Forum. During the webinar members of the OpenCL working group at Khronos will share the latest updates to the OpenCL language and ecosystem that can directly benefit Machine Learning workflow performance.
- Date: Wednesday May 11, 2022
- Start time: 10:00 PDT, 13:00 EST, 19:00 CEST
- Duration: Approx. 60 mins
- LIVE Zoom Webinar
- Talks / Presenters
- Introduction and Invitation to Join the Machine Learning Forum
– Neil Trevett, Khronos and NVIDIA - Qualcomm Extensions for Advancing Machine Learning Acceleration
– Balaji Calidas, Director of Engineering. Qualcomm - A Case Study on OpenCL vs GPU Assembly for Machine Learning Performance
– Roy Oursler, Intel - Ask the Experts Q&A and Panel Discussion
- Introduction and Invitation to Join the Machine Learning Forum
- View presentations slides and video on the Khronos Events page.
OpenCL | Paper ID: P3
Keynote and Invited Talks
Keynote
10 Years of IWOCL and SYCLcon: A Decade of Breakthroughs, But is the Best Yet to Come?
Speaker: Simon McIntosh-Smith | University of Bristol
Speaker Biography – Professor Simon McIntosh Smith is a computer scientist focused on high performance computing (HPC), parallel programming languages and fault tolerant computing. He is Head of the High Performance Computing (HPC) research group at the University of Bristol and lead researcher at the Isambard GW4 Tier 2 HPC Centre.
SYCL & OpenCL | Paper ID: K2 | View YouTube Playlist | View Slides
Keynote
Pushing the boundaries of SYCL with hipSYCL
Speaker: Aksel Alpay | Heidelberg University
From its inception, a key goal of the hipSYCL project has always been to explore other interpretations of SYCL, or to leverage options provided by the specification that so far have not been widely employed by other implementations in order to explore their impact. This talk will discuss how hipSYCL has repeatedly questioned established implementation approaches, and experimented with alternatives – some of which are now well established in the SYCL world, such as non-OpenCL backends, while others are still hipSYCL- specific such as the scoped parallelism programming model, buffer-USM interoperability, automatic memory management below buffer granularity, multi- device queues, asynchronous buffers, library-only device backends or compiler- accelerated CPU support without OpenCL.
Throughout the talk, we will highlight how these approaches can impact the ecosystem, performance or usability of SYCL code and raise important
questions: What is SYCL? What is a queue? What is really needed to implement SYCL?
Speaker Biography – In 2015 Aksel obtained a bachelor’s degree in physics, followed in 2018 by a master’s degree in physics with specialization on computational astrophysics (both from Heidelberg University). During his studies, he has designed and developed teralens, an OpenCL-based tree code for gravitational quasar microlensing, which, to this day, is the fastest known code in its field. Afterwards, he focused on developing tools for HPC, and most notably created hipSYCL. hipSYCL is one of the four major SYCL implementations in use today, and the very first SYCL implementation to pioneer an interpretation of SYCL without OpenCL. In late 2018, he joined Heidelberg University Computing Centre’s HPC team and EMCL.
Since 2019, he has been a member of the Khronos SYCL working group.
SYCL & OpenCL | Paper ID: K1 | View YouTube Playlist | View Slides
Invited Talk
OpenCL Working Group – A State of the Union
Speaker: Neil Trevett, Khronos Group President and OpenCL Working Group Chair | NVIDIA
Speaker Biography – Neil Trevett is the President of The Khronos Group (and OpenCL working group chair) where he has helped initiate and evolve royalty-free open standards such as OpenCL, OpenVX, NNEF, OpenXR, Vulkan, OpenGL ES, WebGL and glTF many of which are widely deployed in the graphics and embedded vision industries. At NVIDIA, Neil works to drive and develop the developer ecosystem that enables applications to take advantage of advanced GPU and silicon acceleration.
OpenCL | Paper ID: K4 | View YouTube Playlist | View Slides
Invited Talk
SYCL Working Group – A State of the Union
Speaker: Michael Wong, Khronos Group, SYCL Working Group Chair | Codeplay Sofware
Speaker Biography – Michael is a Distinguished Engineer at Codeplay, Chairman of SG19 Standards Committee for C++ Machine Learning, Head of Canadian Delegation of C++ Standards Committee. Michael has rich experience in C++ parallel computing, high performance computing and machine learning. He has led the development of C++ heterogeneous programming language (SYCL) standard for GPGPU application development, and has profound research and insights into the underlying performance optimization of Tensorflow. Its specific work covers parallel programming, neural network, computer vision, automatic driving and other fields. Michael was a senior IBM technical expert who led the development of IBM XL C++ compiler and XL C compiler.Head of Canadian Delegation of C++ Standards Committee. Michael has rich experience in C++ parallel computing, high performance computing and machine learning. He has led the development of C++ heterogeneous programming language (SYCL) standard for GPGPU application development, and has profound research and insights into the underlying performance optimization of Tensorflow. Its specific work covers parallel programming, neural network, computer vision, automatic driving and other fields. Michael was a senior IBM technical expert who led the development of IBM XL C++ compiler and XL C compiler.
SYCL | Paper ID: K3 | View YouTube Playlist | View Slides
Research Papers and Technical Talks
OpenCL Command-buffer Extension: Design & Implementation
The cl_khr_command_buffer[1] extension, provisionally released in November 2021 as part of OpenCL 3.0.10, provides such as solution. This extension introduces the concept of a command-buffer which is recorded once with a graph of commands, finalized for submission, and then dispatched for execution many times. Separating command setup from dispatch means that for repetitive workloads the command recording overheads are only incurred once. Additionally, optimization opportunities are introduced at the point of finalization, after which no more commands can be recorded, and the command-buffer is made ready for execution. After finalization, the command-buffer can then be asynchronously dispatched with minimal runtime overhead. This separation of concerns enables pipelined workflows common in machine learning applications by eliminating the latency of having to wait on the host to construct commands again for a similar workload.
In the first half of this technical presentation, we give an overview of the provisionally ratified command-buffer extension and dive into key points of its design. This includes a comparison with the Vulkan command-buffer abstraction[2], which shows that this approach is successful in the real-world. The design decision to introduce new entry-points, rather than reuse existing command-queue entry-points with begin/end markers, is also covered. As well as why mechanisms for host side synchronization were omitted from the new entry-points. Intended layering of future extensions on top of cl_khr_command_buffer is another topic raised, and why it was decided to split the functionality this way. cl_khr_command_buffer is designed as the base layer that is applicable to a wide range of vendors. Plans for the upcoming extensions layered on top will also be outlined in broad terms, these remove the restriction tying a command-buffer to a single command-queue as well as provide mutability of the command-buffer between submissions.
The second half of the presentation relays our experience implementing the command-buffer extension in ComputeAorta[3], Codeplay’s OpenCL implementation, and how this fed back into the extension specification. For example, implementing the simultaneous use capability that allows more than one submission of a command-buffer instance to be in-flight at once. We provide a high level overview of how command-buffers in ComputeAorta are implemented using the same machinery as regular command enqueues via Codeplay’s propriety ComputeMux API, and provide details of some of the common pitfalls and gotchas a vendor may face when implementing command-buffers vs. regular OpenCL commands.
[1] https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/OpenCL_Ext.html#cl_khr_command_buffer[2] https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/VkCommandBuffer.html[3] Alastair Murray and Ewan Crawford. 2020. Compute Aorta: A toolkit for implementing heterogeneous programming models. In Proceedings of the International Workshop on OpenCL (IWOCL ’20). Association for Computing Machinery, New York, NY, USA, Article 6, 1–2. DOI:https://doi.org/10.1145/3388333.3388652
Speaker: Ewan Crawford, Codeplay
Co-Authors: Jack Frankland, Codeplay
OpenCL | Paper ID: 102 | View YouTube Playlist | View Slides
Rapid Prototyping With Combined Scientific CFD Simulation and Real Time Raytracing Implementation in OpenCL
A much faster approach is to implement both simulation and rendering in OpenCL. The rendering kernels have direct read-only access to the raw simulation data that resides in ultra-fast GPU memory. This eliminates all PCIe data transfer but camera parameters and finished frames, allowing for interactive visualization of simulation results in real time while the simulation is running. This is an invaluable tool for rapid prototyping.
Although OpenCL does not have existing functionality for graphical rendering, being a general compute language, it allows for implementing an entire graphics engine, such that no data has to be moved to the CPU during rendering. On top, specific low-level optimizations make this OpenCL graphics engine outperform any existing rendering solution for this scenario, enabling drawing billions of lines per second and fluid raytracing in real time on even non-RTX GPUs.
This combination of simulation and rendering in OpenCL is demonstrated with the software FluidX3D [1] – a lattice Boltzmann method (LBM) fluid dynamics solver.
The first part will briefly introduce the numerical method for simulating fluid flow in a physically accurate manner. After introducing the LBM, the optimizations to make it run at peak efficiency are discussed: Being a memory-bound algorithm, coalesced memory access is key. This is achieved through array-of-structures data layout as well as the one-step-pull scheme, a certain variant of the LBM streaming step. One-step-pull leverages the fact that the misaligned read penalty is much smaller than the misaligned write penalty on almost all GPUs. Roofline analysis shows that with these optimizations, the LBM runs at 100% efficiency on the fastest data-center and gaming GPUs [2].
To simulate free surface flows, the LBM is extended with the Volume-of-Fluid (VoF) model. An efficient algorithm has been designed to vastly accelerate the challenging surface tension computation [3]. This extremely efficient VoF-LBM GPU implementation allows covering new grounds in science: FluidX3D has been used to simulate more than 1600 raindrop impacts to statistically evaluate how microplastics transition from the ocean surface into the atmosphere when the spray droplets are generated during drop impact [4]. At the same power consumption, with existing CPU-parallelized codes, compute time would have been several years, whilst with FluidX3D it was about a week.
The second part will focus on real time rendering with OpenCL, especially raytracing. Rasterization on the GPU is parallelized not over pixels but lines/triangles instead, making runtime mostly independent of screen resolution and lightning fast. Each line/triangle is transformed with the camera parameters from 3D to 2D screen coordinates and then rasterized onto the frame (integer array) with Bresenham algorithm [5] and z-buffer. The raytracing graphics are based on a combination of fast ray-grid traversal and marching-cubes, leveraging that the computational grid from the LBM already is an ideal acceleration structure for raytracing. The idea of raytracing is simple: Through each pixel on the screen, shoot a reverse light ray out of the camera and see where it intersects with a surface in the scene. Then (recursively) calculate reflected/refracted rays and mix the colors. If a ray doesn’t intersect with anything, its color is determined by the skybox image via UV mapping and bilinear pixel interpolation. With mesh surfaces consisting of many triangles, computation time quickly becomes a problem, as for each ray all triangles have to be tested for intersection. To overcome this, an acceleration structure is required. While computer games often use a bounding volume hierarchy, the LBM already provides an ideal alternative acceleration structure: the simulation grid. The corresponding algorithm is called ray-grid traversal: When a ray shoots through the 3D grid, intersections with the surface only have to be checked for at each traversed grid cell rather than the entire grid. In each traversed grid cell, the 0-5 surface triangles are generated on-the-fly with the marching-cubes algorithm and ray-triangle intersections are checked with the Möller-Trumbore algorithm. If an intersection has been found, only afterwards the normals are calculated on the 8 grid points spanning the cell, and are trilinearly interpolated to the intersection coordinates. The so interpolated surface normal makes the raytraced surface appear perfectly smooth.
On the GPU, the ray(s) for each pixel on screen are computed in parallel, vastly speeding up rendering. It is of key importance how to align the OpenCL workgroups on the 2D array of screen pixels: best performance is achieved for 8×4 pixel tiles; this is about 50% faster than 32×1 tiles, because with small, square-ish tiles, all rays of the workgroup are more likely to traverse the same grid cells, greatly improving memory broadcasting. In ray-grid traversal, 8 isovalues spanning a cell have to be loaded from GPU memory for each traversed cell. Once the triangle intersection has been found, the gradient on each of the 8 cell isovalues is calculated with central differences. Instead of loading an additional 6 isovalues for each of the 8 grid points, their isovalues are reused such that only 24 additional isovalues are loaded. For marching-cubes, the algorithm by Paul Bourke [6] is implemented in OpenCL. With 16-/8-bit integers and bit-packing, the tables are reduced to 14% of their original size and stored in constant memory space. For computing the cube index, branching is eliminated by bit operations. The Möller-Trumbore algorithm [7] is implemented in an entirely branchless manner.
This raytracing implementation is fast enough to run in real time for even the largest lattice dimensions that fit into the memory of a GPU. Finally, the combined VoF-LBM simulation and raytracing implementation is demonstrated on the most realistic simulation of an impacting raindrop ever done [8].
[1] https://doi.org/10.15495/EPub_UBT_00005400[2] https://arxiv.org/abs/2112.08926[3] https://doi.org/10.3390/computation10020021[4] https://doi.org/10.1186/s43591-021-00018-8[5] https://doi.org/10.1147/sj.41.0025[6] http://paulbourke.net/geometry/polygonise/[7] https://doi.org/10.1080/10867651.1997.10487468[8] https://youtu.be/HrwWoIsZG1c
Speaker: Moritz Lehmann, University of Bayreuth
OPENCL | Paper ID: 106 | View YouTube Playlist | View Slides
On the Compilation Performance of Current SYCL Implementations
In this work we set out to study the relative compile-time performance and the impact of various SYCL features on compilation times across a selection of the most widely-used SYCL implementations. To this end, we introduce a code generator which creates SYCL kernels stressing various API features and instruction types, either in isolation or in combination, as well as an infrastructure to largely automate related experiments. We apply this infrastructure in a large-scale synthetic evaluation totaling 96000 compiler runs, which also includes a study of the compilation performance over time of the most widespread implementations. In addition to these synthetic experiments, we validate the applicability of our findings by measuring the compile times of two real-world industrial SYCL applications.
On the basis of these experiments, we point out particularly impactful – in terms of compile-time performance – changes during the development of some SYCL implementations, and formulate suggestions for SYCL implementation developers as well as users. We have made both the code generator and all the tools we developed to carry out the experiments in this paper available as open source.
Speaker: Peter Thoman (University of Innsbruck)
Co-Authors: Facundo Molina Heredia and Thomas Fahringer (University of Innsbruck)
SYCL | Paper ID: 115 | View YouTube Playlist | View Slides
Celerity: How (Well) Does the SYCL API Translate to Distributed Clusters?
In this talk, we will review the SYCL API from the perspective of Celerity and distributed memory programming in general. We will highlight challenges encountered and opportunities for future improvement of the SYCL API.
We will begin our presentation by giving an overview of the Celerity programming model, highlighting its similarities to SYCL and introducing core additions to the API. We will showcase how a typical Celerity program is structured, and how an existing SYCL application can be converted to Celerity. Additionally, we will give a brief overview of how Celerity itself uses SYCL internally to power its distributed execution semantics.
The main portion of this presentation will concern itself with investigating important features of SYCL and how well they translate to distributed clusters. We will begin by examining core features such as the high-level data-driven APIs of queues, buffers, command groups and accessors in a distributed context. Next, we will highlight newer additions to SYCL such as host tasks and reductions. Finally, we will take a look at APIs that may be considered problematic from a distributed memory perspective, such as unified shared memory (USM).
We will conclude our presentation with an outlook on what future versions of SYCL could bring to the table to further improve compatibility with distributed memory clusters. We will review HPC use cases that may not yet be fully covered by SYCL and present several potential improvements that would enhance the experience for both us as library developers as well as users of the traditional MPI~+~SYCL approach.
Speaker: Philip Salzmann (University of Innsbruck)
Co-Authors: Fabian Knorr and Peter Thoman (University of Innsbruck), and Biagio Cosenza (University of Salerno)
SYCL | Paper ID: 117 | View YouTube Playlist | View Slides
Untangling Modern Parallel Programming Models
This talk strives to untangle the landscape of modern parallel programming models, to help developers understand how the models and options relate to each other, and to frame how to think about their specific algorithms when expressing them in code. Although experienced developers typically understand much of the terminology and the relationships between models, a holistic presentation of the material is of strong value, as evidenced by feedback from parallel programming experts that have seen previews of this presentation.
To begin, a brief overview will be presented to frame parallel programming and offload compute programming models, followed by characterization of the Single Program Multiple Data (SPMD) abstract model and the power it exhibits when mapping to multiple classes of architecture. We will discuss how fundamental design decisions within a compiler impact the mapping from source code to an underlying programming model, highlighting that the same code can be lowered to multiple models. This is particularly relevant in the presence of vector data types, which permit multiple interpretations and are a common cause of confusion. A core element of the presentation is decomposition of how programming model and design assumptions of a compiler are ideally understood concurrently by developers to streamline the creation and tuning of performant code.
SPMD and explicit Single Instruction Multiple Data (SIMD) programming models will be discussed relative to the Khronos OpenCL and SYCL standards, as well as to OpenMP and CUDA, with the aim of clarifying the concepts and models for developers working in specific languages.
The talk will conclude with an overview of an experimental extension to SYCL that proposes a mechanism for mixing SPMD and explicit SIMD programming styles with clear semantics and boundaries in code. The talk will show that providing clear points of transition with clear semantics can enable expert tuning at the granularity of a single line of code, without breaking the SPMD programming abstraction used by the rest of a kernel.
Parallel programming models such as SPMD and SIMD are critical in the modern landscape of heterogeneous compute architectures. When coupled with decisions made during the implementation of specific compilers, developers are left with a complex task when working to understand how concepts and hardware mappings interact. This talk describes the most common programming models exposed through SYCL, OpenCL, OpenMP, and CUDA, with the intent of clarifying misconceptions and confusion about the mapping of software to hardware. Attendees will leave the presentation with a holistic understanding of how SPMD and SIMD-like programming models fit together, and how they relate to the code that many of us write from day to day.
Speaker: Michael Kinsner (Intel)
Co-Authors: Ben Ashbaugh, James Brodman, Greg Lueck, John Pennycook and Roland Schulz (Intel)
OPENCL & SYCL | Tech Presentatiion ID: 118 | Register to View Presentation Recording | View Slides
Exploring the Possibility of a hipSYCL-based Implementation of oneAPI
Speaker: Aksel Alpay (Heidelberg University)
Co-Authors: Bálint Soproni, Holger Wünsche and Vincent Heuveline (Heidelberg University)
SYCL | Paper ID: 123 | Register to View Presentation Recording | View Slides
Performance Analysis of a Matrix-free Conjugate Gradient Kernels Using SYCL
Speaker: Igor Baratta (University of Cambridge)
Co-Authors: Chris Richardson and Garth Wells (University of Cambridge)
SYCL | Paper ID: 128 | Register to View Presentation Recording | View Slides
A Proof-of-Concept SYCL FFT – Benchmarking a Proof-of-Concept Performance Portable SYCL-based Fast Fourier Transformation Library
Speaker: Vincent R. Pascuzzi (Brookhaven National Laboratory)
Co-Authors: Mehdi Goli (Codeplay Software Ltd.)
SYCL | Paper ID: 136 | Register to View Presentation Recording | View Slides
Improved Address Space Inference for SYCL Programs
Hardware devices can have multiple memory regions with different levels of visibility and performance. Similar to OpenCL C, SYCL abstracts them into a global memory visible to all work-items, a local memory visible to a single work-group, or a private memory only visible to a single work-item. In OpenCL C, the programmer expresses address spaces using type qualifiers in order to statically encode the memory region addressed by pointers thus ensuring that when a programmer does specify an address space the compiler can check whether the program is well-formed. But requiring programs to be written with explicit address spaces comes at the expense of usability, as these need to be integrated into the program design and are a barrier to integrate code not written with this in mind. Thus in OpenCL C 2.x/3 programmers can make use of the unnamed generic address space instead. On the other hand, SYCL does not extend the C++ language therefore programmers cannot express address spaces using a type qualifier (as the C++ standard does not define them). Thus in SYCL pointers and references can be lowered to this unnamed generic address space by the device compiler. This generic address space is a virtual address space that can represent several overlapping address spaces at the same time.
The memory being addressed is no longer statically known by the compiler frontend and the SYCL implementation relies on the hardware, or software emulation, to correctly dispatch the loads and stores to the correct memory. On some hardware targets this flexibility comes with a performance cost, but this can be avoided when the compiler can infer a single address space for a given memory access. Additionally, the low-level compute APIs that are often used as backends to a SYCL 2020 implementation do not guarantee support for a generic address space, e.g. they are an optional feature in OpenCL 3.0 and non-existent in Vulkan. This means that a SYCL compiler that can infer all address spaces for a large set of programs can achieve better performance and target a wider range of backend compute APIs. Moreover, recent efforts to bring safety critical development to SYCL means it will also need to run on top of Vulkan SC. This makes the ability to have a well-defined specification for inferring address spaces still relevant for SYCL.
The rules introduced by SYCL 1.2.1 impose significant restrictions on user code. One striking example is the “defaulting rule”: when a pointer declaration has no initializer, the pointer is assumed to address the private memory, even if it is initialized in the very next statement. As a consequence, you cannot declare a pointer in a structure without it defaulting to the private address space. In practice, however, these restrictions are not a significant barrier in the context of 1.2.1 and large applications were ported to run with SYCL such as Eigen or build new ones like SYCL-BLAS or SYCL-DNN.
SYCL 2020 brought significant changes and added flexibility to users. Among them are the unnamed generic address space and unified shared memory (USM) pointers. The generic address space allowed to lift the restrictions stated by 1.2.1, making programs written for 2020 and generic unlikely to be compilable under the inference rules restriction. USM encourages the usage of raw pointers instead of the accessors container as this quickly implies passing these pointers via structures. As a USM pointer is in fact addressing the global memory region, this creates a conflict with inference rules.
This talk will describe an experimental compiler for ComputeCpp, Codeplay’s SYCL implementation. This compiler employs an improved address space inference method that can efficiently cope with SYCL 2020 features such as the generic address space and unified shared memory (USM) pointers. The talk with also cover the limitations of this approach.
Speaker:
Victor Lomuller (Codeplay Software)
Co-Authors:Ross Brunton (Codeplay Software)
SYCL | Technical Presentation ID: 140 | Register to View Presentation Recording | View Slides
A Comparison of SYCL, OpenCL, CUDA, and OpenMP for Massively Parallel Support Vector Machine Classification on Multi-Vendor Hardware
In this paper, we compare the different competing programming frameworks OpenMP, CUDA, OpenCL, and SYCL, paying special attention to the two SYCL implementations hipSYCL and DPC++. Thereby, we investigate the different frameworks with respect to their usability, performance, and performance portability on a variety of hardware platforms from different vendors, i.e., GPUs from NVIDIA, AMD, and Intel and Central Processing Units (CPUs) from AMD and Intel. Besides discussing the runtimes of these frameworks on the different hardware platforms, we also focus our comparison on the differences between the nd_range kernel formulation and the SYCL specific hierarchical kernels.
Our Parallel Least Squares Support Vector Machine (PLSSVM) library implements backends for the four previously mentioned programming frameworks for a Least Squares Support Vector Machine (LS-SVM). At its example, we show which of the frameworks is best suited for a standard workload that is frequently employed in scientific computing and AI, depending on the target hardware: The most computationally intensive part of our PLSSVM library is solving a system of linear equations using the Conjugate Gradient (CG) method. Specifically, we parallelize the implicit matrix-vector multiplication inside the CG method, a workload common in many scientific codes.
The PLSSVM code, utility scripts, and documentation are all available on GitHub: https://github.com/SC-SGS/PLSSVM.
Speaker: Marcel Breyer (University of Stuttgart)
Co-Authors: Alexander Van Craen and Dirk Pflüger (University of Stuttgart)
OPENCL & SYCL | Paper: 105 | Register to View Presentation Recording | View Slides
How Much SYCL Does a Compiler Need? Experiences from the Implementation of SYCL as a Library for nvc++
However, in principle other devices apart from the host could also be targeted by library-only implementations, if they operate as libraries written in heterogeneous programming models which can accept regular C++ kernel code as required by SYCL.
We have implemented the first library-only device backend in a major SYCL implementation by adding support for hipSYCL’s CUDA backend to operate as a library for NVIDIA’s nvc++ compiler, which is part of the NVIDIA HPC SDK.
Being able to run SYCL code with hipSYCL as a library for nvc++, a vendor-supported compiler, guarantees immediate SYCL support for new NVIDIA hardware, compatibility with latest CUDA releases and easy deployment on existing NVIDIA-based HPC systems where the NVIDIA HPC SDK might already be preinstalled.
In this talk, we will describe hipSYCL’s nvc++-based compilation flow, how it is designed and how it operates compared to hipSYCL’s existing clang-based support for CUDA devices. We will also discuss limitations of the current nvc++ support compared to clang.
While it was shown repeatedly that SYCL implementations in general, and hipSYCL in particular, can often deliver competitive performance compared to CUDA, when performance discrepancies with nvcc-compiled code are found it is often difficult to pinpoint whether these are due to the additional C++ layers of the SYCL programming model itself, or due to the different code generation backends of the compilers.
We will show performance results comparing hipSYCL with nvc++ to CUDA compiled with nvc++, thereby eliminating the switch between different compilers as a variable.
Finally, we will discuss library-only implementations from the point of view of the SYCL 2020 specification, and highlight challenges for the implementation of library-only device backends.
Speaker: Aksel Alpay (Heidelberg University)
Co-Authors: Vincent Heuveline (Heidelberg University)
SYCL | Technical Presentation: 124 | Register to View Presentation Recording | View Slides
Experiences Porting NAMD to the Data Parallel C++ Programming Model
heterogeneous computing resources with a vendor-neutral programming paradigm. Data Parallel C++ is a programming language based on open standards SYCL,
providing a vendor-neutral solution. We describe our experiences porting the NAMD molecular dynamics application with its GPU-offload force kernels to DPC++. Results are shown that demonstrate correctness of the porting effort.
Speaker: David Hardy (University of Illinois at Urbana-Champaign)
Co-Authors: Jaemin Choi and Emad Tajkhorshid (University of Illinois at Urbana-Champaign), and Wei Jiang (Argonne National Laboratory)
SYCL | Paper: 129 | Register to View Presentation Recording | View Slides
Exploring SYCL SC (Safety-Critical)
Safety-critical industries like avionics, automotive, nuclear and rail require their software to be compliant to safety standards such as ISO 26262, DIS21448/SOTIF, DO-178C and UL4600, as well as adhering to guidelines defined by AUTOSAR and MISRA. Historically, software written for safety-critical applications has been written in C or entirely generated by modeling tools, but this is changing rapidly.
Software and hardware complexity is growing, and in particular the demand for Artificial Intelligence (AI) has skyrocketed, for which higher levels of abstraction are required. Possibly the most high-profile example of using AI is to enable the development of semi-autonomous and autonomous vehicles, which manufacturers are competing to be the first to develop safely.
Khronos already has two safety-critical variants of APIs: OpenGL SC for graphics and Vulkan SC for graphics and low-level compute. In addition, OpenVX 1.3 has defined a safety-critical feature set for AI/Vision applications. However, in the safety space there is currently no high-level compute API to develop the complex algorithms of tomorrow.
By implementing SYCL SC on top of Vulkan SC and building on safety certified drivers and hardware, the whole stack, all the way from hardware to application, can be safety certified.
SYCL SC will also unlock the ecosystem of libraries and tools that already exist for SYCL for the safety-critical domain. Frameworks such as AI/ML frameworks can be built on top of SYCL SC in the future, to provide even higher levels of abstraction.
This presentation will talk about the aims of the new standard, which are aligned with Vulkan SC. It will also touch on some initial design ideas, with a focus on deterministic rather than dynamic behavior. We suggest the removal of some SYCL features that are unlikely to be used in a safety context, to facilitate the safety certification of the runtime itself and the addition of extensions that provide functionality that is useful in a context where safety is critical, such as related to the timing of kernels.
We will discuss the importance of Misra’s C++ guidelines, particularly the upcoming Misra C++ 202x standard, for applications and hence APIs, whilst acknowledging the need to remain compatible with standard SYCL as much as possible.
We set up the Exploratory Forum to collect feedback from potential runtime implementers, library developers, researchers and users on what their requirements for a high-level compute API in a safety-critical context are. The Exploratory Forum is open to non-Khronos-members under Khronos NDA and we actively encourage non-members to participate.
Once a wide range of requirements has been collected the next step is the creation of a new Khronos group, which would work towards a specification.
This presentation will describe the issues that organizations are facing that can be solved through the new standard and provoke discussion on how to develop an API that will meet the needs of the safety-critical community.
After the presentation we invite the audience to join the Exploratory Forum to talk about their own requirements and experiences as well as collaborate to develop a framework for the new standard to be defined.
Speaker: Verena Beckham (Codeplay Software)
Co-Authors: Ken Wenger (CoreAVI)
SYCL | Technical Presentation: 134 | Register to View Presentation Recording | View Slides
Optimize AI Pipelines with SYCL and OpenVINO
pipelines still use feature-engineered approaches that are not relying on deep neural nets.
Embedded compute platforms in robotics systems are inherently heterogeneous in nature, often with a variety of CPUs, (integrated) GPUs, VPUs, and so on. Thismeans that there is a growing need to implement “mixed” pipelines on heterogeneous platforms that include a variety of xPUs. We want such pipeline implementations to benefit from the latest advancements in data- and thread-parallel computation, as well as state-of-the-art in optimized inference of AI DNN models. SYCL and OpenVINO are two open, industry supported APIs that allow a developer to do so.
It is not only important to optimize the individual components of the processing pipeline – it is at least as important to also optimize the data flow and minimize data copies. This provides a way to benefit from the efficiencies in inference runtime and compute graph optimizations provided by OpenVINO, in combination with the extensibility that SYCL brings in implementing custom or non-DNN components. Similarly, the use of compatible synchronization primitives allows the different runtimes to schedule work more efficiently on the hardware and avoid execution hiccups.
In this talk, we will demonstrate the mechanisms and primitives provided by both SYCL and OpenVINO to optimize the dataflow between, and efficient execution of the workloads implemented in the respective APIs. We will provide an example and show the impact on the overall throughput and latency of the end-to-end processing pipeline. The audience will learn to recognize inefficiencies in their pipelines using profiling tools, and understand how to optimize those inefficiencies using an easy-to-follow optimization recipe. Finally, we will provide guidance to developers of inference engines other than OpenVINO on how to integrate similar interoperability features into their APIs, so that they too can offer optimized SYCL-enabled AI pipelines to their users.
Speaker: Nico Galoppo (Intel)
SYCL | Technical Presentation: 131 | Register to View Presentation Recording | View Slides
Using Interoperability Mode in SYCL 2020
While the previous versions of the SYCL standard were based only on top of the OpenCL standard to control the accelerators, starting with SYCL 2020, the standard is independent from OpenCL and can target different API, described with the concept of backend. Some SYCL implementations can thus target today various lower-level API, like OpenCL, CUDA, Level0, HIP, XRT, Vulkan, etc. with possibly different backends used at the same time in the same application.
Even if the SYCL standard thrive to abstract the generic principles used in heterogeneous programming with C++ classes and functions, real applications require often to use specific details of a given architecture to benefit fully from an accelerator or need to be into integrated into a wider framework, including parts implemented in other languages and other API for heterogeneous computing. This is possible in SYCL with a less-know but powerful concept of interoperability, which is introduced at different levels.
On one hand, by accessing some native backend objects from SYCL objects, it is possible to use in a SYCL program the native API, for example by calling some existing optimized libraries like mathematical libraries, machine learning, video CODEC, etc. to simplify the application development and reach the maximum performance. In that case it is for example possible to get from a sycl::queue a native queue from the backend to be used to enqueue a library function.
On the other hand, it is possible to use a part of the application written in SYCL from another part of the application using another API by using SYCL interoperability functions to constructs SYCL objects like sycl::device or sycl::queue from native equivalent objects from the lower-level API backend used in the main part of the program.
Another feature of SYCL 2020 interoperability is the ability to schedule backend API operations within the SYCL task DAG using host task interoperability. In SYCL, host tasks allow the user to enqueue an arbitrary C++ function within the SYCL DAG and host tasks have an optional interoperability handle which provides access to the native backend queue, device and memory objects at that point in the DAG. This feature is very powerful as it allows a SYCL application to interoperate with backend-specific libraries such as BLAS or DNN libraries.
Finally, SYCL interoperability allows for calling backend-specific kernel functions in the backend kernel language such as OpenCL or CUDA via backend-specific functions when generating a kernel_bundle, which can be invoked via a SYCL queue. Some implementations can also go beyond the standard and provide some native functions directly callable from a plain SYCL kernel.
SYCL can also be used to simplify the direct use of a lower-level API, like a higher-level C++ wrapper, to remove a lot of the boilerplate code otherwise needed to use the lower-level API. Since it is possible to use the interoperability mode with sycl::buffer and sycl::accessor, some code using the native API can benefit from the implicit data dependency task graph and automatic overlap of computation and implicit communications provided by the SYCL programming model.
Having all these interoperability modes in SYCL allows leveraging existing other interoperability modes and building some complex interoperability paths between several frameworks or standards in a single application. For example in HPC a SYCL application can interoperate with an OpenMP library through a common backend to use parallelism in a cooperative way or could use the OpenCL back-end to reach Vulkan through OpenCL-Vulkan interoperability for high-performance graphics rendering. A multimedia application could use a SYCL-OpenCL-OpenGL-DX12 path to do image processing of native images.
Speaker: Aksel Alpay (Heidelberg University)
Co-Authors: Thomas Applencourt (Argonne National Laboratory), Gordon Brown (Codeplay Software), Ronan Keryell (AMD) and Greg Lueck (Intel)
SYCL | Technical Presentation: 139 | Register to View Presentation Recording | View Slides
Interfacing SYCL and Python for XPU Programming
of Python extension generator tools to build SYCL-based extensions. Details of ongoing work are presented and we demonstrate the development of a performance portable Python native extension that relies on the SYCL-based oneMKL specification.
Speaker: Diptorup Deb (Intel)
Co-Authors: Oleksandr Pavlyk (Intel)
SYCL | Paper: 122 | Register to View Presentation Recording | View Slides
SYCLops: A SYCL Specific LLVM to MLIR Converter
Speaker: Alexandre Singer (Huawei Canada Research Centre)
Co-Authors: Frank Gao and Kai-Ting Amy Wang(Huawei Canada Research Centre)
SYCL | Paper: 127 | Register to View Presentation Recording | View Slides
A Source-to-Source Migration Tool: Intel DPC++ Compatibility Tool
Intel® DPC++ Compatibility Tool is included in the Intel® oneAPI Toolkit, it is a tool that assists developers to do source-to-source migration, e.g., migrate code written in CUDA to SYCL/DPC++ code [2][3] to enable their code to run on multi platforms. The tool generates human readable and maintainable code whenever possible and provides inline comments to help developers complete their code. On average, about 80-90% of CUDA code in applications can be migrated by this tool, completion of the code and verification of the final code is expected to be manual process done by the developers. The goal of Intel® DPC++ Compatibility Tool is to make it as easy as possible for developers to migrate their existing CUDA codebase to SYCL/DPC++ to facilitate more HW choices and access to advantages of oneAPI and DPC++.
Intel® DPC++ Compatibility Tool is based on LLVM/Clang [4], it mainly contains 3 functional components:
- The intercept-build tool: it is used to collect compilation options of the user input project by intercepting build process of user input project, like build option, macro definitions, include folders and so on information. During source-to-source migration, those compilations options are used to identify the active code path, header files depended, finally build right abstract syntax tree for the user input project.
- The DPCT binary tool: The tool is the main migration tool, which do source-to-source migration based on compiler front end technology. It implements a set of migration rules to migrate source language elements like type, APIs, macros to target language equivalents. If some C++ Code are same between source and target language, then the tool keeps those C++ code unchanged. Also, the tool provides a way to let user define migration rule by themselves in migration rule description file to guide a customization migration.
- Helper header library: it provides helper functions and macros to assist the migration of input source code. These header files are intended to become part of migrated code generated by the Intel® DPC++ Compatibility Tool. User can copy these header files if needed and include them with the generated code.
The Intel® DPC++ Compatibility Tool can assist developers to migrate code written in CUDA to the code written in SYCL/DPC++ with proper performance, minimizing developer’s efforts. The tool can enrich the oneAPI ecosystem by assisting developers in migration of more apps to SYCL/DPC++ running on oneAPI.
References[1] https://software.intel.com/content/www/us/en/develop/tools/oneapi.html[2] https://software.intel.com/content/www/us/en/develop/tools/oneapi/data-parallel-c-plus-plus.html[3] https://www.khronos.org/sycl/[4] https://llvm.org/[5] https://software.intel.com/content/www/us/en/develop/documentation/intel-dpcpp-compatibility-tool-user-guide/top.html
Speaker: Zhiming Wang (Intel)
SYCL | Technical Presentation: 132 | Register to View Presentation Recording | View Slides
FPGA Acceleration of Structured-Mesh-Based Explicit and Implicit Numerical Solvers using SYCL
Speaker: Kamalavasan Kamalakkannan (University of Warwick)
Co-Authors: Gihan Mudalige (University of Warwick), István Reguly (Pázmány Péter Catholic University) and Suhaib Fahmy (King Abdullah University of Science and Technology)
SYCL | Paper: 135 | Register to View Presentation Recording | View Slides
How to Optimize Compute Drivers? Let’s Start with Writing Good Benchmarks!
This created a need for driver team to write a tool, that would make performance work on the driver easier, so we created compute benchmarks. In this suite we test all aspects of driver stack to see if they do not have any bottlenecks. Each test checks only one thing and does this in isolation, so it is very easy to work on optimizing it and doesn’t require any extensive setup.
Benchmarks focus on such subtle aspect of every driver as: API overhead of every call, submission latencies, resource creation costs, transfer bandwidths, multi-threaded contention, multi process execution and many others.
Framework offers capabilities for multiple backends, currently we have OpenCL and Level Zero implementations in place, so it is very easy to compare how the same scenario is services with different drivers. It is also very easy to compare driver implementations between vendors, as tests written in OpenCL simply work across different GPU implementations.
We also use this code to present good and bad coding practices, this is very useful to showcase how simple things can drastically improve performance and users can simply run those scenarios and see how performance changes on their own setups. It is also a great tool to prototype new extensions and further propose them as a part of OpenCL standard.
We plan to Open Source this project in Q2 2022, it is expected to be already available during IWOCL.
Speaker: Michal Mrozek (Intel)
OPENCL & SYCL | Paper: 142 | Register to View Presentation Recording | View Slides
TAU Performance System
Speaker: Sameer Shende (University of Oregon)
OPENCL & SYCL | Technical Presentation: 125 | Register to View Presentation Recording | View Slides
Towards Performance Portability of AI Models Using SYCL-DNN
SYCL provides a C++-based portable parallel programming model to target various devices like CPUs, GPUs, DSPs, FPGAs, etc. Thus, enabling SYCL backend for an AI framework can lead to a hardware agnostic model for heterogeneous systems and also allow to reuse the existing optimized library implementations.
SYCL-DNN already supports OpenCL backend and in this paper we extend SYCL-DNN to support Nvidia and RISC-V architectures. The results provide a detailed analysis of the performance portability of SYCL based AI frameworks on various architectures with respect to state-of-the-art optimized vendor specific libraries.
Speaker: Muhammad Tanvir (Codeplay Software)
Co-Authors: Kumudha Narasimhan, Mehdi Goli, Ouadie El Farouki, Svetlozar Georgiev and Isaac Ault (Codeplay Software)
OPENCL & SYCL | Technical Presentation: 141 | Register to View Presentation Recording | View Slides
Towards a Portable Drug Discovery Pipeline with SYCL 2020
In this talk, we discuss the challenges of making the LiGen docking pipeline portable among different accelerators and GPUs by porting the original codebase from CUDA to SYCL. The code has been refactored by removing critical CUDA semantics with portable ones, and by exploiting several features from the SYCL 2020 standard [5], including sub-groups, group algorithms, and Unified Shared Memory. For comparison, we have developed two versions based on, respectively, accessor and USM-based memory accesses. Particular efforts have been spent on kernel tuning, in particular, to optimize those kernels with high register pressure. The final SYCL code base, comprising more than 20 SYCL kernels, has been evaluated on several architectures including NVIDIA V100, NVIDIA A100, AMD MI100 as well as Intel Xeon, and by using both HipSYCL and Intel DPC++ compiler. In terms of performance portability, the SYCL implementation achieves similar performance compared to the CUDA native version on NVIDIA V100 and AMD M100, with minimal modification needed.
Speaker: Luigi Crisci (University of Salerno)
Co-Authors: Majid Salimi Beni, Biagio Cosenza (University of Salerno), Nicolò Scipione, Davide Gadioli, Emanuele Vitali, Gianluca Palermo (Politecnico di Milano), and Andrea Beccari(Dompé Farmaceutici)
SYCL | Technical Presentation: 111 | Register to View Presentation Recording | View Slides
Posters
Reaching Even Richer C++ in OpenCL Kernels with use of libclcxx
We would like to present a brief overview of this project and the new type traits as well as to highlight how previous work for enabling type traits from the standard C++ has been reused and successfully merged into the new project and how it facilitated easy development of new features. We would also like to provide a demonstration of how developers can take advantage of the new libraries along with the clang compiler to use more language features from C++ when developing OpenCL kernels.
At the end we would like to invite the OpenCL community for evaluation and feedback through the libclcxx GitHub repo channels, especially regarding the features they would like to see added next, or even the features they would be interested to contribute to this project.
[1] https://clang.llvm.org/docs/UsersManual.html#cxx-for-opencl[2] https://www.khronos.org/registry/OpenCL/extensions/ext/cl_ext_cxx_for_opencl.html[3] https://doi.org/10.1145/3456669.3456675[4] https://github.com/KhronosGroup/libclcxx
Speaker: Anastasia Stulova (Arm)
Co-Authors: Ishfaq Wardag (Arm)
OPENCL | Poster: 204 | Register to View Presentation Recording | View Slides
OpenCLML integration with TVM
After having enabled the mainline TVM for Adreno GPU, which was presented in IWOCL 2020 and 2021, we made further progress by adding OpenCLML SDK backend in TVM for accelerating machine learning operations on Adreno GPU. This integration uses TVM’s high level frontends, graph level optimizations followed by Adreno GPU’s accelerated proprietary machine learning operators. OpenCLML integration into TVM’s is achieved through TVM’s BYOC (Bring-Your-Own-Codegen). BYOC is a framework that allows integration of accelerator specific kernel library into TVM’s compiler flow. Please refer to TVM’s documentation https://tvm.apache.org/2020/07/15/how-to-bring-your-own-codegen-to-tvm for more details about BYOC. BYOC offers hybrid mode of kernel execution with few subgraphs executing over vendor specific kernel library and the remaining (or unsupported operators) using any TVM’s backend as fallback option. This makes the platform capable of running all models without worrying about new operators.
To enable OpenCLML over TVM, we implemented OpenCLML frontend helps to split the given DNN workload based on OpenCLML operator inventory and a new Codegen that extends TVM’s JSON Codegen. The compiler output here is a platform independent JSON representation. When it comes to execution, we implemented OpenCLML runtime over TVM’s JSONRuntime that initializes and invokes OpenCLML machine learning CL API. We are in the process of upstreaming these efforts and you may refer to the RFC at https://github.com/apache/tvm-rfcs/blob/main/rfcs/0052-OpenCLML-integratio-as-BYOC.md.
By integrating OpenCLML into TVM compiler framework, we have seen significant improvement of most of the well-known models on Adreno platform in comparison to TVM’s default OpenCL backend.
We will discuss the approach and various challenges we face during these efforts along with performance numbers, and our thoughts on the future of OpenCLML with TVM in the poster.
Speaker: Siva Rama Krishna Reddy B (Qualcomm)
Co-Authors: Hongqiang Wang, Alex Bourd, Adarsh Golikeri and Balaji Calidas (Qualcomm)
OPENCL | Poster: 216 | Register to View Presentation Recording | View Slides
SYCL Concurrency on GPU Platforms: Empirical Measurement
By executing commands concurrently, developers hope that their code will run faster. This poster uses this empirical metric to assess if a computing environment lives up to developers’ expectations. We run each individual command serially to generate a baseline and then check if the same commands run faster when scheduled in a way that allows concurrency.
The SYCL specification allows concurrent execution of independent commands when they are scheduled in an out-of-order queue or when they are scheduled to multiples, possibly in-order, queues.
We tested four different kinds of independent command, both in the “multiple in-order-queues” and “single queue out-of-order” modes:
- Two Compute kernels, each kernel having a low occupancy
- One Compute kernels and a one memory copy from system allocated Memory to a Device buffer (M2D)
- One Compute kernel and a memory copy from Device buffer to system allocated Memory (D2M)
- One M2D and one D2M
The poster’s contribution is twofold:
Firstly, the source code used for these experiments has been made open-source (https://github.com/argonne-lcf/HPC-Patterns/tree/main/concurency) so that others can evaluate these different approaches to concurrency. Our code uses USM for the memory transfer and relies on “clpeak like” kernel for the compute part (https://github.com/krrishnarraj/clpeak/blob/master/src/kernels/compute_dp_kernels.cl). Memory buffers used are as large as possible for USM allocation (`sycl::info::device::max_mem_alloc_size`) to minimize runtime overhead with respect to execution time. The number of FMA used for the compute kernel is chosen so that the execution time of the compute kernel and data-transfers are similar.
Secondly, We tested multiple Sycl compilers, targeting multiple backends, on multiple hardware (at the time where this abstract was written: DPCPP / OpenCL / Gen, DPCPP / L0 / Gen9, DPCPP / CUDA / A100, HipSYCL / Hip / MI100. We plan to measure more). Results are mitigated, with some environments achieving concurrency in most tests, others in none. It is also is interesting to note that enabling profiling in queues will serialize commands in some environments.
Speaker: Thomas Applencourt (Argonne National Laboratory)
Co-Authors: Abhishek Bagusetty (Argonne National Laboratory) and Aksel Alpay (Heidelberg University)
SYCL | Poster: 221 | Register to View Presentation Recording | View Slides
C++OpenCL4TVM: Support C++OpenCL Kernel for TVM NN Operators
In this paper, we propose a solution that TVM generates the design patterns and abstractions of OpenCL C++ with a flavor of layout and view of Kokkos. This solution provides two C++-specific abstraction features, sparse layout and unseq execution policy, for TVM’s OpenCL code generator. The sparse layout can be described abstractly like the Kokkos-style, effectively reorganizing the matrix structure of convolution to achieve code reuse and memory optimization. The unseq execution policy serves as a vectorization request to, for example, for_each in the C++ for OpenCL kernel. There are two steps in the design process:
(1) Analyze the sparse structure of the convolution operator in the TVM’s OpenCL kernel and further design the sparse layout for TVM’s OpenCL kernel code generator.
(2) Put compiler directives in function overloads which take the vectorization request.
Currently, this is on-going work. We will report more experimental results for TVM generating design patterns of OpenCL C++ version in the workshop time.
Speaker: Po-Yao Chang (National Tsing Hua University)
Co-Authors: Tai-Liang Chen, Yu-Tse Huang, Meng-Shiun and Jenq-Kuen Lee (National Tsing Hua University)
OPENCL | Poster: 209 | Register to View Presentation Recording | View Slides
Embedding a DSL in SYCL for Productive and Performant Tensor Computing on Heterogeneous Devices
Although these domains all utilize tensor computing, each individual application can vary in their overall implementation and requirements.
Thus creating a diverse and unique range of applications within tensor computing.
Along with the popularity of tensor computing arose the demand for more optimized tensor applications. In response to this growing demand came the proliferation of new hardware platforms to perform more efficient tensor execution. From Google Brain’s Tensor Processing Units (TPUs), Nvidia’s General Graphics Processing Units (GPUs), Microsoft’s Neural Processing Units (NPUs), and Field-Programmable Gate Arrays (FPGAs) for both local and cloud platforms. Many of these architectures utilize matrix-vector units, systolic arrays, or a novel hardware architecture to exploit the parallelism found within tensor computations.
To support many of these new and developing hardware architectures, industry and academia alike has produced Domain-Specific Languages (DSLs) and programming models to help realize each unique hardware architecture’s potential. This involves very close understanding of individual hardware characteristics, their downfalls, and well-engineered software to navigate them effectively. Accompanying these unique hardware architectures and unique supporting software comes the caveat of inflexibility. As software improves or experimental implementations of a tensor application are created for one unique hardware architecture, it may or may not translate easily to another architecture. Applications must be re-implemented for any novel architecture by researchers and developers alike. Forcing difficult trade-offs between hardware, software, ease of engineering labor. Creating an inflexible software ecosystem becomes a hurdle for the development of new algorithms that utilize tensor computation.
This project focuses on combining data parallelism provided though the combination of Data Parallel C++ (DPC++), and Temporal To Spatial Programming (T2SP) to provide a hardware agnostic programming model to construct new tensor computations.
DPC++ is an open source compiler project that is based on SYCL, an industry-driven Khronos standard adding data parallelism to C++ for heterogeneous systems.
- T2SP is both a novel programming framework and compiler which helps enables tensor computation for both spatial and vector architectures such as CPUs/GPUs and FPGAs respectively.
- T2SP does this by dissolving the marriage between functional specification from spatial mapping.
- T2SP is based on several observations, notably that spatial architectures favor optimized dataflow and partitioning the computation into many sub-computations distributed over spatial architecture.
- T2SP allows programmers to describe the computation separately from spatial mapping, partitioning, and dataflow of a spatial architecture.
Allowing programmers to quickly develop various spatial optimizations without having to reconstruct an applications core functional implementation between architectures such as CPUs, GPUs, and FPGAs. - Through the employment of both DPC++ and T2SP, users are able to create a platform agnostic implementation of novel algorithms, without the restriction of architecture dependent software.
- Providing portability, efficient utilization of hardware resources, and ease of development for tensor applications.
Initial evaluations were preformed using General Matrix Multiply (GEMM), 2 Dimensional Convolution (CONV), and Capsule Convolution (CAPSULE), for an Arria-10 FPGA on Intel’s FPGA DevCloud Platform.
Results show that this project has been able to achieve an average of over 60% of the original T2SP’s performance. With minor adjustments, it can be said with confidence that this combination of DPC++ and T2SP can provide competitive performance of tensor applications between special and tensor architectures without extra effort on the end user.
Speaker: Abenezer Wudenhe (University of California Riverside)
Co-Authors: Hongbo Rong (Intel)
SYCL | Poster: 219 | Register to View Presentation Recording | View Slides
C++ for OpenCL 2021
To continue evolution alongside the OpenCL standard, a new language version C++ for OpenCL 2021 has been announced, which provides compatibility with OpenCL version 3.0. The difference between the previous C++ for OpenCL version 1.0 and the new version 2021 arises from differences between OpenCL 2.0 and 3.0 with which they are respectively compatible.
To support more devices, OpenCL 3.0 defines features of OpenCL 2.0 as optional [3]. We would like to present an overview of the new language version along with its ongoing support in Clang. Currently, C++ for OpenCL 2021 is being actively developed side by side with OpenCL 3.0. The experimental support of it is going to appear in clang 14. Most of work conducted on C++ for OpenCL 2021 development was related to the optionality of the features. In its current state, C++ for OpenCL 2021 supports all optional features from OpenCL 3.0.
By implementing the new language version inside Clang compiler as part of the LLVM project, we could reuse substantial portions of code that was already written during development of OpenCL C 3.0. This way we could avoid unnecessary duplication of effort in places where C++ for OpenCL behaves similarly to OpenCL C. Additionally, Clang compiler and community helped to carry out the proof-of-concept stage for experimental features. More specifically, the address space removal utility __remove_address_space [4] was included in the official language documentation after implementing it in Clang and gathering feedback from the Clang community. The largest novelty of this new language version and its implementation lies in support of generic address space optionality on which C++ for OpenCL 1.0 has heavily relied in multiple C++ specific constructs, e.g., implicit object parameters or special member functions.
In this submission we plan to provide an overview or key language changes in C++ for OpenCL 2021 and present the latest status in Clang development of C++ for OpenCL 2021. Although not in its final state yet, this new language version already has most of the intended functionality implemented inside Clang compiler. Provisional language documentation [5] should be referred to for detailed information on all new language features. We thus encourage everyone to start experimenting and provide us feedback on how to improve C++ for OpenCL 2021 even further. Please check the latest support of C++ for OpenCL 2021 directly from the browser using Compiler Explorer [6].
[1] https://www.iwocl.org/wp-content/uploads/06-iwocl-syclcon-2021-strohm-slides.pdf
[2] https://www.youtube.com/watch?v=DBknc1pRB9E
[3] https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/OpenCL_C.html#features
[4] https://www.khronos.org/opencl/assets/CXX_for_OpenCL.html#remove-addrspace
[5] https://www.khronos.org/opencl/assets/CXX_for_OpenCL.html
[6] https://godbolt.org/z/q8PeWY3hn
Speaker: Justas Janickas, Arm
Co-Authors: Anastasia Stulova, Arm
OpenCL | Poster ID: 207 | Register to View Presentation Recording | View Slides
An Overview of the OpenCL vendor extensions supported in Qualcomm Adreno GPUs
Speaker: Hongqiang Wang (Qualcomm)
Co-Authors: Balaji Calidas (Qualcomm)
SYCL | Poster: 213 | Register to View Presentation Recording | View Slides
Exploring Compiler-aided nd-range Parallel-for Implementations on CPU in hipSYCL
The hipSYCL implementation of the SYCL standard, presented in [Alpay and Heuveline 2020], currently provides unified access to CPUs and GPUs. However typical GPU kernels using the nd-range parallel-for paradigm are far from performance portable as mentioned in [Alpay and Heuveline 2020, slide 20]. This is due to the necessary forward-progress guarantees that have to be made so that barriers can be correctly implemented, making typical GPU kernels with low compute to barrier ratio prohibitively slow. One major component is the overhead induced by using either dedicated threads or fibers for every work-item. This problem is common with most heterogeneous programming models when also targeting CPUs. A more favorable execution model for CPUs is the hierarchical parallel-for in SYCL, which can be implemented by parallelizing over the work-groups and sequentially iterating over the work-items. [Deakin et al. 2021] discuss the issues of SYCL’s current kernel submission mechanisms in depth.
Earlier work in the context of OpenCL CPU implementations had to solve the barrier issue for their kernels as well.
In [Kaeli et al. 2015] the authors describe the implementation of the AMD APP SDK’s OpenCL CPU runtime. They use custom lightweight threads that allowed them to optimize stack location and alignment. Support for vectorization is only present when using OpenCL’s explicit floatN vector data types which have overloaded mathematical operations that are mapped directly to vector instructions. This is similar to the Boost.Fiber based, library-only nd-range parallel-for implementation in hipSYCL 0.9.0 and later and thus leaves performance to be gained.
In [Jääskeläinen et al. 2010] an OpenCL implementation is presented that aims to enable using OpenCL on application-specific processors. [Jääskeläinen et al. 2015] continues this project under the name portable OpenCL (POCL) with a focus on high-performance execution on CPUs. The work-group barrier issue has been solved by implementing a custom set of passes on LLVM IR inside the kernel compiler. After inlining all functions in the kernel, it performs barrier-tail replication, loop-barrier insertion, … to make barrier-free regions, which are replicated and at runtime selected by the first work-item and then iterated over by a work-item loop. With a few technical adaptations due to compilation-flow limitations in SYCL, the kernel driver was adopted into the hipSYCL Clang plugin for evaluation.
A different approach to the work-group barrier problem was presented as continuation-based synchronization (CBS) in [Karrenberg and Hack 2012]. The kernel is divided in barrier free sub-CFGs that are identified by their entry barrier and are surrounded by a while loop containing a switch statement that selects the next to be executed sub-CFG based on the last encountered barrier. An adaptation of CBS into hipSYCL is compared to the POCL approach in this work.
In hipSYCL an improved variant of its uniformity analysis that is presented in [Rosemann et al. 2021] was used.
Most SYCL implementations rely on OpenCL drivers to provide access to CPUs. Thereby the performance and semantics are then mostly defined by the underlying runtime which might impact portability. hipSYCL does not have the additional requirement for an OpenCL driver, but supports all CPUs for which an OpenMP compiler exists in a library-only fashion. To provide good CPU performance for the nd-range parallel-for paradigm as well, without adding the dependency on an OpenCL runtime, we implement the compiler-based POCL and CBS approaches solving the work-group barrier problem in hipSYCL’s Clang plugin and compare the approaches’ suitability for the SYCL programming model.
The approach POCL has chosen, allows for more hand-crafted compiler-based optimizations, whereas CBS is more general. One benefit of CBS is, it only requires that if a barrier is reached by any work-item, all work-items must reach it. This is how the barrier semantics are defined by the SYCL and newer OpenCL C standards. POCL additionally assumes that if a barrier is nested inside a loop, the loop has to be executed the exact same amount of iterations for all work-items, even if the barrier would not be reached due to protecting conditionals. This can have a functional correctness impact.
Both approaches outperform hipSYCL’s current fiber implementation in relevant benchmarks [Deakin et al. 2021, 2016; Jin 2021; Lal et al. 2020], in most cases by one or two orders of magnitude. See https://gitlab.com/fodinabor/ma-hipsycl-bench/-/tags/iwocl22 for the used benchmark versions, the reproduction steps and raw data. While the CBS pipeline currently results in faster kernels generally speaking, the POCL pipeline could likely close the gap by porting improvements made to the other one. By example of the DGEMM kernel, it was shown, that on the tested x86 systems, the SYCL implementation is able to outperform the POCL OpenCL implementation. By upstreaming the CBS variant, hipSYCL is the first SYCL implementation that does not rely on an OpenCL runtime, achieving competitive performance and consistent semantics for the nd-range parallel-for paradigm on any CPUs that have an LLVM backend.