The 2022 conference will be a blend of live tutorials and panel sessions, alongside on-demand presentation of the papers, technical talks and posters, all supported by a dedicated slack workspace to ask questions and network.

  • Instructions for joining the live sessions, slack channels and viewing the presentations will be included in the registration confirmation.
  • The on demand presentations will be available to registered delegates from Wednesday May 11 at 9:00 CET
  • If you are missing a registration email please see: What to check when you don’t receive Eventbrite emails

Tutorials

Application Development with SYCL

Parallel programming with heterogeneous architectures has gained a reputation for being difficult, but is it really? Modern C++ has come a long way to making parallel programming easier, and the SYCL programming model means heterogeneous programming using C++ is now more accessible than ever.SYCL uses modern standard C++, and it’s a programming model that lets developers support a wide variety of devices (CPUs, GPUs, FPGAs, and more) from a single code base. Given the growing heterogeneity of processor roadmaps, moving to an open standard, platform-independent model (without vendor lock-in) is essential for modern software developers.There are multiple implementations of SYCL available including open source projects, and in this tutorial you will join instructors who are developing some of these alongside experienced developers from academic institutions implementing complex SYCL code bases.

This tutorial will provide a way for developers to gain expertise with SYCL in a practical environment focused more on writing code than Powerpoint.
Attendees will gain a background of how the designers of the SYCL standard have addressed heterogeneous programming in C++ through industry collaboration. SYCL has gained widespread support in recent years and is available on Exascale systems, desktops, embedded systems, FPGAs, and automotive platforms. Regardless of the particular constructs in the future – the material in this course will prove timeless.

This course will start by teaching the fundamentals of heterogeneous parallelism using SYCL. It will also teach you how to make use of modern C++ and the SYCL programming model to build parallel algorithms for heterogeneous devices. Most of the programming focus will be on GPUs, but some time will be spent applying the techniques to simple FPGA examples. The course will teach you how to apply some common GPU optimizations.

Agenda

Times are approximate and show Central European

  • Introduction – Aksel Alpay, U of Heidelberg
    • PDT 9am – 9:30am, UK 5pm-5:30pm, CET 6pm – 6:30pm
  • Enqueuing a SYCL Kernel – Igor Vorobstov, Intel
    • PDT 9:30am – 10:15am, UK 5:30pm – 6:15pm, CET 6:30pm – 7:15pm
  • SYCL Data Management – Peter Zuzek, Codeplay
    • PDT 10:15am – 11:00am, UK 6:15pm – 7:00pm, CET 7:15pm – 8:00pm
  • ND Range Kernels – Ronan Keryell, AMD
    • PDT 11:00am – 12:00pm, UK 7:00pm – 8:00pm, CET 8:00pm – 9:00pm
  • Image Convolution Case Study – Peter Zuzek, Codeplay
    • PVT 12:00pm – 1:00pm, UK 8:00pm – 9:00pm, CET 9:00pm – 10:00pm

Tutorial Lead: Rod Burns

oneAPI Developer Summit

On day one of this Intel hosted virtual summit you will hear from a range of speakers from industry and academia working on innovative xPU solutions developed on oneAPI.  On day two Ben Odom from the Intel Developer Evangelist team will present a ‘oneAPI SYCL Essentials’ tutorial.

  • Developer Summit: Monday May 9, 2022 – 9:00 CST  Duration: 3:45 hours
  • oneAPI SYCL Essentials: Tuesday May 10, 2022 – 9:00 CST  Duration: 2:00 hours
  • This sponsored tutorial is run by Intel in association with SYCLcon.
  • Registration and additional information: Intel’s oneAPI Developer Summit

Live Panel Discussions

OpenCL Panel Discussion

Panel Chair: Simon McIntosh-Smith, University of Bristol

  • Date: Wednesday May 11, 2022
  • Start time: 16:00 BST,  17:00 CET, 11:00 EST, 08:00 PDT
  • Duration: Approx. 60 mins
  • LIVE Zoom Webinar
  • Panel Members
    • Neil Trevett, Khronos and NVIDIA
    • Balaji Calidas, Qualcomm
    • Moritz Lehmann, University of Bayreuth
    • Nagy-Egri Máté Ferenc, Stream HPC
    • Ben Ashbaugh, Intel
    • Alastair Murray, Codeplay
    • Kevin Petit, Arm
  • View Presentation Recording
    • No registration required
OpenCL  |  Paper ID: P1

SYCL Panel Discussion

Panel Chair: Tom Deakin, University of Bristol

  • Date: Thursday May 12, 2022
  • Start time: 16:00 BST,  17:00 CET, 11:00 EST, 08:00 PDT
  • Duration: Approx. 60 mins
  • LIVE Zoom Webinar
  • Panel Members
    • Michael Wong, Codeplay
    • Aksel Alpay, Heidelberg University
    • Ronan Keryell, AMD
    • Roland Schulz, Intel
    • Verena Beckham, Codeplay
    • Nevin Liber, Argonne National Laboratory
  • View Presentation Recording
    • No registration required
SYCL |  Paper ID: P2

NEW Accelerating Machine Learning with OpenCL

This special Khronos Group live webinar is co-located with IWOCL as  part of their new Machine Learning Forum.  During the webinar members of the OpenCL working group at Khronos will share the latest updates to the OpenCL language and ecosystem that can directly benefit Machine Learning workflow performance.

  • Date: Wednesday May 11, 2022
  • Start time: 10:00 PDT, 13:00 EST, 19:00 CEST
  • Duration: Approx. 60 mins
  • LIVE Zoom Webinar
  • Talks / Presenters
    • Introduction and Invitation to Join the Machine Learning Forum
      – Neil Trevett, Khronos and NVIDIA
    • Qualcomm Extensions for Advancing Machine Learning Acceleration
      – Balaji Calidas, Director of Engineering. Qualcomm
    • A Case Study on OpenCL vs GPU Assembly for Machine Learning Performance
      – Roy Oursler, Intel
    • Ask the Experts Q&A and Panel Discussion
  • View presentations slides and video on the Khronos Events page.
OpenCL  |  Paper ID: P3

Keynote and Invited Talks

Keynote
10 Years of IWOCL and SYCLcon: A Decade of Breakthroughs, But is the Best Yet to Come?

Speaker: Simon McIntosh-Smith | University of Bristol

The idea to organise a workshop focused on OpenCL was conceived on May 8th 2012, almost exactly ten years ago. Later extended to include the emerging C++ parallel programming, SYCL, IWOCL and SYCLcon have witnessed some incredible breakthroughs in performance and programmability, with heterogeneous systems of CPUs, GPUs and even FPGAs now capable of delivering hundreds of times more performance than was possible when the workshop was founded. In this keynote, we’ll reflect on some of the biggest, most exciting advances showcased at IWOCL and SYCLcon over the years, and hypothesise about might happen over the coming decade.

Speaker Biography – Professor Simon McIntosh Smith is a computer scientist focused on high performance computing (HPC), parallel programming languages and fault tolerant computing. He is Head of the High Performance Computing (HPC) research group at the University of Bristol and lead researcher at the Isambard GW4 Tier 2 HPC Centre.

SYCL & OpenCL |  Paper ID: K2  | View YouTube Playlist  | View Slides

Keynote
Pushing the boundaries of SYCL with hipSYCL

Speaker: Aksel Alpay | Heidelberg University

From its inception, a key goal of the hipSYCL project has always been to explore other interpretations of SYCL, or to leverage options provided by the specification that so far have not been widely employed by other implementations in order to explore their impact. This talk will discuss how hipSYCL has repeatedly questioned established implementation approaches, and experimented with alternatives – some of which are now well established in the SYCL world, such as non-OpenCL backends, while others are still hipSYCL- specific such as the scoped parallelism programming model, buffer-USM interoperability, automatic memory management below buffer granularity, multi- device queues, asynchronous buffers, library-only device backends or compiler- accelerated CPU support without OpenCL.

Throughout the talk, we will highlight how these approaches can impact the ecosystem, performance or usability of SYCL code and raise important

questions: What is SYCL? What is a queue? What is really needed to implement SYCL?

Speaker Biography – In 2015 Aksel obtained a bachelor’s degree in physics, followed in 2018 by a master’s degree in physics with specialization on computational astrophysics (both from Heidelberg University). During his studies, he has designed and developed teralens, an OpenCL-based tree code for gravitational quasar microlensing, which, to this day, is the fastest known code in its field. Afterwards, he focused on developing tools for HPC, and most notably created hipSYCL. hipSYCL is one of the four major SYCL implementations in use today, and the very first SYCL implementation to pioneer an interpretation of SYCL without OpenCL. In late 2018, he joined Heidelberg University Computing Centre’s HPC team and EMCL.

Since 2019, he has been a member of the Khronos SYCL working group.

SYCL & OpenCL |  Paper ID: K1  | View YouTube Playlist  | View Slides

Invited Talk
OpenCL Working Group –  A State of the Union

Speaker: Neil Trevett, Khronos Group President and OpenCL Working Group Chair | NVIDIA

Neil will update the community on all the OpenCL working group’s latest developments with OpenCL, including all the latest news and updates on extensions, the eco-system, implementations and more.  Always a not-to-be missed presentation.

Speaker Biography – Neil Trevett is the President of The Khronos Group (and OpenCL working group chair) where he has helped initiate and evolve royalty-free open standards such as OpenCL, OpenVX, NNEF, OpenXR, Vulkan, OpenGL ES, WebGL and glTF many of which are widely deployed in the graphics and embedded vision industries. At NVIDIA, Neil works to drive and develop the developer ecosystem that enables applications to take advantage of advanced GPU and silicon acceleration.

OpenCL |  Paper ID: K4  | View YouTube Playlist  | View Slides

Invited Talk
SYCL Working Group –  A State of the Union

Speaker: Michael Wong, Khronos Group, SYCL Working Group Chair  | Codeplay Sofware

Michael will update the community on all the SYCL working group latest developments, including SYCL 2020.  Always a not-to-be missed presentation.

Speaker Biography – Michael is a Distinguished Engineer at Codeplay, Chairman of SG19 Standards Committee for C++ Machine Learning, Head of Canadian Delegation of C++ Standards Committee.  Michael has rich experience in C++ parallel computing, high performance computing and machine learning. He has led the development of C++ heterogeneous programming language (SYCL) standard for GPGPU application development, and has profound research and insights into the underlying performance optimization of Tensorflow. Its specific work covers parallel programming, neural network, computer vision, automatic driving and other fields. Michael was a senior IBM technical expert who led the development of IBM XL C++ compiler and XL C compiler.Head of Canadian Delegation of C++ Standards Committee. Michael has rich experience in C++ parallel computing, high performance computing and machine learning. He has led the development of C++ heterogeneous programming language (SYCL) standard for GPGPU application development, and has profound research and insights into the underlying performance optimization of Tensorflow. Its specific work covers parallel programming, neural network, computer vision, automatic driving and other fields. Michael was a senior IBM technical expert who led the development of IBM XL C++ compiler and XL C compiler.

SYCL |  Paper ID: K3  | View YouTube Playlist  | View Slides

Research Papers and Technical Talks

OpenCL Command-buffer Extension: Design & Implementation

OpenCL allows a programmer to offload a sequence of commands to a heterogeneous accelerator, such as a GPU. For embedded devices the overhead of building a command sequence can be expensive, and many applications require the same pipeline of commands to be repeatedly enqueued in a loop. For example, in computer vision where the same command sequence is used to process different image inputs. In OpenCL command recording is tied to submission, a clEnqueue API invocation will both create a command and schedule it for execution, meaning that for groups of commands enqueued in a loop the cost of building the command sequence is unnecessarily incurred over and over again. An alternative OpenCL API mechanism for defining the command list would remove this overhead from repeated command sequences, regardless of the target OpenCL device.

The cl_khr_command_buffer[1] extension, provisionally released in November 2021 as part of OpenCL 3.0.10, provides such as solution. This extension introduces the concept of a command-buffer which is recorded once with a graph of commands, finalized for submission, and then dispatched for execution many times. Separating command setup from dispatch means that for repetitive workloads the command recording overheads are only incurred once. Additionally, optimization opportunities are introduced at the point of finalization, after which no more commands can be recorded, and the command-buffer is made ready for execution. After finalization, the command-buffer can then be asynchronously dispatched with minimal runtime overhead. This separation of concerns enables pipelined workflows common in machine learning applications by eliminating the latency of having to wait on the host to construct commands again for a similar workload.

In the first half of this technical presentation, we give an overview of the provisionally ratified command-buffer extension and dive into key points of its design. This includes a comparison with the Vulkan command-buffer abstraction[2], which shows that this approach is successful in the real-world. The design decision to introduce new entry-points, rather than reuse existing command-queue entry-points with begin/end markers, is also covered. As well as why mechanisms for host side synchronization were omitted from the new entry-points. Intended layering of future extensions on top of cl_khr_command_buffer is another topic raised, and why it was decided to split the functionality this way. cl_khr_command_buffer is designed as the base layer that is applicable to a wide range of vendors. Plans for the upcoming extensions layered on top will also be outlined in broad terms, these remove the restriction tying a command-buffer to a single command-queue as well as provide mutability of the command-buffer between submissions.

The second half of the presentation relays our experience implementing the command-buffer extension in ComputeAorta[3], Codeplay’s OpenCL implementation, and how this fed back into the extension specification. For example, implementing the simultaneous use capability that allows more than one submission of a command-buffer instance to be in-flight at once. We provide a high level overview of how command-buffers in ComputeAorta are implemented using the same machinery as regular command enqueues via Codeplay’s propriety ComputeMux API, and provide details of some of the common pitfalls and gotchas a vendor may face when implementing command-buffers vs. regular OpenCL commands.

[1] https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/OpenCL_Ext.html#cl_khr_command_buffer[2] https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/VkCommandBuffer.html[3] Alastair Murray and Ewan Crawford. 2020. Compute Aorta: A toolkit for implementing heterogeneous programming models. In Proceedings of the International Workshop on OpenCL (IWOCL ’20). Association for Computing Machinery, New York, NY, USA, Article 6, 1–2. DOI:https://doi.org/10.1145/3388333.3388652

Speaker: Ewan Crawford, Codeplay
Co-Authors: Jack Frankland, Codeplay

OpenCL |  Paper ID: 102  | View YouTube Playlist  | View Slides

Rapid Prototyping With Combined Scientific CFD Simulation and Real Time Raytracing Implementation in OpenCL

One of the main uses for OpenCL is (scientific) compute applications where graphical rendering is done externally, after the simulation has finished. However separating simulation and rendering has many disadvantages, especially the extreme slowdown caused by copying simulation data from device to host, and needing to store raw data on the hard drive, taking up hundreds of gigabyte, just to visualize preliminary results.

A much faster approach is to implement both simulation and rendering in OpenCL. The rendering kernels have direct read-only access to the raw simulation data that resides in ultra-fast GPU memory. This eliminates all PCIe data transfer but camera parameters and finished frames, allowing for interactive visualization of simulation results in real time while the simulation is running. This is an invaluable tool for rapid prototyping.

Although OpenCL does not have existing functionality for graphical rendering, being a general compute language, it allows for implementing an entire graphics engine, such that no data has to be moved to the CPU during rendering. On top, specific low-level optimizations make this OpenCL graphics engine outperform any existing rendering solution for this scenario, enabling drawing billions of lines per second and fluid raytracing in real time on even non-RTX GPUs.
This combination of simulation and rendering in OpenCL is demonstrated with the software FluidX3D [1] – a lattice Boltzmann method (LBM) fluid dynamics solver.

The first part will briefly introduce the numerical method for simulating fluid flow in a physically accurate manner. After introducing the LBM, the optimizations to make it run at peak efficiency are discussed: Being a memory-bound algorithm, coalesced memory access is key. This is achieved through array-of-structures data layout as well as the one-step-pull scheme, a certain variant of the LBM streaming step. One-step-pull leverages the fact that the misaligned read penalty is much smaller than the misaligned write penalty on almost all GPUs. Roofline analysis shows that with these optimizations, the LBM runs at 100% efficiency on the fastest data-center and gaming GPUs [2].

To simulate free surface flows, the LBM is extended with the Volume-of-Fluid (VoF) model. An efficient algorithm has been designed to vastly accelerate the challenging surface tension computation [3]. This extremely efficient VoF-LBM GPU implementation allows covering new grounds in science: FluidX3D has been used to simulate more than 1600 raindrop impacts to statistically evaluate how microplastics transition from the ocean surface into the atmosphere when the spray droplets are generated during drop impact [4]. At the same power consumption, with existing CPU-parallelized codes, compute time would have been several years, whilst with FluidX3D it was about a week.

The second part will focus on real time rendering with OpenCL, especially raytracing. Rasterization on the GPU is parallelized not over pixels but lines/triangles instead, making runtime mostly independent of screen resolution and lightning fast. Each line/triangle is transformed with the camera parameters from 3D to 2D screen coordinates and then rasterized onto the frame (integer array) with Bresenham algorithm [5] and z-buffer. The raytracing graphics are based on a combination of fast ray-grid traversal and marching-cubes, leveraging that the computational grid from the LBM already is an ideal acceleration structure for raytracing. The idea of raytracing is simple: Through each pixel on the screen, shoot a reverse light ray out of the camera and see where it intersects with a surface in the scene. Then (recursively) calculate reflected/refracted rays and mix the colors. If a ray doesn’t intersect with anything, its color is determined by the skybox image via UV mapping and bilinear pixel interpolation. With mesh surfaces consisting of many triangles, computation time quickly becomes a problem, as for each ray all triangles have to be tested for intersection. To overcome this, an acceleration structure is required. While computer games often use a bounding volume hierarchy, the LBM already provides an ideal alternative acceleration structure: the simulation grid. The corresponding algorithm is called ray-grid traversal: When a ray shoots through the 3D grid, intersections with the surface only have to be checked for at each traversed grid cell rather than the entire grid. In each traversed grid cell, the 0-5 surface triangles are generated on-the-fly with the marching-cubes algorithm and ray-triangle intersections are checked with the Möller-Trumbore algorithm. If an intersection has been found, only afterwards the normals are calculated on the 8 grid points spanning the cell, and are trilinearly interpolated to the intersection coordinates. The so interpolated surface normal makes the raytraced surface appear perfectly smooth.
On the GPU, the ray(s) for each pixel on screen are computed in parallel, vastly speeding up rendering. It is of key importance how to align the OpenCL workgroups on the 2D array of screen pixels: best performance is achieved for 8×4 pixel tiles; this is about 50% faster than 32×1 tiles, because with small, square-ish tiles, all rays of the workgroup are more likely to traverse the same grid cells, greatly improving memory broadcasting. In ray-grid traversal, 8 isovalues spanning a cell have to be loaded from GPU memory for each traversed cell. Once the triangle intersection has been found, the gradient on each of the 8 cell isovalues is calculated with central differences. Instead of loading an additional 6 isovalues for each of the 8 grid points, their isovalues are reused such that only 24 additional isovalues are loaded. For marching-cubes, the algorithm by Paul Bourke [6] is implemented in OpenCL. With 16-/8-bit integers and bit-packing, the tables are reduced to 14% of their original size and stored in constant memory space. For computing the cube index, branching is eliminated by bit operations. The Möller-Trumbore algorithm [7] is implemented in an entirely branchless manner.

This raytracing implementation is fast enough to run in real time for even the largest lattice dimensions that fit into the memory of a GPU. Finally, the combined VoF-LBM simulation and raytracing implementation is demonstrated on the most realistic simulation of an impacting raindrop ever done [8].

[1] https://doi.org/10.15495/EPub_UBT_00005400[2] https://arxiv.org/abs/2112.08926[3] https://doi.org/10.3390/computation10020021[4] https://doi.org/10.1186/s43591-021-00018-8[5] https://doi.org/10.1147/sj.41.0025[6] http://paulbourke.net/geometry/polygonise/[7] https://doi.org/10.1080/10867651.1997.10487468[8] https://youtu.be/HrwWoIsZG1c

Speaker: Moritz Lehmann, University of Bayreuth

OPENCL |  Paper ID: 106  | View YouTube Playlist  | View Slides

On the Compilation Performance of Current SYCL Implementations

The Khronos SYCL abstraction layer is designed to enable programming heterogeneous platforms, consisting of host and accelerator devices, with a single-source code base. In order to allow for a high level of abstraction while still providing competitive runtime performance, both SYCL implementations and the software ecosystems built around SYCL applications frequently make heavy use of C++ templates. A potential consequence of this design choice, as well as the need to generate code for both a host and at least one device architecture, are significant compilation times.

In this work we set out to study the relative compile-time performance and the impact of various SYCL features on compilation times across a selection of the most widely-used SYCL implementations. To this end, we introduce a code generator which creates SYCL kernels stressing various API features and instruction types, either in isolation or in combination, as well as an infrastructure to largely automate related experiments. We apply this infrastructure in a large-scale synthetic evaluation totaling 96000 compiler runs, which also includes a study of the compilation performance over time of the most widespread implementations. In addition to these synthetic experiments, we validate the applicability of our findings by measuring the compile times of two real-world industrial SYCL applications.

On the basis of these experiments, we point out particularly impactful – in terms of compile-time performance – changes during the development of some SYCL implementations, and formulate suggestions for SYCL implementation developers as well as users. We have made both the code generator and all the tools we developed to carry out the experiments in this paper available as open source.

Speaker: Peter Thoman (University of Innsbruck)
Co-Authors: Facundo Molina Heredia and Thomas Fahringer (University of Innsbruck)

SYCL |  Paper ID: 115  | View YouTube Playlist  | View Slides

Celerity: How (Well) Does the SYCL API Translate to Distributed Clusters?

As the SYCL ecosystem matures, adoption across both research and industry projects steadily increases. By offering a modern, vendor-agnostic way of programming a wide array of accelerator hardware, SYCL has the potential to become an important player in high-performance computing (HPC) as well. In fact, existing pre-exascale and upcoming exascale machines already officially support SYCL or even recommend it as one of their preferred programming models. As such, the question of how well the SYCL programming model translates to distributed computing becomes more prevalent. While traditional approaches such as combining SYCL with the Message Passing Interface (MPI) will undoubtedly remain relevant for years to come, a more forward-thinking approach may be to try and extend SYCL’s ease of use for single node systems to a distributed cluster. The first project to explore this in greater detail is Celerity, a distributed runtime system and API that heavily leans on SYCL in both its API design as well as its underlying execution engine. The validity of its design is currently being evaluated through the porting of two industry use cases for large scale distributed execution as part of the LIGATE project. While Celerity is neither a true subset nor superset of the SYCL API, experienced SYCL users will immediately recognize the familiar structure of its API.

In this talk, we will review the SYCL API from the perspective of Celerity and distributed memory programming in general. We will highlight challenges encountered and opportunities for future improvement of the SYCL API.

We will begin our presentation by giving an overview of the Celerity programming model, highlighting its similarities to SYCL and introducing core additions to the API. We will showcase how a typical Celerity program is structured, and how an existing SYCL application can be converted to Celerity. Additionally, we will give a brief overview of how Celerity itself uses SYCL internally to power its distributed execution semantics.

The main portion of this presentation will concern itself with investigating important features of SYCL and how well they translate to distributed clusters. We will begin by examining core features such as the high-level data-driven APIs of queues, buffers, command groups and accessors in a distributed context. Next, we will highlight newer additions to SYCL such as host tasks and reductions. Finally, we will take a look at APIs that may be considered problematic from a distributed memory perspective, such as unified shared memory (USM).

We will conclude our presentation with an outlook on what future versions of SYCL could bring to the table to further improve compatibility with distributed memory clusters. We will review HPC use cases that may not yet be fully covered by SYCL and present several potential improvements that would enhance the experience for both us as library developers as well as users of the traditional MPI~+~SYCL approach.

Speaker: Philip Salzmann (University of Innsbruck)
Co-Authors: Fabian Knorr and Peter Thoman (University of Innsbruck), and Biagio Cosenza (University of Salerno)

SYCL |  Paper ID: 117  | View YouTube Playlist  | View Slides

Untangling Modern Parallel Programming Models

Modern hardware is increasingly rich in diversity, including CPUs, GPUs, FPGAs and more, with new and novel architectures constantly emerging. To provide differentiation between these devices, each is typically built around architectures optimized for some classes of application or some patterns of parallelism. Numerous computational cores, varying levels of hardware vectorization, and other degrees of architectural freedom exist across the many hardware options. The need to efficiently utilize diverse hardware has led to emergence of a wide variety of programming models, execution models, and languages, and has simultaneously led to a complex landscape of confused and often conflicting terminology and abstractions. This reality makes it challenging for developers to comprehend and then choose a programming model that fits with their applications and mental model, particularly when more than one target architecture or vendor is of interest.

This talk strives to untangle the landscape of modern parallel programming models, to help developers understand how the models and options relate to each other, and to frame how to think about their specific algorithms when expressing them in code. Although experienced developers typically understand much of the terminology and the relationships between models, a holistic presentation of the material is of strong value, as evidenced by feedback from parallel programming experts that have seen previews of this presentation.

To begin, a brief overview will be presented to frame parallel programming and offload compute programming models, followed by characterization of the Single Program Multiple Data (SPMD) abstract model and the power it exhibits when mapping to multiple classes of architecture. We will discuss how fundamental design decisions within a compiler impact the mapping from source code to an underlying programming model, highlighting that the same code can be lowered to multiple models. This is particularly relevant in the presence of vector data types, which permit multiple interpretations and are a common cause of confusion. A core element of the presentation is decomposition of how programming model and design assumptions of a compiler are ideally understood concurrently by developers to streamline the creation and tuning of performant code.

SPMD and explicit Single Instruction Multiple Data (SIMD) programming models will be discussed relative to the Khronos OpenCL and SYCL standards, as well as to OpenMP and CUDA, with the aim of clarifying the concepts and models for developers working in specific languages.

The talk will conclude with an overview of an experimental extension to SYCL that proposes a mechanism for mixing SPMD and explicit SIMD programming styles with clear semantics and boundaries in code. The talk will show that providing clear points of transition with clear semantics can enable expert tuning at the granularity of a single line of code, without breaking the SPMD programming abstraction used by the rest of a kernel.

Parallel programming models such as SPMD and SIMD are critical in the modern landscape of heterogeneous compute architectures. When coupled with decisions made during the implementation of specific compilers, developers are left with a complex task when working to understand how concepts and hardware mappings interact. This talk describes the most common programming models exposed through SYCL, OpenCL, OpenMP, and CUDA, with the intent of clarifying misconceptions and confusion about the mapping of software to hardware. Attendees will leave the presentation with a holistic understanding of how SPMD and SIMD-like programming models fit together, and how they relate to the code that many of us write from day to day.

Speaker: Michael Kinsner (Intel)
Co-Authors: Ben Ashbaugh, James Brodman, Greg Lueck, John Pennycook and Roland Schulz (Intel)

OPENCL & SYCL |  Tech Presentatiion ID: 118  | Register to View Presentation Recording  | View Slides

Exploring the Possibility of a hipSYCL-based Implementation of oneAPI

oneAPI is a software platform built around SYCL 2020 and accelerated libraries such as oneMKL as well as low-level building blocks such as oneAPI Level Zero. All oneAPI implementations currently are based on the DPC++ SYCL implementation. However, being able to utilize multiple independent SYCL implementations with oneAPI code can be beneficial to both users and implementors when it comes to testing code, or e.g. noticing ambiguities in the specification. In this work, we explore the possibility of implementing oneAPI using hipSYCL as an independent SYCL implementation instead. We review hipSYCL’s design and demonstrate it running on oneAPI Level Zero with competitive performance. We also discuss hipSYCL’s support for SYCL 2020 at the examples of unified shared memory (USM), group algorithms and optional kernel lambda naming. To this end, we also contribute microbenchmarks for the SYCL 2020 group algorithms and demonstrate their performance. When testing hipSYCL with HeCBench, a large benchmark suite containing SYCL benchmarks initially developed for DPC++, we point out specification ambiguities and practices that negatively impact code portability when transitioning from DPC++ to hipSYCL. We find that we can compile 122 benchmarks with little effort with hipSYCL, and demonstrate performance for a selection of benchmarks within 20% of native models on NVIDIA and AMD GPUs. Lastly, we demonstrate oneMKL’s BLAS domain running with hipSYCL on AMD and NVIDIA GPUs, and find that it can match native cuBLAS and rocBLAS performance for BLAS level 1, level 2 and level 3 operations, while significantly outperforming oneMKL with DPC++ on NVIDIA GPUs for all but the largest problem sizes. Overall, we find that hipSYCL can support low-level building blocks like Level Zero, oneAPI libraries like oneMKL, and the SYCL 2020 programming model efficiently, and hence conclude that it is indeed possible to implement oneAPI independently from DPC++.

Speaker: Aksel Alpay (Heidelberg University)
Co-Authors: Bálint Soproni, Holger Wünsche and Vincent Heuveline (Heidelberg University)

SYCL |  Paper ID: 123  | Register to View Presentation Recording  | View Slides

Performance Analysis of a Matrix-free Conjugate Gradient Kernels Using SYCL

We examine the performance of matrix-free SYCL implementations of the conjugate gradient method for solving sparse linear systems of equations. Performance is tested on an NVIDIA A100-80GB device and a dual socket Intel Ice Lake CPU node using different SYCL implementations, and compared to CUDA BLAS (cuBLAS) implementations on the A100 GPU and MKL implementations on the CPU node. All considered kernels in the matrix-free implementation are memory bandwidth limited, and a simple performance model is applied to estimate the asymptotic memory bandwidth and the latency. Our experiments show that in most cases the considered SYCL implementations match the asymptotic performance of the reference implementations. However, for smaller but practically relevant problem sizes latency is observed to have a significant impact on performance. For some cases the SYCL latency is reasonably close to the reference (cuBLAS/MKL) implementation latency, but in other cases it is more than one order of magnitude greater. In particular, SYCL reductions on the GPU and all operations for one of the SYCL implementations on the CPU exhibit high latency, and this latency limits performance at problem sizes that can in cases be representative of full application simulations.

Speaker: Igor Baratta (University of Cambridge)
Co-Authors: Chris Richardson and Garth Wells (University of Cambridge)

SYCL |  Paper ID: 128  | Register to View Presentation Recording  | View Slides

A Proof-of-Concept SYCL FFT – Benchmarking a Proof-of-Concept Performance Portable SYCL-based Fast Fourier Transformation Library

In this paper, we present an early version of a SYCL-based FFT library, capable of running on all major vendor hardware, including CPUs and GPUs from AMD, ARM, Intel and NVIDIA. Although preliminary, the aim of this work is to seed further developments for a diverse and rich set of features for calculating FFTs. It has the advantage over existing portable FFT libraries in that it is single-source, and therefore removes the complexities that arise due to abundant use of pre-process macros and auto-generated kernels to target different architectures. We exercise two SYCL-enabled compilers, Codeplay ComputeCpp and Intel’s open-source LLVM project, to evaluate performance portability of our SYCL-based FFT on various heterogeneous architectures. The current limitations of our library is it supports single-dimension FFTs up to $2^{11}$ in length and base-2 input sequences. We compare our results with highly optimized vendor specific FFT libraries and provide a detailed analysis to demonstrate a fair level of performance, and highlight sources of performance bottlenecks in SYCL runtimes.

Speaker: Vincent R. Pascuzzi (Brookhaven National Laboratory)
Co-Authors: Mehdi Goli (Codeplay Software Ltd.)

SYCL |  Paper ID: 136  | Register to View Presentation Recording  | View Slides

Improved Address Space Inference for SYCL Programs

SYCL is a single source C++ based programming model for heterogeneous programming. It enables the programmer to write or port code targeting heterogeneous accelerators using what appears to the programmer as standard C++. To achieve peak performance, however, it can be necessary to write the code in a form which allows the compiler to target specific hardware features. If the compiler can target these hardware features without requiring the programmer to consider them, then productivity and application performance can both be improved.  One such example is accelerators with multiple address spaces, this technical talk will describe how a SYCL compiler can infer these address spaces without requiring the programmer to specify them in their application as well as describe some required specification evolution in order to better cope with the new SYCL 2020 features.

Hardware devices can have multiple memory regions with different levels of visibility and performance. Similar to OpenCL C, SYCL abstracts them into a global memory visible to all work-items, a local memory visible to a single work-group, or a private memory only visible to a single work-item. In OpenCL C, the programmer expresses address spaces using type qualifiers in order to statically encode the memory region addressed by pointers thus ensuring that when a programmer does specify an address space the compiler can check whether the program is well-formed. But requiring programs to be written with explicit address spaces comes at the expense of usability, as these need to be integrated into the program design and are a barrier to integrate code not written with this in mind. Thus in OpenCL C 2.x/3 programmers can make use of the unnamed generic address space instead.  On the other hand, SYCL does not extend the C++ language therefore programmers cannot express address spaces using a type qualifier (as the C++ standard does not define them). Thus in SYCL pointers and references can be lowered to this unnamed generic address space by the device compiler. This generic address space is a virtual address space that can represent several overlapping address spaces at the same time.

The memory being addressed is no longer statically known by the compiler frontend and the SYCL implementation relies on the hardware, or software emulation, to correctly dispatch the loads and stores to the correct memory. On some hardware targets this flexibility comes with a performance cost, but this can be avoided when the compiler can infer a single address space for a given memory access. Additionally, the low-level compute APIs that are often used as backends to a SYCL 2020 implementation do not guarantee support for a generic address space, e.g. they are an optional feature in OpenCL 3.0 and non-existent in Vulkan. This means that a SYCL compiler that can infer all address spaces for a large set of programs can achieve better performance and target a wider range of backend compute APIs. Moreover, recent efforts to bring safety critical development to SYCL means it will also need to run on top of Vulkan SC. This makes the ability to have a well-defined specification for inferring address spaces still relevant for SYCL.

The rules introduced by SYCL 1.2.1 impose significant restrictions on user code. One striking example is the “defaulting rule”: when a pointer declaration has no initializer, the pointer is assumed to address the private memory, even if it is initialized in the very next statement. As a consequence, you cannot declare a pointer in a structure without it defaulting to the private address space. In practice, however, these restrictions are not a significant barrier in the context of 1.2.1 and large applications were ported to run with SYCL such as Eigen or build new ones like SYCL-BLAS or SYCL-DNN.

SYCL 2020 brought significant changes and added flexibility to users. Among them are the unnamed generic address space and unified shared memory (USM) pointers. The generic address space allowed to lift the restrictions stated by 1.2.1, making programs written for 2020 and generic unlikely to be compilable under the inference rules restriction. USM encourages the usage of raw pointers instead of the accessors container as this quickly implies passing these pointers via structures. As a USM pointer is in fact addressing the global memory region, this creates a conflict with inference rules.

This talk will describe an experimental compiler for ComputeCpp, Codeplay’s SYCL implementation. This compiler employs an improved address space inference method that can efficiently cope with SYCL 2020 features such as the generic address space and unified shared memory (USM) pointers. The talk with also cover the limitations of this approach.

Speaker:

Victor Lomuller (Codeplay Software)
Co-Authors:Ross Brunton (Codeplay Software)

SYCL |  Technical Presentation ID: 140  | Register to View Presentation Recording  | View Slides

A Comparison of SYCL, OpenCL, CUDA, and OpenMP for Massively Parallel Support Vector Machine Classification on Multi-Vendor Hardware

In scientific computing and Artificial Intelligence (AI), which both rely on massively parallel tasks, frameworks like the Compute Unified Device Architecture (CUDA) and the Open Computing Language (OpenCL) are widely used to harvest the computational power of accelerator cards, in particular of Graphics Processing Units (GPUs). A few years ago, GPUs from NVIDIA were used almost exclusively for these tasks but meanwhile, AMD and Intel are increasing their shares of the GPU market. This introduces many new challenges for code development, as the prevailing CUDA code can only run on NVIDIA hardware and must be adapted or even completely rewritten to run on GPUs from AMD or Intel.

In this paper, we compare the different competing programming frameworks OpenMP, CUDA, OpenCL, and SYCL, paying special attention to the two SYCL implementations hipSYCL and DPC++. Thereby, we investigate the different frameworks with respect to their usability, performance, and performance portability on a variety of hardware platforms from different vendors, i.e., GPUs from NVIDIA, AMD, and Intel and Central Processing Units (CPUs) from AMD and Intel. Besides discussing the runtimes of these frameworks on the different hardware platforms, we also focus our comparison on the differences between the nd_range kernel formulation and the SYCL specific hierarchical kernels.

Our Parallel Least Squares Support Vector Machine (PLSSVM) library implements backends for the four previously mentioned programming frameworks for a Least Squares Support Vector Machine (LS-SVM). At its example, we show which of the frameworks is best suited for a standard workload that is frequently employed in scientific computing and AI, depending on the target hardware: The most computationally intensive part of our PLSSVM library is solving a system of linear equations using the Conjugate Gradient (CG) method. Specifically, we parallelize the implicit matrix-vector multiplication inside the CG method, a workload common in many scientific codes.

The PLSSVM code, utility scripts, and documentation are all available on GitHub: https://github.com/SC-SGS/PLSSVM.

Speaker: Marcel Breyer (University of Stuttgart)
Co-Authors: Alexander Van Craen and Dirk Pflüger (University of Stuttgart)

OPENCL & SYCL |  Paper: 105  | Register to View Presentation Recording  | View Slides

How Much SYCL Does a Compiler Need? Experiences from the Implementation of SYCL as a Library for nvc++

The SYCL 2020 specification explicitly allows library-only implementations. Such SYCL implementations are regular C++ libraries for third-party compilers, and intended by the specification to run on the host. Indeed, there are multiple SYCL implementations that support such library-only host backends such as triSYCL or hipSYCL.

However, in principle other devices apart from the host could also be targeted by library-only implementations, if they operate as libraries written in heterogeneous programming models which can accept regular C++ kernel code as required by SYCL.

We have implemented the first library-only device backend in a major SYCL implementation by adding support for hipSYCL’s CUDA backend to operate as a library for NVIDIA’s nvc++ compiler, which is part of the NVIDIA HPC SDK.

Being able to run SYCL code with hipSYCL as a library for nvc++, a vendor-supported compiler, guarantees immediate SYCL support for new NVIDIA hardware, compatibility with latest CUDA releases and easy deployment on existing NVIDIA-based HPC systems where the NVIDIA HPC SDK might already be preinstalled.

In this talk, we will describe hipSYCL’s nvc++-based compilation flow, how it is designed and how it operates compared to hipSYCL’s existing clang-based support for CUDA devices. We will also discuss limitations of the current nvc++ support compared to clang.

While it was shown repeatedly that SYCL implementations in general, and hipSYCL in particular, can often deliver competitive performance compared to CUDA, when performance discrepancies with nvcc-compiled code are found it is often difficult to pinpoint whether these are due to the additional C++ layers of the SYCL programming model itself, or due to the different code generation backends of the compilers.
We will show performance results comparing hipSYCL with nvc++ to CUDA compiled with nvc++, thereby eliminating the switch between different compilers as a variable.

Finally, we will discuss library-only implementations from the point of view of the SYCL 2020 specification, and highlight challenges for the implementation of library-only device backends.

Speaker: Aksel Alpay (Heidelberg University)
Co-Authors: Vincent Heuveline (Heidelberg University)

SYCL |  Technical Presentation: 124  | Register to View Presentation Recording  | View Slides

Experiences Porting NAMD to the Data Parallel C++ Programming Model

HPC applications have a growing need to leverage
heterogeneous computing resources with a vendor-neutral programming paradigm. Data Parallel C++ is a programming language based on open standards SYCL,
providing a vendor-neutral solution. We describe our experiences porting the NAMD molecular dynamics application with its GPU-offload force kernels to DPC++. Results are shown that demonstrate correctness of the porting effort.

Speaker: David Hardy (University of Illinois at Urbana-Champaign)
Co-Authors: Jaemin Choi and Emad Tajkhorshid (University of Illinois at Urbana-Champaign), and Wei Jiang (Argonne National Laboratory)

SYCL |  Paper: 129  | Register to View Presentation Recording  | View Slides

Exploring SYCL SC (Safety-Critical)

Codeplay and CoreAVI are leading a new Exploratory Forum within Khronos, to evaluate industry interest in a new Khronos API based on SYCL, which is targeted towards the safety-critical industry, called SYCL SC (Safety Critical).

Safety-critical industries like avionics, automotive, nuclear and rail require their software to be compliant to safety standards such as ISO 26262, DIS21448/SOTIF, DO-178C and UL4600, as well as adhering to guidelines defined by AUTOSAR and MISRA. Historically, software written for safety-critical applications has been written in C or entirely generated by modeling tools, but this is changing rapidly.

Software and hardware complexity is growing, and in particular the demand for Artificial Intelligence (AI) has skyrocketed, for which higher levels of abstraction are required. Possibly the most high-profile example of using AI is to enable the development of semi-autonomous and autonomous vehicles, which manufacturers are competing to be the first to develop safely.

Khronos already has two safety-critical variants of APIs: OpenGL SC for graphics and Vulkan SC for graphics and low-level compute. In addition, OpenVX 1.3 has defined a safety-critical feature set for AI/Vision applications. However, in the safety space there is currently no high-level compute API to develop the complex algorithms of tomorrow.

By implementing SYCL SC on top of Vulkan SC and building on safety certified drivers and hardware, the whole stack, all the way from hardware to application, can be safety certified.

SYCL SC will also unlock the ecosystem of libraries and tools that already exist for SYCL for the safety-critical domain. Frameworks such as AI/ML frameworks can be built on top of SYCL SC in the future, to provide even higher levels of abstraction.

This presentation will talk about the aims of the new standard, which are aligned with Vulkan SC. It will also touch on some initial design ideas, with a focus on deterministic rather than dynamic behavior. We suggest the removal of some SYCL features that are unlikely to be used in a safety context, to facilitate the safety certification of the runtime itself and the addition of extensions that provide functionality that is useful in a context where safety is critical, such as related to the timing of kernels.

We will discuss the importance of Misra’s C++ guidelines, particularly the upcoming Misra C++ 202x standard, for applications and hence APIs, whilst acknowledging the need to remain compatible with standard SYCL as much as possible.

We set up the Exploratory Forum to collect feedback from potential runtime implementers, library developers, researchers and users on what their requirements for a high-level compute API in a safety-critical context are. The Exploratory Forum is open to non-Khronos-members under Khronos NDA and we actively encourage non-members to participate.

Once a wide range of requirements has been collected the next step is the creation of a new Khronos group, which would work towards a specification.

This presentation will describe the issues that organizations are facing that can be solved through the new standard and provoke discussion on how to develop an API that will meet the needs of the safety-critical community.

After the presentation we invite the audience to join the Exploratory Forum to talk about their own requirements and experiences as well as collaborate to develop a framework for the new standard to be defined.

Speaker: Verena Beckham (Codeplay Software)
Co-Authors: Ken Wenger (CoreAVI)

SYCL |  Technical Presentation: 134  | Register to View Presentation Recording  | View Slides

Optimize AI Pipelines with SYCL and OpenVINO

Sensor data processing pipelines that are a “mix” of feature-engineered and deep learning based processing have become prevalent today. For example, sensor fusion of point cloud data with RGB image streams is common in autonomous mobile robots and self-driving technology. The state-of-the-art in computer vision for extracting semantic information from RGB data is using deep learning today, and great advancements have been made recently in LiDAR odometry based on deep learning [x]. At the same time, other processing components in “mixed”
pipelines still use feature-engineered approaches that are not relying on deep neural nets.

Embedded compute platforms in robotics systems are inherently heterogeneous in nature, often with a variety of CPUs, (integrated) GPUs, VPUs, and so on. Thismeans that there is a growing need to implement “mixed” pipelines on heterogeneous platforms that include a variety of xPUs. We want such pipeline implementations to benefit from the latest advancements in data- and thread-parallel computation, as well as state-of-the-art in optimized inference of AI DNN models. SYCL and OpenVINO are two open, industry supported APIs that allow a developer to do so.

It is not only important to optimize the individual components of the processing pipeline – it is at least as important to also optimize the data flow and minimize data copies. This provides a way to benefit from the efficiencies in inference runtime and compute graph optimizations provided by OpenVINO, in combination with the extensibility that SYCL brings in implementing custom or non-DNN components. Similarly, the use of compatible synchronization primitives allows the different runtimes to schedule work more efficiently on the hardware and avoid execution hiccups.

In this talk, we will demonstrate the mechanisms and primitives provided by both SYCL and OpenVINO to optimize the dataflow between, and efficient execution of the workloads implemented in the respective APIs. We will provide an example and show the impact on the overall throughput and latency of the end-to-end processing pipeline. The audience will learn to recognize inefficiencies in their pipelines using profiling tools, and understand how to optimize those inefficiencies using an easy-to-follow optimization recipe. Finally, we will provide guidance to developers of inference engines other than OpenVINO on how to integrate similar interoperability features into their APIs, so that they too can offer optimized SYCL-enabled AI pipelines to their users.

Speaker: Nico Galoppo (Intel)

SYCL |  Technical Presentation: 131  | Register to View Presentation Recording  | View Slides

Using Interoperability Mode in SYCL 2020

SYCL is a programming standard targeting hardware platforms with a host connected to various heterogeneous accelerators. Both the host and accelerator parts of the computation are expressed in a single-source modern C++ program.

While the previous versions of the SYCL standard were based only on top of the OpenCL standard to control the accelerators, starting with SYCL 2020, the standard is independent from OpenCL and can target different API, described with the concept of backend. Some SYCL implementations can thus target today various lower-level API, like OpenCL, CUDA, Level0, HIP, XRT, Vulkan, etc. with possibly different backends used at the same time in the same application.
Even if the SYCL standard thrive to abstract the generic principles used in heterogeneous programming with C++ classes and functions, real applications require often to use specific details of a given architecture to benefit fully from an accelerator or need to be into integrated into a wider framework, including parts implemented in other languages and other API for heterogeneous computing. This is possible in SYCL with a less-know but powerful concept of interoperability, which is introduced at different levels.

On one hand, by accessing some native backend objects from SYCL objects, it is possible to use in a SYCL program the native API, for example by calling some existing optimized libraries like mathematical libraries, machine learning, video CODEC, etc. to simplify the application development and reach the maximum performance. In that case it is for example possible to get from a sycl::queue a native queue from the backend to be used to enqueue a library function.
On the other hand, it is possible to use a part of the application written in SYCL from another part of the application using another API by using SYCL interoperability functions to constructs SYCL objects like sycl::device or sycl::queue from native equivalent objects from the lower-level API backend used in the main part of the program.

Another feature of SYCL 2020 interoperability is the ability to schedule backend API operations within the SYCL task DAG using host task interoperability. In SYCL, host tasks allow the user to enqueue an arbitrary C++ function within the SYCL DAG and host tasks have an optional interoperability handle which provides access to the native backend queue, device and memory objects at that point in the DAG. This feature is very powerful as it allows a SYCL application to interoperate with backend-specific libraries such as BLAS or DNN libraries.

Finally, SYCL interoperability allows for calling backend-specific kernel functions in the backend kernel language such as OpenCL or CUDA via backend-specific functions when generating a kernel_bundle, which can be invoked via a SYCL queue. Some implementations can also go beyond the standard and provide some native functions directly callable from a plain SYCL kernel.

SYCL can also be used to simplify the direct use of a lower-level API, like a higher-level C++ wrapper, to remove a lot of the boilerplate code otherwise needed to use the lower-level API. Since it is possible to use the interoperability mode with sycl::buffer and sycl::accessor, some code using the native API can benefit from the implicit data dependency task graph and automatic overlap of computation and implicit communications provided by the SYCL programming model.
Having all these interoperability modes in SYCL allows leveraging existing other interoperability modes and building some complex interoperability paths between several frameworks or standards in a single application. For example in HPC a SYCL application can interoperate with an OpenMP library through a common backend to use parallelism in a cooperative way or could use the OpenCL back-end to reach Vulkan through OpenCL-Vulkan interoperability for high-performance graphics rendering. A multimedia application could use a SYCL-OpenCL-OpenGL-DX12 path to do image processing of native images.

Speaker: Aksel Alpay (Heidelberg University)
Co-Authors: Thomas Applencourt (Argonne National Laboratory), Gordon Brown (Codeplay Software), Ronan Keryell (AMD) and Greg Lueck (Intel)

SYCL |  Technical Presentation: 139  | Register to View Presentation Recording  | View Slides

Interfacing SYCL and Python for XPU Programming

This paper introduces a new framework to help build and use SYCL-based Python native extensions. We present the core design and implementation detail of the framework that includes an overview of the API, a technique to support asynchronous SYCL kernel execution via Python, and discussion around the usage
of Python extension generator tools to build SYCL-based extensions. Details of ongoing work are presented and we demonstrate the development of a performance portable Python native extension that relies on the SYCL-based oneMKL specification.

Speaker: Diptorup Deb (Intel)
Co-Authors: Oleksandr Pavlyk (Intel)

SYCL |  Paper: 122  | Register to View Presentation Recording  | View Slides

SYCLops: A SYCL Specific LLVM to MLIR Converter

There is a growing need for higher level abstractions for device kernels in heterogeneous environments, and the multi-level nature of the MLIR infrastructure perfectly addresses this requirement. As SYCL begins to gain industry adoption for heterogeneous applications and MLIR continues to develop, we present SYCLops: a converter capable of translating SYCL specific LLVM IR to MLIR. This will allow for both target and application specific optimizations within the same framework to exploit opportunities for improvement present at different levels.

Speaker: Alexandre Singer (Huawei Canada Research Centre)
Co-Authors: Frank Gao and Kai-Ting Amy Wang(Huawei Canada Research Centre)

SYCL |  Paper: 127  | Register to View Presentation Recording  | View Slides

A Source-to-Source Migration Tool: Intel DPC++ Compatibility Tool

oneAPI [1] is an industry initiative creating an open, standards-based, cross-architecture programming model to simplify development for a wide range of data-centric workloads across a variety of architectures including CPU, GPU, FPGA, and other accelerators. It includes a cross-architecture language Data Parallel C++ [2] based on ISO C++ and Khronos Group’s SYCL [3], advanced libraries, and community extensions. Intel has created a product implementation of oneAPI with the Intel oneAPI Toolkits. These help developers efficiently build, analyze, and optimize high-performance, cross-architecture applications for Intel CPUs, GPUs and FPGAs. DPC++ [2] is an open specification for a portable, architecture-neutral language for expressing parallelism, which is based on industry standards. The DPC++ specification can be implemented by anybody for any platform. To take advantage of oneAPI and DPC++, for application written in other language e.g., CUDA, developers need manually port or rewrite existing code to SYCL/DPC++. Once a customer migrates their code to SYCL/DPC++, they are no longer tied to a single platform, they can run the code on all platforms which supported oneAPI.

Intel® DPC++ Compatibility Tool is included in the Intel® oneAPI Toolkit, it is a tool that assists developers to do source-to-source migration, e.g., migrate code written in CUDA to SYCL/DPC++ code [2][3] to enable their code to run on multi platforms. The tool generates human readable and maintainable code whenever possible and provides inline comments to help developers complete their code. On average, about 80-90% of CUDA code in applications can be migrated by this tool, completion of the code and verification of the final code is expected to be manual process done by the developers. The goal of Intel® DPC++ Compatibility Tool is to make it as easy as possible for developers to migrate their existing CUDA codebase to SYCL/DPC++ to facilitate more HW choices and access to advantages of oneAPI and DPC++.

Intel® DPC++ Compatibility Tool is based on LLVM/Clang [4], it mainly contains 3 functional components:

  • The intercept-build tool: it is used to collect compilation options of the user input project by intercepting build process of user input project, like build option, macro definitions, include folders and so on information. During source-to-source migration, those compilations options are used to identify the active code path, header files depended, finally build right abstract syntax tree for the user input project.
  • The DPCT binary tool: The tool is the main migration tool, which do source-to-source migration based on compiler front end technology. It implements a set of migration rules to migrate source language elements like type, APIs, macros to target language equivalents. If some C++ Code are same between source and target language, then the tool keeps those C++ code unchanged. Also, the tool provides a way to let user define migration rule by themselves in migration rule description file to guide a customization migration.
  • Helper header library: it provides helper functions and macros to assist the migration of input source code. These header files are intended to become part of migrated code generated by the Intel® DPC++ Compatibility Tool. User can copy these header files if needed and include them with the generated code.

The Intel® DPC++ Compatibility Tool can assist developers to migrate code written in CUDA to the code written in SYCL/DPC++ with proper performance, minimizing developer’s efforts. The tool can enrich the oneAPI ecosystem by assisting developers in migration of more apps to SYCL/DPC++ running on oneAPI.

References[1] https://software.intel.com/content/www/us/en/develop/tools/oneapi.html[2] https://software.intel.com/content/www/us/en/develop/tools/oneapi/data-parallel-c-plus-plus.html[3] https://www.khronos.org/sycl/[4] https://llvm.org/[5] https://software.intel.com/content/www/us/en/develop/documentation/intel-dpcpp-compatibility-tool-user-guide/top.html

Speaker: Zhiming Wang (Intel)

SYCL |  Technical Presentation: 132  | Register to View Presentation Recording  | View Slides

FPGA Acceleration of Structured-Mesh-Based Explicit and Implicit Numerical Solvers using SYCL

We explore the design and development of structured-mesh based solvers on current Intel FPGA hardware using the SYCL programming model. Two classes of applications are targeted : (1) stencil applications based on explicit numerical methods and (2) multi-dimensional tridiagonal solvers based on implicit methods. Both classes of solvers appear as core solvers in a wide-range of real-world applications ranging from CFD to financial computing. A general, unified workflow is formulated for synthesizing them on Intel FPGAs together with predictive analytic models to explore the design space to obtain near-optimal performance. Performance of synthesized designs, using the above techniques, for two non-trivial applications on an Intel PAC D5005 FPGA card is benchmarked. Results are compared to the performance of optimized parallel implementations of the same applications on a Nvidia V100 GPU. Observed runtime results indicate the FPGA providing better or matching performance to the V100 GPU. However, more importantly, the FPGA solutions provide 59%-76% less energy consumption for their largest configurations, making them highly attractive for solving workloads based on these applications in production settings. The performance model predicts the runtime of designs with high accuracy with less than 5% error for all cases tested, demonstrating their significant utility for design space explorations. With these tools and techniques, we discuss determinants for a given structured-mesh code to be amenable to FPGA implementation, providing insights into the feasibility and profitability of a design, how they can be codified using SYCL, and the resulting performance.

Speaker: Kamalavasan Kamalakkannan (University of Warwick)
Co-Authors: Gihan Mudalige (University of Warwick), István Reguly (Pázmány Péter Catholic University) and Suhaib Fahmy (King Abdullah University of Science and Technology)

SYCL |  Paper: 135  | Register to View Presentation Recording  | View Slides

How to Optimize Compute Drivers? Let’s Start with Writing Good Benchmarks!

Writing efficient driver stack is the goal of every driver developer, but to see if your stack is performant, you need tools that will confirm this. You may try to run workloads and benchmarks and see how your driver perform, but this will only give you a summarized score, consisting of many pieces. To further optimize this, you need to take extensive steps in understanding the applications, figuring out what is the bottleneck and optimizing it, is quite time-consuming process involving a lot of effort.
This created a need for driver team to write a tool, that would make performance work on the driver easier, so we created compute benchmarks. In this suite we test all aspects of driver stack to see if they do not have any bottlenecks. Each test checks only one thing and does this in isolation, so it is very easy to work on optimizing it and doesn’t require any extensive setup.

Benchmarks focus on such subtle aspect of every driver as: API overhead of every call, submission latencies, resource creation costs, transfer bandwidths, multi-threaded contention, multi process execution and many others.
Framework offers capabilities for multiple backends, currently we have OpenCL and Level Zero implementations in place, so it is very easy to compare how the same scenario is services with different drivers. It is also very easy to compare driver implementations between vendors, as tests written in OpenCL simply work across different GPU implementations.

We also use this code to present good and bad coding practices, this is very useful to showcase how simple things can drastically improve performance and users can simply run those scenarios and see how performance changes on their own setups. It is also a great tool to prototype new extensions and further propose them as a part of OpenCL standard.
We plan to Open Source this project in Q2 2022, it is expected to be already available during IWOCL.

Speaker: Michal Mrozek (Intel)

OPENCL & SYCL |  Paper: 142  | Register to View Presentation Recording  | View Slides

TAU Performance System

The TAU Performance System [http://tau.uoregon.edu] is a versatile performance evaluation tool that supports OpenCL, DPC++/SYCL, OpenMP, and other GPU runtimes. It features a performance profiling and tracing module that is widely portable and can access hardware performance counter data at the GPU and CPU level. This talk will describe the usage and new features of TAU for performance evaluation of HPC and AI/ML workloads. TAU is integrated in the Extreme-Scale Scientific Software Stack (E4S) [https://e4s.io] and is available in containerized and cloud environments. The talk/tutorial will demonstrate the usage of TAU on uninstrumented applications.

Speaker: Sameer Shende (University of Oregon)

OPENCL & SYCL |  Technical Presentation: 125  | Register to View Presentation Recording  | View Slides

Towards Performance Portability of AI Models Using SYCL-DNN

The wide adoption of Deep Neural Networks (DNN) has served as an incentive to design and manufacture powerful and specialized hardware technologies, targeting systems from Edge devices to Cloud and supercomputers. This huge diversity soon becomes a burden due to the emerging dependencies between development stacks and deployment hardware. While the proposed ONNX as a de facto for AI model description, provides the portability of AI models across various AI frameworks, supporting DNN models on various hardware architectures remains challenging. Several existing AI frameworks such as Tensorflow, Pytorch, ONNXRuntime provides performance portability via a dedicated backend implementations per hardware architecture. While such approach provides wider support of hardware devices, maintainability and readability remains challenging.

SYCL provides a C++-based portable parallel programming model to target various devices like CPUs, GPUs, DSPs, FPGAs, etc. Thus, enabling SYCL backend for an AI framework can lead to a hardware agnostic model for heterogeneous systems and also allow to reuse the existing optimized library implementations.

SYCL-DNN already supports OpenCL backend and in this paper we extend SYCL-DNN to support Nvidia and RISC-V architectures. The results provide a detailed analysis of the performance portability of SYCL based AI frameworks on various architectures with respect to state-of-the-art optimized vendor specific libraries.

Speaker: Muhammad Tanvir (Codeplay Software)
Co-Authors: Kumudha Narasimhan, Mehdi Goli, Ouadie El Farouki, Svetlozar Georgiev and Isaac Ault (Codeplay Software)

OPENCL & SYCL |  Technical Presentation: 141  | Register to View Presentation Recording  | View Slides

Towards a Portable Drug Discovery Pipeline with SYCL 2020

The outcome of the drug discovery process is a molecule that has strong interaction with the target protein. Domain experts expect a beneficial effect from this interaction. The virtual screening is one of the early stages of the process and it aims at finding promising molecules to forward to later stages. We perform this task in-silico to evaluate a very large chemical library in a short time frame. This activity typically comprises two compute-intensive tasks: a docking function that predicts the displacement of atoms, and a scoring function, which estimates the interaction strength [6] Dompé Farmaceutici led the development of LiGen [1–3], a molecular docking platform targeting High-Performance Computing systems.  LiGen has been used for the discovery of novel treatments in the fight against viral infections and multidrug-resistant bacteria [4]. The LiGen processing pipeline includes two main components, ligen-dock and ligen-score, originally developed in OpenACC, refactored to CUDA using non-portable target-specific optimizations [7].

In this talk, we discuss the challenges of making the LiGen docking pipeline portable among different accelerators and GPUs by porting the original codebase from CUDA to SYCL. The code has been refactored by removing critical CUDA semantics with portable ones, and by exploiting several features from the SYCL 2020 standard [5], including sub-groups, group algorithms, and Unified Shared Memory. For comparison, we have developed two versions based on, respectively, accessor and USM-based memory accesses. Particular efforts have been spent on kernel tuning, in particular, to optimize those kernels with high register pressure. The final SYCL code base, comprising more than 20 SYCL kernels, has been evaluated on several architectures including NVIDIA V100, NVIDIA A100, AMD MI100 as well as Intel Xeon, and by using both HipSYCL and Intel DPC++ compiler. In terms of performance portability, the SYCL implementation achieves similar performance compared to the CUDA native version on NVIDIA V100 and AMD M100, with minimal modification needed.

Speaker: Luigi Crisci (University of Salerno)
Co-Authors: Majid Salimi Beni, Biagio Cosenza (University of Salerno), Nicolò Scipione, Davide Gadioli, Emanuele Vitali, Gianluca Palermo (Politecnico di Milano), and Andrea Beccari(Dompé Farmaceutici)

SYCL |  Technical Presentation: 111  | Register to View Presentation Recording  | View Slides

Posters

Reaching Even Richer C++ in OpenCL Kernels with use of libclcxx

Since 2019 OpenCL developers can take advantage of C++ features in their kernels written in the C++ for OpenCL language that is directly supported in upstream clang [1] or in some vendor drivers through the cl_ext_cxx_for_opencl extension [2]. However, one significant limitation has initially been the inability to use the standard C++ libraries. To overcome this limitation, work has been initiated last year for reusing some of LLVM’s libcxx libraries for OpenCL and it had a very positive outcome of making the full standard C++ type traits library available in the OpenCL kernel code [3]. In this submission we would like to present a follow-up work where we have created a new project libclcxx [4] for hosting and providing C++ libraries for OpenCL to the application developers that wish to benefit from more C++ features in their kernels. This public project is hosted in a GitHub repository and it now contains type traits from libcxx along with some new type traits functionality specific to the OpenCL kernel language: for address spaces, vectors and other OpenCL-specific types.

We would like to present a brief overview of this project and the new type traits as well as to highlight how previous work for enabling type traits from the standard C++ has been reused and successfully merged into the new project and how it facilitated easy development of new features. We would also like to provide a demonstration of how developers can take advantage of the new libraries along with the clang compiler to use more language features from C++ when developing OpenCL kernels.

At the end we would like to invite the OpenCL community for evaluation and feedback through the libclcxx GitHub repo channels, especially regarding the features they would like to see added next, or even the features they would be interested to contribute to this project.

[1] https://clang.llvm.org/docs/UsersManual.html#cxx-for-opencl[2] https://www.khronos.org/registry/OpenCL/extensions/ext/cl_ext_cxx_for_opencl.html[3] https://doi.org/10.1145/3456669.3456675[4] https://github.com/KhronosGroup/libclcxx

Speaker: Anastasia Stulova (Arm)
Co-Authors: Ishfaq Wardag (Arm)

OPENCL |  Poster: 204  | Register to View Presentation Recording  | View Slides

OpenCLML integration with TVM

The poster is to introduce our efforts and present preliminary results on enabling OpenCLML backend for Tensor Virtual Machine (TVM) for Qualcomm Adreno GPUs. TVM is a popular machine learning (ML) compiler stack that targets a wide range of computing devices running ML networks. TVM can auto-generate highly optimized CUDA and OpenCL kernels with little information provided by developers, and in many cases, it can beat the kernels that are hand-optimized by experts. TVM has been well-tuned for many desktop and mobile devices. OpenCL ML (cl_qcom_ml_ops) is Qualcomm’s OpenCL vendor extension which provides applications access to key machine learning operations which are accelerated by the implementation. It introduces new set of CL API calls, data structures and tokens for specifying and enqueuing machine learning operations. The latest version of OpenCL ML provides support for training operators in addition to inference. OpenCLML SDK is available for Qualcomm developer community at https://developer.qualcomm.com/blog/accelerate-your-models-our-opencl-ml-sdk.

After having enabled the mainline TVM for Adreno GPU, which was presented in IWOCL 2020 and 2021, we made further progress by adding OpenCLML SDK backend in TVM for accelerating machine learning operations on Adreno GPU. This integration uses TVM’s high level frontends, graph level optimizations followed by Adreno GPU’s accelerated proprietary machine learning operators. OpenCLML integration into TVM’s is achieved through TVM’s BYOC (Bring-Your-Own-Codegen). BYOC is a framework that allows integration of accelerator specific kernel library into TVM’s compiler flow. Please refer to TVM’s documentation https://tvm.apache.org/2020/07/15/how-to-bring-your-own-codegen-to-tvm for more details about BYOC. BYOC offers hybrid mode of kernel execution with few subgraphs executing over vendor specific kernel library and the remaining (or unsupported operators) using any TVM’s backend as fallback option. This makes the platform capable of running all models without worrying about new operators.

To enable OpenCLML over TVM, we implemented OpenCLML frontend helps to split the given DNN workload based on OpenCLML operator inventory and a new Codegen that extends TVM’s JSON Codegen. The compiler output here is a platform independent JSON representation. When it comes to execution, we implemented OpenCLML runtime over TVM’s JSONRuntime that initializes and invokes OpenCLML machine learning CL API. We are in the process of upstreaming these efforts and you may refer to the RFC at https://github.com/apache/tvm-rfcs/blob/main/rfcs/0052-OpenCLML-integratio-as-BYOC.md.

By integrating OpenCLML into TVM compiler framework, we have seen significant improvement of most of the well-known models on Adreno platform in comparison to TVM’s default OpenCL backend.

We will discuss the approach and various challenges we face during these efforts along with performance numbers, and our thoughts on the future of OpenCLML with TVM in the poster.

Speaker: Siva Rama Krishna Reddy B (Qualcomm)
Co-Authors: Hongqiang Wang, Alex Bourd, Adarsh Golikeri and Balaji Calidas (Qualcomm)

OPENCL |  Poster: 216  | Register to View Presentation Recording  | View Slides

SYCL Concurrency on GPU Platforms: Empirical Measurement

“Execution of Independent SYCL commands may overlap” is an optimization that SYCL applications developers would like to rely on.

By executing commands concurrently, developers hope that their code will run faster. This poster uses this empirical metric to assess if a computing environment lives up to developers’ expectations. We run each individual command serially to generate a baseline and then check if the same commands run faster when scheduled in a way that allows concurrency.

The SYCL specification allows concurrent execution of independent commands when they are scheduled in an out-of-order queue or when they are scheduled to multiples, possibly in-order, queues.

We tested four different kinds of independent command, both in the “multiple in-order-queues” and “single queue out-of-order” modes:

  • Two Compute kernels, each kernel having a low occupancy
  • One Compute kernels and a one memory copy from system allocated Memory to a Device buffer (M2D)
  • One Compute kernel and a memory copy from Device buffer to system allocated Memory (D2M)
  • One M2D and one D2M

The poster’s contribution is twofold:

Firstly, the source code used for these experiments has been made open-source (https://github.com/argonne-lcf/HPC-Patterns/tree/main/concurency) so that others can evaluate these different approaches to concurrency. Our code uses USM for the memory transfer and relies on “clpeak like” kernel for the compute part (https://github.com/krrishnarraj/clpeak/blob/master/src/kernels/compute_dp_kernels.cl). Memory buffers used are as large as possible for USM allocation (`sycl::info::device::max_mem_alloc_size`) to minimize runtime overhead with respect to execution time. The number of FMA used for the compute kernel is chosen so that the execution time of the compute kernel and data-transfers are similar.

Secondly, We tested multiple Sycl compilers, targeting multiple backends, on multiple hardware (at the time where this abstract was written: DPCPP / OpenCL / Gen, DPCPP / L0 / Gen9, DPCPP / CUDA / A100, HipSYCL / Hip / MI100. We plan to measure more). Results are mitigated, with some environments achieving concurrency in most tests, others in none. It is also is interesting to note that enabling profiling in queues will serialize commands in some environments.

Speaker: Thomas Applencourt (Argonne National Laboratory)
Co-Authors: Abhishek Bagusetty (Argonne National Laboratory) and Aksel Alpay (Heidelberg University)

SYCL |  Poster: 221  | Register to View Presentation Recording  | View Slides

C++OpenCL4TVM: Support C++OpenCL Kernel for TVM NN Operators

In various artificial intelligence (AI) frameworks, tensor virtual machine (TVM) is a machine learning compiler framework that can provide OpenCL parallel optimization, focusing on the inference stage in the field of deep learning. The operators of a NN model can automatically generate OpenCL kernel functions through its built-in OpenCL kernel code generator to optimize the inference process. However, the kernel code of its OpenCL specification still supports the C version with the kernel source stream, which makes the OpenCL kernel unable to obtain further optimization opportunities through the C++ syntax, like compile-time and abstraction design.

In this paper, we propose a solution that TVM generates the design patterns and abstractions of OpenCL C++ with a flavor of layout and view of Kokkos. This solution provides two C++-specific abstraction features, sparse layout and unseq execution policy, for TVM’s OpenCL code generator. The sparse layout can be described abstractly like the Kokkos-style, effectively reorganizing the matrix structure of convolution to achieve code reuse and memory optimization. The unseq execution policy serves as a vectorization request to, for example, for_each in the C++ for OpenCL kernel. There are two steps in the design process:

(1) Analyze the sparse structure of the convolution operator in the TVM’s OpenCL kernel and further design the sparse layout for TVM’s OpenCL kernel code generator.

(2) Put compiler directives in function overloads which take the vectorization request.
Currently, this is on-going work. We will report more experimental results for TVM generating design patterns of OpenCL C++ version in the workshop time.

Speaker: Po-Yao Chang (National Tsing Hua University)
Co-Authors: Tai-Liang Chen, Yu-Tse Huang, Meng-Shiun and Jenq-Kuen Lee (National Tsing Hua University)

OPENCL |  Poster: 209  | Register to View Presentation Recording  | View Slides

Embedding a DSL in SYCL for Productive and Performant Tensor Computing on Heterogeneous Devices

Within the last few deceases, tensor computing has grown to affect many domains including scientific computation, engineering, machine learning, and many other sub-domains.
Although these domains all utilize tensor computing, each individual application can vary in their overall implementation and requirements.
Thus creating a diverse and unique range of applications within tensor computing.

Along with the popularity of tensor computing arose the demand for more optimized tensor applications. In response to this growing demand came the proliferation of new hardware platforms to perform more efficient tensor execution. From Google Brain’s Tensor Processing Units (TPUs), Nvidia’s General Graphics Processing Units (GPUs), Microsoft’s Neural Processing Units (NPUs), and Field-Programmable Gate Arrays (FPGAs) for both local and cloud platforms. Many of these architectures utilize matrix-vector units, systolic arrays, or a novel hardware architecture to exploit the parallelism found within tensor computations.

To support many of these new and developing hardware architectures, industry and academia alike has produced Domain-Specific Languages (DSLs) and programming models to help realize each unique hardware architecture’s potential. This involves very close understanding of individual hardware characteristics, their downfalls, and well-engineered software to navigate them effectively. Accompanying these unique hardware architectures and unique supporting software comes the caveat of inflexibility. As software improves or experimental implementations of a tensor application are created for one unique hardware architecture, it may or may not translate easily to another architecture. Applications must be re-implemented for any novel architecture by researchers and developers alike. Forcing difficult trade-offs between hardware, software, ease of engineering labor. Creating an inflexible software ecosystem becomes a hurdle for the development of new algorithms that utilize tensor computation.

This project focuses on combining data parallelism provided though the combination of Data Parallel C++ (DPC++), and Temporal To Spatial Programming (T2SP) to provide a hardware agnostic programming model to construct new tensor computations.

DPC++ is an open source compiler project that is based on SYCL, an industry-driven Khronos standard adding data parallelism to C++ for heterogeneous systems.

  • T2SP is both a novel programming framework and compiler which helps enables tensor computation for both spatial and vector architectures such as CPUs/GPUs and FPGAs respectively.
  • T2SP does this by dissolving the marriage between functional specification from spatial mapping.
  • T2SP is based on several observations, notably that spatial architectures favor optimized dataflow and partitioning the computation into many sub-computations distributed over spatial architecture.
  • T2SP allows programmers to describe the computation separately from spatial mapping, partitioning, and dataflow of a spatial architecture.
    Allowing programmers to quickly develop various spatial optimizations without having to reconstruct an applications core functional implementation between architectures such as CPUs, GPUs, and FPGAs.
  • Through the employment of both DPC++ and T2SP, users are able to create a platform agnostic implementation of novel algorithms, without the restriction of architecture dependent software.
  • Providing portability, efficient utilization of hardware resources, and ease of development for tensor applications.

Initial evaluations were preformed using General Matrix Multiply (GEMM), 2 Dimensional Convolution (CONV), and Capsule Convolution (CAPSULE), for an Arria-10 FPGA on Intel’s FPGA DevCloud Platform.

Results show that this project has been able to achieve an average of over 60% of the original T2SP’s performance. With minor adjustments, it can be said with confidence that this combination of DPC++ and T2SP can provide competitive performance of tensor applications between special and tensor architectures without extra effort on the end user.

Speaker: Abenezer Wudenhe (University of California Riverside)
Co-Authors: Hongbo Rong (Intel)

SYCL |  Poster: 219  | Register to View Presentation Recording  | View Slides

C++ for OpenCL 2021

Since its release in 2019, the C++ for OpenCL language has offered higher programming productivity to application developers targeting OpenCL hardware accelerators. By keeping backwards compatibility with OpenCL C 2.0, it adhered to the industry-wide programming practices. Moreover, support for high-level C++ language concepts and libraries [1, 2] opened up possibilities of developing even more complex applications, thus putting OpenCL in a better position for competition with other compute technologies such as SYCL, CUDA or Metal.

To continue evolution alongside the OpenCL standard, a new language version C++ for OpenCL 2021 has been announced, which provides compatibility with OpenCL version 3.0. The difference between the previous C++ for OpenCL version 1.0 and the new version 2021 arises from differences between OpenCL 2.0 and 3.0 with which they are respectively compatible.

To support more devices, OpenCL 3.0 defines features of OpenCL 2.0 as optional [3]. We would like to present an overview of the new language version along with its ongoing support in Clang. Currently, C++ for OpenCL 2021 is being actively developed side by side with OpenCL 3.0. The experimental support of it is going to appear in clang 14. Most of work conducted on C++ for OpenCL 2021 development was related to the optionality of the features. In its current state, C++ for OpenCL 2021 supports all optional features from OpenCL 3.0.

By implementing the new language version inside Clang compiler as part of the LLVM project, we could reuse substantial portions of code that was already written during development of OpenCL C 3.0. This way we could avoid unnecessary duplication of effort in places where C++ for OpenCL behaves similarly to OpenCL C. Additionally, Clang compiler and community helped to carry out the proof-of-concept stage for experimental features. More specifically, the address space removal utility __remove_address_space [4] was included in the official language documentation after implementing it in Clang and gathering feedback from the Clang community. The largest novelty of this new language version and its implementation lies in support of generic address space optionality on which C++ for OpenCL 1.0 has heavily relied in multiple C++ specific constructs, e.g., implicit object parameters or special member functions.

In this submission we plan to provide an overview or key language changes in C++ for OpenCL 2021 and present the latest status in Clang development of C++ for OpenCL 2021. Although not in its final state yet, this new language version already has most of the intended functionality implemented inside Clang compiler. Provisional language documentation [5] should be referred to for detailed information on all new language features. We thus encourage everyone to start experimenting and provide us feedback on how to improve C++ for OpenCL 2021 even further. Please check the latest support of C++ for OpenCL 2021 directly from the browser using Compiler Explorer [6].

[1] https://www.iwocl.org/wp-content/uploads/06-iwocl-syclcon-2021-strohm-slides.pdf

[2] https://www.youtube.com/watch?v=DBknc1pRB9E

[3] https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/OpenCL_C.html#features

[4] https://www.khronos.org/opencl/assets/CXX_for_OpenCL.html#remove-addrspace

[5] https://www.khronos.org/opencl/assets/CXX_for_OpenCL.html

[6] https://godbolt.org/z/q8PeWY3hn

Speaker: Justas Janickas, Arm
Co-Authors: Anastasia Stulova, Arm

OpenCL |  Poster ID: 207  | Register to View Presentation Recording  | View Slides

An Overview of the OpenCL vendor extensions supported in Qualcomm Adreno GPUs

In this poster, we will provide a high-level overview of the OpenCL vendor extensions that are supported by the Adreno GPUs in Snapdragon SOCs. These vendor extensions provide features that are currently not supported by the OpenCL standard but could be available as part of the OpenCL standard going forward. There are also some extensions that are highly Adreno-specific and will be solely available on Adreno GPUs for the foreseeable future. These features include the integer dot product with 8-components support, bit reverse operation, subgroup size selection, recordable queues, etc., and many of them have been used by our customers for optimal performance. In addition to the overview, we will provide examples to demonstrate the best practices of how to use and optimize some of the extensions.

Speaker: Hongqiang Wang (Qualcomm)
Co-Authors: Balaji Calidas (Qualcomm)

SYCL |  Poster: 213  | Register to View Presentation Recording  | View Slides

Exploring Compiler-aided nd-range Parallel-for Implementations on CPU in hipSYCL

Parallel and heterogeneous programming have brought the most notable advances in computing over the last two decades.  One of the major challenges with heterogeneous programming is the performance portability of code on the various computing resources. Open standards like OpenCL or SYCL provide means to program various classes of processing units using the same kernel. While mainstream CPUs expose up to 64 physical cores at this point in time, GPUs are usually programmed with thousands of work-items.  This difference makes it challenging to efficiently use CPUs as fallback or as additional computing resources using the same kernels as with GPUs.

The hipSYCL implementation of the SYCL standard, presented in [Alpay and Heuveline 2020], currently provides unified access to CPUs and GPUs. However typical GPU kernels using the nd-range parallel-for paradigm are far from performance portable as mentioned in [Alpay and Heuveline 2020, slide 20]. This is due to the necessary forward-progress guarantees that have to be made so that barriers can be correctly implemented, making typical GPU kernels with low compute to barrier ratio prohibitively slow. One major component is the overhead induced by using either dedicated threads or fibers for every work-item. This problem is common with most heterogeneous programming models when also targeting CPUs. A more favorable execution model for CPUs is the hierarchical parallel-for in SYCL, which can be implemented by parallelizing over the work-groups and sequentially iterating over the work-items. [Deakin et al. 2021] discuss the issues of SYCL’s current kernel submission mechanisms in depth.

Earlier work in the context of OpenCL CPU implementations had to solve the barrier issue for their kernels as well.

In [Kaeli et al. 2015] the authors describe the implementation of the AMD APP SDK’s OpenCL CPU runtime. They use custom lightweight threads that allowed them to optimize stack location and alignment. Support for vectorization is only present when using OpenCL’s explicit floatN vector data types which have overloaded mathematical operations that are mapped directly to vector instructions.  This is similar to the Boost.Fiber based, library-only nd-range parallel-for implementation in hipSYCL 0.9.0 and later and thus leaves performance to be gained.

In [Jääskeläinen et al. 2010] an OpenCL implementation is presented that aims to enable using OpenCL on application-specific processors. [Jääskeläinen et al. 2015] continues this project under the name portable OpenCL (POCL) with a focus on high-performance execution on CPUs. The work-group barrier issue has been solved by implementing a custom set of passes on LLVM IR inside the kernel compiler. After inlining all functions in the kernel, it performs barrier-tail replication, loop-barrier insertion, … to make barrier-free regions, which are replicated and at runtime selected by the first work-item and then iterated over by a work-item loop. With a few technical adaptations due to compilation-flow limitations in SYCL, the kernel driver was adopted into the hipSYCL Clang plugin for evaluation.

A different approach to the work-group barrier problem was presented as continuation-based synchronization (CBS) in [Karrenberg and Hack 2012]. The kernel is divided in barrier free sub-CFGs that are identified by their entry barrier and are surrounded by a while loop containing a switch statement that selects the next to be executed sub-CFG based on the last encountered barrier. An adaptation of CBS into hipSYCL is compared to the POCL approach in this work.
In hipSYCL an improved variant of its uniformity analysis that is presented in [Rosemann et al. 2021] was used.

Most SYCL implementations rely on OpenCL drivers to provide access to CPUs. Thereby the performance and semantics are then mostly defined by the underlying runtime which might impact portability. hipSYCL does not have the additional requirement for an OpenCL driver, but supports all CPUs for which an OpenMP compiler exists in a library-only fashion. To provide good CPU performance for the nd-range parallel-for paradigm as well, without adding the dependency on an OpenCL runtime, we implement the compiler-based POCL and CBS approaches solving the work-group barrier problem in hipSYCL’s Clang plugin and compare the approaches’ suitability for the SYCL programming model.

The approach POCL has chosen, allows for more hand-crafted compiler-based optimizations, whereas CBS is more general. One benefit of CBS is, it only requires that if a barrier is reached by any work-item, all work-items must reach it. This is how the barrier semantics are defined by the SYCL and newer OpenCL C standards. POCL additionally assumes that if a barrier is nested inside a loop, the loop has to be executed the exact same amount of iterations for all work-items, even if the barrier would not be reached due to protecting conditionals. This can have a functional correctness impact.

Both approaches outperform hipSYCL’s current fiber implementation in relevant benchmarks [Deakin et al. 2021, 2016; Jin 2021; Lal et al. 2020], in most cases by one or two orders of magnitude. See https://gitlab.com/fodinabor/ma-hipsycl-bench/-/tags/iwocl22 for the used benchmark versions, the reproduction steps and raw data. While the CBS pipeline currently results in faster kernels generally speaking, the POCL pipeline could likely close the gap by porting improvements made to the other one. By example of the DGEMM kernel, it was shown, that on the tested x86 systems, the SYCL implementation is able to outperform the POCL OpenCL implementation. By upstreaming the CBS variant, hipSYCL is the first SYCL implementation that does not rely on an OpenCL runtime, achieving competitive performance and consistent semantics for the nd-range parallel-for paradigm on any CPUs that have an LLVM backend.

Speaker: Joachim Meyer (Saarland University)
Co-Authors: Aksel Alpay, Holger Fröning and Vincent Heuveline (Heidelberg University)

SYCL |  Poster: 210  | Register to View Presentation Recording  | View Slides