Welcome to the IWOCL and SYCLcon 2021 program of events

Virtual Event Logistics

  • Instructions for joining the live sessions can be found in your registration confirmation.
  • If you are missing registration emails please see: What to check when you don’t receive Eventbrite emails
  • A range of Slack channels are in use to facilitate discussions.  The link to join these channels can be found in your original registration confirmation.
  • All our authors have been asked to check their Slack channel from Wednesday 28th to answer any questions.
  • Registration is now closed.  The Slack channels will remain open to anyone who registered before the 29th April.

Proceedings

This year’s proceedings are available on the
ACM Digital Library

OneAPI Dev Summit

The First 2021 oneAPI Developer Summit

In this one-day virtual summit, you will hear speakers from industry and academia share their experience working on innovative cross-platform, multi-vendor architecture solutions developed on oneAPI.

Date: Monday 26 April
Start time: 9:00am BST, 10:00 CET | Duration: All day
Register: Please visit oneAPI Developer Summit to register and for additional information on the program and speakers.

Join the oneAPI community for the first 2021 oneAPI Developer Summit IWOCL focused on oneAPI and Data Parallel C++ for accelerated computing across xPU architectures (CPU, GPU, FPGA, and other accelerators). In this one-day virtual conference, you will hear from industry and academia speakers working on innovative cross-platform, multi-vendor architecture solutions developed on oneAPI. Learn from fellow developers and connect with other innovators.

Please join us, a self-sustained, vibrant community to support each other using oneAPI and Data Parallel C++.

  • INTRODUCTION
  • VENDOR UPDATE: SYCL 2021 Vendor Update
  • DEVCLOUD UPDATE: Developer tools to get you started on oneAPI
  • HANDS-ON SESSION: Application optimization with Cache-aware Roofline Model and Intel oneAPI tools
  • LUNCH
  • LIGHTNING TALK: Great Cross-Architecture Challenge Application Showcase
  • KEYNOTE: SYCL 2020 in hipSYCL: DPC++ features on AMD GPUs, NVIDIA GPUs and CPUs
  • TECH TALK:  AI > A Deep Dive into a Deep Learning Library for the A64FX Fugaku CPU – Meet the Developer
  • LIGHTNING TALK: Bringing SYCL to Super Computers with Celerity
  • LIGHTNING TALK: Great Cross-Architecture Challenge Application Showcase
  • TECH TALK: It’s acceleration ….. but faster! – a business perspective on FPGA technology​
  • TECH TALK: Comparative Analysis of Intel HLS Design Tools on a Case Study in Neuromorphic
  • TECH TALK: TAU Performance System
  • CLOSING
  • HAPPY HOUR

To attend this session please register directly at: https://www.oneapi.com/events/devcon2021iwocl/

Tutorials

Registered delegates will be sent instructions for joining these live tutorials and the associated Slack channel.

A Hands-On Introduction To SYCL

Tutorial Lead: Rod Burns, Codeplay
Presenters: Igor Vorobtsov, Intel.  Aksel Alpay, University of Heidelberg. Ronan Keryell, Xilinx.  Peter Zuzek, Codeplay and Gordon Brown, Codeplay

Date: Tuesday 27 April
Start time: 09:00 BST, 10:00 CET | Duration: Approx 7 Hours
discuss on slack: #discuss_tutorial_intro_to_sycl

Set-up Instructions

The tutorial is very focused on hands-on work and is a great way to grasp the essentials you will need for SYCL development. By the end of the day we expect you to be able to write your own SYCL kernels and understand the main concepts required to build more complex applications.

Because of this we strongly encourage that you set up your machine so that you can follow along with the exercises.  There are a few options for this, some are more involved in terms of setup than others.

The simplest way to do the exercises is to use the Intel DevCloud environment. This requires no machine setup, and all you need to do is register for the DevCloud and then access this via your web browser or SSH. Use this link to register for the DevCloud https://devcloud.intel.com/oneapi/

It’s also possible to configure your own machine to use the implementations ComputeCpp, DPC++ or hipSYCL.

ComputeCpp supports Intel CPU and GPU on Linux and Windows
DPC++ supports Intel CPU and GPU on Linux and Windows
hipSYCL supports Nvidia and AMD processors on Linux

Individual instructions on how to install each of these are in this README https://github.com/codeplaysoftware/syclacademy/tree/iwocl21#install-sycl-implementations

Abstact

SYCL is a programming model that lets developers support a wide variety of devices (CPUs, GPUs, and more) from a single code base. Given the growing heterogeneity of processor roadmaps, moving to a platform-independent model such as SYCL is essential for modern software developers. SYCL has the further advantage of supporting a single-source style of programming from completely standard C++. In this tutorial, we will introduce SYCL and provide programmers with a solid foundation they can build on to gain mastery of this language.

This is a hands-on tutorial. The real learning will happen as students write code. The format will be short presentations followed by hands-on exercises. Hence, attendees will require their own laptop to perform the hands-on exercises.

Topics Covered Include:

– An Introduction to SYCL
– SYCL Kernel Definition
– Memory and Data Management
– Data Parallelism
– Asynchronous Execution

SYCL  | Paper ID: 113

Layers for OpenCL

Brice Videau, Argonne National Lab

Date: Tuesday 27 April
Start time:  15:00 BST, 16:00 CET, 07:00 PST | Duration: 2 hours
slides  |  tutorial on github  |  video presentation  |  discuss on slack: #discuss_tutorial_layers_for_opencl
Interception mechanisms for injection or tracing are an integral part of the software analysis toolkit. These techniques are well understood and many example of interception libraries exist. Some widely used tools in the OpenCL community leverage such techniques, the most famous example being “Intercept Layer for OpenCL Applications” [1]. Nonetheless, those techniques present some pitfalls such as requiring platform specific functionalities, or managing conflicts between different libraries intercepting the same API calls. A certain amount of engineering is also required to set-up such a solution. Another alternative is for the runtime of a library to offer such capabilities in the form of plugins that are often called layers. For instance, the Khronos Vulkan graphical API proposes such capabilities.

The last release of the OpenCL loader introduced a new experimental system of plugins implementing layers. Those layers can be used to intercept, introspect, and potentially modify the behavior of one or several OpenCL API functions. The plugins take the shape of simple shared libraries that the loader is asked to load through an environment variable. The system aims at solving the portability issues with classical interception mechanisms, as well as minimizing the amount of code developers need to create a functional layer. As the feature is experimental, we are eagerly awaiting the community’s feedback so that we can improve the system to fit OpenCL users and developers requirements and use-cases. One of this session’s objective is to facilitate the gathering of such feedback.

This session will be constituted of two parts. The first part will be a presentation of the layering system implemented in the OpenCL loader, followed by a Q&A session. The second part will be a demonstration of layers usage and development, followed by a Q&A session.

Presentation outline:
– Layers vs Interception
– Original OpenCL loader API call workflow
– OpenCL loader with layers API call workflow
– How the new OpenCL implements layers
– Presentation of the loader layers configuration options
– Presentation of the layers plugin API
– Discussion of the limitation of the current experimental layer implementation

Q&A Session

Demonstration Outline:
– Demonstration of the use of existing layers and the required loader configuration
– Deep dive into the anatomy of a simple layer
– Presentation of the resources available to developers

Q&A Session

[1] https://github.com/intel/opencl-intercept-layer

OpenCL  | Paper ID: 103

Live Panel Discussions

Registered delegates will be sent instructions for joining these live panel discussions and the associated Slack channel.

OpenCL Panel Discussion

Panel Chair: Simon McIntosh-Smith, University of Bristol

Date: Wednesday 28 April
Start time: 16:00 BST,  17:00 CET, 08:00 PST  |  Duration: 60 mins
on-demand video | discuss on slack: #discuss_panel_opencl
  • Neil Trevett , Khronos and NVIDIA
  • Jesse Natalie, Microsoft
  • Paul Miller. Boris FX
  • Ben Ashbourgh, Intel
  • Kevin Petit, Arm
  • Alastair Murray, Codeplay
OpenCL  |  Paper ID: P1  |

SYCL Panel Discussion

Panel Chair: Simon McIntosh-Smith, University of Bristol

Date: Thursday 29 April
Start time: 16:00 BST,  17:00 CET, 08:00 PST  |  Duration: 60 mins
on-demand video  | discuss on slack: #discuss_panel_sycl
  • Michael Wong, Codeplay
  • Ronan Keryell, Xilinx
  • Aksel Alpay, University of Heidelberg
  • James Reinders, Intel
  • Tom Deakin, University of Bristol
  • Hal Finkel, U.S. Department of Energy
  • Peter Thoman, University of Innsbruck
SYCL|  Paper ID: P2  |

Khronos Advisory Panel Meetings – OpenCL and SYCL

These meetings are for members of the OpenCL and SYCL advisory panels.  Existing members will be sent call-in instructions directly by the Khronos Group.  See below for details on becoming an Advisor.

Advisory Panel Meeting – OpenCL

Date: Monday 26 April
Start time: 16:00 BST,  17:00 CET, 08:00 PST
Members only
Call-in instructions will be sent to members by the Khronos Group

SYCL Advisory Panel Meeting

Date: Tuesday 27 April
Start time: 16:00 BST,  17:00 CET, 08:00 PST
Members only 
Call-in instructions will be sent to members by the Khronos Group

An Invitation to Become an OpenCL or SYCL Advisor

Khronos understands that to best meet the needs of the industry, standards such as OpenCL and SYCL should incorporate the requirements and feedback from industry experts. Consequently Khronos has established Advisory Panels to serve as a forum where said experts can engage in a bi-directional discourse with the corresponding Working Group (WG) to communicate their requirements, provide feedback on draft specs, consume WG materials offline, help prioritize work on features and among others, participate in shaping the ecosystem.

There is no cost to join an advisory panel.  If you are interested in finding out more and would like to apply please visit: https://www.khronos.org/advisors and and then email: memberservices@khronos.org with your contact details and a short summary of your involvement with OpenCl and/or SYCL.

Please note that each working group can only support a limited number of Advisors.  The meetings taking place during IWOCL/SYCLcon represent the first official meetings of the OpenCL and SYCL Advisory Panels and will be open to members only.  Members will be sent call-in instructions directly by the Khronos Group.

Research Papers and Technical Talks

Registered delegates will be able to view the on-demand video presentations of these sessions from Wednesday 28th April.   All our speakers will be asked to check their Slack channel regularly to answer any questions coming in from the community.

KEYNOTE
SYCL, DPC++, XPUs, oneAPI – a View from Intel

Speaker: James Reinders, Intel.

on-demand video | slides | Discuss on slack: #discuss_talks_sycl  | SYCL |  Paper ID: K01  |
James will share his passion for getting to a world of heterogeneous computing where software tooling (compilers, frameworks, libraries, etc.) all have an “XPU view” of the world that spans vendors and devices. In this world, James advocates that we all be free to write our programs to use whatever XPUs we want, get full access to all XPU capabilities, and be comfortable trusting our ability to do this without extra risk to performance or stability. James will discuss how SYCL, DPC++, XPUs, and oneAPI all are important on our journey to make this vision a reality. James invites all conference attendees to join in, and help guide Intel’s enthusiasm to help us all succeed together. Note: James co-authored the first (and only for now) book that teaches SYCL 2020 programming.

INVITED TALK
An Insight Into Kalray’s OpenCL Optimum Implementation

Speaker: Sebastien Leduc, Kalray.  Software Engineering Director

on-demand video | slides | discuss on slack: #discuss_talks_opencl  |  OpenCL  |  Paper ID: K02  |

Abstract

Kalray is proud to have officially reached OpenCL™ conformance at the close of 2020 for Coolidge™, Kalray’s 3rd generation of MPPA® (Massively Parallel Processor Array) intelligent processor.

Kalray MPPA intelligent processors are a new generation of processors specialized in intelligent data processing from cloud to edge. They are able to capture and analyze on the fly massive data flows, and interact in real time with the outside world. These processors are capable of running demanding AI algorithms and simultaneously a wide set of different processing and control tasks such as intensive mathematical algorithms, signal processing, network or storage software stacks.

Let’s explore why it is important for our users that Coolidge now runs a conformant implementation of OpenCL, and how we offer open standard programming for high performance and flexible applications on manycore processors.

The evolution of high-performance systems is crying out for disruptive hardware architectures and innovative software programming models. The challenges encountered by embedded compute system users reside in the selection of the appropriate hardware technologies and in the selection of the programming models for Computer vision, Neural networks, Machine learning… It usually requires:-

– Re-use of legacy code
– Ease of finding high qualified engineers
– Flexibility for porting from one hardware architecture to another
– Long term maintenance
– Rapid prototyping up to productization…

Some solution providers are proposing a proprietary framework and API while others are implementing a defined Standard API for full open framework.
Such frameworks need to support high level interfaces for several types of applications and help users to initialize, use and “combine” these applications.
The Deliberate Choice of Open Standards

At Kalray, we are convinced Open Standards answer the requirements set out above. This is why Kalray’s Software Development Kit, AccessCore® SDK, relies massively on Open Standards and why we directly worked with Khronos (https://www.khronos.org) for selecting the most appropriate programming solution for parallel architectures and performance offloading.

We are seeing an explosion of the most demanding applications that require a tremendous range of advanced computing capabilities. The focus so far has been to execute these applications on a dedicated type of architecture, the GPU (initially conceived for graphical demands). As the industry expands its needs for neural network, algebra calculation and computer vision algorithms, more adapted architectures are being developed and used.

Here enters Kalray’s MPPA intelligent processor which provides high performance for heterogeneous computation while keeping a homogenous architecture. The challenges, as mentioned above, are for the users to be able to re-use already developed applications, to port them and to evaluate benefits of our architecture (execution time, latency, power consumption…). In addressing these challenges, we must also minimize the learning ramp up and the maintenance burden of a new language whilst reducing the need for extensive training.

By adopting an open standard, Kalray makes MPPA adoption easy for developers. They can use legacy code, they know the programming environment, they are not surprised about configuration capabilities and are even used to the optimization methods.

With this proof of commitment to the OpenCL conformance of Coolidge, our 3rd generation of MPPA intelligent processor, Kalray ensures that our users can rely on our implementation as much as that from other major actors in the industry. In addition, as a Khronos member, we are involved into Khronos Working Groups for contributing to these Open Standards evolution and adoption.

This session will describe in detail the implementation choices (mapping of platform, execution and memory models on MPPA) as well as the optimum programming methods and usage of extensions utilizing the full capabilities of the architecture.

Speaker Biography

Sebastien Le Duc is Software Engineering Director at Kalray.  He started his professional career in 1998 at STMicroelectronics where he worked on compilers for a proprietary VLIW DSP targeted at multimedia applications.  He then joined ST-Ericsson in 2006 where he spent 8 years managing Multimedia Software development teams. He continued his career back at STMicroelectronics as Lead Software Architect on set-top box products.  Together with his technical background that ranges from low-level software development to middleware integration, his team management and product development experience, Sebastien brings outstanding Software Engineering leadership to Kalray.

INVITED TALK
OpenCL Working Group –  A State of the Union

Speaker: Neil Trevett, NVIDIA  |   Khronos Group, President and OpenCL Working Group Chair.

on-demand video | slides (updated) | discuss on slack: #discuss_talks_opencl  |  OpenCL  |  Paper ID: K03  |

Abstract

Neil will update the community on all the OpenCL working group’s latest developments with OpenCL, including OpenCL 3.0, plus all the latest news and updates on extensions, the eco-system, implementations and more.  Always a not-to-be missed presentation.

Speaker Biography

Neil Trevett is the President of The Khronos Group (and OpenCL working group chair) where he has helped initiate and evolve royalty-free open standards such as OpenCL, OpenVX, NNEF, OpenXR, Vulkan, OpenGL ES, WebGL and glTF many of which are widely deployed in the graphics and embedded vision industries. At NVIDIA, Neil works to drive and develop the developer ecosystem that enables applications to take advantage of advanced GPU and silicon acceleration.

INVITED TALK
SYCL Working Group –  A State of the Union

Speaker: Michael Wong , Codeplay   |   Khronos Group, SYCL Working Group Chair.

on-demand video | slides | discuss on slack: #discuss_talks_sycl  |  SYCL  |  Paper ID: K04  |

Abstract

Michael will update the community on all the SYCL working group latest developments, including SYCL 2020.  Always a not-to-be missed presentation.

Speaker Biography

Michael is a Distinguished Engineer at Codeplay, Chairman of SG19 Standards Committee for C++ Machine Learning, Head of Canadian Delegation of C++ Standards Committee.  Michael has rich experience in C++ parallel computing, high performance computing and machine learning. He has led the development of C++ heterogeneous programming language (SYCL) standard for GPGPU application development, and has profound research and insights into the underlying performance optimization of Tensorflow. Its specific work covers parallel programming, neural network, computer vision, automatic driving and other fields. Michael was a senior IBM technical expert who led the development of IBM XL C++ compiler and XL C compiler.Head of Canadian Delegation of C++ Standards Committee. Michael has rich experience in C++ parallel computing, high performance computing and machine learning. He has led the development of C++ heterogeneous programming language (SYCL) standard for GPGPU application development, and has profound research and insights into the underlying performance optimization of Tensorflow. Its specific work covers parallel programming, neural network, computer vision, automatic driving and other fields. Michael was a senior IBM technical expert who led the development of IBM XL C++ compiler and XL C compiler.

BEST PAPER*
Sylkan: Towards a Vulkan Compute Target Platform for SYCL

Speaker: Peter Thoman, University of Innsbruck (UIBK)

Co-Authors: Daniel Gogl (UIBK); Thomas Fahringer (UIBK)
* The best paper award for 2021 was sponsored by Arm

on-demand video | slides | discuss on slack: #discuss_talks_sycl |  SYCL |  Paper ID: 14  |
SYCL is a modern high-level C++ programming interface which excels at expressing data parallelism for heterogeneous hardware platforms in a programmer-friendly way, and is standardized by the Khronos Group. The latest version of the standard, SYCL 2020, removes the previous dependence of the specification and its implementations on an underlying OpenCL target, opening the door for compliant alternative implementations.

In this paper, we discuss the opportunities and challenges of mapping SYCL to Vulkan, a low-level explicit programming model for GPUs. This includes an analysis of the potential semantic mismatch between each respective standard, as well as approaches to work around some of these issues. Additionally, we present a prototype research implementation of Sylkan, a SYCL compiler and runtime targeting Vulkan.

In order to evaluate our prototype qualitatively and quantitatively, we chose a variety of functional tests as well as three performance benchmarks. For the functional tests, we discuss and categorize the failures of the current prototype, noting which semantic mismatch or missing implementation causes them. For the performance benchmarks, we compare execution times against a OpenCL-based SYCL implementation and a native Vulkan version of each benchmark, on two hardware platforms.

Performance-Portable Distributed k-Nearest Neighbors using Locality-Sensitive Hashing and SYCL

Speaker: Marcel Breyer, University of Stuttgart, IPVS

Co-Authors: Gregor Daiß (University of Stuttgart, IPVS); Dirk Pflüger (University of Stuttgart, IPVS)

on-demand video | slides | discuss on slack: #discuss_talks_sycl  |  SYCL |  Paper ID: 24  |
In the age of data collection, machine learning algorithms have to be able to efficiently cope with vast data sets. This requires scalable algorithms and efficient implementations that can cope with heterogeneous hardware. We propose a new, performance portable implementation of a well-known, robust, and versatile multi-class classification method that supports multiple Graphics Processing Units (GPUs) from different vendors. It is based on a performance portable implementation of the approximate k-nearest neighbors (k-NN) algorithm in SYCL. The k-NN assigns a class to a data point based on a majority vote of its neighborhood. The naive approach compares a data point x to all other data points in the training data to identify the k nearest ones. However, this has quadratic runtime and is infeasible for large data sets. Therefore, approximate variants have been developed. Such an algorithm is the Locality-Sensitive Hashing (LSH) algorithm, which uses hash tables together with locality-sensitive hash functions to reduce the data points that have to be examined to compute the k-NN.

To the best of our knowledge, there is no distributed LSH version supporting multiple GPUs from different vendors available so far despite the fact that k-NNs are frequently employed. Therefore, we have developed the sycl_lsh library. It provides the first hardware-independent, yet efficient and distributed implementation of the LSH algorithm that is suited for modern supercomputers. The implementation uses C++17 together with SYCL, which is an abstraction layer for OpenCL that allows targeting different hardware with a single implementation. To support large data sets, we utilize multiple GPUs using the Message Passing Interface (MPI) to enable the usage of both shared and distributed memory systems.

We have tested different parameter combinations for two locality-sensitive hash function implementations, which we compare. Our results show that our library can easily scale on multiple GPUs using both hash function types, achieving a nearly optimal parallel speedup of up to 7.6 on 8 GPUs. Furthermore, we demonstrate that the sycl_lsh library supports different SYCL implementations—ComputeCpp, hipSYCL, and oneAPI—to target different hardware architectures without significant performance differences.

Toward Performance Portability of Highly Parametrizable TRSM Algorithm Using SYCL

Speaker: Thales Sabino, Codeplay Software

Co-Authors: Mehdi Goli (Codeplay software Ltd)

on-demand video | slides | discuss on slack: #discuss_talks_sycl  |  SYCL |  Paper ID: 26  |
Presented in 1979, BLAS is, to this day, the de-facto standard for low-level linear algebra routines. BLAS provides essential linear algebra routines used in various domains such as numerical and scientific computing, weather simulation, computational fluid dynamics, machine learning and adopted for a broad range of hardware from HPC to embedded systems and AI specialized accelerators.

While originally BLAS routines have been implemented for CPU, with the emergence of GPGPU BLAS routines had to be re-written to exploit the provided extensive computational power. Machine learning is rapidly changing this landscape again by incentivizing the development of specialized hardware that can
perform certain operations more efficiently. With various range of hardware, having different memory hierarchy, different cache line size, and various memory access pattern, with different number of registers and different type of memory connections, performance portability of BLAS routine across various platforms while avoiding rewrites of existing code is a major challenge of the heterogeneous programming world.

Written in SYCL programming Language, SYCL-BLAS is an open-source BLAS library that provides performance portability across various SYCL-enabled platforms.

This paper presents the implementation of a parametric tile-based TRSM routine for SYCL-BLAS by employing a highly optimized GEMM routine provided in SYCL-BLAS.

Our results shows that we can achieve up to 2.6x speedup on Intel GPU, 7x on AMD GPU and up to 3.4x speedup on ARM GPU compared with the highly optimized clBLAST and clBLAS libraries by tuning the tile size per-device without reimplementing the kernel.

On Measuring the Maturity of SYCL Implementations by Tracking Historical Performance Improvements

Speaker: Wei-Chen Lin, University of Bristol

Co-Authors: Tom Deakin (University of Bristol); Simon McIntosh-Smith (University of Bristol)

on-demand video | slides | discuss on slack: #discuss_talks_sycl  |  SYCL |  Paper ID: 33  |
SYCL is a platform agnostic, single-source, C++ based, parallel programming framework for developing platform independent software for heterogeneous systems.  As an emerging framework, SYCL has been under active development for several years, with multiple implementations available from hardware vendors and others.

A crucial metric for potential adopters is how mature these implementations are; are they still improving rapidly, indicating that the space is still quite immature, or has performance improvement plateaued, potentially indicating a mature market?

This study presents a historical study of the performance delivered by all major SYCL implementations on a range of supported platforms.
We use existing HPC-style mini-apps written in SYCL, and benchmark these on current and historical revisions of each SYCL implementation, revealing the rate of change of performance improvements over time.

The data indicates that most SYCL implementations are now quite mature, showing rapid performance improvements in the past, slowing to more modest performance improvements more recently.

We also compare the most recent SYCL performance to existing well established frameworks, such as OpenCL and OpenMP.

Experiences Supporting DPC++ in AMReX

Speaker: Sravani Konda, Intel Corporation

Co-Authors: Dunni Aribuki (Intel Corporation); Weiqun Zhang (Lawrence Berkeley National Laboratory); Kevin Gott (National Energy Research Scientific Computing Center); Christopher Lishka (Intel Corporation)

on-demand video | slides | discuss on slack: #discuss_talks_sycl  |  SYCL |  Tech Talk ID: 04  |
AMReX is a software framework for massively parallel, block-structured adaptive mesh refinement (AMR) applications. AMReX is developed as part of the United States Department Of Energy’s Exascale Computing Project (ECP). Besides AMR capabilities, AMReX also provides a parallel programming framework for numerous applications including six ECP projects, and it implements several backends for CPU-GPU heterogeneous computing.

In this talk, we present experiences in supporting DPC++, a language based on the SYCL specification as a backend for AMReX. We will demonstrate how AMReX provides an abstraction layer for its users so that they can write performance portable code for a variety of heterogeneous platforms. We will discuss key DPC++ features that allow AMReX to implement the abstractions and our contributions to the oneAPI specification and Intel’s implementation. We will also highlight some features missing in SYCL/DPC++ that limits its efficiency and our future plans.

Developing Medical Imaging Application Across GPU, FPGA and, CPU Using oneAPI

Speaker: Wang Yong, Intel

Co-Authors: Zhou Yongfa (Intel); Wang Scott (Intel); Yang Wang (Intel Corporation); Xu Qing (Intel); Wang Chen (Intel)

on-demand video | slides | discuss on slack: #discuss_talks_sycl  |  SYCL |  Tech Talk ID: 11  |
The Diagnostic ultrasound is a rapidly developing imaging technology which is widely used in the clinic. A typical ultrasound imaging pipeline including the following algorithms: beamforming, Envelope detection, log-compression, and scan-conversion. In tradition, ultrasound imaging is implemented using Application-specific integrated circuits (ASICs) and FPGAs due to its high throughput and massive data processing requirements. With the development of the GPGPU and its programming environments (e.g. CUDA), researchers use software to implement ultrasound imaging algorithms.

For now, the two limiting factors of developing ultrasound imaging are: First, using a hardware development approach to implement ultrasound imaging algorithms is complex, time-consuming, and lacks flexibility. Second, the existing CUDA-based ultrasound imaging implementations are limited to Nvidia hardware, which is also a restriction applying more architectures.

oneAPI is a SYCL-based programming environment developed by intel. It enables heterogeneous computing across multiple hardware architectures using Data Parallel C++ (DPC++). This new programming suite can be used to address the problems mentioned above. To be clear, using a high-level language like DPC++ to program FPGA can accelerate ultrasound imaging application development. SYCL-based ultrasound imaging applications can be easily migrated to other vendor’s hardware.

To implement an ultrasound imaging application across multiple architectures (e.g., GPU, FPGA, and CPU) in a unified programming environment. We migrated a CUDA-based open-source ultrasound imaging project SUPRA. The migration process was performed using oneAPI compatibility tool (e.g. dpct). After migration, the code was tuned to run on GPU, FPGA and, CPU.

In this talk, we will discuss our experiences with the complete process of migrating a CUDA code to oneAPI code. First, the whole process of migrating the CUDA code base using dpct will be presented, including usage, code modification, API comparison, and build instruction. Second, the ultrasound imaging algorithms computation characteristics will be analyzed, and we will show how to optimize the application on Intel GPUs, Including ESIDM usage. Third, the early experiences of tuning the migrated code to target FPGA will be highlighted, this will include device code rewrite for FPGA and programming skills to improve performance on FPGA. The device code comparison of GPU and FPGA will also be discussed. Last, we will compare ultrasound imaging algorithms performance and computation results on different hardware, including Intel GPU (integrated GPU and discrete GPU), Intel Arria 10 FPGA, Intel CPU, Nvidia GTX 1080 GPU, and GTX 960M GPU.

hipSYCL in 2021: Peculiarities, Unique Features and SYCL 2020

Speaker: Aksel Alpay, Heidelberg University

Co-Authors: Vincent Heuveline (Heidelberg University)

on-demand video | slides | discuss on slack: #discuss_talks_sycl  |  SYCL |  Tech Talk ID: 23  |
The current SYCL implementation ecosystem contains four major SYCL implementations: DPC++ mainly driven by Intel, Codeplay‘s ComputeCpp, triSYCL which is led by Xilinx, and our implementation, hipSYCL. HipSYCL has made significant progress in the past year with an entirely rewritten runtime, performance improvements and the introduction of key SYCL 2020 features such as unified shared memory (USM) or reductions, and is therefore the subject of increasing attention.

In this talk we will discuss consequences of the hipSYCL design and implications for the development of software with hipSYCL. We will discuss and explain common performance pitfalls, how to circumvent them, and will discuss peculiarities of hipSYCL that open up new approaches. This includes recent hipSYCL extensions such as hipSYCL‘s unique scoped parallelism execution model for increased performance portability, the custom operations extension for efficiently submitting backend interoperability operations, and command group properties as well as the initial implementation of an interoperability infrastructure between buffers and SYCL 2020 USM pointers. Where applicable we will also provide performance measurements to demonstrate the impact of these aspects.
We will also provide an update to the SYCL 2020 implementation status and discuss performance and functionality of implemented key SYCL 2020 features, such as unified shared memory (USM), group algorithms and more.

Can SYCL and OpenCL Meet the Challenges of Functional Safety?

Speaker: Illya Rudkin, Codeplay Software

Co-Authors: Rod Burns (Codeplay Software)

on-demand video | slides | discuss on slack: #discuss_talks_sycl  |  SYCL |  Paper ID: 20  |
Open standards are being looked at as an attractive alternative to proprietary solutions by the automotive domain to enable sensor fusion systems in cheap mass-market vehicles. Open standards specification for SYCL, OpenCL and Vulkan were not always designed with safety in mind, yet they could be at the centre of tomorrows highly critical systems in a vehicle.

At the leading edge of HPC and vision technologies in many areas open standards are being used, but automotive functional safety standards were designed with older technologies in mind. Modern programming constructs, system topologies and paradigm-shifting technologies such as AI were not considered when the safety standards were created. Today’s developers are using development processes that are causing a seismic shift in how the automotive industry does business.

In 2006 the software component of a car made up proportionally about 50% of the development effort, today it is the majority. The amount of software continues to grow exponentially driven by cheaper and more powerful hardware and demand for more innovative new functionality. The growth of software, and the importance in how it operates, brings numerous challenges including how the car operates safely in any circumstance. Over the years there have been many news articles on how a car has malfunctioned due to human error. The automotive industry has responded with the publication of the ISO 26262 functional safety standard, now in its second edition.

The automotive industry needs to find a balance between the established safety standards required and the need to use emerging commercially developed technologies to differentiate and sell more vehicles. Automotive Tiers are having to look at off the shelf solutions, CoS model, to partner with technology companies with promising solutions. However, these are companies who do not have the same level of safety experience. This approach has worked for a few companies with deep pockets, putting technological wonders into high-end luxury cars but this will not work for all automotive companies. This means some are looking for alternative ways to develop future ADAS systems, in particular those that will be mandated by law.
Open standards like SYCL and OpenCL are seen as a viable approach for OEMs to lower costs while still have access to a large technical knowledge base. Code can be more easily ported between architectures, developers can take advantage of a wide range of pre-existing libraries and frameworks, and crucially these standards are defined by the industry.

In cars we can find all the interesting problems and challenging issues of software and systems engineering. In this context, this presentation will give an overview of

+ A brief history of autonomous vehicles
+ The functional safety standard ISO 26262
+ How functional safety affects applications using the SYCL stack
+ Example architecture of an automotive platform
+ Features of SYCL that support or challenge ISO 26262 requirements

Experiences with Adding SYCL Support to GROMACS

Speaker: Andrey Alekseenko, KTH Royal Institute of Technology

Co-Authors: Szilárd Páll (KTH Royal Institute of Technology); Erik Lindahl (KTH Royal Institute of Technology)

on-demand video | slides | discuss on slack: #discuss_talks_sycl  |  SYCL |  Tech Talk ID: 22  |
GROMACS is an open-source, high-performance molecular dynamics (MD) package primarily used for biomolecular simulations, accounting for ~5% of HPC utilization worldwide. Due to the extreme computing needs of MD, significant efforts are invested in improving the performance and scalability of simulations. Target hardware ranges from supercomputers to laptops of individual researchers and volunteers of distributed computing projects such as Folding@Home. The code has been designed both for portability and performance by explicitly adapting algorithms to SIMD and data-parallel processors. A SIMD intrinsic abstraction layer provides high CPU performance. Explicit GPU acceleration has long used CUDA to target NVIDIA devices and OpenCL for AMD/Intel devices.

In this talk, we discuss the experiences and challenges of adding support for the SYCL platform into the established GROMACS codebase and share experiences and considerations in porting and optimization. While OpenCL offers the benefits of using the same code to target different hardware, it suffers from several drawbacks that add significant development friction. Its separate-source model leads to code duplication and makes changes complicated. The need to use C99 for kernels, while the rest of the codebase uses C++17, exacerbates these issues. Another problem is that OpenCL, while supported by most GPU vendors, is never the main framework and thus is not getting the primary support or tuning efforts. SYCL alleviates many of these issues, employing a single-source model based on the modern C++ standard. In addition to being the primary platform for Intel GPUs, the possibility to target AMD and NVIDIA GPUs through other implementations (e.g., hipSYCL) might make it possible to reduce the number of separate GPU ports that have to be maintained.

Some design differences from OpenCL, such as flow directed acyclic graphs (DAGs) instead of in-order queues, made it necessary to reconsider the GROMACS’s task scheduling approach and architectural choices in the GPU backend. Additionally, supporting multiple GPU platforms presents a challenge of balancing performance (low-level and hardware-specific code) and maintainability (more generalization and code-reuse). We will discuss the limitations of the existing codebase and interoperability layers with regards to adding the new platform; the compute performance and latency comparisons; code quality considerations; and the issues we encountered with SYCL implementations tested. Finally, we will discuss our goals for the next release cycle for the SYCL backend and the overall architecture of GPU acceleration code in GROMACS.

Extending DPC++ with Support for Huawei Ascend AI Chipset

Speaker: Rasool Maghareh, Huawei Heterogeneous Compiler Lab

Co-Authors: Wilson Feng (Huawei Heterogeneous Compiler Lab); Kai-Ting Amy Wang (Huawei Heterogeneous Compiler Lab)

on-demand video | slides | discuss on slack: #discuss_talks_sycl  |  SYCL |  Tech Talk ID: 15  |
Heterogeneous computing has emerged as an important method for supporting more than one kind of processors or accelerators in a program. The Khoronos SYCL standard defines an abstract programming model for heterogeneous computing. The oneAPI Specification and at its core the DPC++ programming language are built on top of the SYCL standards. In this presentation, we will be reviewing the implementation steps taken to add the support for the Huawei Ascend AI Chipset to DPC++.

The Ascend AI chipset is Huawei’s ASIC hardware dedicated for artificial intelligence workloads. Similar to Nvidia’s CUDA programming model, the Ascend AI chip exposes a host-device programming model to users. The host can be written in generic C++ source code. The device is programmed in a C-based variant with special Single-Instruction-Multiple-Data (SIMD) extensions: the CCE language. Our main contribution has been adding the CCE backend to DPC++ (presented in the supporting material). This backend enables DPC++ code to be compiled for and executed on Ascend. Since the runtime and compilation steps of the Ascend is similar to CUDA, we have been implementing the CCE backend similar to the CUDA backend in DPC++.

The main parts of the CCE backend is the CCE Runtime Plugin Interface (CCE PI) and the CCE Toolchain.

The CCE PI registers a HiIPU device. This device represents the Ascend AI Chipset and a kernel can be passed to it using the HiIPU-selector class. In DPC++, a set of high-level plugin API are defined which control the runtime behaviour of a DPC++ program via a set of plugin interface. The CCE PI is similar to the CUDA PI in many aspects except the point that the API calls are lowered to the Ascend runtime API.

The CCE tool chain refers to the set of derived C++ classes based on the clang driver framework, clang::driver::ToolChain. This extension contains the necessary metadata in order to orchestrate and pass options to other external tools. For example, this would include our proprietary LLVM-based compiler, CCEC, that compiles our device code down to native binary. The compilation flow for the Ascend AI Chipset is presented in the supporting material.

Adapting a SIMT programming model for the Ascend architecture is a challenging task as the underlying vector and cube unit within an AICORE is SIMD by design. One important component of the toolchain is the converter pass which addresses this issue. The converter pass receives the LLVM IR of the device code and generates a hybrid script that is consumed by AKG to produce scheduled, vectorized AICORE device code. We have further elaborated on the converter pass in the supporting material.

Many of our colleagues have been helping in the implementation steps for this project. The authors acknowledge the direct and indirect contribution and support of all Huawei colleagues who contributed to this project.

Toward a Better Defined SYCL Memory Consistency Model

Speaker: Ben Ashbaugh, Intel Corporation

Co-Authors: James C Brodman (Intel Corporation); Michael Kinsner (Intel Corporation); Gregory Lueck (Intel Corporation); John Pennycook (Intel Corporation); Roland Schulz (Intel Corporation)

on-demand video | slides | discuss on slack: #discuss_talks_sycl  |  SYCL |  Tech Talk ID: 28  |
A memory consistency model is a key component of a parallel programming model that describes guaranteed behavior for applications and valid optimizations for implementers. The SYCL 2020 specification took a step forward by adopting the atomic_ref syntax from the C++20 specification and concepts similar to memory scopes from the OpenCL 2.0 specification, but efforts to formalize the SYCL memory model are ongoing and have been deferred to a future specification.

In this technical presentation we will:

• Summarize the guarantees and several unexpected non-guarantees that are provided by the memory model in the SYCL 2020 specification, using accessible language and examples.
• Describe related memory models from other parallel programming models that could inform and influence the SYCL memory model, including the C++, OpenCL 2.0, and Vulkan memory models.
• Describe features unique to the SYCL specification that will need to be included in the SYCL memory model, such as unified shared memory, which introduce challenges that haven’t been solved in existing memory models.

We will close the technical presentation with a call for participation in helping to refine and formalize the SYCL memory model for future versions of the specification.

Enabling OpenCL and SYCL for RISC-V Processors

Speaker: Colin Davidson, Codeplay Software

Co-Authors: Rod Burns (Codeplay Software); Aidan Dodds (Codeplay Software)

on-demand video | slides | discuss on slack: #discuss_talks_sycl  |  OPENCL + SYCL |  Paper ID: 19  |
Today, system-on-chip manufacturers are building specialist accelerator processors based on the RISC-V architecture, taking advantage of the Vectorized (RVV) extensions that match compute performance mostly seen on GPUs today. The availability of a familiar and well defined programming model is an absolute requirement if expecting to successfully bring these new processors to market.

This presentation will describe the components needed to integrate OpenCL and SYCL onto RISC-V Vector solution using multiple simulators.
While Codeplay has previously enabled OpenCL for a variety of processor architectures, there are a number of technical challenges involved in delivering a generic integration that can be rapidly used by multiple RVV based systems, a solution that requires a change in development approach. By adding to the existing LLVM back-end for RISC-V, and creating an integration layer that plugs into OpenCL, we have built a common base architecture for a variety of RISC-V processors.

This presentation will explain how Codeplay’s current driver interface works, and how it has been adapted to integrate with multiple RISC-V targets, in particular the Spike RISC-V ISA simulator. We will also talk about some of the RISC-V Vector extensions that are available, and how these can help to expose features specific to the RISC-V architecture through OpenCL.

RISC-V is a non-profit, member managed organization and is gaining momentum in the processor space, with more than 900 members. One of the goals of the organization is to build an open software platform, providing software developers an easy way to harness the familiar benefits already available on CPUs and GPUs

Profiling Heterogeneous Computing Performance with VTune Profiler

Speaker: Vladimir Tsymbal, Intel

Co-Authors: Alexandr Kurylev (Intel)

on-demand video | slides | discuss on slack: #discuss_talks_sycl  |  OPENCL + SYCL |  Tech Talk ID: 09  |
Programming of heterogeneous platforms requires deep understanding of system architecture on all levels, which help applications design to leveraging the best data and work decomposition between CPU and an accelerating hardware like GPUs. However, in many cases the applications are being converted form a conventional CPU programming language like C++, or from accelerator friendly but still low level languages like OpenCL, and the main problem is to determine which part of the application is leveraging from being offloaded to GPU. Another problem is to estimate, how much performance increase one might gain due to the accelerating in the particular GP GPU device. Each platform has its unique limitations that are affecting performance of offloaded computing tasks, e.g. data transfer tax, task initialization overhead, memory latency and bandwidth limitations. In order to take into account those constraints, software developers need tooling for collecting right information and producing recommendations to make the best design and optimization decisions.

In this presentation we will introduce two new GPU performance analysis types in Intel VTune Profiler, and a methodology of heterogeneous applications performance profiling supported by the analyses. VTune Profiler is a well-known tool for performance characterization on CPUs, now it includes GPU Offload Analysis and GPU Hotspots Analysis of applications written on most offloading models with OpenCL, SYCL/Data Parallel C++, and OpenMP Offload.

FAST: A Framework for High-Performance Medical Image Computing and Visualization

Speaker: Erik Smistad, Norwegian University of Science and Technology and SINTEF

on-demand video | slides | discuss on slack: #discuss_talks_opencl  |  OpenCL  |  Tech Talk ID: 17  |
Medical image processing and visualization is often computationally demanding. Ultrasound images are acquired in real-time and needs to be processed at a high framerate with low latency. Computed tomography (CT) and magnetic resonance imaging (MRI) create large three dimensional volumes with common sizes of 512x512x800 voxels. In digital pathology, whole slide microscopy images can have an extreme image size of up to 200k x 100k pixels, which does not even fit into the memory of most computers. Thus, there is a need for smart data storage, processing and visualization methods to deal medical image data.

The development of FAST started in 2014, the goal was to create an open-source framework which made GPU and parallel processing of medical images easy and portable. While there existed popular image processing libraries such as the visualization toolkit (VTK), insight toolkit (ITK) and OpenCV, the GPU processing capabilities were still implemented ad-hoc and often implied copying data back and forth from the GPU and CPU. Thus it was decided to use the new OpenCL API to create a cross-platform framework designed bottom-up with GPU processing at the very core. One of the design goals was to remove the burden of moving data back and forth from different processors and memory spaces from the developer. Instead, the developer requests access to the data on a given processor, and FAST will copy and update data as needed. Now, seven years later FAST version 3.2 is released, it still uses OpenCL 1.2 and OpenGL 3.3 at the core of almost all of its operations. FAST can stream images in real-time from ultrasound scanners, webcameras, Intel’s RealSense depth camera, and read many different formats from disk including medical formats such as DICOM, Metaimage and huge microscopy images stored as tiled image pyramids.

FAST uses a processing pipeline concept, meaning that you define a pipeline as multiple processing and visualization steps first, then initiate the processing by executing the pipeline. FAST pipelines can be created with C++, Python and even without programming using simple text files.

In the last five years or so, deep neural networks have become the standard for almost all image processing tasks. Many high-performance frameworks for deep neural network inference already exist, but have very different APIs and use different formats for storing neural network models. FAST now provides a common API for neural networks with multiple backends such as NVIDIA’s TensorRT, Intel’s OpenVINO and Google’s TensorFlow. This removes the burden of the user to learn the API of every inference library, and makes neural network inference as simple as just loading a model stored on disk.

This presentation will present the FAST framework, the idea and goal behind it, how it is designed, the challenges faced in its making, and the plans for the future. FAST is open-source and we invite the community to contribute through GitHub at https://github.com/smistad/FAST

Experiences Porting the SU3_Bench Microbenchmark to the Intel Arria 10 and Xilinx Alveo U280 FPGAs

Speaker: Douglas Doerfler, Lawrence Berkeley National Laboratory

Co-Authors: Farzad Fatollahi-Fard (Lawrence Berkeley National Laboratory); Colin MacLean (Lawrence Berkeley National Laboratory); Tan Nguyen (Lawrence Berkeley National Laboratory); Samuel Williams (Lawrence Berkeley National Laboratory); Nicholas Wright (Lawrence Berkeley National Laboratory); Marco Siracusa (DEIB/Politecnico di Milano)
on-demand video | slides | discuss on slack: #discuss_talks_opencl  |  OpenCL  |  Paper ID: 02  |
In this study we investigate the implications of porting a common computational kernel used in high performance computing, which has been optimized for efficient execution on general purpose graphics processing units (GPUs), to a field programmable gate array (FPGA). In particular, we use a benchmark based on a matrix-matrix multiply kernel commonly used in lattice quantum chromodynamics applications. The microbenchmark is based on the OpenCL programming language. We evaluate the performance, and portability, aspects associated for two FPGAs, the Intel Arria 10 and the Xilinx Alveo U280. The purpose of the study is not to compare the two FPGAs, but to evaluate their respective OpenCL toolchains and to evaluate the level of effort needed to port a GPU optimized code to a FPGA, and the effectiveness of the respective toolchains. We did find the toolchains to be relatively easy to use, and it was possible to get correctness with little effort, but there was significant effort needed to get relatively good performance. We found that FPGAs perform best when using single work item kernels, as opposed to the nominal multiple work item NDRange kernel used for CPUs and GPUs. In addition, other source code changes were necessary, and in particular the lack of a local cache in FPGA architectures can require a significant rewrite of the code. The performance achieved with the Intel Arria 10 was 47.6% of its maximum sustained bandwidth, while the Xilinx Alveo U280 achieved 35.2%. GPU architectures have been shown to demonstrate 75% to 90% architectural efficiencies.

Accelerating Regular-Expression Matching on FPGAs with High-Level Synthesis

Speaker: Devon Callanan, University of Pittsburgh

Co-Authors: Luke Kljucaric (University of Pittsburgh); Alan George (NSF Center for High Performance Reconfigurable Computing)
on-demand video | slides | discuss on slack: #discuss_talks_openclOpenCL  |  Paper ID: 08  |
The importance of security infrastructures for high-throughput networks has rapidly grown as a result of expanding internet traffic and increasingly high-bandwidth connections. Intrusion-detection systems (IDSs) such as SNORT rely upon rule sets designed to alert system administrators of malicious packets. Such deep-packet inspection, which depends upon regular-expression searches, can be accelerated on programmable-logic (PL) architectures using non-deterministic finite automata (NFAs). Prior designs have relied upon register-transfer level (RTL) design descriptions and achieved efficient resource utilization through fine-grained optimizations. New advances made by field-programmable gate array (FPGA) vendors have led to more powerful compiler toolchains for OpenCL that allow for rapid development on PL architectures while generating competitive designs in terms of performance. The goal of this research is to evaluate performance differences between a custom, OpenCL-based, acceleration architecture for regular expressions and comparable RTL designs. The simplicity of the application, which requires only basic hardware building blocks, adds to the novelty of the comparison. In contrast to RTL-based solutions, which show frequency degradation with bandwidth scaling, our approach is able to maintain stable and high operating frequencies at the cost of resource usage. By scaling input bandwidth with multi-character transformations, throughput in excess of 17 Gbps can be achieved on Intel’s Arria 10 Programmable Acceleration Card, outperforming similar designs with RTL as reported in the literature.

Performance Evaluation and Improvements of the PoCL Open-Source OpenCL Implementation on Intel CPUs

Speaker: Tobias Baumann, Zuse Institute Berlin

Co-Authors: Thomas Steinke (Zuse Institute Berlin); Matthias Noack (Zuse Institute Berlin)
on-demand video | slides | discuss on slack: #discuss_talks_opencl  |  OpenCL  |  Paper ID: 30  |
The Portable Computing Language (PoCL) is a vendor independent open-source OpenCL implementation that aims to support a variety of compute devices in a single platform. Evaluating PoCL versus the Intel OpenCL implementation reveals significant performance drawbacks of PoCL on Intel CPUs – which run 92% of the TOP500 list. Using a selection of benchmarks, we identify and analyse performance issues in PoCL with a focus on scheduling and vectorisation. We propose a new CPU device-driver based on Intel Threading Building Blocks (TBB), and evaluate LLVM with respect to automatic compiler vectorisation across work-items in PoCL. Using the TBB driver, it is possible to narrow the gap to Intel OpenCL and even outperform it by a factor of up to 1.3× in our proxy application benchmark with a manual vectorisation strategy.

Towards Evaluating High-Level Synthesis Portability and Performance Between Intel and Xilinx FPGAs

Speaker: Anthony M Cabrera, Oak Ridge National Laboratory

Co-Authors: Aaron Young (Oak Ridge National Laboratory); Jacob Lambert (University of Oregon); Zhili Xiao (Washington University in St. Louis); Amy An (Washington University in St. Louis); Seyong Lee (ORNL); Zheming Jin (ORNL); Jungwon Kim (Oak Ridge National Laboratory); Jeremy Buhler (Washington University in St. Louis); Roger Chamberlain (Washington University in St. Louis); Jeffrey Vetter (Oak Ridge National Laboratory)
on-demand video | slides | discuss on slack: #discuss_talks_opencl  |  OpenCL  |  Paper ID: 31  |
Offloading computation from a CPU to a hardware accelerator is becoming a more common solution to improve performance as traditional gains enabled by Moore’s Law and Dennard Scaling have slowed. GPUs are often used as hardware accelerators, but FPGAs are gaining traction. FPGAs are beneficial because they allow for the creation of hardware specific to a particular application. However, they are notoriously difficult to program. To this end, two of the major FPGA manufacturers, Intel and Xilinx have created tools and frameworks that enable the use of higher level languages to design FPGA hardware. Though Xilinx kernels can be designed using C/C++, both Intel and Xilinx support using OpenCL C to architect FPGA hardware. However, not much is known about the portability and performance between these two device families, other than the fact that it is theoretically possible to synthesize a kernel meant for Intel to Xilinx and vice versa.

In this work, we evaluate the portability and performance of Intel and Xilinx kernels. We use OpenCL C implementations of a subset of the Rodinia benchmarking suite that were designed for an Intel FPGA and make the necessary modifications to create synthesizable OpenCL C kernels for a Xilinx FPGA. We find that the difficulty of porting certain kernel optimizations varies depending on the construct. Once the minimum amount of modifications are made to create synthesizable hardware for the Xilinx platform, though, more non-trivial work is necessary in order to improve performance. However, we find that constructs that are known to be performant for an FPGA should improve performance regardless of the platform; the difficulty comes in deciding how to invoke certain kernel optimizations while also abiding by the constraints enforced by a given platform’s hardware compiler.

Executing Graphs with OpenCL

Speaker: Erik Tomusk, Codeplay Software

Co-Authors: Rod Burns (Codeplay Software)
on-demand video | slides | discuss on slack: #discuss_talks_opencl  |  OpenCL  |  Tech Talk ID: 12  |
For several decades, graph and dataflow programming models have been niche topics limited to a small number of highly specialized domains. In recent years, however, the machine learning (ML) revolution and the proliferation of ML libraries has made graph programming accessible to even novice programmers. Before, a beginner programmer may have talked about writing a number-guessing game; today the programmer will describe training an off-the-shelf neural network — a type of graph — for handwriting recognition.

There is growing demand from industry and individual users to run programs that are based on ML graphs. This demand is being met by hardware vendors, who are designing increasingly heterogeneous accelerator devices that can efficiently execute graphs. Since its creation, OpenCL has been a key API for bridging the gap between user applications and accelerator hardware. The question, then, is whether OpenCL is an appropriate API for this new breed of graph software running on these new, highly heterogeneous accelerators. Does OpenCL have the expressive power required to describe graphs to graph accelerator hardware?

In this technical presentation, we will argue that the answer is yes, OpenCL is sufficiently expressive to allow an ML library to describe an execution graph, and it is sufficiently powerful to execute that graph on a graph accelerator. We will use graphs from real applications to demonstrate the possibility of data dependency tracking using OpenCL events and memory buffers. We will show how built-in kernels can be used to simplify scheduling to the device. Where appropriate, the presentation will be supported by lessons learned from Codeplay’s ComputeAorta OpenCL implementation.

Posters

Registered delegates will be able to view the on-demand video presentations of these posters sessions from Tuesday 27th April 2021.   All our speakers will be asked to check their Slack channel regularly to answer any questions coming in from the community.

Enabling the Use of C++20 Unseq Execution Policy for OpenCL

Speaker: Po-Yao Chang, National Tsing Hua University

Co-Authors: Tai-Liang Chen (National Tsing Hua University); Jenq-Kuen Lee (National Tsing Hua University)

on-demand video | poster | discuss on slack: #discuss_posters  |  OpenCL  |  Poster ID: 05  |
The C++ community has been prospering recently, with a major standard revision being published every three years, and the standard committee meeting attendees are more than ever. Then, in hope of bringing the world of C++ and OpenCL together, here comes a new OpenCL kernel language, C++ for OpenCL, which allows us to write kernel code in modern C++, specifically C++17 with some limitations, without breaking backward compatibility with OpenCL C. However, C++ standard libraries, which take up substantial part of the standard, are not supported, and certainly not parallel algorithms introduced in C++17 and C++20. This is quite discouraging for C++ practitioners and enthusiasts who are stepping into the realm of OpenCL. This poster is a trial on making some functions in C++ standard library work with execution::unseq, as defined in C++20 standard [execpol.objects], which is a global object of type execution::unsequenced_policy, as stated in C++20 [execpol.unseq]. Functions taking execution::unseq as the first parameter are expected to be executed as if it was vectorized; whereas functions without any execution policy parameter or taking execution::seq should be executed as normal (sequentially). Inspired by OpenCL vector, this work selects some functions in C++ standard library, and made a version of them that can take execution policy objects as their first parameter. Inside those functions, some clang compiler directives are added to vectorize the loops where applicable. The resulting LLVM IR would look the same as if OpenCL vector were used. Users wanting to use those functions with execution::unseq can simply include the headers. Then finally comes the compilation step, which looks like this: (1) clang fed with C++ for OpenCL code outputs vectorized LLVM IR, (2) llvm-spirv converts LLVM IR to spirv, (3) host code leverages host api to compile spirv to target assembly and execute it on a OpenCL platform. At the end, this work also benchmarks applications using those functions this work implements with and without execution::unseq. Our experiments are performed on an Intel Core i7-7700 CPU 3.60GHz with a built-in HD Graphics 630 graphics card, which means this experiment is conducted on both CPU and GPU. In the best case, this work results in a speedup up to over 6.9 times compared to functions without unseq (non-vectorized).

Experimenting with C++ Libraries in OpenCL Kernel Code

Speaker: Ole M Strohm, Arm

Co-Authors: Anastasia Stulova, Arm

on-demand video | slides | discuss on slack: #discuss_posters  |  OpenCL  |  Poster ID: 06  |
The recently developed C++ for OpenCL kernel language [1] brings many language features of C++17 to GPUs and other OpenCL devices, but it currently does not bring the C++ libraries with it. This is a significant limitation because more and more functionality is being added through libraries in the latest C++ standards.

As C++ for OpenCL matures, the viability of porting existing C++ code such as the standard libraries comes closer to reality. This would let C++ for OpenCL use more current and new features of C++, given that nearly 70% of the C++ specification pertains to the standard library [2].

In this endeavor we have experimented with the available open-source C++ standard library libcxx [3], that is part of the LLVM project. By utilizing existing C++ code for OpenCL, the development time can be decreased, and more robustness can be provided to OpenCL kernel developers.

While we have conducted an evaluation of multiple libraries within libcxx, our focus was on supporting the type traits library. Type traits is a powerful metaprogramming utility library, and it is well suited to OpenCL as it mainly provides general compile-time features. Also, it is a header-only library, which makes it very easy to integrate and deploy into the OpenCL development environment. Attempting to use type traits in C++ for OpenCL did expose bugs in the compiler that were fixed upstream. We have also added two Clang compile time extensions that allowed us to support the full functionality of the type traits library from C++17 without the risk of exposing non-conformant behavior to the application developers.

Currently 88% of the modified libcxx tests for type traits pass with the upstream implementation of C++ for OpenCL, and the rest of the tests cannot currently be supported because they contain functionality from C++20 or other functionality outside of C++ for OpenCL’s scope. 58% of the passing tests had to be modified by removing features unsupported in OpenCL, such as virtual methods and function pointers.

As a result of this work, type traits are currently enabled for users of C++ for OpenCL to experiment in offline compilation starting from Clang 12 [4]. We anticipate that more functionality will be added in the future and we aim to outline possible directions in the conclusion.

[1] https://www.iwocl.org/wp-content/uploads/iwocl-syclcon-2020-stulova-13-slides.pdf

[2] https://www.stroustrup.com/bs_faq.html#big

[3] https://libcxx.llvm.org

[4] https://clang.llvm.org/docs/OpenCLSupport.html#experimental-features

Trip Down the Compute Pipeline

Speaker: Lukasz Towarek, Intel Corporation

on-demand video | slides | discuss on slack: #discuss_posters  |  OpenCL  |  Poster ID: 07  |
New generation of rendering and computing APIs like Vulkan, DirectX 12 and oneAPI Level Zero underline the need for thin abstraction layers over the hardware and low-latency, low-overhead drivers.

How do OpenCL implementations compare with those supporting new generation APIs? What is the architecture of OpenCL driver? What is happening under driver’s hood?

We answer these questions based on the open source OpenCL driver stack for Intel Processor Graphics. We present activities that are performed by each component of the driver stack when handling critical OpenCL API calls like clGetPlatformIDs, clBuildProgram, clEnqueueWriteBuffer, clEnqueueNDRangeKernel, clEnqueueReadBuffer and clFinish.

Machine Learning Training with Tensor Virtual Machine (TVM) and Adreno GPUs

Speaker:  Siva Rama Krishna Reddy, Qualcomm

Co-Authors: Hongqiang Wang (Qualcomm); Adarsh Golikeri (Qualcomm); Alex Bourd (Qualcomm)

on-demand video | slides | discuss on slack: #discuss_posters  |  OpenCL  |  Poster ID: 34  |
The poster is to introduce our recent efforts and present preliminary results on enabling Tensor Virtual Machine (TVM) with Qualcomm Adreno GPUs for machine learning training. TVM is a popular machine learning (ML) compiler stack that targets a wide range of computing devices running ML networks. TVM can auto-generate highly optimized CUDA and OpenCL kernels with little information provided by developers, and in many cases it can beat the kernels that are hand-optimized by experts. TVM has been well-tuned for many desktop and mobile devices.

After having enabled the mainline TVM for Adreno GPU for machine learning inference, which was presented in IWOCL 2020, we made further progress by enhancing TVM for training of deep learning network with OpenCL backend on Adreno GPU. This effort is the first of its kind on TVM, to our best knowledge.

Machine learning training on mobile is emerging as an exciting topic due to raising concerns on privacy and demand for customized user experience. As compared with inference on mobile, training on mobile poses significant challenges in terms of functionality, complexity and resources. Besides the inference functionality, training requires loss functions, gradient computation, optimizers, and a workflow to progress the training. As of today, the mainline TVM has not enabled training support, except few gradient operators.

To enable training on TVM, we implemented a graph pass to build back ward graph given a forward graph, and various gradient operators which include complex layers like depth wise convolutions and batchnorm. In addition, cross entropy loss function and optimizers like SGD, Adam are also implemented. On top of them, a workflow was implemented to support maintain various trainable parameters, optimizer state, etc.

With all the amendments to the TVM framework, we now can train LeNet-5 and Mobilenet-v1 natively on target starting with random initialized weights and achieve desired accuracy of greater than 90% in few epochs. This effort showcase capability of TVM with OpenCL backend on Adreno GPU targets.

We will discuss the approach and various challenges we face during these efforts, and our thoughts on the future of training with TVM framework in the poster.

SYCL for Vitis 2020.2: SYCL & C++20 on Xilinx FPGA

Speaker: Gauthier Harnisch, Xilinx

Co-Authors: Andrew Gozillon (University of the West of Scotland); Ronan Keryell, (Xilinx), Lin-Ya Yu (Xilinx, Inc); Ralph Wittig (Xilinx); Luc Forget (Xilinx)

poster | discuss on slack: #discuss_posters  |  SYCL  |  Poster ID: 27  |
SYCL is a single-source C++ DSL targeting a large variety of accelerators in a unified way by using different backends.

We present a SYCL implementation targeting Xilinx Alveo FPGA cards by merging 2 different open-source implementations, Intel’s oneAPI DPC++ wit Save h some LLVM passes from triSYCL.

The FPGA device configuration is generated by Xilinx Vitis 2020.2 fed with LLVM IR SPIR and Xilinx XRT is used as a host OpenCL API top control the device.

This is an ongoing open-source project available on: https://github.com/triSYCL/sycl

Bringing SYCL to Ampere Architecture

Speaker: Steffen Larsen, Codeplay Software

Co-Authors: Rod Burns (Codeplay Software); Brandon Cook (Lawrence Berkeley National Laboratory); Douglas Doerfler (Lawrence Berkeley National Laboratory); Kevin Harms (Lawrence Berkeley National Laboratory); Thomas Applencourt (Lawrence Berkeley National Laboratory); Stuart Adams (Codeplay Software)

on-demand video | slides | discuss on slack: #discuss_posters  |  SYCL  |  PosterID: 16  |
Codeplay has worked with the SYCL community, and Intel in particular, to deliver a CUDA open source back-end implementation for the SYCL DPC++ compiler, the Clang LLVM compiler project. This backend provides support for NVIDIA GPUs through the CUDA Driver API, rather than using OpenCL. SYCL applications built with this backend are in effect native CUDA applications. This poster presents new features that Codeplay is contributing to DPC++ to improve SYCL 2020 conformance and implement support for NVIDIA A100 Tensor Core GPUs. This is an ongoing project from Codeplay that will see further progress over the year, including support for modern CUDA features and overall performance improvements.

SYCL 2020 is a significant step towards bringing C++ heterogeneous programming to all. It supports diverse applications, including HPC supercomputing centers, powerful machine learning frameworks, and creative and professional applications on embedded and desktop PCs. One of the key improvements of SYCL 2020 is the new backend model, which allows a SYCL implementation to target multiple heterogeneous APIs, including CUDA. This makes SYCL an attractive target for frameworks and libraries, allowing them to target a wide range of platforms without having to port and translate their code. Over the next year, Codeplay™ will help improve DPC++ by improving their CUDA backend and contributing new features from the SYCL 2020 provisional specification, including Universal Shared Memory (USM), reductions, subgroups, unnamed lambdas and in-order queue execution. Key among these new features is USM, a new pointer-based alternative to the buffer programming model that provides the ability to create allocations that are visible to both the device and the host. Although there is support for USM already upstream, our project aims to provide further testing and an stable interface. Codeplay will implement CUDA support for these new features and ensure that they are performant on the NVIDIA A100 platform and recent CUDA toolkit versions.

As part of the contributions to the SYCL community in general, and the DPC++ CUDA backend in particular, Codeplay will also provide new extensions to SYCL 2020 and DPC++ that allow developers to take advantage of CUDA-specific APIs and features.

These extensions will help developers deliver performance on the NVIDIA A100 platform. Planned extensions include new SYCL APIs that will expose the NVIDIA A100’s Tensor Cores and hardware-accelerated barriers. Codeplay will design and implement these new extensions, adding the necessary changes to DPC++’s CUDA backend and extending LLVM’s NVPTX backend to support the SM 80 architecture.

Improving DPC++’s NVIDIA multi-GPU support will be essential for the NVIDIA A100. Codeplay will contribute support for multiple CUDA devices with different SYCL contexts, device-to-device memory transfers and group collective operations.

Path Tracing on FPGA with SYCL and C++20

Speaker: Luc Forget (INRIA)

Co-Authors: Krishna Kumar Ranipet Murugan (NCSU); Ronan Keryell, (Xilinx); Gauthier Harnisch (Xilinx);

poster  |  discuss on slack: #discuss_posters  |  SYCL  |  Poster ID: 25  |
Path tracing is a global illumination method used in computer graphics to render photorealistic images by using a Monte Carlo integration. While it is very compute intensive, it is nowadays commonly used in movie production because of its image quality.

We present a straight-forward implementation in modern C++ using SYCL for the offloading of the compute intensive part on accelerators.

Usually ray tracer and path tracer implementations rely on dynamic polymorphism to handle objects with different shapes and different materials but this is currently unsupported in SYCL since often accelerators cannot handle function pointers. Instead we do not use polymorphism but rely on C++17 std::variant and std::visit to dispatch operations with duck-typing in a type-safe way.

std::visit can be executed in O(1) on FPGA because the dispatch is spacialized on the architecture.

This is an ongoing open-source project available on https://github.com/triSYCL/path_tracer

SYCL for Xilinx Versal ACAP AIE CGRA

Speaker: Ronan Keryell, Xilinx

Co-Authors: Andrew Gozillon (University of the West of Scotland); Gauthier Harnisch (Xilinx); Hyun Kwon (Xilinx); Ravikumar Chakaravarthy (Xilinx); Ralph Wittig (Xilinx)

poster | discuss on slack: #discuss_posters  |  SYCL  |  Poster ID: 29  |
SYCL is a single-source C++ DSL targeting a large variety of accelerators in a unified way by using different backends.

Xilinx Versal ACAP is a new system-on-chip (SoC) device integrating various computing resources like various CPUs, an FPGA, a coarse-grain reconfigurable array (CGRA), etc. interconnected by a network-on-chip (NoC).

The AIE CGRA is an array of 400 VLIW DSP operating on 512-bit vectors with their own neighborhood distributed memory.

We expose architectural details to the programmer through some SYCL extensions.

We present a SYCL implementation targeting the AIE CGRA by merging 2 different open-source implementations, Intel’s oneAPI DPC++ with some LLVM passes from triSYCL and a new SYCL runtime from triSYCL.

The SYCL device compiler generates LLVM IR for the Synopsys ASIP CHESS compiler generating the AIE instructions.

The host runtime runs on the ARM A72 CPU of the ACAP and controls the CGRA through the Xilinx libaiengine library.

This is an ongoing project we plan to open-source soon.

Approaching Coupled Cluster Theory with Perturbative Triples using SYCL

Speaker: Abhishek Bagusetty, Argonne National Laboratory

Co-Authors: Jinsung Kim (Pacific Northwest National Laboratory); Ajay Panyala (Pacific Northwest National Laboratory); Alvaro Vazquez-Mayagoitia (Argonne National Laboratory); Karol Kowalski (Pacific Northwest National Laboratory); Sriram Krishnamoorthy (Pacific Northwest National Laboratory)

No Video or Slides Available  |  SYCL  |  Poster ID: 32  |
The quantum-chemical formulation of coupled-cluster theory with perturbative triples CCSD(T) is regarded as a gold standard for high-accuracy computational modeling of the correlated behavior of electrons in molecular systems. The perturbative triples (T) calculation part of CCSD(T) scales as O(N7) where N is the problem size. The (T) calculation consists of 27 tensor contractions to generate two six-dimensional tensors. Because the six-dimensional tensor can quickly grow to exceed the total physical memory available on a parallel machine, treating the 27 tensor contractions as independent contractions are often infeasible. Given the fundamental constraint of the relatively small global memory capacity in GPUs compared to the main memory capacity on host nodes, necessitating relatively smaller tile sizes for high-dimensional tensor contractions that forms the core of (T) calculations. However, even if the proposed algorithm is portable to other architectures, it is doubtful on its performance, especially AMD GPUs and Intel GPUs. This is because the proposed fused kernel is optimized for NVIDIA GPUs’ features such as shared memory size, the number of threads within a warp, the SIMT model and so on.

The overarching goal of oneAPI ecosystem which provides Data Parallel C++ (DPC++) enables the domain-science applications to target multiple hardware architectures like multi-core CPUs, GPUs, and even FPGAs with the advantages of single-source code portability. In this process, our workflow involved the transitioning of the (T) calculation written in CUDA and optimized for Nvidia GPUs to SYCL. The Intel’s DPC++ compatibility tool was used to facilitate this porting of our recent CUDA kernel to the SYCL-based DPC++. Our recent novel algorithm [1] associated with the (T) calculation involves the minimization of GPU memory bandwidth usage. This approach fuses all the tensor contractions involved in the (T) calculation at the register-level so as to eliminate global-to-shared memory data transfer for the largest six-dimensional intermediate. This SYCL implementation involved a hybrid-memory model of using both the pointer-based and as well as traditional buffer-based memory management. A majority of the data structures involved using pointer-based unified-shared memory (USM) for explicit and robust control on data transfers between host and devices. On the other hand, shared local memory was managed using the SYCL buffer model.

In this work, we demonstrate the performance aspects of our SYCL implementation for the (T) calculation. Our primary analysis is associated with Intel’s integrated and discrete GPU architectures with the focus on roofline analysis.

[1] Jinsung Kim et al. Scalable heterogeneous execution of a coupled-cluster model with perturbative triples. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’20). IEEE Press, Article 79, 1–15.