Welcome to the IWOCL and SYCLcon 2021 program of events
Virtual Event Logistics
- Instructions for joining the live sessions can be found in your registration confirmation.
- If you are missing registration emails please see: What to check when you don’t receive Eventbrite emails
- A range of Slack channels are in use to facilitate discussions. The link to join these channels can be found in your original registration confirmation.
- All our authors have been asked to check their Slack channel from Wednesday 28th to answer any questions.
- Registration is now closed. The Slack channels will remain open to anyone who registered before the 29th April.
Proceedings
This year’s proceedings are available on the
ACM Digital Library
OneAPI Dev Summit
The First 2021 oneAPI Developer Summit
In this one-day virtual summit, you will hear speakers from industry and academia share their experience working on innovative cross-platform, multi-vendor architecture solutions developed on oneAPI.
Date: Monday 26 April
Start time: 9:00am BST, 10:00 CET | Duration: All day
Register: Please visit oneAPI Developer Summit to register and for additional information on the program and speakers.
Join the oneAPI community for the first 2021 oneAPI Developer Summit IWOCL focused on oneAPI and Data Parallel C++ for accelerated computing across xPU architectures (CPU, GPU, FPGA, and other accelerators). In this one-day virtual conference, you will hear from industry and academia speakers working on innovative cross-platform, multi-vendor architecture solutions developed on oneAPI. Learn from fellow developers and connect with other innovators.
Please join us, a self-sustained, vibrant community to support each other using oneAPI and Data Parallel C++.
- INTRODUCTION
- VENDOR UPDATE: SYCL 2021 Vendor Update
- DEVCLOUD UPDATE: Developer tools to get you started on oneAPI
- HANDS-ON SESSION: Application optimization with Cache-aware Roofline Model and Intel oneAPI tools
- LUNCH
- LIGHTNING TALK: Great Cross-Architecture Challenge Application Showcase
- KEYNOTE: SYCL 2020 in hipSYCL: DPC++ features on AMD GPUs, NVIDIA GPUs and CPUs
- TECH TALK: AI > A Deep Dive into a Deep Learning Library for the A64FX Fugaku CPU – Meet the Developer
- LIGHTNING TALK: Bringing SYCL to Super Computers with Celerity
- LIGHTNING TALK: Great Cross-Architecture Challenge Application Showcase
- TECH TALK: It’s acceleration ….. but faster! – a business perspective on FPGA technology
- TECH TALK: Comparative Analysis of Intel HLS Design Tools on a Case Study in Neuromorphic
- TECH TALK: TAU Performance System
- CLOSING
- HAPPY HOUR
To attend this session please register directly at: https://www.oneapi.com/events/devcon2021iwocl/
Tutorials
Registered delegates will be sent instructions for joining these live tutorials and the associated Slack channel.
A Hands-On Introduction To SYCL
Tutorial Lead: Rod Burns, Codeplay
Presenters: Igor Vorobtsov, Intel. Aksel Alpay, University of Heidelberg. Ronan Keryell, Xilinx. Peter Zuzek, Codeplay and Gordon Brown, Codeplay
Date: Tuesday 27 April
Start time: 09:00 BST, 10:00 CET | Duration: Approx 7 Hours
discuss on slack: #discuss_tutorial_intro_to_sycl
Set-up Instructions
The tutorial is very focused on hands-on work and is a great way to grasp the essentials you will need for SYCL development. By the end of the day we expect you to be able to write your own SYCL kernels and understand the main concepts required to build more complex applications.
Because of this we strongly encourage that you set up your machine so that you can follow along with the exercises. There are a few options for this, some are more involved in terms of setup than others.
The simplest way to do the exercises is to use the Intel DevCloud environment. This requires no machine setup, and all you need to do is register for the DevCloud and then access this via your web browser or SSH. Use this link to register for the DevCloud https://devcloud.intel.com/oneapi/
It’s also possible to configure your own machine to use the implementations ComputeCpp, DPC++ or hipSYCL.
ComputeCpp supports Intel CPU and GPU on Linux and Windows
DPC++ supports Intel CPU and GPU on Linux and Windows
hipSYCL supports Nvidia and AMD processors on Linux
Individual instructions on how to install each of these are in this README https://github.com/codeplaysoftware/syclacademy/tree/iwocl21#install-sycl-implementations
Abstact
SYCL is a programming model that lets developers support a wide variety of devices (CPUs, GPUs, and more) from a single code base. Given the growing heterogeneity of processor roadmaps, moving to a platform-independent model such as SYCL is essential for modern software developers. SYCL has the further advantage of supporting a single-source style of programming from completely standard C++. In this tutorial, we will introduce SYCL and provide programmers with a solid foundation they can build on to gain mastery of this language.
This is a hands-on tutorial. The real learning will happen as students write code. The format will be short presentations followed by hands-on exercises. Hence, attendees will require their own laptop to perform the hands-on exercises.
Topics Covered Include:
– An Introduction to SYCL
– SYCL Kernel Definition
– Memory and Data Management
– Data Parallelism
– Asynchronous Execution
SYCL | Paper ID: 113
Layers for OpenCL
Brice Videau, Argonne National Lab
Date: Tuesday 27 April
Start time: 15:00 BST, 16:00 CET, 07:00 PST | Duration: 2 hours
slides | tutorial on github | video presentation | discuss on slack: #discuss_tutorial_layers_for_opencl
The last release of the OpenCL loader introduced a new experimental system of plugins implementing layers. Those layers can be used to intercept, introspect, and potentially modify the behavior of one or several OpenCL API functions. The plugins take the shape of simple shared libraries that the loader is asked to load through an environment variable. The system aims at solving the portability issues with classical interception mechanisms, as well as minimizing the amount of code developers need to create a functional layer. As the feature is experimental, we are eagerly awaiting the community’s feedback so that we can improve the system to fit OpenCL users and developers requirements and use-cases. One of this session’s objective is to facilitate the gathering of such feedback.
This session will be constituted of two parts. The first part will be a presentation of the layering system implemented in the OpenCL loader, followed by a Q&A session. The second part will be a demonstration of layers usage and development, followed by a Q&A session.
Presentation outline:
– Layers vs Interception
– Original OpenCL loader API call workflow
– OpenCL loader with layers API call workflow
– How the new OpenCL implements layers
– Presentation of the loader layers configuration options
– Presentation of the layers plugin API
– Discussion of the limitation of the current experimental layer implementation
Q&A Session
Demonstration Outline:
– Demonstration of the use of existing layers and the required loader configuration
– Deep dive into the anatomy of a simple layer
– Presentation of the resources available to developers
Q&A Session
[1] https://github.com/intel/opencl-intercept-layer
OpenCL | Paper ID: 103
Live Panel Discussions
Registered delegates will be sent instructions for joining these live panel discussions and the associated Slack channel.
OpenCL Panel Discussion
Panel Chair: Simon McIntosh-Smith, University of Bristol
Date: Wednesday 28 April
Start time: 16:00 BST, 17:00 CET, 08:00 PST | Duration: 60 mins
on-demand video | discuss on slack: #discuss_panel_opencl
- Neil Trevett , Khronos and NVIDIA
- Jesse Natalie, Microsoft
- Paul Miller. Boris FX
- Ben Ashbourgh, Intel
- Kevin Petit, Arm
- Alastair Murray, Codeplay
OpenCL | Paper ID: P1 |
SYCL Panel Discussion
Panel Chair: Simon McIntosh-Smith, University of Bristol
Date: Thursday 29 April
Start time: 16:00 BST, 17:00 CET, 08:00 PST | Duration: 60 mins
on-demand video | discuss on slack: #discuss_panel_sycl
- Michael Wong, Codeplay
- Ronan Keryell, Xilinx
- Aksel Alpay, University of Heidelberg
- James Reinders, Intel
- Tom Deakin, University of Bristol
- Hal Finkel, U.S. Department of Energy
- Peter Thoman, University of Innsbruck
SYCL| Paper ID: P2 |
Khronos Advisory Panel Meetings – OpenCL and SYCL
These meetings are for members of the OpenCL and SYCL advisory panels. Existing members will be sent call-in instructions directly by the Khronos Group. See below for details on becoming an Advisor.
Advisory Panel Meeting – OpenCL
Date: Monday 26 April
Start time: 16:00 BST, 17:00 CET, 08:00 PST
Members only
Call-in instructions will be sent to members by the Khronos Group
SYCL Advisory Panel Meeting
Date: Tuesday 27 April
Start time: 16:00 BST, 17:00 CET, 08:00 PST
Members only
Call-in instructions will be sent to members by the Khronos Group
An Invitation to Become an OpenCL or SYCL Advisor
Khronos understands that to best meet the needs of the industry, standards such as OpenCL and SYCL should incorporate the requirements and feedback from industry experts. Consequently Khronos has established Advisory Panels to serve as a forum where said experts can engage in a bi-directional discourse with the corresponding Working Group (WG) to communicate their requirements, provide feedback on draft specs, consume WG materials offline, help prioritize work on features and among others, participate in shaping the ecosystem.
There is no cost to join an advisory panel. If you are interested in finding out more and would like to apply please visit: https://www.khronos.org/advisors and and then email: memberservices@khronos.org with your contact details and a short summary of your involvement with OpenCl and/or SYCL.
Please note that each working group can only support a limited number of Advisors. The meetings taking place during IWOCL/SYCLcon represent the first official meetings of the OpenCL and SYCL Advisory Panels and will be open to members only. Members will be sent call-in instructions directly by the Khronos Group.
Research Papers and Technical Talks
Registered delegates will be able to view the on-demand video presentations of these sessions from Wednesday 28th April. All our speakers will be asked to check their Slack channel regularly to answer any questions coming in from the community.
KEYNOTE
SYCL, DPC++, XPUs, oneAPI – a View from Intel
Speaker: James Reinders, Intel.
on-demand video | slides | Discuss on slack: #discuss_talks_sycl | SYCL | Paper ID: K01 |
INVITED TALK
An Insight Into Kalray’s OpenCL Optimum Implementation
Speaker: Sebastien Leduc, Kalray. Software Engineering Director
on-demand video | slides | discuss on slack: #discuss_talks_opencl | OpenCL | Paper ID: K02 |
Abstract
Kalray is proud to have officially reached OpenCL™ conformance at the close of 2020 for Coolidge™, Kalray’s 3rd generation of MPPA® (Massively Parallel Processor Array) intelligent processor.
Kalray MPPA intelligent processors are a new generation of processors specialized in intelligent data processing from cloud to edge. They are able to capture and analyze on the fly massive data flows, and interact in real time with the outside world. These processors are capable of running demanding AI algorithms and simultaneously a wide set of different processing and control tasks such as intensive mathematical algorithms, signal processing, network or storage software stacks.
Let’s explore why it is important for our users that Coolidge now runs a conformant implementation of OpenCL, and how we offer open standard programming for high performance and flexible applications on manycore processors.
The evolution of high-performance systems is crying out for disruptive hardware architectures and innovative software programming models. The challenges encountered by embedded compute system users reside in the selection of the appropriate hardware technologies and in the selection of the programming models for Computer vision, Neural networks, Machine learning… It usually requires:-
– Re-use of legacy code
– Ease of finding high qualified engineers
– Flexibility for porting from one hardware architecture to another
– Long term maintenance
– Rapid prototyping up to productization…
Some solution providers are proposing a proprietary framework and API while others are implementing a defined Standard API for full open framework.
Such frameworks need to support high level interfaces for several types of applications and help users to initialize, use and “combine” these applications.
The Deliberate Choice of Open Standards
At Kalray, we are convinced Open Standards answer the requirements set out above. This is why Kalray’s Software Development Kit, AccessCore® SDK, relies massively on Open Standards and why we directly worked with Khronos (https://www.khronos.org) for selecting the most appropriate programming solution for parallel architectures and performance offloading.
We are seeing an explosion of the most demanding applications that require a tremendous range of advanced computing capabilities. The focus so far has been to execute these applications on a dedicated type of architecture, the GPU (initially conceived for graphical demands). As the industry expands its needs for neural network, algebra calculation and computer vision algorithms, more adapted architectures are being developed and used.
Here enters Kalray’s MPPA intelligent processor which provides high performance for heterogeneous computation while keeping a homogenous architecture. The challenges, as mentioned above, are for the users to be able to re-use already developed applications, to port them and to evaluate benefits of our architecture (execution time, latency, power consumption…). In addressing these challenges, we must also minimize the learning ramp up and the maintenance burden of a new language whilst reducing the need for extensive training.
By adopting an open standard, Kalray makes MPPA adoption easy for developers. They can use legacy code, they know the programming environment, they are not surprised about configuration capabilities and are even used to the optimization methods.
With this proof of commitment to the OpenCL conformance of Coolidge, our 3rd generation of MPPA intelligent processor, Kalray ensures that our users can rely on our implementation as much as that from other major actors in the industry. In addition, as a Khronos member, we are involved into Khronos Working Groups for contributing to these Open Standards evolution and adoption.
This session will describe in detail the implementation choices (mapping of platform, execution and memory models on MPPA) as well as the optimum programming methods and usage of extensions utilizing the full capabilities of the architecture.
Speaker Biography
Sebastien Le Duc is Software Engineering Director at Kalray. He started his professional career in 1998 at STMicroelectronics where he worked on compilers for a proprietary VLIW DSP targeted at multimedia applications. He then joined ST-Ericsson in 2006 where he spent 8 years managing Multimedia Software development teams. He continued his career back at STMicroelectronics as Lead Software Architect on set-top box products. Together with his technical background that ranges from low-level software development to middleware integration, his team management and product development experience, Sebastien brings outstanding Software Engineering leadership to Kalray.
INVITED TALK
OpenCL Working Group – A State of the Union
Speaker: Neil Trevett, NVIDIA | Khronos Group, President and OpenCL Working Group Chair.
on-demand video | slides (updated) | discuss on slack: #discuss_talks_opencl | OpenCL | Paper ID: K03 |
Abstract
Neil will update the community on all the OpenCL working group’s latest developments with OpenCL, including OpenCL 3.0, plus all the latest news and updates on extensions, the eco-system, implementations and more. Always a not-to-be missed presentation.
Speaker Biography
Neil Trevett is the President of The Khronos Group (and OpenCL working group chair) where he has helped initiate and evolve royalty-free open standards such as OpenCL, OpenVX, NNEF, OpenXR, Vulkan, OpenGL ES, WebGL and glTF many of which are widely deployed in the graphics and embedded vision industries. At NVIDIA, Neil works to drive and develop the developer ecosystem that enables applications to take advantage of advanced GPU and silicon acceleration.
INVITED TALK
SYCL Working Group – A State of the Union
Speaker: Michael Wong , Codeplay | Khronos Group, SYCL Working Group Chair.
on-demand video | slides | discuss on slack: #discuss_talks_sycl | SYCL | Paper ID: K04 |
Abstract
Michael will update the community on all the SYCL working group latest developments, including SYCL 2020. Always a not-to-be missed presentation.
Speaker Biography
Michael is a Distinguished Engineer at Codeplay, Chairman of SG19 Standards Committee for C++ Machine Learning, Head of Canadian Delegation of C++ Standards Committee. Michael has rich experience in C++ parallel computing, high performance computing and machine learning. He has led the development of C++ heterogeneous programming language (SYCL) standard for GPGPU application development, and has profound research and insights into the underlying performance optimization of Tensorflow. Its specific work covers parallel programming, neural network, computer vision, automatic driving and other fields. Michael was a senior IBM technical expert who led the development of IBM XL C++ compiler and XL C compiler.Head of Canadian Delegation of C++ Standards Committee. Michael has rich experience in C++ parallel computing, high performance computing and machine learning. He has led the development of C++ heterogeneous programming language (SYCL) standard for GPGPU application development, and has profound research and insights into the underlying performance optimization of Tensorflow. Its specific work covers parallel programming, neural network, computer vision, automatic driving and other fields. Michael was a senior IBM technical expert who led the development of IBM XL C++ compiler and XL C compiler.
BEST PAPER*
Sylkan: Towards a Vulkan Compute Target Platform for SYCL
Speaker: Peter Thoman, University of Innsbruck (UIBK)
Co-Authors: Daniel Gogl (UIBK); Thomas Fahringer (UIBK)
* The best paper award for 2021 was sponsored by Arm
on-demand video | slides | discuss on slack: #discuss_talks_sycl | SYCL | Paper ID: 14 |
In this paper, we discuss the opportunities and challenges of mapping SYCL to Vulkan, a low-level explicit programming model for GPUs. This includes an analysis of the potential semantic mismatch between each respective standard, as well as approaches to work around some of these issues. Additionally, we present a prototype research implementation of Sylkan, a SYCL compiler and runtime targeting Vulkan.
In order to evaluate our prototype qualitatively and quantitatively, we chose a variety of functional tests as well as three performance benchmarks. For the functional tests, we discuss and categorize the failures of the current prototype, noting which semantic mismatch or missing implementation causes them. For the performance benchmarks, we compare execution times against a OpenCL-based SYCL implementation and a native Vulkan version of each benchmark, on two hardware platforms.
Performance-Portable Distributed k-Nearest Neighbors using Locality-Sensitive Hashing and SYCL
Speaker: Marcel Breyer, University of Stuttgart, IPVS
Co-Authors: Gregor Daiß (University of Stuttgart, IPVS); Dirk Pflüger (University of Stuttgart, IPVS)
on-demand video | slides | discuss on slack: #discuss_talks_sycl | SYCL | Paper ID: 24 |
To the best of our knowledge, there is no distributed LSH version supporting multiple GPUs from different vendors available so far despite the fact that k-NNs are frequently employed. Therefore, we have developed the sycl_lsh library. It provides the first hardware-independent, yet efficient and distributed implementation of the LSH algorithm that is suited for modern supercomputers. The implementation uses C++17 together with SYCL, which is an abstraction layer for OpenCL that allows targeting different hardware with a single implementation. To support large data sets, we utilize multiple GPUs using the Message Passing Interface (MPI) to enable the usage of both shared and distributed memory systems.
We have tested different parameter combinations for two locality-sensitive hash function implementations, which we compare. Our results show that our library can easily scale on multiple GPUs using both hash function types, achieving a nearly optimal parallel speedup of up to 7.6 on 8 GPUs. Furthermore, we demonstrate that the sycl_lsh library supports different SYCL implementations—ComputeCpp, hipSYCL, and oneAPI—to target different hardware architectures without significant performance differences.
Toward Performance Portability of Highly Parametrizable TRSM Algorithm Using SYCL
Speaker: Thales Sabino, Codeplay Software
Co-Authors: Mehdi Goli (Codeplay software Ltd)
on-demand video | slides | discuss on slack: #discuss_talks_sycl | SYCL | Paper ID: 26 |
While originally BLAS routines have been implemented for CPU, with the emergence of GPGPU BLAS routines had to be re-written to exploit the provided extensive computational power. Machine learning is rapidly changing this landscape again by incentivizing the development of specialized hardware that can
perform certain operations more efficiently. With various range of hardware, having different memory hierarchy, different cache line size, and various memory access pattern, with different number of registers and different type of memory connections, performance portability of BLAS routine across various platforms while avoiding rewrites of existing code is a major challenge of the heterogeneous programming world.
Written in SYCL programming Language, SYCL-BLAS is an open-source BLAS library that provides performance portability across various SYCL-enabled platforms.
This paper presents the implementation of a parametric tile-based TRSM routine for SYCL-BLAS by employing a highly optimized GEMM routine provided in SYCL-BLAS.
Our results shows that we can achieve up to 2.6x speedup on Intel GPU, 7x on AMD GPU and up to 3.4x speedup on ARM GPU compared with the highly optimized clBLAST and clBLAS libraries by tuning the tile size per-device without reimplementing the kernel.
On Measuring the Maturity of SYCL Implementations by Tracking Historical Performance Improvements
Speaker: Wei-Chen Lin, University of Bristol
Co-Authors: Tom Deakin (University of Bristol); Simon McIntosh-Smith (University of Bristol)
on-demand video | slides | discuss on slack: #discuss_talks_sycl | SYCL | Paper ID: 33 |
A crucial metric for potential adopters is how mature these implementations are; are they still improving rapidly, indicating that the space is still quite immature, or has performance improvement plateaued, potentially indicating a mature market?
This study presents a historical study of the performance delivered by all major SYCL implementations on a range of supported platforms.
We use existing HPC-style mini-apps written in SYCL, and benchmark these on current and historical revisions of each SYCL implementation, revealing the rate of change of performance improvements over time.
The data indicates that most SYCL implementations are now quite mature, showing rapid performance improvements in the past, slowing to more modest performance improvements more recently.
We also compare the most recent SYCL performance to existing well established frameworks, such as OpenCL and OpenMP.
Experiences Supporting DPC++ in AMReX
Speaker: Sravani Konda, Intel Corporation
Co-Authors: Dunni Aribuki (Intel Corporation); Weiqun Zhang (Lawrence Berkeley National Laboratory); Kevin Gott (National Energy Research Scientific Computing Center); Christopher Lishka (Intel Corporation)
on-demand video | slides | discuss on slack: #discuss_talks_sycl | SYCL | Tech Talk ID: 04 |
In this talk, we present experiences in supporting DPC++, a language based on the SYCL specification as a backend for AMReX. We will demonstrate how AMReX provides an abstraction layer for its users so that they can write performance portable code for a variety of heterogeneous platforms. We will discuss key DPC++ features that allow AMReX to implement the abstractions and our contributions to the oneAPI specification and Intel’s implementation. We will also highlight some features missing in SYCL/DPC++ that limits its efficiency and our future plans.
Developing Medical Imaging Application Across GPU, FPGA and, CPU Using oneAPI
Speaker: Wang Yong, Intel
Co-Authors: Zhou Yongfa (Intel); Wang Scott (Intel); Yang Wang (Intel Corporation); Xu Qing (Intel); Wang Chen (Intel)
on-demand video | slides | discuss on slack: #discuss_talks_sycl | SYCL | Tech Talk ID: 11 |
For now, the two limiting factors of developing ultrasound imaging are: First, using a hardware development approach to implement ultrasound imaging algorithms is complex, time-consuming, and lacks flexibility. Second, the existing CUDA-based ultrasound imaging implementations are limited to Nvidia hardware, which is also a restriction applying more architectures.
oneAPI is a SYCL-based programming environment developed by intel. It enables heterogeneous computing across multiple hardware architectures using Data Parallel C++ (DPC++). This new programming suite can be used to address the problems mentioned above. To be clear, using a high-level language like DPC++ to program FPGA can accelerate ultrasound imaging application development. SYCL-based ultrasound imaging applications can be easily migrated to other vendor’s hardware.
To implement an ultrasound imaging application across multiple architectures (e.g., GPU, FPGA, and CPU) in a unified programming environment. We migrated a CUDA-based open-source ultrasound imaging project SUPRA. The migration process was performed using oneAPI compatibility tool (e.g. dpct). After migration, the code was tuned to run on GPU, FPGA and, CPU.
In this talk, we will discuss our experiences with the complete process of migrating a CUDA code to oneAPI code. First, the whole process of migrating the CUDA code base using dpct will be presented, including usage, code modification, API comparison, and build instruction. Second, the ultrasound imaging algorithms computation characteristics will be analyzed, and we will show how to optimize the application on Intel GPUs, Including ESIDM usage. Third, the early experiences of tuning the migrated code to target FPGA will be highlighted, this will include device code rewrite for FPGA and programming skills to improve performance on FPGA. The device code comparison of GPU and FPGA will also be discussed. Last, we will compare ultrasound imaging algorithms performance and computation results on different hardware, including Intel GPU (integrated GPU and discrete GPU), Intel Arria 10 FPGA, Intel CPU, Nvidia GTX 1080 GPU, and GTX 960M GPU.
hipSYCL in 2021: Peculiarities, Unique Features and SYCL 2020
Speaker: Aksel Alpay, Heidelberg University
Co-Authors: Vincent Heuveline (Heidelberg University)
on-demand video | slides | discuss on slack: #discuss_talks_sycl | SYCL | Tech Talk ID: 23 |
In this talk we will discuss consequences of the hipSYCL design and implications for the development of software with hipSYCL. We will discuss and explain common performance pitfalls, how to circumvent them, and will discuss peculiarities of hipSYCL that open up new approaches. This includes recent hipSYCL extensions such as hipSYCL‘s unique scoped parallelism execution model for increased performance portability, the custom operations extension for efficiently submitting backend interoperability operations, and command group properties as well as the initial implementation of an interoperability infrastructure between buffers and SYCL 2020 USM pointers. Where applicable we will also provide performance measurements to demonstrate the impact of these aspects.
We will also provide an update to the SYCL 2020 implementation status and discuss performance and functionality of implemented key SYCL 2020 features, such as unified shared memory (USM), group algorithms and more.
Can SYCL and OpenCL Meet the Challenges of Functional Safety?
Speaker: Illya Rudkin, Codeplay Software
Co-Authors: Rod Burns (Codeplay Software)
on-demand video | slides | discuss on slack: #discuss_talks_sycl | SYCL | Paper ID: 20 |
At the leading edge of HPC and vision technologies in many areas open standards are being used, but automotive functional safety standards were designed with older technologies in mind. Modern programming constructs, system topologies and paradigm-shifting technologies such as AI were not considered when the safety standards were created. Today’s developers are using development processes that are causing a seismic shift in how the automotive industry does business.
In 2006 the software component of a car made up proportionally about 50% of the development effort, today it is the majority. The amount of software continues to grow exponentially driven by cheaper and more powerful hardware and demand for more innovative new functionality. The growth of software, and the importance in how it operates, brings numerous challenges including how the car operates safely in any circumstance. Over the years there have been many news articles on how a car has malfunctioned due to human error. The automotive industry has responded with the publication of the ISO 26262 functional safety standard, now in its second edition.
The automotive industry needs to find a balance between the established safety standards required and the need to use emerging commercially developed technologies to differentiate and sell more vehicles. Automotive Tiers are having to look at off the shelf solutions, CoS model, to partner with technology companies with promising solutions. However, these are companies who do not have the same level of safety experience. This approach has worked for a few companies with deep pockets, putting technological wonders into high-end luxury cars but this will not work for all automotive companies. This means some are looking for alternative ways to develop future ADAS systems, in particular those that will be mandated by law.
Open standards like SYCL and OpenCL are seen as a viable approach for OEMs to lower costs while still have access to a large technical knowledge base. Code can be more easily ported between architectures, developers can take advantage of a wide range of pre-existing libraries and frameworks, and crucially these standards are defined by the industry.
In cars we can find all the interesting problems and challenging issues of software and systems engineering. In this context, this presentation will give an overview of
+ A brief history of autonomous vehicles
+ The functional safety standard ISO 26262
+ How functional safety affects applications using the SYCL stack
+ Example architecture of an automotive platform
+ Features of SYCL that support or challenge ISO 26262 requirements
Experiences with Adding SYCL Support to GROMACS
Speaker: Andrey Alekseenko, KTH Royal Institute of Technology
Co-Authors: Szilárd Páll (KTH Royal Institute of Technology); Erik Lindahl (KTH Royal Institute of Technology)
on-demand video | slides | discuss on slack: #discuss_talks_sycl | SYCL | Tech Talk ID: 22 |
In this talk, we discuss the experiences and challenges of adding support for the SYCL platform into the established GROMACS codebase and share experiences and considerations in porting and optimization. While OpenCL offers the benefits of using the same code to target different hardware, it suffers from several drawbacks that add significant development friction. Its separate-source model leads to code duplication and makes changes complicated. The need to use C99 for kernels, while the rest of the codebase uses C++17, exacerbates these issues. Another problem is that OpenCL, while supported by most GPU vendors, is never the main framework and thus is not getting the primary support or tuning efforts. SYCL alleviates many of these issues, employing a single-source model based on the modern C++ standard. In addition to being the primary platform for Intel GPUs, the possibility to target AMD and NVIDIA GPUs through other implementations (e.g., hipSYCL) might make it possible to reduce the number of separate GPU ports that have to be maintained.
Some design differences from OpenCL, such as flow directed acyclic graphs (DAGs) instead of in-order queues, made it necessary to reconsider the GROMACS’s task scheduling approach and architectural choices in the GPU backend. Additionally, supporting multiple GPU platforms presents a challenge of balancing performance (low-level and hardware-specific code) and maintainability (more generalization and code-reuse). We will discuss the limitations of the existing codebase and interoperability layers with regards to adding the new platform; the compute performance and latency comparisons; code quality considerations; and the issues we encountered with SYCL implementations tested. Finally, we will discuss our goals for the next release cycle for the SYCL backend and the overall architecture of GPU acceleration code in GROMACS.
Extending DPC++ with Support for Huawei Ascend AI Chipset
Speaker: Rasool Maghareh, Huawei Heterogeneous Compiler Lab
Co-Authors: Wilson Feng (Huawei Heterogeneous Compiler Lab); Kai-Ting Amy Wang (Huawei Heterogeneous Compiler Lab)
on-demand video | slides | discuss on slack: #discuss_talks_sycl | SYCL | Tech Talk ID: 15 |
The Ascend AI chipset is Huawei’s ASIC hardware dedicated for artificial intelligence workloads. Similar to Nvidia’s CUDA programming model, the Ascend AI chip exposes a host-device programming model to users. The host can be written in generic C++ source code. The device is programmed in a C-based variant with special Single-Instruction-Multiple-Data (SIMD) extensions: the CCE language. Our main contribution has been adding the CCE backend to DPC++ (presented in the supporting material). This backend enables DPC++ code to be compiled for and executed on Ascend. Since the runtime and compilation steps of the Ascend is similar to CUDA, we have been implementing the CCE backend similar to the CUDA backend in DPC++.
The main parts of the CCE backend is the CCE Runtime Plugin Interface (CCE PI) and the CCE Toolchain.
The CCE PI registers a HiIPU device. This device represents the Ascend AI Chipset and a kernel can be passed to it using the HiIPU-selector class. In DPC++, a set of high-level plugin API are defined which control the runtime behaviour of a DPC++ program via a set of plugin interface. The CCE PI is similar to the CUDA PI in many aspects except the point that the API calls are lowered to the Ascend runtime API.
The CCE tool chain refers to the set of derived C++ classes based on the clang driver framework, clang::driver::ToolChain. This extension contains the necessary metadata in order to orchestrate and pass options to other external tools. For example, this would include our proprietary LLVM-based compiler, CCEC, that compiles our device code down to native binary. The compilation flow for the Ascend AI Chipset is presented in the supporting material.
Adapting a SIMT programming model for the Ascend architecture is a challenging task as the underlying vector and cube unit within an AICORE is SIMD by design. One important component of the toolchain is the converter pass which addresses this issue. The converter pass receives the LLVM IR of the device code and generates a hybrid script that is consumed by AKG to produce scheduled, vectorized AICORE device code. We have further elaborated on the converter pass in the supporting material.
Many of our colleagues have been helping in the implementation steps for this project. The authors acknowledge the direct and indirect contribution and support of all Huawei colleagues who contributed to this project.
Toward a Better Defined SYCL Memory Consistency Model
Speaker: Ben Ashbaugh, Intel Corporation
Co-Authors: James C Brodman (Intel Corporation); Michael Kinsner (Intel Corporation); Gregory Lueck (Intel Corporation); John Pennycook (Intel Corporation); Roland Schulz (Intel Corporation)
on-demand video | slides | discuss on slack: #discuss_talks_sycl | SYCL | Tech Talk ID: 28 |
In this technical presentation we will:
• Summarize the guarantees and several unexpected non-guarantees that are provided by the memory model in the SYCL 2020 specification, using accessible language and examples.
• Describe related memory models from other parallel programming models that could inform and influence the SYCL memory model, including the C++, OpenCL 2.0, and Vulkan memory models.
• Describe features unique to the SYCL specification that will need to be included in the SYCL memory model, such as unified shared memory, which introduce challenges that haven’t been solved in existing memory models.
We will close the technical presentation with a call for participation in helping to refine and formalize the SYCL memory model for future versions of the specification.
Enabling OpenCL and SYCL for RISC-V Processors
Speaker: Colin Davidson, Codeplay Software
Co-Authors: Rod Burns (Codeplay Software); Aidan Dodds (Codeplay Software)
on-demand video | slides | discuss on slack: #discuss_talks_sycl | OPENCL + SYCL | Paper ID: 19 |
This presentation will describe the components needed to integrate OpenCL and SYCL onto RISC-V Vector solution using multiple simulators.
While Codeplay has previously enabled OpenCL for a variety of processor architectures, there are a number of technical challenges involved in delivering a generic integration that can be rapidly used by multiple RVV based systems, a solution that requires a change in development approach. By adding to the existing LLVM back-end for RISC-V, and creating an integration layer that plugs into OpenCL, we have built a common base architecture for a variety of RISC-V processors.
This presentation will explain how Codeplay’s current driver interface works, and how it has been adapted to integrate with multiple RISC-V targets, in particular the Spike RISC-V ISA simulator. We will also talk about some of the RISC-V Vector extensions that are available, and how these can help to expose features specific to the RISC-V architecture through OpenCL.
RISC-V is a non-profit, member managed organization and is gaining momentum in the processor space, with more than 900 members. One of the goals of the organization is to build an open software platform, providing software developers an easy way to harness the familiar benefits already available on CPUs and GPUs
Profiling Heterogeneous Computing Performance with VTune Profiler
Speaker: Vladimir Tsymbal, Intel
Co-Authors: Alexandr Kurylev (Intel)
on-demand video | slides | discuss on slack: #discuss_talks_sycl | OPENCL + SYCL | Tech Talk ID: 09 |
In this presentation we will introduce two new GPU performance analysis types in Intel VTune Profiler, and a methodology of heterogeneous applications performance profiling supported by the analyses. VTune Profiler is a well-known tool for performance characterization on CPUs, now it includes GPU Offload Analysis and GPU Hotspots Analysis of applications written on most offloading models with OpenCL, SYCL/Data Parallel C++, and OpenMP Offload.
FAST: A Framework for High-Performance Medical Image Computing and Visualization
Speaker: Erik Smistad, Norwegian University of Science and Technology and SINTEF
on-demand video | slides | discuss on slack: #discuss_talks_opencl | OpenCL | Tech Talk ID: 17 |
The development of FAST started in 2014, the goal was to create an open-source framework which made GPU and parallel processing of medical images easy and portable. While there existed popular image processing libraries such as the visualization toolkit (VTK), insight toolkit (ITK) and OpenCV, the GPU processing capabilities were still implemented ad-hoc and often implied copying data back and forth from the GPU and CPU. Thus it was decided to use the new OpenCL API to create a cross-platform framework designed bottom-up with GPU processing at the very core. One of the design goals was to remove the burden of moving data back and forth from different processors and memory spaces from the developer. Instead, the developer requests access to the data on a given processor, and FAST will copy and update data as needed. Now, seven years later FAST version 3.2 is released, it still uses OpenCL 1.2 and OpenGL 3.3 at the core of almost all of its operations. FAST can stream images in real-time from ultrasound scanners, webcameras, Intel’s RealSense depth camera, and read many different formats from disk including medical formats such as DICOM, Metaimage and huge microscopy images stored as tiled image pyramids.
FAST uses a processing pipeline concept, meaning that you define a pipeline as multiple processing and visualization steps first, then initiate the processing by executing the pipeline. FAST pipelines can be created with C++, Python and even without programming using simple text files.
In the last five years or so, deep neural networks have become the standard for almost all image processing tasks. Many high-performance frameworks for deep neural network inference already exist, but have very different APIs and use different formats for storing neural network models. FAST now provides a common API for neural networks with multiple backends such as NVIDIA’s TensorRT, Intel’s OpenVINO and Google’s TensorFlow. This removes the burden of the user to learn the API of every inference library, and makes neural network inference as simple as just loading a model stored on disk.
This presentation will present the FAST framework, the idea and goal behind it, how it is designed, the challenges faced in its making, and the plans for the future. FAST is open-source and we invite the community to contribute through GitHub at https://github.com/smistad/FAST
Experiences Porting the SU3_Bench Microbenchmark to the Intel Arria 10 and Xilinx Alveo U280 FPGAs
Speaker: Douglas Doerfler, Lawrence Berkeley National Laboratory
on-demand video | slides | discuss on slack: #discuss_talks_opencl | OpenCL | Paper ID: 02 |
Accelerating Regular-Expression Matching on FPGAs with High-Level Synthesis
Speaker: Devon Callanan, University of Pittsburgh
on-demand video | slides | discuss on slack: #discuss_talks_opencl | OpenCL | Paper ID: 08 |
Performance Evaluation and Improvements of the PoCL Open-Source OpenCL Implementation on Intel CPUs
Speaker: Tobias Baumann, Zuse Institute Berlin
on-demand video | slides | discuss on slack: #discuss_talks_opencl | OpenCL | Paper ID: 30 |
Towards Evaluating High-Level Synthesis Portability and Performance Between Intel and Xilinx FPGAs
Speaker: Anthony M Cabrera, Oak Ridge National Laboratory
on-demand video | slides | discuss on slack: #discuss_talks_opencl | OpenCL | Paper ID: 31 |
In this work, we evaluate the portability and performance of Intel and Xilinx kernels. We use OpenCL C implementations of a subset of the Rodinia benchmarking suite that were designed for an Intel FPGA and make the necessary modifications to create synthesizable OpenCL C kernels for a Xilinx FPGA. We find that the difficulty of porting certain kernel optimizations varies depending on the construct. Once the minimum amount of modifications are made to create synthesizable hardware for the Xilinx platform, though, more non-trivial work is necessary in order to improve performance. However, we find that constructs that are known to be performant for an FPGA should improve performance regardless of the platform; the difficulty comes in deciding how to invoke certain kernel optimizations while also abiding by the constraints enforced by a given platform’s hardware compiler.
Executing Graphs with OpenCL
Speaker: Erik Tomusk, Codeplay Software
on-demand video | slides | discuss on slack: #discuss_talks_opencl | OpenCL | Tech Talk ID: 12 |
There is growing demand from industry and individual users to run programs that are based on ML graphs. This demand is being met by hardware vendors, who are designing increasingly heterogeneous accelerator devices that can efficiently execute graphs. Since its creation, OpenCL has been a key API for bridging the gap between user applications and accelerator hardware. The question, then, is whether OpenCL is an appropriate API for this new breed of graph software running on these new, highly heterogeneous accelerators. Does OpenCL have the expressive power required to describe graphs to graph accelerator hardware?
In this technical presentation, we will argue that the answer is yes, OpenCL is sufficiently expressive to allow an ML library to describe an execution graph, and it is sufficiently powerful to execute that graph on a graph accelerator. We will use graphs from real applications to demonstrate the possibility of data dependency tracking using OpenCL events and memory buffers. We will show how built-in kernels can be used to simplify scheduling to the device. Where appropriate, the presentation will be supported by lessons learned from Codeplay’s ComputeAorta OpenCL implementation.
Posters
Registered delegates will be able to view the on-demand video presentations of these posters sessions from Tuesday 27th April 2021. All our speakers will be asked to check their Slack channel regularly to answer any questions coming in from the community.
Enabling the Use of C++20 Unseq Execution Policy for OpenCL
Speaker: Po-Yao Chang, National Tsing Hua University
Co-Authors: Tai-Liang Chen (National Tsing Hua University); Jenq-Kuen Lee (National Tsing Hua University)
on-demand video | poster | discuss on slack: #discuss_posters | OpenCL | Poster ID: 05 |
Experimenting with C++ Libraries in OpenCL Kernel Code
Speaker: Ole M Strohm, Arm
Co-Authors: Anastasia Stulova, Arm
on-demand video | slides | discuss on slack: #discuss_posters | OpenCL | Poster ID: 06 |
As C++ for OpenCL matures, the viability of porting existing C++ code such as the standard libraries comes closer to reality. This would let C++ for OpenCL use more current and new features of C++, given that nearly 70% of the C++ specification pertains to the standard library [2].
In this endeavor we have experimented with the available open-source C++ standard library libcxx [3], that is part of the LLVM project. By utilizing existing C++ code for OpenCL, the development time can be decreased, and more robustness can be provided to OpenCL kernel developers.
While we have conducted an evaluation of multiple libraries within libcxx, our focus was on supporting the type traits library. Type traits is a powerful metaprogramming utility library, and it is well suited to OpenCL as it mainly provides general compile-time features. Also, it is a header-only library, which makes it very easy to integrate and deploy into the OpenCL development environment. Attempting to use type traits in C++ for OpenCL did expose bugs in the compiler that were fixed upstream. We have also added two Clang compile time extensions that allowed us to support the full functionality of the type traits library from C++17 without the risk of exposing non-conformant behavior to the application developers.
Currently 88% of the modified libcxx tests for type traits pass with the upstream implementation of C++ for OpenCL, and the rest of the tests cannot currently be supported because they contain functionality from C++20 or other functionality outside of C++ for OpenCL’s scope. 58% of the passing tests had to be modified by removing features unsupported in OpenCL, such as virtual methods and function pointers.
As a result of this work, type traits are currently enabled for users of C++ for OpenCL to experiment in offline compilation starting from Clang 12 [4]. We anticipate that more functionality will be added in the future and we aim to outline possible directions in the conclusion.
[1] https://www.iwocl.org/wp-content/uploads/iwocl-syclcon-2020-stulova-13-slides.pdf
[2] https://www.stroustrup.com/bs_faq.html#big
[4] https://clang.llvm.org/docs/OpenCLSupport.html#experimental-features
Trip Down the Compute Pipeline
Speaker: Lukasz Towarek, Intel Corporation
on-demand video | slides | discuss on slack: #discuss_posters | OpenCL | Poster ID: 07 |
How do OpenCL implementations compare with those supporting new generation APIs? What is the architecture of OpenCL driver? What is happening under driver’s hood?
We answer these questions based on the open source OpenCL driver stack for Intel Processor Graphics. We present activities that are performed by each component of the driver stack when handling critical OpenCL API calls like clGetPlatformIDs, clBuildProgram, clEnqueueWriteBuffer, clEnqueueNDRangeKernel, clEnqueueReadBuffer and clFinish.
Machine Learning Training with Tensor Virtual Machine (TVM) and Adreno GPUs
Speaker: Siva Rama Krishna Reddy, Qualcomm
Co-Authors: Hongqiang Wang (Qualcomm); Adarsh Golikeri (Qualcomm); Alex Bourd (Qualcomm)
on-demand video | slides | discuss on slack: #discuss_posters | OpenCL | Poster ID: 34 |
After having enabled the mainline TVM for Adreno GPU for machine learning inference, which was presented in IWOCL 2020, we made further progress by enhancing TVM for training of deep learning network with OpenCL backend on Adreno GPU. This effort is the first of its kind on TVM, to our best knowledge.
Machine learning training on mobile is emerging as an exciting topic due to raising concerns on privacy and demand for customized user experience. As compared with inference on mobile, training on mobile poses significant challenges in terms of functionality, complexity and resources. Besides the inference functionality, training requires loss functions, gradient computation, optimizers, and a workflow to progress the training. As of today, the mainline TVM has not enabled training support, except few gradient operators.
To enable training on TVM, we implemented a graph pass to build back ward graph given a forward graph, and various gradient operators which include complex layers like depth wise convolutions and batchnorm. In addition, cross entropy loss function and optimizers like SGD, Adam are also implemented. On top of them, a workflow was implemented to support maintain various trainable parameters, optimizer state, etc.
With all the amendments to the TVM framework, we now can train LeNet-5 and Mobilenet-v1 natively on target starting with random initialized weights and achieve desired accuracy of greater than 90% in few epochs. This effort showcase capability of TVM with OpenCL backend on Adreno GPU targets.
We will discuss the approach and various challenges we face during these efforts, and our thoughts on the future of training with TVM framework in the poster.
SYCL for Vitis 2020.2: SYCL & C++20 on Xilinx FPGA
Speaker: Gauthier Harnisch, Xilinx
Co-Authors: Andrew Gozillon (University of the West of Scotland); Ronan Keryell, (Xilinx), Lin-Ya Yu (Xilinx, Inc); Ralph Wittig (Xilinx); Luc Forget (Xilinx)
poster | discuss on slack: #discuss_posters | SYCL | Poster ID: 27 |
We present a SYCL implementation targeting Xilinx Alveo FPGA cards by merging 2 different open-source implementations, Intel’s oneAPI DPC++ wit Save h some LLVM passes from triSYCL.
The FPGA device configuration is generated by Xilinx Vitis 2020.2 fed with LLVM IR SPIR and Xilinx XRT is used as a host OpenCL API top control the device.
This is an ongoing open-source project available on: https://github.com/triSYCL/sycl
Bringing SYCL to Ampere Architecture
Speaker: Steffen Larsen, Codeplay Software
Co-Authors: Rod Burns (Codeplay Software); Brandon Cook (Lawrence Berkeley National Laboratory); Douglas Doerfler (Lawrence Berkeley National Laboratory); Kevin Harms (Lawrence Berkeley National Laboratory); Thomas Applencourt (Lawrence Berkeley National Laboratory); Stuart Adams (Codeplay Software)
on-demand video | slides | discuss on slack: #discuss_posters | SYCL | PosterID: 16 |
SYCL 2020 is a significant step towards bringing C++ heterogeneous programming to all. It supports diverse applications, including HPC supercomputing centers, powerful machine learning frameworks, and creative and professional applications on embedded and desktop PCs. One of the key improvements of SYCL 2020 is the new backend model, which allows a SYCL implementation to target multiple heterogeneous APIs, including CUDA. This makes SYCL an attractive target for frameworks and libraries, allowing them to target a wide range of platforms without having to port and translate their code. Over the next year, Codeplay™ will help improve DPC++ by improving their CUDA backend and contributing new features from the SYCL 2020 provisional specification, including Universal Shared Memory (USM), reductions, subgroups, unnamed lambdas and in-order queue execution. Key among these new features is USM, a new pointer-based alternative to the buffer programming model that provides the ability to create allocations that are visible to both the device and the host. Although there is support for USM already upstream, our project aims to provide further testing and an stable interface. Codeplay will implement CUDA support for these new features and ensure that they are performant on the NVIDIA A100 platform and recent CUDA toolkit versions.
As part of the contributions to the SYCL community in general, and the DPC++ CUDA backend in particular, Codeplay will also provide new extensions to SYCL 2020 and DPC++ that allow developers to take advantage of CUDA-specific APIs and features.
These extensions will help developers deliver performance on the NVIDIA A100 platform. Planned extensions include new SYCL APIs that will expose the NVIDIA A100’s Tensor Cores and hardware-accelerated barriers. Codeplay will design and implement these new extensions, adding the necessary changes to DPC++’s CUDA backend and extending LLVM’s NVPTX backend to support the SM 80 architecture.
Improving DPC++’s NVIDIA multi-GPU support will be essential for the NVIDIA A100. Codeplay will contribute support for multiple CUDA devices with different SYCL contexts, device-to-device memory transfers and group collective operations.
Path Tracing on FPGA with SYCL and C++20
Speaker: Luc Forget (INRIA)
Co-Authors: Krishna Kumar Ranipet Murugan (NCSU); Ronan Keryell, (Xilinx); Gauthier Harnisch (Xilinx);
poster | discuss on slack: #discuss_posters | SYCL | Poster ID: 25 |
We present a straight-forward implementation in modern C++ using SYCL for the offloading of the compute intensive part on accelerators.
Usually ray tracer and path tracer implementations rely on dynamic polymorphism to handle objects with different shapes and different materials but this is currently unsupported in SYCL since often accelerators cannot handle function pointers. Instead we do not use polymorphism but rely on C++17 std::variant and std::visit to dispatch operations with duck-typing in a type-safe way.
std::visit can be executed in O(1) on FPGA because the dispatch is spacialized on the architecture.
This is an ongoing open-source project available on https://github.com/triSYCL/path_tracer
SYCL for Xilinx Versal ACAP AIE CGRA
Speaker: Ronan Keryell, Xilinx
Co-Authors: Andrew Gozillon (University of the West of Scotland); Gauthier Harnisch (Xilinx); Hyun Kwon (Xilinx); Ravikumar Chakaravarthy (Xilinx); Ralph Wittig (Xilinx)
poster | discuss on slack: #discuss_posters | SYCL | Poster ID: 29 |
Xilinx Versal ACAP is a new system-on-chip (SoC) device integrating various computing resources like various CPUs, an FPGA, a coarse-grain reconfigurable array (CGRA), etc. interconnected by a network-on-chip (NoC).
The AIE CGRA is an array of 400 VLIW DSP operating on 512-bit vectors with their own neighborhood distributed memory.
We expose architectural details to the programmer through some SYCL extensions.
We present a SYCL implementation targeting the AIE CGRA by merging 2 different open-source implementations, Intel’s oneAPI DPC++ with some LLVM passes from triSYCL and a new SYCL runtime from triSYCL.
The SYCL device compiler generates LLVM IR for the Synopsys ASIP CHESS compiler generating the AIE instructions.
The host runtime runs on the ARM A72 CPU of the ACAP and controls the CGRA through the Xilinx libaiengine library.
This is an ongoing project we plan to open-source soon.
Approaching Coupled Cluster Theory with Perturbative Triples using SYCL
Speaker: Abhishek Bagusetty, Argonne National Laboratory
Co-Authors: Jinsung Kim (Pacific Northwest National Laboratory); Ajay Panyala (Pacific Northwest National Laboratory); Alvaro Vazquez-Mayagoitia (Argonne National Laboratory); Karol Kowalski (Pacific Northwest National Laboratory); Sriram Krishnamoorthy (Pacific Northwest National Laboratory)
No Video or Slides Available | SYCL | Poster ID: 32 |
The overarching goal of oneAPI ecosystem which provides Data Parallel C++ (DPC++) enables the domain-science applications to target multiple hardware architectures like multi-core CPUs, GPUs, and even FPGAs with the advantages of single-source code portability. In this process, our workflow involved the transitioning of the (T) calculation written in CUDA and optimized for Nvidia GPUs to SYCL. The Intel’s DPC++ compatibility tool was used to facilitate this porting of our recent CUDA kernel to the SYCL-based DPC++. Our recent novel algorithm [1] associated with the (T) calculation involves the minimization of GPU memory bandwidth usage. This approach fuses all the tensor contractions involved in the (T) calculation at the register-level so as to eliminate global-to-shared memory data transfer for the largest six-dimensional intermediate. This SYCL implementation involved a hybrid-memory model of using both the pointer-based and as well as traditional buffer-based memory management. A majority of the data structures involved using pointer-based unified-shared memory (USM) for explicit and robust control on data transfers between host and devices. On the other hand, shared local memory was managed using the SYCL buffer model.
In this work, we demonstrate the performance aspects of our SYCL implementation for the (T) calculation. Our primary analysis is associated with Intel’s integrated and discrete GPU architectures with the focus on roofline analysis.
[1] Jinsung Kim et al. Scalable heterogeneous execution of a coupled-cluster model with perturbative triples. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’20). IEEE Press, Article 79, 1–15.