Monday 14 May – DHPCC++ Conference – ALL DAY
The Distributed & Heterogeneous Programming in C/C++ conference is hosted by IWOCL.
In response to the demand for heterogeneous programming models for C/C++, and the interest in driving these models in ISO C++, Distributed & Heterogeneous Programming in C/C++ includes all the programming models that have been designed to support heterogeneous programming in C and C++. Many models now exist including SYCL, HPX, KoKKos, Raja, C++AMP, HCC, Boost.Compute, and CUDA to name a few.
This conference aims to address the needs of both HPC and the consumer/embedded community where a number of C++ parallel programming frameworks have been developed to address the needs of multi-threaded and distributed applications. The C++11/14/17 International Standards have introduced new tools for parallel programming to the language, and the ongoing standardization effort is developing additional features which will enable support for heterogeneous and distributed parallelism into ISO C++ 20/23.
DHPCC++ is an ideal place to discuss research in this domain, consolidate usage experience, and share new directions to support new hardware and memory models with the aim of passing that experience to ISO C and C++.
Modern C++, Heterogeneous Programming Models, and Compiler Optimization
What do our parallel programming models in HPC look like now? What will they look like in the future? As we approach the exascale era and hardware diversifies, how can we continue to achieve high performance while also maintaining high productivity? In this talk, we’ll explore the changing trends in parallel programming models in HPC, and how those trends may be influenced by trends in the C++ ecosystem and by parallelism-aware compile-optimization technology.
Accelerating Octo-Tiger: Stellar Mergers on Intel Knights Landing with HPX
The optimization of performance of complex simulation codes with high computational demands, such as Octo-Tiger, is an ongoing challenge. Octo-Tiger is an astrophysics code simulating the evolution of star systems based on the fast multipole method on adaptive octrees. It was implemented using high-level C++ libraries, specifically HPX and Vc, which allows its use on different hardware platforms. Recently, we have demonstrated excellent scalability in a distributed setting.
In this paper, we study Octo-Tiger’s node-level performance on an Intel Knights Landing platform. We focus on the fast multipole method, as it is Octo-Tiger’s computationally most demanding component. By using HPX and a futurization approach, we can efficiently traverse the adaptive octrees in parallel. On the core-level, threads process sub-grids using multiple 743-element stencils.
In numerical experiments, simulating the time evolution of a rotating star on an Intel Xeon Phi 7250 Knights Landing processor, Octo-Tiger shows good parallel efficiency and achieves up to 408 GFLOPS. This results in a speedup of 2x compared to a 24-core Skylake-SP platform, using the same high-level abstractions.
Introducing Parallelism to the Ranges TS
The current interface provided by the C++17 parallel algorithms poses some limitations with respect to parallel data access and heterogeneous systems, such as personal computers and server nodes with GPUs, smartphones, and embedded System on a Chip chipsets. In this work, we present a summary of why we believe the Ranges TS solves these problems, and also improves both programmability and performance on heterogeneous platforms.
The complete paper has been submitted to WG21 for consideration, and we present here a summary of the changes proposed alongside new performance results.
To the best of our knowledge, this is the first paper presented to WG21 that unifies the Ranges TS with the parallel algorithms introduced in C++17. Although there are various points of intersection, we will focus on the composability of functions, and the benefit that this brings to accelerator devices via kernel fusion.
Towards Heterogeneous and Distributed Computing in C++
SYCL-Based Data Layout Abstractions for CPU+GPU Codes
The focus of the talk will be on data layouts that meet the requirements on CPU and GPU. As both of the two draw on a data parallel (SIMD/SIMT) execution model, contiguous data loads and stores as well as data alignment and padding are essential to obtain the platform’s performance. While AoS (Array of Structs) layouts on the GPU can help increase the per-work-item instruction level parallelism (ILP) by processing short vectors (e.g. SYCL’s vector data types), the same layout on the CPU results in expensive gather and scatter operations, or might prevent the compiler from generating SIMD instructions at all due to compiler-internal performance models. Transferring data between the host CPU system and the GPU is a natural point to transform between (hybrid) SoA (Struct of Arrays) and AoS data layouts on the fly.
We will present C++ data types that encapsulate these layout transformations as well as SYCL specifics regarding buffer and accessor management. We will show early results for adapting an application from the field of electrical engineering to using our data types and SYCL for GPU computations. Benchmarks on a dual-socket 16 core Intel Xeon E5-2630v3 CPU node with AMD FirePro W8100 GPUs show portable performance with about a factor 2x difference in overall program execution time between optimized GPU and CPU code (using OpenMP threading and SIMD constructs).
|Lunch – College Dining Hall|
Journey to the Centre of the OpenCL Memory Model
Since version 2.0, OpenCL has supported C/C++11-style atomic operations. These operations, known as “atomics”, allow barrier-free communication between work-items — even those from different work-groups. The semantics of atomics is set out by the OpenCL memory model, but things all get rather complicated because there are so many different kinds — acquire atomics, release atomics, seq_cst atomics, relaxed atomics, etc. — and each can be scoped so that it is visible only within its work-group, only within its device, or throughout the whole system.
In this talk, I will discuss my efforts to tackle this complexity through formalisation. I will
* present a mechanised version of the OpenCL memory model that accounts for all the various kinds of atomic,
* explain how this mechanised model can be used as a basis for verifying that the language has been implemented correctly on GPUs, and
* show how the OpenCL memory model can be implemented correctly and efficiently on reconfigurable hardware devices (FPGAs).
This talk is based on joint work with: Mark Batty, Bradford M. Beckmann, George A. Constantinides, Alastair F. Donaldson, Nadesh Ramanathan, and Tyler Sorensen.
Early Experiments Using SYCL Single-Source Modern C++ on Xilinx FPGA
Heterogeneous computing is required in systems ranging from low-end embedded systems up to the high-end HPC systems to reach high-performance while keeping power consumption low. Having more and more CPU and accelerators such as FPGA creates challenges for the programmer, requiring even more expertise of them. Fortunately, new modern C++-based domain-specific languages, such as the SYCL open standard from Khronos Group, simplify the programming at the full system level while keeping high performance.
SYCL is a single-source programming model providing a task graph of heterogeneous kernels that can be run on various accelerators or even just the CPU. The memory heterogeneity is abstracted through buffer objects and the memory usage is abstracted with accessor objects. From these accessors, the task graph is implicitly constructed, the synchronizations and the data movements across the various physical memories are done automatically, by opposition to
triSYCL is an on-going open-source project used to experiment with the SYCL standard, based on C++17, OpenCL, OpenMP and Clang/LLVM. We have extended this framework to target Xilinx SDx tool to compile some SYCL programs to run on a CPU host connected to some FPGA PCIe cards, by using OpenCL and SPIR standards from Khronos.
While SYCL provides functional portability, we made a few FPGA-friendly extensions to express some optimization to the SDx back-end in a pure C++ way.
We present some interesting preliminary results with simple benchmarks showing how to express pipeline, dataflow and array-partitioning and we compare with the implementation written using other languages available for Xilinx FPGA: HLS C++ and OpenCL.
Distributed & Heterogeneous Programming in C++ for HPC at SC17
In response to the HPC requirements to achieve exascale performance through heterogeneous programming models for C++, and the interest in driving these models in ISO C++, we held a BoF session at SuperComputing (SC) 17. This paper is a report on the result of that BoF.
The BoF had panelists that represented several important C++ frameworks that support heterogeneous and distributed computing. We specifically invited key members of SYCL, ISO C++, Kokkos, Raja, HPX, HCC, and HiHat as well as representatives from AMD, Intel, Nvidia, Codeplay, and Xilinx.
We had time to address the top three questions and collate the resulting discussion showing some important conclusions on what is urgent and important to the HPC community for Heterogeneous and Distributed C++.
Heterogeneous C++ in 2020, is it real or just fantasy