Monday 14 May – DHPCC++ Conference – ALL DAY
The Distributed & Heterogeneous Programming in C/C++ conference is hosted by IWOCL.
In response to the demand for heterogeneous programming models for C/C++, and the interest in driving these models in ISO C++, Distributed & Heterogeneous Programming in C/C++ includes all the programming models that have been designed to support heterogeneous programming in C and C++. Many models now exist including SYCL, HPX, KoKKos, Raja, C++AMP, HCC, Boost.Compute, and CUDA to name a few.
This conference aims to address the needs of both HPC and the consumer/embedded community where a number of C++ parallel programming frameworks have been developed to address the needs of multi-threaded and distributed applications. The C++11/14/17 International Standards have introduced new tools for parallel programming to the language, and the ongoing standardization effort is developing additional features which will enable support for heterogeneous and distributed parallelism into ISO C++ 20/23.
DHPCC++ is an ideal place to discuss research in this domain, consolidate usage experience, and share new directions to support new hardware and memory models with the aim of passing that experience to ISO C and C++.
Introducing Parallelism to the Ranges TS
The current interface provided by the C++17 parallel algorithms poses some limitations with respect to parallel data access and heterogeneous systems, such as personal computers and server nodes with GPUs, smartphones, and embedded System on a Chip chipsets. In this work, we present a summary of why we believe the Ranges TS solves these problems, and also improves both programmability and performance on heterogeneous platforms.
The complete paper has been submitted to WG21 for consideration, and we present here a summary of the changes proposed alongside new performance results.
To the best of our knowledge, this is the first paper presented to WG21 that unifies the Ranges TS with the parallel algorithms introduced in C++17. Although there are various points of intersection, we will focus on the composability of functions, and the benefit that this brings to accelerator devices via kernel fusion.
Early Experiments Using SYCL Single-Source Modern C++ on Xilinx FPGA
Heterogeneous computing is required in systems ranging from low-end embedded systems up to the high-end HPC systems to reach high-performance while keeping power consumption low. Having more and more CPU and accelerators such as FPGA creates challenges for the programmer, requiring even more expertise of them. Fortunately, new modern C++-based domain-specific languages, such as the SYCL open standard from Khronos Group, simplify the programming at the full system level while keeping high performance.
SYCL is a single-source programming model providing a task graph of heterogeneous kernels that can be run on various accelerators or even just the CPU. The memory heterogeneity is abstracted through buffer objects and the memory usage is abstracted with accessor objects. From these accessors, the task graph is implicitly constructed, the synchronizations and the data movements across the various physical memories are done automatically, by opposition to
triSYCL is an on-going open-source project used to experiment with the SYCL standard, based on C++17, OpenCL, OpenMP and Clang/LLVM. We have extended this framework to target Xilinx SDx tool to compile some SYCL programs to run on a CPU host connected to some FPGA PCIe cards, by using OpenCL and SPIR standards from Khronos.
While SYCL provides functional portability, we made a few FPGA-friendly extensions to express some optimization to the SDx back-end in a pure C++ way.
We present some interesting preliminary results with simple benchmarks showing how to express pipeline, dataflow and array-partitioning and we compare with the implementation written using other languages available for Xilinx FPGA: HLS C++ and OpenCL.
Towards Heterogeneous and Distributed Computing in C++
Accelerating Octo-Tiger: Stellar Mergers on Intel Knights Landing with HPX
The optimization of performance of complex simulation codes with high computational demands, such as Octo-Tiger, is an ongoing challenge. Octo-Tiger is an astrophysics code simulating the evolution of star systems based on the fast multipole method on adaptive octrees. It was implemented using high-level C++ libraries, specifically HPX and Vc, which allows its use on different hardware platforms. Recently, we have demonstrated excellent scalability in a distributed setting.
In this paper, we study Octo-Tiger’s node-level performance on an Intel Knights Landing platform. We focus on the fast multipole method, as it is Octo-Tiger’s computationally most demanding component. By using HPX and a futurization approach, we can efficiently traverse the adaptive octrees in parallel. On the core-level, threads process sub-grids using multiple 743-element stencils.
In numerical experiments, simulating the time evolution of a rotating star on an Intel Xeon Phi 7250 Knights Landing processor, Octo-Tiger shows good parallel efficiency and achieves up to 408 GFLOPS. This results in a speedup of 2x compared to a 24-core Skylake-SP platform, using the same high-level abstractions.
Distributed & Heterogeneous Programming in C++ for HPC at SC17
In response to the HPC requirements to achieve exascale performance through heterogeneous programming models for C++, and the interest in driving these models in ISO C++, we held a BoF session at SuperComputing (SC) 17. This paper is a report on the result of that BoF.
The BoF had panelists that represented several important C++ frameworks that support heterogeneous and distributed computing. We specifically invited key members of SYCL, ISO C++, Kokkos, Raja, HPX, HCC, and HiHat as well as representatives from AMD, Intel, Nvidia, Codeplay, and Xilinx.
We had time to address the top three questions and collate the resulting discussion showing some important conclusions on what is urgent and important to the HPC community for Heterogeneous and Distributed C++.