DHPCC++ 2019 CONFERENCE PROGRAM

The Distributed & Heterogeneous Programming in C/C++ conference is hosted by IWOCL.

Quick Links: View IWOCL 2019 PROGRAM and PRESENTATIONS

About DHPCC++

In response to the demand for heterogeneous programming models for C/C++, and the interest in driving these models in ISO C++, Distributed & Heterogeneous Programming in C/C++ includes all the programming models that have been designed to support heterogeneous programming in C and C++. Many models now exist including SYCL, HPX, KoKKos, Raja, C++AMP, HCC, Boost.Compute, and CUDA to name a few.

This conference aims to address the needs of both HPC and the consumer/embedded community where a number of C++ parallel programming frameworks have been developed to address the needs of multi-threaded and distributed applications. The C++11/14/17 International Standards have introduced new tools for parallel programming to the language, and the ongoing standardization effort is developing additional features which will enable support for heterogeneous and distributed parallelism into ISO C++ 20/23.

DHPCC++ is an ideal place to discuss research in this domain, consolidate usage experience, and share new directions to support new hardware and memory models with the aim of passing that experience to ISO C and C++.

Monday 13 May – DHPCC++ 2019 Conference – ALL DAY

View Presentation Slides

Keynote: Performance Transparency and Performance Portability

Geoff Lowney, Senior Fellow, Intel

View Abstract

Performance transparency is a property of a programming system that allows the programmer to predict accurately how his or her program will perform on a target platform. Performance portability is a property of a programming system that allows a program developed for one target platform to obtain acceptable performance on a different target. Both properties are desirable, but they may be in conflict. In this talk, I will discuss how these properties have been supported in the past and present some directions for the future.

View Presentation Slides

Evaluating Data Parallelism in C++ Using the Parallel Research Kernels

Jeff Hammond and Tim Mattson | Intel

View Abstract

This presentation will describe the Parallel Research Kernels (PRK) and how this framework has been used to evaluate a wide range of programming models for shared and distributed memory in many languages. In particular, we have implemented the kernels most relevant to high-performance computing in many of the modern C++ frameworks for data parallelism: C++17 parallel STL, Khronos SYCL, Intel Threading Building Blocks, Kokkos and RAJA. These are evaluated against C++ implementations using OpenMP (traditional host, task-based and target variants) and OpenCL. Because the PRK are simple, it is possible to implement equivalent code in each of these to perform object comparisons of both performance and productivity. We will describe some of our performance measurements on Intel Xeon Scalable and Intel Xeon Phi processors and describe how the PRKs can be used on other platforms, such as GPUs.

The PRK project is open-source (https://github.com/ParRes/Kernels) with a permissive license and is known to support a wide range of toolchains and platforms, hence should be easy for others to adopt in their own research activities.

View Presentation Slides

Heterogeneous Active Messages (HAM) — Implementing Lightweight Remote Procedure Calls in C++

Matthias Noack | Zuse Institute Berlin

View Abstract

We present HAM (Heterogeneous Active Messages), a C++-only active messaging solution for heterogeneous distributed systems. Combined with a communication protocol, HAM can be used as a generic Remote Procedure Call (RPC) mechanism. It has been used in HAM-Offload to implement a low-overhead offloading framework for inter and intra-node offloading between different architectures including accelerators like the Intel Xeon Phi x100 series and the NEC SX-Aurora Vector Engine.

HAM uses template meta-programming to implicitly generate active message types and their corresponding handler functions. Heterogeneity is enabled by providing an efficient address translation mechanism between the individual handler code addresses of processes running different binaries on different architectures, as well a hooks to inject serialisation and deserialisation code on a per-type basis. Implementing such a solution in modern C++ sheds some light on the shortcomings and grey areas of the C++ standard when it comes to distributed and heterogeneous environments.

View Presentation Slides

An Introduction to hpxMP -- A Modern OpenMP Implementation Leveraging HPX, An Asynchronous Many-Task System

Tianyi Zhang, Shahrzad Shirzad, Patrick Diehl, R. Tohid, Weile Wei and Hartmut Kaiser | Center for Computation an Technology, Louisiana State University

View Abstract

Asynchronous Many-task (AMT) runtime systems gain more and more acceptance for HPC applications. At the same time, the C++ standardization efforts currently focus on creating higher-level interfaces usable to replace OpenMP or OpenACC for modern C++ codes. Both trends call for a migration path for existing applications that directly or indirectly use OpenMP allowing moving parts of the code to the AMT paradigm. Additionally, an AMT implementation is not yet available for most existing highly optimized OpenMP libraries. For these reasons it is beneficial to combine both technologies, AMT+OpenMP, where the distributed communication is handled by the AMT system and the intra-node parallelism is handled by OpenMP or even combine OpenMP and the parallel algorithms. Currently, these two scenarios are not possible, since the light-weighted thread implementations present in AMTs interferes with the system threads utilized by the available OpenMP implementations. To overcome this issue, hpxMP, an implementation of the OpenMP standard, which utilizes HPX’s light-weight threads is presented. Four linear algebra benchmarks of the Blaze C++ library are utilized to compare hpxMP with clang’s OpenMP. In general, hpxMP is not able to reach the same performance yet. However, we demonstrated viability for providing a smooth migration for applications but have to be extended to benefit from a more general task-based programming model.

View Presentation Slides

A Comparative Analysis of Kokkos and SYCL as Heterogeneous, Parallel Programming Models for C++ Applications

Jeff Hammond, Michael Kinsner and James Brodman | Intel

View Abstract

While Kokkos and SYCL have rather different origin stories, they are the same goal of attempting to support high-performance parallelism in heterogeneous compute nodes. It is reassuring then, that the two models have reached similar semantics in supporting hierarchical and nested loop parallelism. This talk will describe in detail the features shared by Kokkos and SYCL, point out the fine details where similar features differ in their definition, and describe semantics that are unique to one or the other. This analysis is beneficial both as an affirmation of SYCL, which has standardized that which has been co-designed with DOE HPC applications for many years, and as a Rosetta Stone for those interested in mapping one to the other for practical purposes (such as a SYCL back-end for Kokkos).

View Presentation Slides

A SYCL Compiler and Runtime Architecture

Alexey Bader, James Brodman and Michael Kinsner | Intel

View Abstract

With the increasing diversity of offload accelerator architectures, a unified programming model greatly simplifies extraction of performance from modern heterogeneous systems. The SYCL standard describes a cross-platform abstraction layer that enables programming of heterogeneous systems using standard C++. The SYCL programming model combines host and device code for an application in the same source file in a type-safe way, and it provides a powerful task graph execution model that allows programmers to coordinate multiple offload accelerations across many devices.

This talk will present the architecture of Intel’s open source SYCL implementation that consists of two major components: the SYCL device compiler and the SYCL runtime library. Intel’s implementation is built on top of Clang/LLVM and has a design goal of becoming a fully supported part of the LLVM project. The device compiler outlines sections of application code that will execute on an accelerator device, and compiles it to either the SPIR-V intermediate format or a device native binary format. In addition, the compiler emits auxiliary information which is used by the SYCL runtime to launch device code on an accelerator using an OpenCL implementation. The SYCL runtime library is a C++ abstraction layer that orchestrates the execution of kernels and the movement of data on both host and supported accelerator devices.

View Presentation Slides

ReSYCLator: Transforming CUDA C++ Source Code into SYCL

Tobias Stauber and Peter Sommerlad | University of Applied Sciences Rapperswil

View Abstract

CUDA while very popular, is not as flexible with respect to target devices as OpenCL.
While parallel algorithm research might address problems first with a CUDA C++ solution, those results are not easily portable to a target not directly supported by CUDA.

In contrast, a SYCL C++ solution can operate on the larger variety of platforms supported by OpenCL.

ReSYCLator is a plug-in for Cevelop (www.cevelop.com) an extension of Eclipse-CDT that bridges the gap, by providing automatic transformation of CUDA C++ code to SYCL C++. A first attempt basing the transformation on NVIDIA’s Nsight Eclipse CDT plug-in showed that Nsight’s weak integration into CDT’s static analysis and refactoring infrastructure is insufficient. Therefore, an own CUDA-C++ parser for Eclipse CDT was developed that is a sound platform for transformations from CUDA C++ programs to SYCL based on AST transformations.

View Presentation Slides