CONFERENCE PROGRAM 2018

Quick Links: MONDAY | TUESDAY | WEDNESDAY | POSTERS | DHPCC++

Monday 14 May – Tutorial, Workshop and DHPCC++ Conference

Advanced Hands-On-OpenCL

Prof. Simon McIntosh-Smith, University of Bristol

The “Advanced Hands-On OpenCL Tutorial” focuses on advanced OpenCL concepts and is an extension of the highly successful ‘Hands on OpenCL’ course which has received over 6,500 downloads from GitHub. Simon McIntosh-Smith, Professor in High Performance Computing at the University of Bristol and one of the tutorial authors will lead the sessions along with members of his research team.

“I’m delighted to offer developers the opportunity to extend their OpenCL knowledge by running this advanced version of the open source Hands-On OpenCL tutorial.” said Simon McIntosh-Smith. “The course is based on my extensive “HandsOnOpenCL” online material which been incredibly popular on Github. Anyone looking to extend their OpenCL skills beyond the introductory level should benefit from this one-day tutorial.”

Course Outline

The tutorial format is a 50/50 split between lectures and exercises and uses a mix of OpenCL C and C++ host APIs. Attendees will require their own laptop to log onto a server running OpenCL 1.0 thru OpenCL 2.0. Alternatively, students can run the exercises on their laptops using their preferred OpenCL SDK.

  • Aimed at all developers looking to use OpenCL on any platform
  • Attendees should have written at least one OpenCL program
  • There will be plenty of time for attendees to have specific OpenCL questions addressed.
  • Course Outline
    • Shipping kernel code,
    • Portable binaries with SPIR
    • OpenCL kernel compiler options
    • Kernel metaprogramming
    • Optimised host-device communications
    • Using multiple OpenCL devices
    • Performance portability
    • Coalesced memory accesses
    • Tuning work-group sizes
    • Vectorisation
    • OpenCL / OpenGL interoperability
    • The OpenCL ecosystem:
      • OpenCL 2.0 and future versions
      • OpenCL SPIR
      • OpenCL SYCL
      • OpenCL libraries
    • Other OpenCL resources

About the Presenter

Simon McIntosh-Smith is a leading OpenCL trainer, having taught the subject since 2009. He has run many OpenCL training courses at conferences such as Super Computing and HiPEAC, and has provided OpenCL training for the UK’s national supercomputing service and for the Barcelona Supercomputing Center. With OpenCL training experience ranging from half day on-site introductions within companies, to three-day intensive hands-on workshops, Simon provides standard and customized OpenCL training courses. The tutorial will also be supported by members of Simon’s research team, all of whom are experienced OpenCL software developers. Follow Simon on Twitter: @simonmcs

Track 1

09:30 – 17:30

DHPCC++ Conference

Distributed & Heterogeneous Programming for C/C++

About DHPCC++

In response to the demand for heterogeneous programming models for C/C++, and the interest in driving these models in ISO C++, Distributed & Heterogeneous Programming in C/C++ includes all the programming models that have been designed to support heterogeneous programming in C and C++.

Many models now exist including SYCL, HPX, KoKKos, Raja, C++AMP, HCC, Boost.Compute, and CUDA to name a few.

This conference aims to address the needs of both HPC and the consumer/embedded community where a number of C++ parallel programming frameworks have been developed to address the needs of multi-threaded and distributed applications. The C++11/14/17 International Standards have introduced new tools for parallel programming to the language, and the ongoing standardization effort is developing additional features which will enable support for heterogeneous and distributed parallelism into ISO C++ 20/23.

DHPCC++ is an ideal place to discuss research in this domain, consolidate usage experience, and share new directions to support new hardware and memory models with the aim of passing that experience to ISO C and C++.

Track 2

09:30 – 17:30

Writing OpenCL for FPGAs

Karl Qi, Intel

FPGAs are reconfigurable silicon that can be used to create custom circuits for accelerating algorithms. This hands-on workshop will cover how to use OpenCL to implement high performance solutions on the FPGA using the Intel® FPGA SDK for OpenCL. We will examine how kernels are converted to custom data flow circuits, and how executions of OpenCL kernels are mapped onto the FPGA. We will experiment with various debug and analysis tools available in the SDK to help us optimize OpenCL kernels with regards to both FPGA resource consumption and performance. We will examine how loops in kernels can be effectively optimized for deep pipelined-parallel execution. We will experiment with streaming of data in and out of the kernels using pipes and channels from the host, external interfaces, and other kernels for effective inline acceleration. We will guide the compiler to make performance and area trade-offs through use of attributes and pragmas and arbitrary-precision data types. We will lastly discuss how local memory systems can be generated on the FPGA for effective stall-free accesses from kernels.
Track 3

09:30 – 17:30

Tuesday 15 May – Conference Sessions and Khronos Panel Discussion

Opening Address

Prof. Simon McIntosh-Smith, University of Bristol

Download Slides

KEYNOTE SPEAKER

OpenCL and Its Eco-System – State of the Nation Address

Neil Trevett, The Khronos Groups President and OpenCL Working Group Chair, VP NVidia

Download Slides
View Video

DCompute: Compiling D to SPIR-V for Seamless Integration with OpenCL

Nicholas Wilson, D Language Foundation.

With the advent of SPIR-V comes the ability to have custom kernel languages to address the shortcomings of OpenCL C/C++ & SYCL. This enables a more language integrated and intuitive solution for writing and dispatching SPIR-V kernels. To this end I have developed DCompute: a framework for writing and calling kernels in D. This talk covers how DCompute, in conjunction with LDC (the LLVM D Compiler) built against a forward ported fork of Khronos’ SPIR-V LLVM, is able to provide:

  • Simple, type-safe kernel and kernel lambda dispatch without having to specify kernel signatures multiple times.
  • A much more powerful kernel language (a subset of D) that enables metaprogramming & code sharing across both host & device and ensures data structures stay in sync.
  • Consistent and intuitive wrappers for the OpenCL device standard library and host runtime library.
  • D’s pillars of convenience, modelling power and efficiency to the world of OpenCL and SPIR-V.

This talk also includes an overview of the modifications made to LDC to compile SPIR-V kernels and features of D that make the above possible.

10:00 – 10:30

What’s New in SYCL 1.2.1 and How to Explore the Features

Michael Wong. Codeplay

On the 17th of November 2017, Khronos ratified the latest SYCL 1.2.1 specification. Although only one minor version increase, the work on the new specification represents two and half years of effort from the SYCL group. The group spent time receiving feedback from the public specifications and working closely with C++ developers to devise the best way to approach the challenges of heterogeneous programming with real-world applications like TensorFlow.

SYCL 1.2.1 improves on the previous SYCL 1.2 specification by adding a number of “mini-features” in the form of extensions to the C++ API that simplify programming and expose more capabilities from the underlying OpenCL 1.2 interface, such as explicit copy functionality; alongside various improvements on the interface including better support for standard C++ allocators or extensions capabilities.

In this presentation, we introduce the new SYCL 1.2.1 specification, explain the different updates and changes to the APIs and illustrate how to take advantage of them by showing some examples. We will also present the current status of the implementation of SYCL 1.2.1 for ComputeCpp, an implementation of the standard, and how to use the new features and API changes. To conclude we will provide some hints on the future direction of the SYCL and C++ standards by examining various proposals from Codeplay that are currently work in progress.

10:30 – 11:00

Morning Coffee Break & Technology Demonstrations

11:00 – 11:30

Performance-oriented Optimizations for OpenCL Streaming Kernels on the FPGA

Zheming Jin* and Hal Finkel, Argonne National Lab.

Field-programmable gate arrays (FPGAs) can implement streaming applications efficiently and High-level synthesis (HLS) tools allow people, who do not have complex hardware design knowledge, to evaluate an application on FPGAs, there is a need to understand where OpenCL and FPGA can play in the streaming domains. To this end, we explore the implementation space and discuss the performance optimization techniques for the streaming kernels using the Intel OpenCL SDK for FPGA. On the target Nallatech 385A FPGA platform that features an Arria 10 GX1150 FPGA, the experimental results show FPGA resources, such as block RAMs and DSPs, can limit the kernel performance before the constraint of memory bandwidth takes effect. Kernel vectorization and compute unit duplication are practical optimization techniques that can improve the kernel performance by a factor of 2 to 10. The combination of the two techniques can achieve the best performance. To improve the performance of compute unit duplication for the streaming kernels, the local work size needs to be tuned and the optimal value can increase the performance by a factor of 3 to 70 compared to the default value.
11:30 – 12:00

KOCL: Kernel-level Power Estimation for Arbitrary FPGA-SoC-Accelerated OpenCL Applications

James Davis*, Joshua Levine, Edward Stott, Eddie Hung,Peter Cheung and George Constantinides, Imperial College London

This work presents KOCL, a fully automated tool flow and accompanying software, accessible through a minimalist API, allowing OpenCL developers targetting FPGA-SoC devices to obtain kernel-level power estimates for their applications via function calls in their host code. KOCL is open-source, available with example applications at https://github.com/PRiME-project/KOCL. Crucially, in order to maximise accessibility, KOCL necessitates no user exposure to hardware whatsoever.

Contrary to reliance upon worst-case operating assumptions, knowledge of fine-grained power consumption can facilitate the deployment of adaptive energy-saving strategies including dynamic voltage and frequency scaling, task migration and power gating. Since the provision of multiple power islands within SoCs is often impractical, particularly for reconfigurable devices, our approach provides this information by relating circuit switching activity to power.

When targetted to FPGA-SoCs, such as the Altera Cyclone V devices we used to evaluate KOCL, OpenCL kernel code is compiled into bespoke hardware accelerators, one per kernel, prior to application execution. KOCL adds additional steps to Altera’s OpenCL tool flow in order to augment kernel accelerators with instrumentation to measure the switching activity of their most power-indicative signals. Probabilistic circuit simulations are performed to rank signals from highest to lowest predicted activity. These so-called ‘vectorless’ simulations are fast and do not require user provision of test vectors. Area-efficient counters are added to the highest-ranked signals, with control logic also instantiated to allow for their read-back at runtime. All of these steps are automatic, with none contributing significantly to overall compilation times.

During execution, KOCL’s software runs alongside OpenCL host code in order to build and continuously update an online power model of the hardware. Kernel-level activity and FPGA-wide power measurements are fed into the model, which apportions the total power between the kernel accelerators present. Users can query realtime power estimates of a particular kernel by simply passing its name to the KOCL_get() function within KOCL’s API.

Since KOCL’s power model is online—that is, trained using real data at runtime—it is able to adapt to workload and environmental changes dynamically. Unlike alternative techniques that make use of offline models, KOCL is able to compensate for sources of power behavioural change including input data, operating modes, noise, voltage, frequency, temperature, variation and degradation automatically.

Experimentation has confirmed that KOCL is both a low-overhead and high-accuracy power estimation technique. Typically, relatively few signals need to be monitored in order to achieve good results. With just eight counters per kernel, 10mW absolute accuracy—the difference between modelled and measured kernel-level power—has been found to be obtainable, with errors as low as 5mW achievable when additional counters are used. With low numbers of instruments, area and power overheads are only a few percent.

In this work, we presented KOCL: a tool facilitating the provision of realtime kernel-level power consumption estimates to OpenCL developers targetting FPGA-SoC devices. KOCL is open-source, its application is trivial and the flow can be used with any design implemented on an Altera FPGA.

Download Slides

High Performance Asynchronous Host-Device Communication Through the Intel FPGA Host Pipe Extension to OpenCL

Michael Kinsner and Dirk Seynhaeve, Intel.

OpenCL(TM) defines a programming framework for offload accelerators, including a host API that manipulates device kernels. The global address space provides a mechanism for data to be passed between the host program and accelerator devices. Except for shared virtual memory (SVM) atomics in recent OpenCL(TM) versions, which are not available on many platforms, data is only guaranteed to be communicated between the host and devices at coarse grained synchronization points. Such synchronization is not suited for communication with persistently running kernels, for asynchronous updates to or from executing kernels, or for some low latency applications.

The Intel(R) FPGA host pipe extension for OpenCL(TM) (will be publicly released before IWOCL’18) enables performant bidirectional FIFO communication between a device and the host. The extension leverages the OpenCL(TM) 2.x pipe API with some expansions to enable pipe data access by the host, and enables both lightweight control communication as well as high bandwidth data streaming. The OpenCL(TM) memory model is extended to enable host pipe data communication outside of OpenCL(TM) synchronization points.

A key design principle of the extension is avoidance of memory polling while checking for the availability of data. Even with an expanded memory model, polling can consume precious memory bandwidth. Host pipes provide implicit control information (data presence) as a sideband signal to the data, which enables low overhead querying for existence of data.

In the context of the extension implemented on an FPGA, use cases will be detailed to showcase the need for low frequency asynchronous communication without significant hardware or performance costs, including routing table entry updates on a device performing network routing, and asynchronous signaling of detection events in deep packet inspection. A high throughput data flow programming use case will also be described, that saturates PCIe bandwidth using host pipe communication. Benchmark results will be presented from an Intel(R) Arria(R) 10 FPGA accelerator.

We have chosen IWOCL as the first public presentation of this OpenCL(TM) extension.

Download Slides

Lunch Break

13:00 – 14:00

KEYNOTE SPEAKER

OpenCL and C++ – The Way Ahead

Andrew Richards, CEO. Codeplay Software Ltd.

14:00 – 14:30

Nuclear Reactor Simulation on OpenCL FPGA : a Case Study of RSBench

Zheming Jin* and Hal Finkel, Argonne National Lab.

FPGAs are becoming a promising choice as a heterogeneous computing component for scientific computing when floating-point optimized architectures are added to the current FPGAs. The emerging high-level synthesis (HLS) tools such as Intel OpenCL SDK for FPGA highlight a streamlined design flow to facilitate the use of FPGAs in scientific computing. In this paper, we evaluate the OpenCL-based FPGA design of a nuclear reactor simulation application RSBench. We describe the OpenCL implementations and optimization methods on an Intel Arria 10 FPGA. Compared
with the naïve OpenCL kernel, the optimizations increase the performance by a factor of 295 on the FPGA. Compared with a dual-socket Intel Xeon E5-2687W host and an Nvidia K80 GPU, the performance per watt on the FPGA is 3.59 X better than the CPU and 5.8X lower than the GPU.
14:30 – 15:00

Advantages and Pitfalls of OpenCL in Computational Physics

Katharina Ostaszewski, Philip Heinisch and Hendrik Ranocha, TU Braunschweig

OpenCL offers many advantages in computational physics in comparison to traditional MPI/OpenMP parallelization. We present an MPI/OpenCL based plasma simulation code, as an example of how computational physics can benefit from OpenCL. The code utilizes a hybrid modeling approach which combines elements from both fluid and particle-in-cell methods. The hybrid model includes many of the common problems encountered in computational physics. This includes for example solving differential equations and systems of linear equations and parallel reduction. Therefore it is well suited to show the common advantages and problems that arise with an OpenCL based approach. Whiles some problems like solving differential equations greatly benefit from deployment on GPUs other well parallelizable problems may suffer in performance due to OpenCL memory management.
An additional advantage of OpenCL is that it allows for modular structuring of the code by encapsulating numerical aspects into separate OpenCL kernels. This separates the MPI based host code from the numerical problem and allows for easy change of numerical modules for example solving algorithms. Furthermore a combination of MPI and OpenCL makes deployment on modern heterogeneous systems, including office workstations, possible.
Download Slides

Afternoon Coffee Break & Technology Demonstrations

15:30 – 16:00

Khronos Panel Discussion

Chair: Prof. Simon McIntosh-Smith, University of Bristol

This session is always a favourite amongst delegates as it puts members of the Khronos OpenCL, SYCL and SPIR Working groups on the stage alongside leading members of the OpenCL development community to debate the issues of the delay and answer questions from the audience in the room and online.
16:00 – 17:30

Close

17:30

Conference Dinner

See Venue and Travel for additional information and directions

19:30 – 21:00

Wednesday 16 May – Conference Sessions

INVITED SPEAKER

Exposing OpenCL to Artists

Jeff Lait, SideFX (developers of the Houdini 3D animation software)

In the talk Jeff will share how they use OpenCL to access both GPU and CPU acceleration and how despite the technical nature of OpenCL, many artists have taken the chance to write their own kernels to speed up their effects. Jeff is a senior mathematician at SideFX where he has worked on Houdini since version 1.0, it’s now on release 16.5! He has contributed to geometry processing, rigid body solvers, and fluid simulations and has also had the “joy” of working with many architectures over the years: SGI, Alpha, x86, Itanium, PS2, and PS3; and is still waiting for the system that solves more problems than it causes.
Download Slides

OpenCL Optimization and Best Practices for Qualcomm Adreno GPUs

Hongqiang Wang, Jay Yun and Alex Bourd, Qualcomm

The Adreno GPUs in Qualcomm’s Snapdragon SOCs are the world leading mobile GPUs. These GPUs have been supporting OpenCL since the Adreno A3x series, through the A4x and A5x series, and in the forthcoming A6x series. Many computationally intense use cases such as imaging, video and computer vision algorithms have been accelerated by the Adreno GPU using OpenCL in the Snapdragon SOC powered commercial devices. Qualcomm’s deep learning SDK, the Snapdragon Neural Processing Engine (SNPE), is also powered by Adreno’s OpenCL. The broad adoption of Adreno’s OpenCL capabilities is largely attributed to its superior performance and power advantage over CPU.

This technical presentation provides guidelines and best practices on how to optimize applications using OpenCL on Adreno GPUs. Here is the outline of the presentation:

  1. OpenCL on Snapdragon. This section presents a high-level overview of the Qualcomm Adreno GPU architecture and pipeline concerning OpenCL and compute, as well as the OpenCL support in Qualcomm Adreno GPU families, including the profiles and key changes across different generations of Adreno GPU
  2. OpenCL optimization guide for Adreno GPU. As the key part of the presentation, this section provides a comprehensive and in-depth OpenCL programming guide, covering the essential optimization principles, such as memory and ALU optimization, best use of math functions, Adreno OpenCL extensions, etc.
  3. Qualcomm Snapdragon Profiler (SDP) for OpenCL use cases. SDP is an all-in-one profiling tool provided by Qualcomm to help developers profile and analyze various modules in Snapdragon devices. This section focuses on how to use SDP to profile OpenCL use cases and identify bottlenecks.
  4. Two OpenCL use cases, Epsilon filter and Sobel filter, are presented in the final section. Major optimization steps, along with the performance boost from each step, are provided to demonstrate the effectiveness of the optimization practices.
09:30 – 10:00

TensorFlow Acceleration on ARM Hikey Board

Mehdi Goli, Luke Iwanski, John Lawson, Uwe Dolinsky and Andrew Richards, Codeplay

There is huge demand for targeting complex and large-scale machine learning applications particularly those based on popular actively-maintained frameworks such as TensorFlow and CAFFE to a variety of platforms with accelerators ranging from high-end desktop GPUs to resource-constrained embedded or mobile GPUs, FPGAs, and DSPs. However, to deliver good performance different platforms may require different algorithms or data structures, yet code should be easily portable and reused as much as possible across different devices. The open SYCL standard addresses this by providing parallel processing through a single-source programming model enabling the same standard C++ code to be used on the CPU and accelerator. This allows high-level C++ abstractions and templates to be used to quickly configure device and host code to cover specific features of the platform. By targeting OpenCL, SYCL enables C++ applications such as TensorFlow to run efficiently on OpenCL devices without having to write OpenCL code.

In this presentation we propose an OpenCL-enabled back-end for TensorFlow via SYCL in order to enable developers to access a wider range of processor combinations.

SYCL is a royalty-free, cross-platform C++ abstraction layer that builds on the underlying concepts,
portability and efficiency of OpenCL, while adding the ease-of-use and flexibility of modern C++14. This solution also benefits from ensuring the implementation is maintainable and compliant as the standards evolve. Dispatching device kernels from C++ applications is a widely used method for dealing with heterogeneous platforms in various programming models, such as
CUDA, C++AMP, HCC, OpenACC, or OpenMP.

SYCL brings this capability to a wide range of accelerators supporting OpenCL. This lets developers create powerful, more performance-portable template libraries that can take advantage of a wide range of heterogeneous hardware and software platforms.
Moreover, porting TensorFlow to OpenCL would mean handwriting the kernels in OpenCL C and having separate code-bases, which would be complicated to maintain.
By using SYCL, everything is single-source C++, and therefore it is possible to use a non-intrusive approach to add the SYCL back-end to TensorFlow.

There are three steps for implementing the SYCL back-end for TensorFlow:
In the first step we have introduced the SYCL device by specializing the TensorFlow’s device abstraction layer. The implemented SYCL device supports any OpenCL-enabled devices.

In the next step we implement a SYCL back-end for all the linear operations in the Eigen Tensor module. Each TensorFlow operator maps to an Eigen expression. As the Eigen has the same expression interface regardless of the selected device, in most cases, there is no need to specialize TensorFlow’s operators for SYCL.

In the last step we have registered the existing TensorFlow operators for SYCL device.

The evaluation of the proposed approach was carried out on an ARM Hikey board with ARM A53CPU , A73 CPU, and Mali GPU

Our results show significant improvements over those run on ARM A53 CPU, A73 CPU, and ACL library for Mali GPU especially for large scale application.
We are maintaining the TensorFlow SYCL back-end and actively optimizing TensorFlow’s operators for different DNN models across different platforms.

10:00 – 10:30

CLBlast: A Tuned OpenCL BLAS Library

Cedric Nugteren, TomTom

This work introduces CLBlast, an open-source BLAS library providing optimized OpenCL routines to accelerate dense linear algebra for a wide variety of devices. It is targeted at machine learning and HPC applications and thus provides a fast matrix-multiplication routine (GEMM) to accelerate the core of many applications (e.g. deep learning, iterative solvers, astrophysics, computational fluid dynamics, quantum chemistry). CLBlast has five main advantages over other BLAS libraries: 1) it is optimized for and tested on a large variety of OpenCL devices including less commonly used devices such as embedded and low-power GPUs, 2) it can be explicitly tuned for specific problem-sizes on specific hardware platforms, 3) it can perform operations in half-precision floating-point FP16 saving bandwidth, time and energy, 4) it has an optional CUDA back-end, 5) and it can combine multiple operations in a single batched routine, accelerating smaller problems significantly.
This paper describes the library and demonstrates the advantages of CLBlast experimentally for different use-cases on a wide variety of OpenCL hardware.
Download Slides

Morning Coffee Break & Technology Demonstrations

11:00 – 11:30

HIGHLIGHTED PRESENTATION – Based on Review Feedback

MatCL – A New Easy-to-Use OpenCL Toolbox for MathWorks Matlab

Philip Heinisch* and Katharina Ostaszewski, TU Braunschweig

We present “MatCL” an OpenCL interface for MathWorks Matlab. This MEX-based toolbox aims at providing a simple and easy to use solution to launch OpenCL kernels and handle memory transfers from Matlab using a single command. In comparison to other Matlab OpenCL solutions, MatCL is not just an OpenCL API wrapper but encapsulates the low-level host API calls necessary to initialize devices, create OpenCL buffers from Matlab workspace variables and build and launch kernels. MatCL is primarily intended to help in the development and testing of OpenCL kernels by allowing to transparently pass data from and to Matlab. This is especially helpful for kernels intended to be deployed on FPGAs or in an HPC environment, where it becomes necessary to test or benchmark using suitable input data or automatically verify results independent from the target architecture. Because MatCL handles the entire low-level process, this toolbox makes it possible to execute kernels without in depth knowledge of the host implementation necessary to support the execution of OpenCL code. MatCL is also optimized to allow efficient execution of OpenCL kernels within Matlab to accelerate computationally intensive tasks without having to rely on Nvidia CUDA. In addition to single command kernel execution, MatCL also allows for an independent two-step kernel compilation and launch workflow to save the kernel compile time and allow efficient repetitive kernel execution.
Download Slides

clARMOR: A Dynamic Buffer Overflow Detector for OpenCL Kernels

Joseph Greathouse and Christopher Erb, AMD

clARMOR is an open source buffer overflow detector for OpenCL kernels and memory APIs. This technical presentation proposal discusses the tool, shows relevant performance analyses, and describes its novelty with respect to other memory analysis tools.
Download Slides

Improving Performance of OpenCL Workloads on Intel Processors with Profiling Tools

Michael Carroll, Intel.

We present techniques for improving OpenCL workload performance with Intel profiling tools on modern heterogeneous hardware. We discuss how the profiling tools interface with heterogeneous hardware and the hardware’s correspondence with architectural metrics derived from profiling. By leveraging metrics, we demonstrate improvements to OpenCL workload subscription of hardware with particular attention paid to CPU versus graphics coprocessor residency. With source code examples, we observe metrics to show the performance advantages of contemporary OpenCL extensions such as subgroups, as well as extensions enabling access to special function hardware features outside of the standard execution unit array.
Download Slides

Lunch Break

13:00 – 14:00

HIGHLIGHTED PRESENTATION – Based on Review Feedback

Building a Brain with SYCL and Modern C++

Toby St Clere Smithe*, Department of Experimental Psychology, University of Oxford and Ralph Potter, Codeplay Software Ltd.

State-of-the art machine learning systems typically depend on energetically costly gradient-descent learning over a curated task-specific data set. Despite their successes, these methods are not well suited to building fully autonomous systems such as may employ energy-efficient accelerators targeted by OpenCL. By contrast, the brain uses low-energy local learning rules to discover the causal structure of an environment, forming semantically rich representations without supervision, and therefore exhibiting the required combination of efficiency and flexibility. To investigate these properties, a paradigm shift to dynamic “spike-based” computation is required. Historically, investigating spiking neural models has been a task for specialists, with software that is tailored to specific scientific projects, or that trades flexibility against performance. Here, we present neurosycl, a high-performance, portable spiking network simulator based on SYCL, with a modern and extensible C++ API. Our aim is to provide the necessary components for non-specialists to build a simulated brain, and to run the constructed models as close to real-time as possible.

This bipartite aim leads to two competing considerations – a simple interface, and portable performance – which are reconciled using SYCL’s single-source programming model. We describe two principal algorithmic challenges that illustrate the different hardware demands of spiking neural networks relative to deep learning networks, and how neurosycl solves them for GPU-like parallel processors via SYCL. Firstly, although the brain is akin to a parallel processor whose cores are neurons, the connections between neurons may have differing temporal delays, which results in a message-passing problem if the neurons are simulated asynchronously. Secondly, because these messages (‘spikes’) are generated chaotically, then transmitted to arbitrary target neurons with arbitrary transmission delays, a naive implementation even of a synchronous model quickly runs into a highly suboptimal memory access regime.

neurosycl’s design separates the specification of a model architecture from its simulation, so that once a model has been instantiated, its basic structure is fixed. This simplification enables us to infer the memory access pattern, and thus re-order the connection indices so that adjacent kernels access nearby memory locations. The simplification is also reflected in the API design: users can construct complex connection graphs between arbitrary neuron groups using a simple declarative interface, but runtime interactions with the model, for monitoring or I/O, are mediated by a set of simulated electrodes, combined with hooks into the simulation loop. This design mirrors that of neuroscientific experiments, and permits the user to embed the simulated brain into a virtual environment by integrating with other technologies, exposing implementation details only when necessary to allow this. We describe our API, illustrated by a number of “brain-building” examples, showing how the components compose and map via SYCL onto the hardware. We present performance comparisons across hardware platforms and alternative simulators, demonstrating portability for various network configurations and standard neuron models.

14:00 – 14:30

Debugging and Analyzing Programs Using the Intercept Layer for OpenCL Applications

Ben Ashbaugh, Intel

The Intercept Layer for OpenCL Applications is a recently released open source middleware layer that can be used to debug, analyze, and optimize OpenCL applications. It fills a key gap in the OpenCL development ecosystem, and provides similar functionality as the Microsoft DirectX Debug Runtime and Vulkan Validation Layers. It requires no driver or application modifications, and has been tested on OpenCL implementations from multiple vendors.

This Technical Presentation will introduce the Intercept Layer for OpenCL Applications, describe how it works, some of its capabilities, and some of its limitations. The talk will close with a discussion of features that could be added or moved to middleware layers like the Intercept Layer for OpenCL Applications, possible additions to the OpenCL APIs that would simplify development of new features or enable new functionality, and a call for contributions.

Download Slides

Enabling Profiling for SYCL Applications

Callum Fare, Codeplay Software Limited.

Since GPGPU devices have become mainstream, more and more software is being written to target many-core devices. Developers are now required to think in parallel in order to run applications with maximum performance, however, the ability to target a wide range of devices is vital. GPUs can range from very powerful discrete cards to extremely low-power embedded chips and, to be efficient, developers must be able to reuse their code in different scenarios. OpenCLTM addresses this issue by providing a C like programming language that can target different architectures, however, it requires a deep knowledge of the underlying hardware to be used efficiently. SYCLTM provides a C++ abstraction layer that simplifies parallel development, allowing developers to leverage the power of OpenCLTM while reducing the amount of code required.

Parallel programming offers a set of new challenges for developers, and being able to understand the behavior of their applications on target hardware is important in order to ensure they run with the best performance on all hardware platforms.

The LPGPU2 project has been working to add SYCLTM profiling capabilities in the now open-source tool suite CodeXL. Originally created to perform OpenCLTM profiling on AMD hardware, the project is adding the ability for CodeXL to perform SYCLTM profiling on devices including all the supported ComputeCpp’s desktop CPUs and GPUs as well as mobile low-power devices, like ARM GPUs running under the Android operating system.

In this talk, we will explain how the LPGPU2 CodeXL codebase was modified and extended to allow developers understand and identify bottlenecks of SYCLTM applications and how we extended it to perform accurate power consumption measurements as well as the ability to analyse and provide feedback on how the application can be improved in the context of Android development.

We’ll reveal the secrets of complex software stacks through the new profiling capabilities. The tool can be used at all levels and with all sorts of applications, from complex simulations to machine learning.

15:00 – 15:30

Afternoon Coffee Break & Technology Demonstrations

15:30 – 17:00

Closing Address

Prof. Simon McIntosh-Smith, University of Bristol

Download Slides

Free Discussion

An opportunity to network with colleagues and continue the discussions until the venue closes./p>

16:00 – 17:00

Close

17:00

Poster Sessions – Tuesday & Wednesday

Accelerating Typical Image Processing Operations Using Qualcomm Adreno OpenCL Extensions

Hongqiang Wang, Jay Yun, Qinghua Dai and Javier Girado, Qualcomm.

Qualcomm’s Adreno GPUs are industry leading mobile GPUs that demonstrate superior performance and power advantage over other compute processors such as the CPU in mobile SOCs. Among the pioneer GPUs that support OpenCL, Adreno GPUs have been supporting OpenCL since the A3x series, continuing on to the A4x and A5x series, and now the A6x series.

One of the advantages of OpenCL is its openness and flexibility. For example, the vendors are allowed to add and expose new hardware (or software) functionality and features through the OpenCL’s “extension” mechanism. Many of these vendor extensions are innovative and useful, which later may become part of the core standard that other vendors also support. Adreno GPU team has added many extensions to expose Adreno GPU’s advanced features, which have proved to be very useful.

This technical presentation introduces a few private OpenCL extensions that are supported in the upcoming Adreno A6x GPU family in Qualcomm’s Snapdragon SOCs. The extensions being presented here are accelerated by a dedicated hardware module inside the texture engine, which do not require full data loading to the shader processor (SP). Essentially, the extensions target commonly used image and video processing primitives, such as high order filtering (HOF), sum of absolute difference (SAD), and Sum of Square distance (SSD), etc. Using these extensions not only reduces developers’ burden on the development and optimization of the custom OpenCL C code, but also provide excellent performance.

The HOF extension applies 2D filtering operations, or convolutions, on image objects by passing the filter weights as a type of an image object. The filter can be either separable or non-separable, and separable filters are specified as an array of horizontal and vertical 1D filters. This extension also supports sub-pixel convolution, also known as multi-phase filtering, where the origin of the filter aligns between pixels, something that is useful during scaling and warping operations.

Sum of Absolute Differences is an operation generally applied between two image regions for block-matching types of applications. Given a kernel size (4×4, 9×7, 16×16, etc.) and two sets of coordinates from two image regions, the absolute difference between each corresponding pixel from the two regions is then calculated and added up. The two image regions (blocks) can come from two different image objects or from within the same image.

Similarly, the Sum of Square Differences sums up the square of the difference rather than the absolute value of the difference.

Both SAD and SSD are widely used in video processing, and by adding these two extensions, their usability and performance have been improved greatly. Again, the built-in hardware in the texture engine conducts the required calculations without using shader processor (SP) to do the expensive memory load and calculations.

The presentation describes how to use these private extensions, along with a few examples and some profiling data demonstrating the performance advantage. For example, it shows that for 15×1 Gaussian blur, the new extensions could yield up to 380% of performance boost over the naïve Gaussian blur implementation. For 2D Gaussian blur, the private extension also shows considerable performance boost as compared with the optimized kernel code without using the feature, with the advantage of much simpler coding and less optimization effort.

Tue-Wed Breaks

OpenCL-based Performance Enhancement of Model-Transformations

Tamás Fekete, Gergely Mezei, Budapest University of Technology and Economics

The software industry is currently facing challenges that involve processing larger and larger amounts of data as well as having to manage this data faster. There are several computing units in computers, however, in the industrial world, using the same algorithm or application with different hardware and software platforms can be challenging.

OpenCL can be a great solution. Building on top of the OpenCL provides us a way that the result can be easily used in a different environment.   Therefore, we have also built our solution based upon OpenCL which is referred to as PaMMTE (Parallel Multiplatform Model-transformation Engine). We soon realized that the advantages of using the OpenCL are not obvious. Therefore,  we have determined the challenges as well as the solutions. In this poster, we extended PaMMTE with incremental searching which also increases the performance of the computation. The current poster can support who creates a similar OpenCL-based solution.

Tue-Wed Breaks

Analysis of OpenCL Support for Mobile GPUs on Android

Alejandro Acosta Diaz and Johannes Totz, Twitter

The capabilities of mobile devices, like smartphones and tablets, are increasing every year. As each system-on-a-chip (SoC) generation provides better performance while also being more energy efficient compared to its predecessors, running computationally intensive tasks on the device becomes feasible. This enables advanced image filtering, video processing and machine learning applications based on Deep Learning.

The dominant platform for mobile devices is Android, running on a hugely diverse set of devices, from low-end feature phone to high-end setop box. OpenCL allows one to harness that compute power with its portable cross-platform API and language. However, while being source portable, the achievable performance is not. The different characteristics of SoCs mean that the application developer needs to take advantage of SoC-specific capabilities to achieve maximum performance.

This poster presents an analysis of OpenCL adoption across Android devices that have the Twitter Android app installed. The analysis shows that most of the sampled Android devices support OpenCL but there are differences in terms of OpenCL version support, different architectures and memory models. This analysis enables software developers to make an informed decision about which OpenCL implementations to target in order to provide performance portability across different hardware manufacturers and support the largest amount of devices possible.

Tue-Wed Breaks

ViennaCL++: Enable TensorFlow/Eigen via ViennaCL with OpenCL C++ Flow

Tai-Liang Chen, Shih-Huan Chien and Jenq-Kuen Lee, Department of Computer Science, National Tsing Hua University

This poster presents the ViennaCL++, an OpenCL C++ kernel library for Vienna Computing Library (ViennaCL) combined with TensorFlow/Eigen library to enable acceleration and optimization of linear algebraic computing, achieving 8 times and 49 times speedup for BLAS2 and BLAS3 operations compared to Eigen library, respectively.

Previously, TensorFlow would invoke Eigen for solvers. To enable OpenCL flow, one can invoke Eigen via ViennaCL which uses metaprogramming framework to generate programs for GPU compute. With the availability of OpenCL C++, this paper explores the model in the software flow. As OpenCL C++ is still with host program and kernel C++ computing, to utilize the C++ features with Eigen, we focus on constructing OpenCL flow of TensorFlow underlying environments and develop a modified version of ViennaCL++ to enable the generation of OpenCL C++ kernels instead of OpenCL C kernels and the compilation of kernel codes to SPIR-V format, as well as the enqueuing and executing of SPIR-V program based on the state-of-the-art OpenCL 2.X and C++ kernel language specification. Furthermore, ViennaCL++ enables object-oriented programming while achieving performance improvement by C++ Class optimization with partial C++ standard library implemented in the OpenCL C++ language.

The OpenCL 2.X flow also enables the utilization of OpenCL Shared Virtual Memory (SVM) to minimize the data transfer overhead between the host and devices, reducing the overall execution time. The design of experiments includes C++11 move semantics, SPIR-V flow, and SVM based on benchmarks of ViennaCL++ modified version. The experimental results of ViennaCL++ are executed on the x86_64 of Intel hardware and it shows that our scheme is suitably seamless to C++. In other words, the performance of ViennaCL++ runtime execution is similar to traditional OpenCL C flow. In addition, for hardware that supports SVM feature, the OpenCL C++ kernel code performance can be improved 3.4 times speedup for vector add operation.

Note that Intel OpenCL 2.1 compiler is equipped with mostly Khronos OpenCL 2.2 (C++) linguistic to allow us for experiments. The work enables the use of C++ template abstraction, move operator, and SVM features compared to previous flow with OpenCL C version.

Tue-Wed Breaks

Shaping Open Source Compute Ecosystem with Neo and clDNN

Michal Mrozek, Intel

The Compute Library for Deep Neural Networks (clDNN) is an open source performance library for Deep Learning (DL) applications intended for acceleration of DL inference on Intel® Processor Graphics. It was released in 2017 and was recently coupled with Neo, which is the open source OpenCL driver for Intel Processor Graphics. Together they provide a complete open source software stack for Deep Learning applications.

In this poster, we will discuss our architecture and design choice for Neo and how it works efficiently with clDNN. We will highlight what we learned in making OpenCL efficient and demonstrate how to get great performance using the driver standalone and in complete this software stack. We will also explain what extension were used to enhance OpenCL capabilities and what may be added to the standard to allow even better usage models.

One of major features of Neo architecture is N:1 submission model, it allows multiple clients to be efficiently aggregated into one command stream. This allows concurrent execution of independent command streams that gives boosts in all clDNN topologies that underutilize GPU. We will share how we managed to get up to 45% performance increase in various topologies with proper usage of out of order queues and N:1 driver architecture. We will also share new usage models that allows to execute concurrently multiple independent in-order command queues.
We will also deep dive into capabilities of clDNN library, share current states of implementation and supported topologies.

In addition we will share our experiences with creating a development environment that allows us to effectively develop and deliver an open source driver to the community on a daily basis. We will also provide process for external contributions and our strategy for future ecosystem development.

View Poster