Monday 14 May – Tutorial, Workshop and DHPCC++ Conference
Tuesday 15 May – Conference Sessions and Khronos Panel Discussion
OpenCL and Its Eco-System – State of the Nation Address
|09:00 – 10:00|
DCompute: Compiling D to SPIR-V for Seamless Integration with OpenCL
With the advent of SPIR-V comes the ability to have custom kernel languages to address the shortcomings of OpenCL C/C++ & SYCL. This enables a more language integrated and intuitive solution for writing and dispatching SPIR-V kernels. To this end I have developed DCompute: a framework for writing and calling kernels in D. This talk covers how DCompute, in conjunction with LDC (the LLVM D Compiler) built against a forward ported fork of Khronos’ SPIR-V LLVM, is able to provide:
This talk also includes an overview of the modifications made to LDC to compile SPIR-V kernels and features of D that make the above possible.
|10:00 – 10:30|
What’s New in SYCL 1.2.1 and How to Explore the Features
On the 17th of November 2017, Khronos ratified the latest SYCL 1.2.1 specification. Although only one minor version increase, the work on the new specification represents two and half years of effort from the SYCL group. The group spent time receiving feedback from the public specifications and working closely with C++ developers to devise the best way to approach the challenges of heterogeneous programming with real-world applications like TensorFlow.
SYCL 1.2.1 improves on the previous SYCL 1.2 specification by adding a number of “mini-features” in the form of extensions to the C++ API that simplify programming and expose more capabilities from the underlying OpenCL 1.2 interface, such as explicit copy functionality; alongside various improvements on the interface including better support for standard C++ allocators or extensions capabilities.
In this presentation, we introduce the new SYCL 1.2.1 specification, explain the different updates and changes to the APIs and illustrate how to take advantage of them by showing some examples. We will also present the current status of the implementation of SYCL 1.2.1 for ComputeCpp, an implementation of the standard, and how to use the new features and API changes. To conclude we will provide some hints on the future direction of the SYCL and C++ standards by examining various proposals from Codeplay that are currently work in progress.
|10:30 – 11:00|
|11:00 – 11:30|
Performance-oriented Optimizations for OpenCL Streaming Kernels on the FPGA
Field-programmable gate arrays (FPGAs) can implement streaming applications efficiently and High-level synthesis (HLS) tools allow people, who do not have complex hardware design knowledge, to evaluate an application on FPGAs, there is a need to understand where OpenCL and FPGA can play in the streaming domains. To this end, we explore the implementation space and discuss the performance optimization techniques for the streaming kernels using the Intel OpenCL SDK for FPGA. On the target Nallatech 385A FPGA platform that features an Arria 10 GX1150 FPGA, the experimental results show FPGA resources, such as block RAMs and DSPs, can limit the kernel performance before the constraint of memory bandwidth takes effect. Kernel vectorization and compute unit duplication are practical optimization techniques that can improve the kernel performance by a factor of 2 to 10. The combination of the two techniques can achieve the best performance. To improve the performance of compute unit duplication for the streaming kernels, the local work size needs to be tuned and the optimal value can increase the performance by a factor of 3 to 70 compared to the default value.
|11:30 – 12:00|
KOCL: Kernel-level Power Estimation for Arbitrary FPGA-SoC-Accelerated OpenCL Applications
This work presents KOCL, a fully automated tool flow and accompanying software, accessible through a minimalist API, allowing OpenCL developers targetting FPGA-SoC devices to obtain kernel-level power estimates for their applications via function calls in their host code. KOCL is open-source, available with example applications at https://github.com/PRiME-project/KOCL. Crucially, in order to maximise accessibility, KOCL necessitates no user exposure to hardware whatsoever.
Contrary to reliance upon worst-case operating assumptions, knowledge of fine-grained power consumption can facilitate the deployment of adaptive energy-saving strategies including dynamic voltage and frequency scaling, task migration and power gating. Since the provision of multiple power islands within SoCs is often impractical, particularly for reconfigurable devices, our approach provides this information by relating circuit switching activity to power.
When targetted to FPGA-SoCs, such as the Altera Cyclone V devices we used to evaluate KOCL, OpenCL kernel code is compiled into bespoke hardware accelerators, one per kernel, prior to application execution. KOCL adds additional steps to Altera’s OpenCL tool flow in order to augment kernel accelerators with instrumentation to measure the switching activity of their most power-indicative signals. Probabilistic circuit simulations are performed to rank signals from highest to lowest predicted activity. These so-called ‘vectorless’ simulations are fast and do not require user provision of test vectors. Area-efficient counters are added to the highest-ranked signals, with control logic also instantiated to allow for their read-back at runtime. All of these steps are automatic, with none contributing significantly to overall compilation times.
During execution, KOCL’s software runs alongside OpenCL host code in order to build and continuously update an online power model of the hardware. Kernel-level activity and FPGA-wide power measurements are fed into the model, which apportions the total power between the kernel accelerators present. Users can query realtime power estimates of a particular kernel by simply passing its name to the KOCL_get() function within KOCL’s API.
Since KOCL’s power model is online—that is, trained using real data at runtime—it is able to adapt to workload and environmental changes dynamically. Unlike alternative techniques that make use of offline models, KOCL is able to compensate for sources of power behavioural change including input data, operating modes, noise, voltage, frequency, temperature, variation and degradation automatically.
Experimentation has confirmed that KOCL is both a low-overhead and high-accuracy power estimation technique. Typically, relatively few signals need to be monitored in order to achieve good results. With just eight counters per kernel, 10mW absolute accuracy—the difference between modelled and measured kernel-level power—has been found to be obtainable, with errors as low as 5mW achievable when additional counters are used. With low numbers of instruments, area and power overheads are only a few percent.
In this work, we presented KOCL: a tool facilitating the provision of realtime kernel-level power consumption estimates to OpenCL developers targetting FPGA-SoC devices. KOCL is open-source, its application is trivial and the flow can be used with any design implemented on an Altera FPGA.
|12:00 – 12:30|
High Performance Asynchronous Host-Device Communication Through the Intel FPGA Host Pipe Extension to OpenCL
OpenCL(TM) defines a programming framework for offload accelerators, including a host API that manipulates device kernels. The global address space provides a mechanism for data to be passed between the host program and accelerator devices. Except for shared virtual memory (SVM) atomics in recent OpenCL(TM) versions, which are not available on many platforms, data is only guaranteed to be communicated between the host and devices at coarse grained synchronization points. Such synchronization is not suited for communication with persistently running kernels, for asynchronous updates to or from executing kernels, or for some low latency applications.
The Intel(R) FPGA host pipe extension for OpenCL(TM) (will be publicly released before IWOCL’18) enables performant bidirectional FIFO communication between a device and the host. The extension leverages the OpenCL(TM) 2.x pipe API with some expansions to enable pipe data access by the host, and enables both lightweight control communication as well as high bandwidth data streaming. The OpenCL(TM) memory model is extended to enable host pipe data communication outside of OpenCL(TM) synchronization points.
A key design principle of the extension is avoidance of memory polling while checking for the availability of data. Even with an expanded memory model, polling can consume precious memory bandwidth. Host pipes provide implicit control information (data presence) as a sideband signal to the data, which enables low overhead querying for existence of data.
In the context of the extension implemented on an FPGA, use cases will be detailed to showcase the need for low frequency asynchronous communication without significant hardware or performance costs, including routing table entry updates on a device performing network routing, and asynchronous signaling of detection events in deep packet inspection. A high throughput data flow programming use case will also be described, that saturates PCIe bandwidth using host pipe communication. Benchmark results will be presented from an Intel(R) Arria(R) 10 FPGA accelerator.
We have chosen IWOCL as the first public presentation of this OpenCL(TM) extension.
|12:30 – 13:00|
|13:00 – 14:00|
OpenCL and C++ – The Way Ahead
|14:30 – 15:00|
Nuclear Reactor Simulation on OpenCL FPGA : a Case Study of RSBench
FPGAs are becoming a promising choice as a heterogeneous computing component for scientific computing when floating-point optimized architectures are added to the current FPGAs. The emerging high-level synthesis (HLS) tools such as Intel OpenCL SDK for FPGA highlight a streamlined design flow to facilitate the use of FPGAs in scientific computing. In this paper, we evaluate the OpenCL-based FPGA design of a nuclear reactor simulation application RSBench. We describe the OpenCL implementations and optimization methods on an Intel Arria 10 FPGA. Compared
with the naïve OpenCL kernel, the optimizations increase the performance by a factor of 295 on the FPGA. Compared with a dual-socket Intel Xeon E5-2687W host and an Nvidia K80 GPU, the performance per watt on the FPGA is 3.59 X better than the CPU and 5.8X lower than the GPU.
|15:00 – 15:30|
Advantages and Pitfalls of OpenCL in Computational Physics
OpenCL offers many advantages in computational physics in comparison to traditional MPI/OpenMP parallelization. We present an MPI/OpenCL based plasma simulation code, as an example of how computational physics can benefit from OpenCL. The code utilizes a hybrid modeling approach which combines elements from both fluid and particle-in-cell methods. The hybrid model includes many of the common problems encountered in computational physics. This includes for example solving differential equations and systems of linear equations and parallel reduction. Therefore it is well suited to show the common advantages and problems that arise with an OpenCL based approach. Whiles some problems like solving differential equations greatly benefit from deployment on GPUs other well parallelizable problems may suffer in performance due to OpenCL memory management.
An additional advantage of OpenCL is that it allows for modular structuring of the code by encapsulating numerical aspects into separate OpenCL kernels. This separates the MPI based host code from the numerical problem and allows for easy change of numerical modules for example solving algorithms. Furthermore a combination of MPI and OpenCL makes deployment on modern heterogeneous systems, including office workstations, possible.
|15:00 – 15:30|
|15:30 – 16:00|
Khronos Panel Discussion
This session is always a favourite amongst delegates as it puts members of the Khronos OpenCL, SYCL and SPIR Working groups on the stage alongside leading members of the OpenCL development community to debate the issues of the delay and answer questions from the audience in the room and online.
|16:00 – 17:30|
|19:00 – 21:00|
Wednesday 16 May – Conference Sessions
Exposing OpenCL to Artists
In the talk Jeff will share how they use OpenCL to access both GPU and CPU acceleration and how despite the technical nature of OpenCL, many artists have taken the chance to write their own kernels to speed up their effects. Jeff is a senior mathematician at SideFX where he has worked on Houdini since version 1.0, it’s now on release 16.5! He has contributed to geometry processing, rigid body solvers, and fluid simulations and has also had the “joy” of working with many architectures over the years: SGI, Alpha, x86, Itanium, PS2, and PS3; and is still waiting for the system that solves more problems than it causes.
|09:00 – 09:30|
OpenCL Optimization and Best Practices for Qualcomm Adreno GPUs
The Adreno GPUs in Qualcomm’s Snapdragon SOCs are the world leading mobile GPUs. These GPUs have been supporting OpenCL since the Adreno A3x series, through the A4x and A5x series, and in the forthcoming A6x series. Many computationally intense use cases such as imaging, video and computer vision algorithms have been accelerated by the Adreno GPU using OpenCL in the Snapdragon SOC powered commercial devices. Qualcomm’s deep learning SDK, the Snapdragon Neural Processing Engine (SNPE), is also powered by Adreno’s OpenCL. The broad adoption of Adreno’s OpenCL capabilities is largely attributed to its superior performance and power advantage over CPU.
This technical presentation provides guidelines and best practices on how to optimize applications using OpenCL on Adreno GPUs. Here is the outline of the presentation:
|09:30 – 10:00|
TensorFlow Acceleration on ARM Hikey Board
There is huge demand for targeting complex and large-scale machine learning applications particularly those based on popular actively-maintained frameworks such as TensorFlow and CAFFE to a variety of platforms with accelerators ranging from high-end desktop GPUs to resource-constrained embedded or mobile GPUs, FPGAs, and DSPs. However, to deliver good performance different platforms may require different algorithms or data structures, yet code should be easily portable and reused as much as possible across different devices. The open SYCL standard addresses this by providing parallel processing through a single-source programming model enabling the same standard C++ code to be used on the CPU and accelerator. This allows high-level C++ abstractions and templates to be used to quickly configure device and host code to cover specific features of the platform. By targeting OpenCL, SYCL enables C++ applications such as TensorFlow to run efficiently on OpenCL devices without having to write OpenCL code.
In this presentation we propose an OpenCL-enabled back-end for TensorFlow via SYCL in order to enable developers to access a wider range of processor combinations.
SYCL is a royalty-free, cross-platform C++ abstraction layer that builds on the underlying concepts,
SYCL brings this capability to a wide range of accelerators supporting OpenCL. This lets developers create powerful, more performance-portable template libraries that can take advantage of a wide range of heterogeneous hardware and software platforms.
There are three steps for implementing the SYCL back-end for TensorFlow:
In the next step we implement a SYCL back-end for all the linear operations in the Eigen Tensor module. Each TensorFlow operator maps to an Eigen expression. As the Eigen has the same expression interface regardless of the selected device, in most cases, there is no need to specialize TensorFlow’s operators for SYCL.
In the last step we have registered the existing TensorFlow operators for SYCL device.
The evaluation of the proposed approach was carried out on an ARM Hikey board with ARM A53CPU , A73 CPU, and Mali GPU
Our results show significant improvements over those run on ARM A53 CPU, A73 CPU, and ACL library for Mali GPU especially for large scale application.
|10:00 – 10:30|
CLBlast: A Tuned OpenCL BLAS Library
This work introduces CLBlast, an open-source BLAS library providing optimized OpenCL routines to accelerate dense linear algebra for a wide variety of devices. It is targeted at machine learning and HPC applications and thus provides a fast matrix-multiplication routine (GEMM) to accelerate the core of many applications (e.g. deep learning, iterative solvers, astrophysics, computational fluid dynamics, quantum chemistry). CLBlast has five main advantages over other BLAS libraries: 1) it is optimized for and tested on a large variety of OpenCL devices including less commonly used devices such as embedded and low-power GPUs, 2) it can be explicitly tuned for specific problem-sizes on specific hardware platforms, 3) it can perform operations in half-precision floating-point FP16 saving bandwidth, time and energy, 4) it has an optional CUDA back-end, 5) and it can combine multiple operations in a single batched routine, accelerating smaller problems significantly.
This paper describes the library and demonstrates the advantages of CLBlast experimentally for different use-cases on a wide variety of OpenCL hardware.
|10:30 – 11:00|
|11:00 – 11:30|
MatCL – A New Easy-to-Use OpenCL Toolbox for MathWorks Matlab
We present “MatCL” an OpenCL interface for MathWorks Matlab. This MEX-based toolbox aims at providing a simple and easy to use solution to launch OpenCL kernels and handle memory transfers from Matlab using a single command. In comparison to other Matlab OpenCL solutions, MatCL is not just an OpenCL API wrapper but encapsulates the low-level host API calls necessary to initialize devices, create OpenCL buffers from Matlab workspace variables and build and launch kernels. MatCL is primarily intended to help in the development and testing of OpenCL kernels by allowing to transparently pass data from and to Matlab. This is especially helpful for kernels intended to be deployed on FPGAs or in an HPC environment, where it becomes necessary to test or benchmark using suitable input data or automatically verify results independent from the target architecture. Because MatCL handles the entire low-level process, this toolbox makes it possible to execute kernels without in depth knowledge of the host implementation necessary to support the execution of OpenCL code. MatCL is also optimized to allow efficient execution of OpenCL kernels within Matlab to accelerate computationally intensive tasks without having to rely on Nvidia CUDA. In addition to single command kernel execution, MatCL also allows for an independent two-step kernel compilation and launch workflow to save the kernel compile time and allow efficient repetitive kernel execution.
|11:30 – 12:00|
clARMOR: A Dynamic Buffer Overflow Detector for OpenCL Kernels
clARMOR is an open source buffer overflow detector for OpenCL kernels and memory APIs. This technical presentation proposal discusses the tool, shows relevant performance analyses, and describes its novelty with respect to other memory analysis tools.
|12:00 – 12:30|
Improving Performance of OpenCL Workloads on Intel Processors with Profiling Tools
We present techniques for improving OpenCL workload performance with Intel profiling tools on modern heterogeneous hardware. We discuss how the profiling tools interface with heterogeneous hardware and the hardware’s correspondence with architectural metrics derived from profiling. By leveraging metrics, we demonstrate improvements to OpenCL workload subscription of hardware with particular attention paid to CPU versus graphics coprocessor residency. With source code examples, we observe metrics to show the performance advantages of contemporary OpenCL extensions such as subgroups, as well as extensions enabling access to special function hardware features outside of the standard execution unit array.
|12:30 – 13:00|
|13:00 – 14:00|
Building a Brain with SYCL and Modern C++
State-of-the art machine learning systems typically depend on energetically costly gradient-descent learning over a curated task-specific data set. Despite their successes, these methods are not well suited to building fully autonomous systems such as may employ energy-efficient accelerators targeted by OpenCL. By contrast, the brain uses low-energy local learning rules to discover the causal structure of an environment, forming semantically rich representations without supervision, and therefore exhibiting the required combination of efficiency and flexibility. To investigate these properties, a paradigm shift to dynamic “spike-based” computation is required. Historically, investigating spiking neural models has been a task for specialists, with software that is tailored to specific scientific projects, or that trades flexibility against performance. Here, we present neurosycl, a high-performance, portable spiking network simulator based on SYCL, with a modern and extensible C++ API. Our aim is to provide the necessary components for non-specialists to build a simulated brain, and to run the constructed models as close to real-time as possible.
This bipartite aim leads to two competing considerations – a simple interface, and portable performance – which are reconciled using SYCL’s single-source programming model. We describe two principal algorithmic challenges that illustrate the different hardware demands of spiking neural networks relative to deep learning networks, and how neurosycl solves them for GPU-like parallel processors via SYCL. Firstly, although the brain is akin to a parallel processor whose cores are neurons, the connections between neurons may have differing temporal delays, which results in a message-passing problem if the neurons are simulated asynchronously. Secondly, because these messages (‘spikes’) are generated chaotically, then transmitted to arbitrary target neurons with arbitrary transmission delays, a naive implementation even of a synchronous model quickly runs into a highly suboptimal memory access regime.
neurosycl’s design separates the specification of a model architecture from its simulation, so that once a model has been instantiated, its basic structure is fixed. This simplification enables us to infer the memory access pattern, and thus re-order the connection indices so that adjacent kernels access nearby memory locations. The simplification is also reflected in the API design: users can construct complex connection graphs between arbitrary neuron groups using a simple declarative interface, but runtime interactions with the model, for monitoring or I/O, are mediated by a set of simulated electrodes, combined with hooks into the simulation loop. This design mirrors that of neuroscientific experiments, and permits the user to embed the simulated brain into a virtual environment by integrating with other technologies, exposing implementation details only when necessary to allow this. We describe our API, illustrated by a number of “brain-building” examples, showing how the components compose and map via SYCL onto the hardware. We present performance comparisons across hardware platforms and alternative simulators, demonstrating portability for various network configurations and standard neuron models.
|14:30 – 15:00|
Debugging and Analyzing Programs Using the Intercept Layer for OpenCL Applications
The Intercept Layer for OpenCL Applications is a recently released open source middleware layer that can be used to debug, analyze, and optimize OpenCL applications. It fills a key gap in the OpenCL development ecosystem, and provides similar functionality as the Microsoft DirectX Debug Runtime and Vulkan Validation Layers. It requires no driver or application modifications, and has been tested on OpenCL implementations from multiple vendors.
This Technical Presentation will introduce the Intercept Layer for OpenCL Applications, describe how it works, some of its capabilities, and some of its limitations. The talk will close with a discussion of features that could be added or moved to middleware layers like the Intercept Layer for OpenCL Applications, possible additions to the OpenCL APIs that would simplify development of new features or enable new functionality, and a call for contributions.
|15:00 – 15:30|
Enabling Profiling for SYCL Applications
Since GPGPU devices have become mainstream, more and more software is being written to target many-core devices. Developers are now required to think in parallel in order to run applications with maximum performance, however, the ability to target a wide range of devices is vital. GPUs can range from very powerful discrete cards to extremely low-power embedded chips and, to be efficient, developers must be able to reuse their code in different scenarios. OpenCLTM addresses this issue by providing a C like programming language that can target different architectures, however, it requires a deep knowledge of the underlying hardware to be used efficiently. SYCLTM provides a C++ abstraction layer that simplifies parallel development, allowing developers to leverage the power of OpenCLTM while reducing the amount of code required.
Parallel programming offers a set of new challenges for developers, and being able to understand the behavior of their applications on target hardware is important in order to ensure they run with the best performance on all hardware platforms.
The LPGPU2 project has been working to add SYCLTM profiling capabilities in the now open-source tool suite CodeXL. Originally created to perform OpenCLTM profiling on AMD hardware, the project is adding the ability for CodeXL to perform SYCLTM profiling on devices including all the supported ComputeCpp’s desktop CPUs and GPUs as well as mobile low-power devices, like ARM GPUs running under the Android operating system.
In this talk, we will explain how the LPGPU2 CodeXL codebase was modified and extended to allow developers understand and identify bottlenecks of SYCLTM applications and how we extended it to perform accurate power consumption measurements as well as the ability to analyse and provide feedback on how the application can be improved in the context of Android development.
We’ll reveal the secrets of complex software stacks through the new profiling capabilities. The tool can be used at all levels and with all sorts of applications, from complex simulations to machine learning.
|15:00 – 15:30|
|15:30 – 17:00|
|16:00 – 17:00|
Poster Sessions – Tuesday & Wednesday
Accelerating Typical Image Processing Operations Using Qualcomm Adreno OpenCL Extensions
Qualcomm’s Adreno GPUs are industry leading mobile GPUs that demonstrate superior performance and power advantage over other compute processors such as the CPU in mobile SOCs. Among the pioneer GPUs that support OpenCL, Adreno GPUs have been supporting OpenCL since the A3x series, continuing on to the A4x and A5x series, and now the A6x series.
One of the advantages of OpenCL is its openness and flexibility. For example, the vendors are allowed to add and expose new hardware (or software) functionality and features through the OpenCL’s “extension” mechanism. Many of these vendor extensions are innovative and useful, which later may become part of the core standard that other vendors also support. Adreno GPU team has added many extensions to expose Adreno GPU’s advanced features, which have proved to be very useful.
This technical presentation introduces a few private OpenCL extensions that are supported in the upcoming Adreno A6x GPU family in Qualcomm’s Snapdragon SOCs. The extensions being presented here are accelerated by a dedicated hardware module inside the texture engine, which do not require full data loading to the shader processor (SP). Essentially, the extensions target commonly used image and video processing primitives, such as high order filtering (HOF), sum of absolute difference (SAD), and Sum of Square distance (SSD), etc. Using these extensions not only reduces developers’ burden on the development and optimization of the custom OpenCL C code, but also provide excellent performance.
The HOF extension applies 2D filtering operations, or convolutions, on image objects by passing the filter weights as a type of an image object. The filter can be either separable or non-separable, and separable filters are specified as an array of horizontal and vertical 1D filters. This extension also supports sub-pixel convolution, also known as multi-phase filtering, where the origin of the filter aligns between pixels, something that is useful during scaling and warping operations.
Sum of Absolute Differences is an operation generally applied between two image regions for block-matching types of applications. Given a kernel size (4×4, 9×7, 16×16, etc.) and two sets of coordinates from two image regions, the absolute difference between each corresponding pixel from the two regions is then calculated and added up. The two image regions (blocks) can come from two different image objects or from within the same image.
Similarly, the Sum of Square Differences sums up the square of the difference rather than the absolute value of the difference.
Both SAD and SSD are widely used in video processing, and by adding these two extensions, their usability and performance have been improved greatly. Again, the built-in hardware in the texture engine conducts the required calculations without using shader processor (SP) to do the expensive memory load and calculations.
The presentation describes how to use these private extensions, along with a few examples and some profiling data demonstrating the performance advantage. For example, it shows that for 15×1 Gaussian blur, the new extensions could yield up to 380% of performance boost over the naïve Gaussian blur implementation. For 2D Gaussian blur, the private extension also shows considerable performance boost as compared with the optimized kernel code without using the feature, with the advantage of much simpler coding and less optimization effort.
OpenCL-based Performance Enhancement of Model-Transformations
The software industry is currently facing challenges that involve processing larger and larger amounts of data as well as having to manage this data faster. There are several computing units in computers, however, in the industrial world, using the same algorithm or application with different hardware and software platforms can be challenging.
OpenCL can be a great solution. Building on top of the OpenCL provides us a way that the result can be easily used in a different environment. Therefore, we have also built our solution based upon OpenCL which is referred to as PaMMTE (Parallel Multiplatform Model-transformation Engine). We soon realized that the advantages of using the OpenCL are not obvious. Therefore, we have determined the challenges as well as the solutions. In this poster, we extended PaMMTE with incremental searching which also increases the performance of the computation. The current poster can support who creates a similar OpenCL-based solution.
Analysis of OpenCL Support for Mobile GPUs on Android
The capabilities of mobile devices, like smartphones and tablets, are increasing every year. As each system-on-a-chip (SoC) generation provides better performance while also being more energy efficient compared to its predecessors, running computationally intensive tasks on the device becomes feasible. This enables advanced image filtering, video processing and machine learning applications based on Deep Learning.
The dominant platform for mobile devices is Android, running on a hugely diverse set of devices, from low-end feature phone to high-end setop box. OpenCL allows one to harness that compute power with its portable cross-platform API and language. However, while being source portable, the achievable performance is not. The different characteristics of SoCs mean that the application developer needs to take advantage of SoC-specific capabilities to achieve maximum performance.
This poster presents an analysis of OpenCL adoption across Android devices that have the Twitter Android app installed. The analysis shows that most of the sampled Android devices support OpenCL but there are differences in terms of OpenCL version support, different architectures and memory models. This analysis enables software developers to make an informed decision about which OpenCL implementations to target in order to provide performance portability across different hardware manufacturers and support the largest amount of devices possible.
ViennaCL++: Enable TensorFlow/Eigen via ViennaCL with OpenCL C++ Flow
This poster presents the ViennaCL++, an OpenCL C++ kernel library for Vienna Computing Library (ViennaCL) combined with TensorFlow/Eigen library to enable acceleration and optimization of linear algebraic computing, achieving 8 times and 49 times speedup for BLAS2 and BLAS3 operations compared to Eigen library, respectively.
Previously, TensorFlow would invoke Eigen for solvers. To enable OpenCL flow, one can invoke Eigen via ViennaCL which uses metaprogramming framework to generate programs for GPU compute. With the availability of OpenCL C++, this paper explores the model in the software flow. As OpenCL C++ is still with host program and kernel C++ computing, to utilize the C++ features with Eigen, we focus on constructing OpenCL flow of TensorFlow underlying environments and develop a modified version of ViennaCL++ to enable the generation of OpenCL C++ kernels instead of OpenCL C kernels and the compilation of kernel codes to SPIR-V format, as well as the enqueuing and executing of SPIR-V program based on the state-of-the-art OpenCL 2.X and C++ kernel language specification. Furthermore, ViennaCL++ enables object-oriented programming while achieving performance improvement by C++ Class optimization with partial C++ standard library implemented in the OpenCL C++ language.
The OpenCL 2.X flow also enables the utilization of OpenCL Shared Virtual Memory (SVM) to minimize the data transfer overhead between the host and devices, reducing the overall execution time. The design of experiments includes C++11 move semantics, SPIR-V flow, and SVM based on benchmarks of ViennaCL++ modified version. The experimental results of ViennaCL++ are executed on the x86_64 of Intel hardware and it shows that our scheme is suitably seamless to C++. In other words, the performance of ViennaCL++ runtime execution is similar to traditional OpenCL C flow. In addition, for hardware that supports SVM feature, the OpenCL C++ kernel code performance can be improved 3.4 times speedup for vector add operation.
Note that Intel OpenCL 2.1 compiler is equipped with mostly Khronos OpenCL 2.2 (C++) linguistic to allow us for experiments. The work enables the use of C++ template abstraction, move operator, and SVM features compared to previous flow with OpenCL C version.
Shaping Open Source Compute Ecosystem with Neo and clDNN
The Compute Library for Deep Neural Networks (clDNN) is an open source performance library for Deep Learning (DL) applications intended for acceleration of DL inference on Intel® Processor Graphics. It was released in 2017 and was recently coupled with Neo, which is the open source OpenCL driver for Intel Processor Graphics. Together they provide a complete open source software stack for Deep Learning applications.
In this poster, we will discuss our architecture and design choice for Neo and how it works efficiently with clDNN. We will highlight what we learned in making OpenCL efficient and demonstrate how to get great performance using the driver standalone and in complete this software stack. We will also explain what extension were used to enhance OpenCL capabilities and what may be added to the standard to allow even better usage models.
One of major features of Neo architecture is N:1 submission model, it allows multiple clients to be efficiently aggregated into one command stream. This allows concurrent execution of independent command streams that gives boosts in all clDNN topologies that underutilize GPU. We will share how we managed to get up to 45% performance increase in various topologies with proper usage of out of order queues and N:1 driver architecture. We will also share new usage models that allows to execute concurrently multiple independent in-order command queues.
In addition we will share our experiences with creating a development environment that allows us to effectively develop and deliver an open source driver to the community on a daily basis. We will also provide process for external contributions and our strategy for future ecosystem development.