Accelerating SGEMM with Subgroups

Intel.

The concept of a subgroup was introduced in the OpenCL 2.0 spec and is an optional Khronos OpenCL extension. This poster will describe work done at Intel to accelerate the SGEMM matrix multiplication algorithm on Intel GPUs using subgroups. Using subgroups, we were able to achieve SGEMM performance results that were comparable to our best hand-written assembler results.

Topics that will be covered include:

  • An overview of the most common OpenCL SGEMM algorithm.
  • Performance results and shortcomings of pre-subgroup implementations of this algorithm.
  • Performance results and benefits of the subgroup implementation of this algorithm.
  • Using Intel extensions to subgroups to get even better performance!
  • Tue 12 – Wed 13, May
  • All Day
  • Li Ka Shing Center, Stanford

Architecture-Aware Tuning and Optimization Using OpenCL

Aclectic Systems Inc.

This presentation will utilize profiling techniques in a case study to systematically illustrate a number of task-parallel, memory-parallel and SIMD-parallel tuning and optimization techniques to improve the performance of a simple linear algebra algorithm targeting the Intel Xeon CPU, an Altera FPGA and an Intel Xeon Phi floating-point accelerator. Our goal is to demystify heterogeneous computing using a practical application context, and provide insight into how to most effectively leverage OpenCL optimization techniques across a range of compute devices.
By providing an insightful discussion on the intersection of the architectural details of a variety of different compute devices, code transformations applied within the context of OpenCL and profiling techniques, we hope to provide practical guidelines on how to dramatically improve application performance. This presentation will be of interest to developers interested in fully leveraging a heterogeneous computing platform as well as the wider OpenCL developer community.

  • Tue 12 – Wed 13, May
  • All Day
  • Li Ka Shing Center, Stanford

OpenCL Accelerated Deep Learning for Visual Understanding

Intel.

In order to enable a much wider range of useable devices for deep learning and to investigate advanced techniques leveraging features not available on Discrete GPUs such as a unified memory architecture with shared virtual memory we have been working to add OpenCL as an option into the Caffe framework. As part of this work we have had to look at many different components such as evaluating available OpenCL based BLAS approaches, experimenting with different convolution techniques and libraries, and looking at methods to leverage specific OpenCL extensions to achieve the best performance for various architectures.

On this poster we would like to highlight the work we have done so far such as a fully functional OpenCL based Caffe solution which we hope to contribute back to the open source community, the initial results we are seeing as compared to CUDA and CPU based approaches and our future plans to further enable and optimize deep learning for OpenCL including thoughts on leveraging next gen OpenCL 2.0 features.

  • Tue 12 – Wed 13, May
  • All Day
  • Li Ka Shing Center, Stanford

High Dynamic Range Imaging by Heterogeneous Computing in Mobile Devices

Samsung

In mobile environments, achieving high performance with low power consumption is a challenging task. As GPUs have emerged as a helping hand to CPUs for general computation since GPUs can perform the same computational loads as CPUs with lower clock frequencies, we propose a solution to construct CPU- and GPU-enabled high dynamic range imaging by using general purpose GPU (GPGPU) computing in a heterogeneous environment to accomplish both goals. Depending on the complexity of the module, the workload is distributed over CPUs and GPUs. The performance of the proposed solution is compared against the baseline-CPU HDR (multi-threaded and SIMD optimized) in terms of elapsed time and power consumption. Experiments on 8 MP, 9.8 MP and 13 MP images show that our method achieved the 24 % performance enhancement and 18 % power savings on Qualcomm’s AP. We expect this will add another to a few reported GPGPU solutions in the mobile environment and will overcome CPU’s maximum frequency limitations due to heat and power dissipation.

  • Tue 12 – Wed 13, May
  • All Day
  • Li Ka Shing Center, Stanford

A Compute Model for Augmented Reality with Integrated-GPU Acceleration

Intel

This poster will use the Visual Shopping Assistant use case in the AR field to depict augmented reality building blocks. We show how this use case can be used as a workload to fine tune the hardware. We talk about how the power of GPU can be leveraged using OpenCL to process the depth maps captured by 3D camera and to track the camera position in order to blend the virtual and the real world seamlessly. The compute intensive operations of our use case were offloaded to GPU by using OpenCL and OpenGL. Our framework was portable to run on devices ranging from desktops to low power form factors like tablets running Windows as well as Android operating systems. We present the significant performance gain achieved with leveraging GPU for compute purposes. Improvements in the depth processing algorithm as well as more tracking algorithms can be implemented to get better camera pose. In future we can also leverage the features of OpenCL2.0 to get better efficiency.

  • Tue 12 – Wed 13, May
  • All Day
  • Li Ka Shing Center, Stanford