Presented by: Tim Hartley, Staff Engineer and Johan Gronqvist, Senior Graphics Software Engineer, ARM

[gview file=”/wp-content/uploads/iwocl-2014-workshop-Tim-Hartley.pdf”]

Desktop and HPC systems have enjoyed the benefits of GPU Compute for several years now. Developers have become accustomed to optimization techniques for GPUs designed for those markets. To fully exploit the compute capabilities of GPUs in mobile and embedded systems, developers need to learn different optimization techniques due to differences in hardware organization.

We will describe some optimization techniques for the ARM Mali-T600 GPU series. We will start with a naive implementation of an image processing filter and progressively transform it to improve hardware utilization on the ARM Mali T604. We will further discuss using Renderscript and OpenCL APIs for enabling GPU Compute.

Detailed Agenda

  • Introduction to Mali GPUs
    • Includes a brief history of ARM, our GPU roadmap, developer resources for compute on Mali etc
  • Mali-T600 / T700 Compute Overview
    • OpenCL Execution Model
      • An overview of the underlying architecture of ARM’s compute-capable GPUs
      • Differences with desktop architectures
    • Mali OpenCL driver & built-in function library
    • RenderScript driver
  • Optimal OpenCL for Mali-T600 / T700
    • Programming Suggestions
      • Advice for developers moving from the desktop to the mobile world
      • How to get the best out of the Mali compute architecture
      • Arithmetic optimisations, latency hiding by parallelism, register operations, the load/store pipeline and more
    • Optimising with DS-5 Streamline and HW Counters
      • Using ARM’s DS-5 tool to determine whole-system performance
      • Hiding pipeline latency, pipeline utilisation, finding bottlenecks
      • Other available tools and techniques
    • Optimising: Two Examples
      • Some real-world examples of the above
    • General Advice
  • OpenCL Optimisation Case Studies
    • Some real optimisation case studies showing step-by-step analysis and the performance results.
      • Laplace and others (subject to time)