In this talk, we will discuss how support for a diverse set of OpenCL features affects performance in the domain of graph applications executing on GPU platforms. Given that adoption of OpenCL features varies widely across vendors, these results can help quantify the performance benefits, and potentially motivate, the timely adoption of these OpenCL features.
Our findings are drawn from the experience of developing an OpenCL backend for a state-of-the-art graph application DSL, originally developed with a CUDA backend. This DSL allows competitive algorithms for applications such as breadth-first-search, page-rank, and single-source-shortest-path to be written at a high level. A series of optimisations can then be applied by the compiler and executable OpenCL code can be generated. These optional optimisations exercise various features of OpenCL: on one end of the spectrum, applications compiled without optimisations require only core OpenCL features provided in version 1.1 of the standard; on the other end, a certain optimisation requires inter-workgroup forward progress guarantees, which are yet to be officially supported by OpenCL, but have been empirically validated. Other optimisations require OpenCL features such as: fine-grained memory consistency guarantees (added in OpenCL 2.0) and subgroup primitives (added to core in OpenCL 2.1).
Our compiler can apply 6 independent optimisations. For each optimisation, we determine the minimum version of OpenCL required to support the optimisation. We find that the relevant OpenCL versions, and the number of optimisations they support, are: 1.1 (2 optimisations are supported), 2.0 (adds 1 additional optimisation), and 2.1 (adds 2 more additional optimisation). We additionally create the notion of version FP (forward-progress) that adds support for unofficial forward progress guarantees, which are required for the final optimisation. Clearly, as support increases, so does the number of supported optimisations. For each optimisation, we will discuss the OpenCL features required for support and the idioms in which the features are used. Use-case discussions of these features (i.e. memory consistency and subgroup primitives) are valuable as there appear to be very few open-source examples, e.g. a GitHub search shows only a small number of examples.
The compiler infrastructure enables us to carry out a large and controlled study, in which the performance benefit of various levels of OpenCL support can be evaluated. We gather runtime data exhaustively on all combinations across: all optimisations, 17 applications, 3 graph inputs, 6 different GPUs (spanning 4 vendors: Nvidia, AMD, Intel and ARM). Our results show that if feature support is limited to OpenCL 2.0 (and below), the available optimisations fail to achieve any speedup up in over 70% of the cases. If support OpenCL 2.1 is added, then this number drops to 60%; however, in all of these cases, observed application speedup is modest, rarely exceeding 2x. Finally, if unsupported forward progress guarantees can be assumed, then speedups can be observed in over half of the cases, including impressive speedups of over 14x for AMD and Intel GPUs. We believe this provides compelling evidence for forward progress properties to be considered for adoption for a future OpenCL version.