

#### University of Stuttgart Germany



# Institute for Parallel and Distributed Systems

#### Scientific Computing





Marcel.Breyer@ipvs.uni-stuttgart.de

Marcel Breyer

A Comparison of SYCL, OpenCL, CUDA, and OpenMP for Massively Parallel Support Vector Machine Classification on Multi-Vendor Hardware

> 10th IWOCL & SYCLcon Mai 10-12, 2022

#### **Motivation - Parallel Programming Languages**

Papers mentioning parallel programming langages. Data according to Google Scholar (April 27th 2020)



Source: https://www.iwocl.org/wp-content/uploads/iwocl-syclcon-2020-panel-slides.pdf (slide 2)

#### **Motivation - Parallel Programming Languages**

Papers mentioning parallel programming langages. Data according to Google Scholar (April 27th 2020)



Source: https://www.iwocl.org/wp-content/uploads/iwocl-syclcon-2020-panel-slides.pdf (slide 2)

## Example Application

1



supervised machine learning: binary classification



supervised machine learning: binary classification









supervised machine learning: binary classification



- SVMs have to solve a convex quadratic problem
  - → state-of-the-art: Sequential Minimal Optimization (SMO) (proposed by Platt in 1998)
  - ➔ inherently sequential algorithm

- SVMs have to solve a convex quadratic problem
  - → state-of-the-art: Sequential Minimal Optimization (SMO) (proposed by Platt in 1998)
  - ➔ inherently sequential algorithm
- many SVM implementations modify SMO to exploit some parallelism
  - ➔ still not well suited for modern, highly parallel hardware

- SVMs have to solve a convex quadratic problem
  - → state-of-the-art: Sequential Minimal Optimization (SMO) (proposed by Platt in 1998)
  - ➔ inherently sequential algorithm
- many SVM implementations modify SMO to exploit some parallelism
   still not well suited for modern, highly parallel hardware

## Least Squares Support Vector Machine (LS-SVM)

(proposed by Suykens and Vandewalle in 1999)

- SVMs have to solve a convex quadratic problem
  - state-of-the-art: Sequential Minimal Optimization (SMO) (proposed by Platt in 1998)
  - ➔ inherently sequential algorithm
- many SVM implementations modify SMO to exploit some parallelism
   still not well suited for modern, highly parallel hardware

## ✦ Least Squares Support Vector Machine (LS-SVM)

(proposed by Suykens and Vandewalle in 1999)

- reformulation of standard SVM to solving a system of linear equations
- massively parallel algorithms known, e.g., Conjugate Gradient (CG)

## Implementation

2

## Parallel Least Squares Support Vector Machine (PLSSVM)

- modern C++17
- single and double precision via template parameter
- backend and target platform selectable at runtime
- parallelizes matrix-vector multiplication in CG algorithm
- multi-GPU support for the linear kernel function
- drop-in replacement for LIBSVM's svm-train and svm-predict executables
- currently only binary classification and dense calculations













- CPU only (no target offloading for GPUs)
- only directive based constructs
- not yet optimized to the same level as the GPU backends





- optimizations: blocking, caching, padding
- block-level caching (global ↔ shared memory)
- thread-level caching (shared memory ↔ register)
- blocking sizes changeable during compilation
- Ahead-of-Time (AOT) instead of Just-in-Time (JIT) compilation





- same optimizations as in CUDA
- C++ (RAII) wrapper around OpenCL handles
- custom floating point atomic functions via atomic\_cmpxchg and atom\_cmpxchg
- no AOT compilation, JIT only
- custom OpenCL kernel binary caching implementation





- same optimizations as in CUDA
- DPC++ and hipSYCL supported
- other SYCL implementations (e.g., ComputeCpp, triSYCL, neoSYCL) not investigated (missing features)
- DPC++ with AOT compilation



hipSYCL

Open**MP** 



oneAPT

hipSYCL

- same optimizations as in CUDA
- DPC++ and hipSYCL supported
- other SYCL implementations (e.g., ComputeCpp, triSYCL, neoSYCL) not investigated (missing features)
- DPC++ with AOT compilation
- two implementations of the same kernel:
  - nd\_range: directly comparable to CUDA and OpenCL
  - hierarchical: acceptable performance using hipSYCL on CPUs





#### Setup - Hardware

|      |                    |                | memory                 | bandwidth            | FP64                     |
|------|--------------------|----------------|------------------------|----------------------|--------------------------|
| GPUs | NVIDIA             | A100           | $40\mathrm{GB}$ HBM2e  | $1555\mathrm{GB/s}$  | 9.7 TFLOPS               |
|      |                    | P100           | 16 GB HBM2             | $732{ m GB/s}$       | 4.7 TFLOPS               |
|      |                    | RTX 3080       | $10\mathrm{GB}$ GDDR6X | $760.3\mathrm{GB/s}$ | $465.1\mathrm{GFLOPS}$   |
|      |                    | GTX 1080 Ti    | 11 GB GDDR5X           | $484.4\mathrm{GB/s}$ | $354.4\mathrm{GFLOPS}$   |
|      | AMD Radeon Pro VII |                | 16 GB HBM2             | $1024\mathrm{GB/s}$  | 6.5 TFLOPS               |
|      | Intel              | UHD P630       | 53.8 GB DDR4           | $41.6\mathrm{GB/s}$  | 96 GFLOPS                |
|      |                    | Iris Xe MAX    | $4\mathrm{GB}$ LPDDR4x | $68\mathrm{GB/s}$    | emulated                 |
|      |                    |                | base/boost freq.       | bandwidth            | # cores/# HT             |
|      | AMD                | EPYC 7742      | $2.25/3.4\mathrm{GHz}$ | $204.8\mathrm{GB/s}$ | $2 \cdot 64/2 \cdot 128$ |
| S    |                    | Ryzen TR 3960X | $3.8/4.5\mathrm{GHz}$  | $102.4\mathrm{GB/s}$ | 24/48                    |
| CPU  | Intel              | Xeon Phi 7210  | $1.3/1.5\mathrm{GHz}$  | $102\mathrm{GB/s}$   | 64/256                   |
|      |                    | Xeon E-2176G   | $3.7/4.7\mathrm{GHz}$  | $41.6\mathrm{GB/s}$  | 6/12                     |
|      |                    | Core i9-10920X | $3.5/4.6\mathrm{GHz}$  | $94\mathrm{GB/s}$    | 12/24                    |

#### Setup - Hardware

|      |                    |                | memory                 | bandwidth            | FP64                     |
|------|--------------------|----------------|------------------------|----------------------|--------------------------|
| GPUs | NVIDIA             | A100           | 40 GB <b>HBM2</b> e    | $1555\mathrm{GB/s}$  | 9.7 TFLOPS               |
|      |                    | P100           | 16 GB HBM2             | $732\mathrm{GB/s}$   | 4.7 TFLOPS               |
|      |                    | RTX 3080       | 10 GB GDDR6X           | $760.3\mathrm{GB/s}$ | 465.1 GFLOPS             |
|      |                    | GTX 1080 Ti    | 11 GB GDDR5X           | $484.4\mathrm{GB/s}$ | $354.4\mathrm{GFLOPS}$   |
|      | AMD Radeon Pro VII |                | 16 GB HBM2             | $1024{ m GB/s}$      | $6.5\mathrm{TFLOPS}$     |
|      | Intel              | UHD P630       | 53.8 GB DDR4           | $41.6\mathrm{GB/s}$  | 96 GFLOPS                |
|      |                    | Iris Xe MAX    | $4\mathrm{GB}$ LPDDR4x | $68\mathrm{GB/s}$    | emulated                 |
|      |                    |                | base/boost freq.       | bandwidth            | # cores/# HT             |
| CPUs | AMD                | EPYC 7742      | $2.25/3.4\mathrm{GHz}$ | $204.8\mathrm{GB/s}$ | $2 \cdot 64/2 \cdot 128$ |
|      |                    | Ryzen TR 3960X | $3.8/4.5\mathrm{GHz}$  | $102.4\mathrm{GB/s}$ | 24/48                    |
|      | Intel              | Xeon Phi 7210  | $1.3/1.5\mathrm{GHz}$  | $102\mathrm{GB/s}$   | 64/256                   |
|      |                    | Xeon E-2176G   | $3.7/4.7\mathrm{GHz}$  | $41.6\mathrm{GB/s}$  | 6/12                     |
|      |                    | Core i9-10920X | $3.5/4.6\mathrm{GHz}$  | $94\mathrm{GB/s}$    | 12/24                    |

#### NVIDIA GPUs: A100 vs. RTX 3080 - 4096 features



## AMD GPU: Radeon Pro VII



## Intel CPU: Core i9-10920X



#### **Additional Observations**

- results for the P100 and GTX 1080 Ti nearly identical to the A100 and RTX 3080 respectively
- overall behavior the same on Intel GPUs
- OpenCL faster than DPC++ on the Iris Xe MAX GPU
- overall behavior on CPUs nearly identical (except OpenMP)
- on every hardware: DPC++ hierarchical slower than nd\_range

## **OpenCL JIT Compilation Overhead**



## **OpenCL JIT Compilation Overhead**



## **OpenCL JIT Compilation Overhead**



## Conclusion

4

#### **Conclusion - Contribution**

- Open Source Parallel Least Squares Support Vector Machine (PLSSVM)
  - multiple backends: OpenMP, CUDA, OpenCL, SYCL
  - be able to target GPUs from NVIDIA, AMD, and Intel as well as CPUs



https://github.com/SC-SGS/PLSSVM

#### **Conclusion - Contribution**

- Open Source *Parallel Least Squares Support Vector Machine* (PLSSVM)
  - multiple backends: OpenMP, CUDA, OpenCL, SYCL
  - be able to target GPUs from NVIDIA, AMD, and Intel as well as CPUs
- comparison of a standard problem (matrix-vector multiplication in the CG algorithm) using different programming frameworks on different hardware platforms
- based on our findings: recommendation of which framework to use when



https://github.com/SC-SGS/PLSSVM



see our paper for more results NVIDIA GPUs only?























Marcel Breyer: A Comparison of SYCL, OpenCL, CUDA, and OpenMP for Massively Parallel Support Vector Machine Classification on Multi-Vendor Hardware 13



Marcel Breyer: A Comparison of SYCL, OpenCL, CUDA, and OpenMP for Massively Parallel Support Vector Machine Classification on Multi-Vendor Hardware 13

#### **Future Work**

- further optimizing the OpenMP backend
- add additional backends (e.g., Kokkos or OpenMP target offloading)
- investigate other SYCL implementations, e.g., ComputeCpp
- investigate performance on other hardware platforms, e.g., FPGAs

#### **Future Work**

- further optimizing the OpenMP backend
- add additional backends (e.g., Kokkos or OpenMP target offloading)
- investigate other SYCL implementations, e.g., ComputeCpp
- investigate performance on other hardware platforms, e.g., FPGAs

## **Current Work in Progress**

- sparse (CG) implementation
- support for distributed systems and multi-node execution via MPI
- investigate mixed precision and the usage of special ML hardware (e.g., NVIDIA's tensor cores)





Marcel Breyer 💿

Marcel.Breyer@ipvs.unistuttgart.de



Alexander Van Craen 💿

Alexander.Van-Craen@ipvs.unistuttgart.de



Dirk Pflüger 💿

Dirk.Pflueger@ipvs.unistuttgart.de