[fusion_tagline_box backgroundcolor=”” shadow=”no” shadowopacity=”0.70″ border=”1″ bordercolor=”” highlightposition=”left” content_alignment=”left” link=”/wp-content/uploads/iwocl-karol-jerome-blurring-the-boundary-between-cpu-and-gpu-opencl.pdf” button=”View Presentation Slides” linktarget=”_blank” modal=”” button_size=”” button_type=”” buttoncolor=”default” title=”Blurring the Boundary between CPU and GPU” description=”SU5WSVRFRCBUQUxLIGJ5OiA8c3Ryb25nPkplcm9tZSBHbGlzc2U8L3N0cm9uZz4sIEthcm9sIEhlcmJzdCAgfMKgIFJlZGhhdA==” margin_top=”” margin_bottom=”” hide_on_mobile=”small-visibility,medium-visibility,large-visibility” class=”” id=”” animation_type=”” animation_direction=”left” animation_speed=”0.3″ animation_offset=”” button_border_radius=””]
OpenCL 2.0 have define various level of share virtual memory (SVM) and this is a feature that is still not widely adopted by end users. This talk aims to provide insight in what way SVM can help OpenCL programmers in their application. It will also looks at some of the today pitfalls and limitations and all the work under way inside the Linux kernel to address those and improve usability of this feature. The talk will reference the work undertaken to add support for OpenCL to Nouveau through SPIR-V/NIR in order to be able to use HMM (Heterogeneous Memory Management).
[/fusion_tagline_box][fusion_tagline_box backgroundcolor=”” shadow=”no” shadowopacity=”0.70″ border=”1″ bordercolor=”” highlightposition=”left” content_alignment=”left” link=”/wp-content/uploads/iwocl-2019-john-lawson-accelerated-neural-networks-on-opencl-devices-using-sycl-dnn.pdf” button=”View Presentation Slides” linktarget=”_blank” modal=”” button_size=”” button_type=”” buttoncolor=”default” title=”Accelerated Neural Networks on OpenCL Devices Using SYCL-DNN” description=”Um9kIEJ1cm5zLCA8c3Ryb25nPkpvaG4gTGF3c29uPC9zdHJvbmc+LCBEdW5jYW4gTWNCYWluIGFuZCBEYW5pZWwgU291dGFyICB8wqDCoENvZGVwbGF5″ margin_top=”” margin_bottom=”” hide_on_mobile=”small-visibility,medium-visibility,large-visibility” class=”” id=”” animation_type=”” animation_direction=”left” animation_speed=”0.3″ animation_offset=”” button_border_radius=””]
Over the past few years machine learning has seen a renewed explosion of interest, following a number of studies showing the effectiveness of neural networks in a range of tasks which had previously been considered incredibly hard. Neural networks’ effectiveness in the fields of image recognition and natural language processing stems primarily from the vast amounts of data available to companies and researchers, coupled with the huge amounts of compute available in modern accelerators such as GPUs, FPGAs and ASICs. There are a number of approaches available to developers for utilizing GPGPU technologies such as SYCL, OpenCL and CUDA, however many applications require the same low level mathematical routines. Libraries dedicated to accelerating these common routines allow developers to easily make full use of the available hardware without requiring low level knowledge of the hardware themselves, however such libraries are often provided by hardware manufacturers for specific hardware such as cuDNN for Nvidia hardware or MIOpen for AMD hardware.
SYCL-DNN is a new open-source library dedicated to providing accelerated routines for neural network operations which are hardware and vendor agnostic. Built on top of the SYCL open standard and written entirely in standard C++, SYCL-DNN allows a user to easily accelerate neural network code for a wide range of hardware using a modern C++ interface. The library is tested on AMD’s OpenCL for GPU, Intel’s OpenCL for CPU and GPU, ARM’s OpenCL for Mali GPUs as well as ComputeAorta’s OpenCL for RCar CVEngine and host CPU. In this talk we will present performance figures for SYCL-DNN on this range of hardware, and discuss the requirements for achieving high performance on such a varied set of accelerators with such different hardware features.
For additional information visit: https://github.com/codeplaysoftware/SYCL-DNN
[/fusion_tagline_box][fusion_tagline_box backgroundcolor=”” shadow=”no” shadowopacity=”0.70″ border=”1″ bordercolor=”” highlightposition=”left” content_alignment=”left” link=”/wp-content/uploads/iwocl-2019-gordon-brown-how-to-deploy-ai-software-to-self-driving-cars.pdf” button=”View Presentation Slides” linktarget=”_blank” modal=”” button_size=”” button_type=”” buttoncolor=”default” title=”How to Deploy AI Software to Self Driving Cars” description=”Um9kIEJ1cm5zLCA8c3Ryb25nPkdvcmRvbiBCcm93bjwvc3Ryb25nPiwgTWVlbmFrc2hpIFJhdmluZHJhbiBhbmQgTmljb2xhcyBNaWxsZXIgIHzCoMKgQ29kZXBsYXk=” margin_top=”” margin_bottom=”” hide_on_mobile=”small-visibility,medium-visibility,large-visibility” class=”” id=”” animation_type=”” animation_direction=”left” animation_speed=”0.3″ animation_offset=”” button_border_radius=””]
The automotive industry is embracing new challenges to deliver self-driving cars, and this in turn requires increasingly complex hardware and software. Software developers are leveraging artificial intelligence, and in particular machine learning, to deliver the capabilities required for an autonomous vehicle to operate. This has driven the integration of heterogeneous hardware into automotive systems offering multi-core processors capable of performing the intense algorithms required for artificial intelligence and machine learning. These multi-core processors can be used to vastly speed up common operations used in AI and machine learning algorithms.
This session will demonstrate how artificial intelligence software can be developed and accelerated using SYCL and OpenCL on Yocto Linux, then targeted at a range of hardware including the Rensas R-Car IMP range of automotive processors. The OpenCL model enables extensive usage of heterogenous hardware, including fully programmable IP, efficient data transfer using the DMA and on-chip memory, and fixed function IP block, such as CNN for enabling high throughput convolution operations, via OpenCL builtin kernels. We will look at the memory mapping to bring in the efficiency and the software pipelining & parallelism. These hardware architectures include AI accelerator processors specifically designed to be used in the next generation of vehicles. In particular, the processors are designed to tackle complex algorithms whilst limiting the overall consumption of power. Benchmarks will be presented to show how portable code can also deliver performance for developers using this hardware.
As well as enabling developers to choose OpenCL or SYCL, we will talk about how these standards enable additional high-level frameworks that can be used to target this hardware. These include libraries for deep neural networks and linear algebra operations.
[/fusion_tagline_box][fusion_tagline_box backgroundcolor=”” shadow=”no” shadowopacity=”0.70″ border=”1″ bordercolor=”” highlightposition=”left” content_alignment=”left” link=”/wp-content/uploads/iwocl-2019-michal-mrozek-intel-breaking-the-last-line-of-performance-border.pdf” button=”View Presentation Slides” linktarget=”_blank” modal=”” button_size=”” button_type=”” buttoncolor=”default” title=”Breaking the Last Line of Performance Border.” description=”PHN0cm9uZz5NaWNoYWwgTXJvemVrPC9zdHJvbmc+ICB8wqDCoEludGVs” margin_top=”” margin_bottom=”” hide_on_mobile=”small-visibility,medium-visibility,large-visibility” class=”” id=”” animation_type=”” animation_direction=”left” animation_speed=”0.3″ animation_offset=”” button_border_radius=””]
In this talk we will present various techniques that we used to optimize performance of our clDNN libraries and obtain top notch performance.
We will provide details about following techniques:
- offloading execution
- combining primitives into graphs to apply graph level optimizations
- leveraging proper layout of data
- primitive fusing
- using memory padding
- aggregating independent primitives to be executed concurrently
- dedicated kernel selection, formed only to service efficiently dedicated use cases
- utilizing proper memory pool
- auto tuning
We will provide details about those techniques, how much performance can be gained with those and how to apply them to OpenCL programs.
[/fusion_tagline_box][fusion_tagline_box backgroundcolor=”” shadow=”no” shadowopacity=”0.70″ border=”1″ bordercolor=”” highlightposition=”left” content_alignment=”left” link=”/wp-content/uploads/iwocl-2019-sreepathi-pai-performance-evaluation-of-opencl-standard-support-and-beyond.pdf” button=”View Presentation Slides” linktarget=”_blank” modal=”” button_size=”” button_type=”” buttoncolor=”default” title=”Performance Evaluation of OpenCL Standard Support (and Beyond)” description=”VHlsZXIgU29yZW5zZW4gfCAgUHJpbmNldG9uIFVuaXZlcnNpdHk8YnIgLz48c3Ryb25nPlNyZWVwYXRoaSBQYWk8L3N0cm9uZz4gIHwgIFVuaXZlcnNpdHkgb2YgUm9jaGVzdGVyPGJyIC8+IEFsYXN0YWlyIEYuIERvbmFsZHNvbsKgIHzCoMKgSW1wZXJpYWwgQ29sbGVnZSBMb25kb24gYW5kIEdvb2dsZQ==” margin_top=”” margin_bottom=”” hide_on_mobile=”small-visibility,medium-visibility,large-visibility” class=”” id=”” animation_type=”” animation_direction=”left” animation_speed=”0.3″ animation_offset=”” button_border_radius=””]
In this talk, we will discuss how support for a diverse set of OpenCL features affects performance in the domain of graph applications executing on GPU platforms. Given that adoption of OpenCL features varies widely across vendors, these results can help quantify the performance benefits, and potentially motivate, the timely adoption of these OpenCL features.
Our findings are drawn from the experience of developing an OpenCL backend for a state-of-the-art graph application DSL, originally developed with a CUDA backend. This DSL allows competitive algorithms for applications such as breadth-first-search, page-rank, and single-source-shortest-path to be written at a high level. A series of optimisations can then be applied by the compiler and executable OpenCL code can be generated. These optional optimisations exercise various features of OpenCL: on one end of the spectrum, applications compiled without optimisations require only core OpenCL features provided in version 1.1 of the standard; on the other end, a certain optimisation requires inter-workgroup forward progress guarantees, which are yet to be officially supported by OpenCL, but have been empirically validated. Other optimisations require OpenCL features such as: fine-grained memory consistency guarantees (added in OpenCL 2.0) and subgroup primitives (added to core in OpenCL 2.1).
Our compiler can apply 6 independent optimisations. For each optimisation, we determine the minimum version of OpenCL required to support the optimisation. We find that the relevant OpenCL versions, and the number of optimisations they support, are: 1.1 (2 optimisations are supported), 2.0 (adds 1 additional optimisation), and 2.1 (adds 2 more additional optimisation). We additionally create the notion of version FP (forward-progress) that adds support for unofficial forward progress guarantees, which are required for the final optimisation. Clearly, as support increases, so does the number of supported optimisations. For each optimisation, we will discuss the OpenCL features required for support and the idioms in which the features are used. Use-case discussions of these features (i.e. memory consistency and subgroup primitives) are valuable as there appear to be very few open-source examples, e.g. a GitHub search shows only a small number of examples.
The compiler infrastructure enables us to carry out a large and controlled study, in which the performance benefit of various levels of OpenCL support can be evaluated. We gather runtime data exhaustively on all combinations across: all optimisations, 17 applications, 3 graph inputs, 6 different GPUs (spanning 4 vendors: Nvidia, AMD, Intel and ARM). Our results show that if feature support is limited to OpenCL 2.0 (and below), the available optimisations fail to achieve any speedup up in over 70% of the cases. If support OpenCL 2.1 is added, then this number drops to 60%; however, in all of these cases, observed application speedup is modest, rarely exceeding 2x. Finally, if unsupported forward progress guarantees can be assumed, then speedups can be observed in over half of the cases, including impressive speedups of over 14x for AMD and Intel GPUs. We believe this provides compelling evidence for forward progress properties to be considered for adoption for a future OpenCL version.
[/fusion_tagline_box][fusion_tagline_box backgroundcolor=”” shadow=”no” shadowopacity=”0.70″ border=”1″ bordercolor=”” highlightposition=”left” content_alignment=”left” link=”/wp-content/uploads/iwocl-2019-harri-renney-opencl-vs-accelerated-finite-difference-digital-synthesis.pdf” button=”View Presentation Slides” linktarget=”_blank” modal=”” button_size=”” button_type=”” buttoncolor=”default” title=”OpenCL vs: Accelerated Finite-Difference Digital Synthesis” description=”PHN0cm9uZz5IYXJyaSBSZW5uZXk8L3N0cm9uZz4sIEJlbmVkaWN0IEdhc3RlciBhbmQgVG9tIE1pdGNoZWxswqAgfMKgwqBVbml2ZXJzaXR5IG9mIFdlc3Qgb2YgRW5nbGFuZA==” margin_top=”” margin_bottom=”” hide_on_mobile=”small-visibility,medium-visibility,large-visibility” class=”” id=”” animation_type=”” animation_direction=”left” animation_speed=”0.3″ animation_offset=”” button_border_radius=””]
Digital audio synthesis has become an important component of modern music production with techniques that can produce realistic simulations of real instruments. Physical modelling sound synthesis is a category of audio synthesis that uses mathematical models to emulate the physical phenomena of acoustic musical instruments including drum membranes, air columns and strings. The synthesis of physical phenomena can be expressed as discrete variants of Newton’s laws of motion, using, for example, the Finite-Difference Time-Domain method or FDTD.
FDTD is notoriously computationally expensive and the real time demands of sound synthesis in a live setting has led implementers to consider offloading to GPUs. In this paper we present multiple OpenCL implementations of FDTD for real time simulation of a drum membrane. Additionally, we compare against an AVX optimized CPU implementation and an OpenGL version that utilizes a careful mapping to the GPU texture cache. We find using a discrete, laptop class, AMD GPU that for all but the smallest mesh sizes, the OpenCL implementation out performs the others. Although, to our surprise we found that optimizing for workgroup local memory provided only a small performance benefit.
[/fusion_tagline_box][fusion_tagline_box backgroundcolor=”” shadow=”no” shadowopacity=”0.70″ border=”1″ bordercolor=”” highlightposition=”left” content_alignment=”left” link=”/wp-content/uploads/iwocl-2019-chao-lin-lee-sparse-matrix-compression-primitives-with-opencl-framework-to-support-halide.pdf” button=”View Presentation Slides” linktarget=”_blank” modal=”” button_size=”” button_type=”” buttoncolor=”default” title=”Sparse-Matrix Compression Primitives with OpenCL Framework to Support Halide” description=”PHN0cm9uZz5DaGFvLUxpbiBMZWU8L3N0cm9uZz4gfCBOYXRpb25hbCBUc2luZyBIdWEgVW5pdmVyc2l0eQ==” margin_top=”” margin_bottom=”” hide_on_mobile=”small-visibility,medium-visibility,large-visibility” class=”” id=”” animation_type=”” animation_direction=”left” animation_speed=”0.3″ animation_offset=”” button_border_radius=””]
Halide and OpenCL now play important roles for heterogeneous multi-core computing. OpenCL provides vendor-level support and Halide provides domain-specific support such as vision processing and AI model (TVM Halide IR). Halide also provides flexible scheduling for applications on target machines. OpenCL plays a supporting role for Halide environments. In this work, we investigate the research issues in supporting sparse computation with Halide and their corresponding OpenCL support. We present sparse matrix compression primitives on Halide for sparse matrix matrix (SpMM) multiplication with OpenCL framework. Halide is a programming language designed to process image and array from numerous algorithms and scheduling primitives to achieve state-of-art performance including SIMD and heterogeneous computation. Given a m-by-k sparse matrix A and a k-by-n dense matrix B, SpMM computes a m-by-n dense matrix C = AB. We choose the most two common sparse matrix formats – coordinate (COO) format and compressed-sparse-row (CSR) format. COO format stores data in a list of tuples with three elements; row, column and value. The first element is row index, the second element is column index, and the third element is the value to be stored in the row and column. All the value store in COO tuples is non-zero elements. We paralleled the reading of the 2-D array index with OpenCL support in Halide to speed up the traversal time of the array index. When the traverse meets non-zero elements, we stored them in COO format. As a result, the SpMM multiplication can be speedup with compressed COO matrix. The shortcomings of COO format is that the row index using identical entries. CSR format can improve it by replacing an array of row index with a shorter array of row offset. We also integrate recent related work of sparse matrix compression, hybrid CSR and DCSR. DCSR (Doubly Compressed Sparse Row) and CSR split sparse matrix into clustered row segments and light row segments. Heavily clustered row segments of DCSR format can be used as the basis with tiling. This method enabled higher data reuse than COO format. The design of experiments includes Halide primitives for sparse matrix compression and matrix computations. The experiments were performed on an AMD Radeon R9 GPU with OpenCL2.0 framework. Our experiment uses Trefethen20000, ACTIVSg10K, G67, and ACTIVSg2000 as data sets from the SuiteSparse Matrix Collection. The experimental result of computation with compressing matrix shows the performance are improved by more than 85% compared to the baseline without compression. Our work also gives the detailed methods to use OpenCL in implementing sparse matrix compression for Halide.
[/fusion_tagline_box][fusion_tagline_box backgroundcolor=”” shadow=”no” shadowopacity=”0.70″ border=”1″ bordercolor=”” highlightposition=”left” content_alignment=”left” link=”#” button=”Slides Not Being Published” linktarget=”_self” modal=”” button_size=”” button_type=”” buttoncolor=”lightgray” title=”Exploring Integer Sum Reduction using Atomics on Intel CPU” description=”PHN0cm9uZz5aaGVtaW5nIEppbjwvc3Ryb25nPiBhbmQgSGFsIEZpbmtlbCAgfMKgwqBBcmdvbm5lIE5hdGlvbmFsIExhYg==” margin_top=”” margin_bottom=”” hide_on_mobile=”small-visibility,medium-visibility,large-visibility” class=”” id=”” animation_type=”” animation_direction=”left” animation_speed=”0.3″ animation_offset=”” button_border_radius=””]
Atomic functions are useful in updating a shared variable by multiple threads, barrier synchronizations, constructing complex data structures, and building high-level frameworks. In this paper, we focus on the evaluation and analysis of integer sum reduction, a common data parallel primitive. We convert the sequential reduction into parallel OpenCL implementations on the CPU. We also develop three micro kernels, which allow us to understand the relationships between the kernel performance and the operations involved in reduction. The results of the micro kernels show that increasing the work-group size linearly can linearly improve the kernel performance. There is a sweet spot in the relationship between the work-group size and barrier synchronization overhead. The performance of the atomics over local memory are not sensitive to the work-group size. The sum reduction kernel with vectorized memory accesses can improve the performance of the baseline kernel for a wide range of work-group sizes. However, the vectorization efficiency shrinks with the growing work-group size.
We also find that the vendor’s default OpenCL kernel optimization does not improve the kernel performance. On average, disabling the optimization can reduce the execution time of the kernel with vectorized memory accesses by 15%. We attribute the performance drop to the fact that the default kernel optimizations instantiate a large number of atomics over global memory when implicitly vectorizing the kernel computation.
[/fusion_tagline_box][fusion_tagline_box backgroundcolor=”” shadow=”no” shadowopacity=”0.70″ border=”1″ bordercolor=”” highlightposition=”left” content_alignment=”left” link=”/wp-content/uploads/iwocl-2019-yifan-sun-mgsim-a-flexible-high-performance-simulator-for-multi-gpu-systems.pdf” button=”View Presentation Slides” linktarget=”_blank” modal=”” button_size=”” button_type=”” buttoncolor=”default” title=”MGSim: a Flexible High-Performance Simulator for Multi-GPU Systems” description=”PHN0cm9uZz5ZaWZhbiBTdW48L3N0cm9uZz4sIFRyaW5heWFuIEJhcnVhaCwgU2hpIERvbmcsIGFuZCBEYXZpZCBLYWVsaSB8IE5vcnRoZWFzdGVybiBVbml2ZXJzaXR5″ margin_top=”” margin_bottom=”” hide_on_mobile=”small-visibility,medium-visibility,large-visibility” class=”” id=”” animation_type=”” animation_direction=”left” animation_speed=”0.3″ animation_offset=”” button_border_radius=””]
GPUs can provide both high performance and energy efficiency in processing data-parallel workloads. Today, GPUs are accelerating a wide range of applications, spanning large-scale physics simulations to deep neural network training. However, faced with the ever-increasing amounts of data in the many of these applications, a single GPU can no longer satisfy the compute and memory demands of these workloads. In response, industry has started to offer
multi-GPU systems, designing high-performance platforms with an impressive amount of raw computing power. The introduction of new multi-GPU systems comes with a number of new design challenges, including innovations in scalable distributed shared-memory design, new cache coherency policies and high-throughput GPU interconnects. Presently, there is no opensource simulation framework that can support detailed simulation and architectural exploration
of multi-GPU tradeoffs.
Modifications to existing simulators to support multi-GPU system modeling require a complete redesign of a framework, and result in poorly architected simulation infrastructure. Instead, we believe it is time for a new class of simulator, one that addresses many of the current issues present in architectural simulators. The time is right for a flexible simulator infrastructure that satisfies these design requirements and provides a rich framework to support multi-GPU
simulation.
To enable multi-GPU architectural modeling, we introduce MGSim, a new open-source, cyclelevel multi-GPU simulator. MGSim runs AMD GCN3 binaries that are compiled from OpenCL kernels using the official ROCm drivers. MGSim natively supports running parallel simulation without compromising simulation accuracy. MGSim also features a flexible modular design, allowing users to create a wide variety of system configurations. We developed MGSim using
the Go programming language, primarily based on Go’s simplicity, tool support, and language level multi-threading support.
MGSim represents the next generation in GPU simulation. In terms of accuracy, MGSim simulations differs by 5.5% on average as compared to GPU hardware execution. Exploiting the multi-threaded capabilities of our simulation, on a 4-core CPU we can achieve a 3.5X speedup running functional emulation and a 2.5X speedup running detailed timing simulation, as compared to single-threaded simulation.
[/fusion_tagline_box][fusion_tagline_box backgroundcolor=”” shadow=”no” shadowopacity=”0.70″ border=”1″ bordercolor=”” highlightposition=”left” content_alignment=”left” link=”” button=”15:00 – 15:30″ linktarget=”_self” modal=”” button_size=”” button_type=”” buttoncolor=”lightgray” title=”Closing Keynote
Powering the Exascale Era – The Frontier Supercomputer and the Open Source Software that will Power the World’s Fastest Supercomputer at Oak Ridge National Laboratory” description=”RGF2aWQgQ293bmllICB8ICBBTUQ=” margin_top=”” margin_bottom=”” hide_on_mobile=”small-visibility,medium-visibility,large-visibility” class=”” id=”” animation_type=”” animation_direction=”left” animation_speed=”0.3″ animation_offset=”” button_border_radius=””]
On May 7th 2019 AMD joined the U.S. Department of Energy (DOE), Oak Ridge National Laboratory (ORNL) and Cray Inc. in announcing what is expected to be the world’s fastest exascale-class supercomputer, scheduled to be delivered to ORNL in 2021. To deliver what is expected to be more than 1.5 exaflops of expected processing performance, the Frontier system is designed to use future generation High Performance Computing (HPC) and Artificial Intelligence (AI) optimized, custom AMD EPYC CPU, and AMD Radeon Instinct GPU processors. Researchers at ORNL will use the Frontier system’s unprecedented computing power and next generation AI techniques to simulate, model and advance understanding of the interactions underlying the science of weather, sub-atomic structures, genomics, physics, and other important scientific fields. Read the full Press Release.
[/fusion_tagline_box][fusion_tagline_box backgroundcolor=”” shadow=”no” shadowopacity=”0.70″ border=”1″ bordercolor=”” highlightposition=”left” content_alignment=”left” link=”” button=”15:30 – 15:35″ linktarget=”_self” modal=”” button_size=”” button_type=”” buttoncolor=”lightgray” title=”Presentation of Awards” description=”” margin_top=”” margin_bottom=”” hide_on_mobile=”small-visibility,medium-visibility,large-visibility” class=”” id=”” animation_type=”” animation_direction=”left” animation_speed=”0.3″ animation_offset=”” button_border_radius=””]
BEST PAPER Awarded to:
Performance Evaluation of OpenCL Standard Support (and Beyond)
Tyler Sorensen, Princeton University, Sreepathi Pai, University of Rochester, and Alastair F. Donaldson, Imperial College London
BEST PRESENTATION Awarded to:
Evaluating Portability and Performance of OpenCL FPGA Kernels on Intel HARPv2
Anthony M. Cabrera and Roger Chamberlain | Washington University in St. Louis
[/fusion_tagline_box][fusion_tagline_box backgroundcolor=”” shadow=”no” shadowopacity=”0.70″ border=”1″ bordercolor=”” highlightposition=”left” content_alignment=”left” link=”/wp-content/uploads/iwocl-2019-simon-mcintosh-smith-opencl-closing-remarks.pdf” button=”View Presentation Slides” linktarget=”_blank” modal=”” button_size=”” button_type=”” buttoncolor=”default” title=”Closing Remarks including IWOCL 2020 Announcement” description=”” margin_top=”” margin_bottom=”” hide_on_mobile=”small-visibility,medium-visibility,large-visibility” class=”” id=”” animation_type=”” animation_direction=”left” animation_speed=”0.3″ animation_offset=”” button_border_radius=”” /]