Dynamic energy-efficiency optimization of GPGPU accelerated applications

Abstract

High energy consumption is a major obstacle in building exascale supercomputers, where performance largely depends on accelerators -specialized hardware far more efficient than general-purpose processors (CPU). The surge of artificial intelligence has significantly increased the energy demands of these supercomputers, as these systems scaled up in these accelerator counts to meet the demand. The electricity bill competing with the purchase price is now the main motivation to improve energy efficiency, alongside reducing the carbon footprint, stress on the power infrastructure, and load on the cooling systems. While runtime systems with dynamic voltage and frequency scaling have greatly improved energy efficiency in CPU-based supercomputers, applying similar techniques to accelerators is challenging due to architectural differences and unknown reactivity to frequency changes. This thesis presents a methodology for measuring the accelerator reactivity to frequency changes and the length of the frequency transitions through artificial workload, using built-in timers in accelerators. The methodology was implemented in the LATEST tool for CUDA hardware streaming multiprocessor (SM) frequency and validated on three pieces of hardware: RTX Quadro 6000, A100 SXM-4, and GH200. The gained insights were particularly useful in the next part of the work, the development of the runtime system, dedicated to CUDA hardware. The developed runtime system relies on the SM frequency tuning based on the time interval. The main reason for this choice was the short CUDA kernel duration compared to the switching latency. This makes the usage of individual SM frequency settings for each kernel impractical, leading to switching overhead that outweighs all potential savings. By periodically sampling the performance counters using the CUPTI PM Sampling API, the arithmetic intensity can be determined in real-time. Based on the roofline mode, the optimum frequency is identified. To adjust the frequency, a special daemon tool is used, which directly uses the NVML API to perform the frequency changes. This daemon tool removes the overhead present in nvidia-smi and also facilitates access to otherwise root-only SM frequency scaling to non-privileged users. The runtime system was evaluated on both artificial benchmark with predefined arithmetic intensity behavior in several configurations and on ESPRESO FEM, which is a highly optimized production application for mechanical simulations using the FETI method. All experiments were done using the A100 SXM-4 accelerator. The results confirmed that dynamic tuning based on hardware utilization can achieve significant GPU energy savings. The artificial load benchmark savings reached up to 23%, while the ESPRESO FEM bechmarks were more reserved with only 7.4% energy savings - highlighting that significant energy savings achieved with small performance penalty, under 8.7% in all cases.

Description

Subject(s)

Supercomputer, HPC, GPU, energy efficiency, CUDA, GPGPU

Citation