An in-depth study of GPU frequency-scaling latency and its optimization on modern architectures

Velička, Daniel

doi:10.1016/j.future.2025.108331

An in-depth study of GPU frequency-scaling latency and its optimization on modern architectures

dc.contributor.author	Velička, Daniel
dc.contributor.author	Vysocký, Ondřej
dc.contributor.author	Yasal, Osman
dc.contributor.author	Říha, Lubomír
dc.date.accessioned	2026-04-23T12:00:04Z
dc.date.available	2026-04-23T12:00:04Z
dc.date.issued	2026
dc.description.abstract	The move towards the exascale systems in High-Performance Computing and the demand for Artificial Intelligence brought together thousands of CPUs and even more GPU accelerators. This massive hardware consolidation has made energy optimization a critical challenge. The immense amount of energy consumption creates a cascade of secondary issues: it increases the carbon footprint, generates significant heat that demands advanced cooling, and causes dramatic power fluctuations that threaten the stability of the electrical grid. Although energy-saving techniques based on Dynamic Voltage and Frequency Scaling are well understood for CPUs, a critical knowledge gap exists for GPU accelerators, limiting the ability to apply similar optimizations. This paper presents a method for measuring how long it takes the CPU to adjust the operating frequency of the GPU (switching latency), and how long the frequency change itself takes to complete (transition latency). The approach employs a minimal iterative workload that allows statistically distinguishing runtime differences between frequency pairs. It first measures execution times for each frequency and then determines the switching and transition latency of the change from an initial to a target frequency by tracking runtime changes and repeating measurements to ensure statistical robustness. Finally, the methodology filters out outliers from external factors such as driver management or system interruptions. The methodology is implemented in the open-source LATEST [1] tool with support for NVIDIA GPU accelerators. It is evaluated on three GPUs based on different generations of architecture, GH200, A100-SXM4, and RTX Quadro 6000. These results show that the transition latency takes from hundreds of microseconds up to hundreds of milliseconds, while the absolute majority of the time is spent in the GPU applying the frequency change. From the analysed GPUs, the GH200 exhibited the widest range, with switching latencies spanning from 5.6 ms to 477 ms and transition latencies from 0.2 ms to 471 ms. Additionally, the transition latency measurement can be used to identify manufacturing variability of accelerators, showing differences in frequency scaling reactivity. Our analysis identifies specific frequency pairs with high switching latencies, creating a challenge that the slow transitions discourage their use, yet the target frequencies themselves may be highly efficient in terms of energy consumption. To address this, we introduce an indirect switching method that leverages an intermediate frequency. This technique effectively circumvents overhead, allowing the system to access these efficient frequency states without the high latency penalty of a direct transition. The use of the indirect frequency switching technique produced a latency reduction between 250 and 431 ms, for a single frequency change on GH200.
dc.description.firstpage	art. no. 108331
dc.description.source	Web of Science
dc.description.volume	179
dc.identifier.citation	Future Generation Computer Systems. 2026, vol. 179, art. no. 108331.
dc.identifier.doi	10.1016/j.future.2025.108331
dc.identifier.issn	0167-739X
dc.identifier.issn	1872-7115
dc.identifier.uri	http://hdl.handle.net/10084/158460
dc.identifier.wos	001657954500001
dc.language.iso	en
dc.publisher	Elsevier
dc.relation.ispartofseries	Future Generation Computer Systems
dc.relation.uri	https://doi.org/10.1016/j.future.2025.108331
dc.rights	© 2025 Elsevier B.V. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
dc.subject	GPU
dc.subject	energy efficient computing
dc.subject	DVFS
dc.subject	transition latency
dc.title	An in-depth study of GPU frequency-scaling latency and its optimization on modern architectures
dc.type	article
dc.type.status	Peer-reviewed
dc.type.version	publishedVersion

Files

License bundle

Now showing 1 - 1 out of 1 results

Name:: license.txt
Size:: 718 B
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Publikační činnost VŠB-TUO ve Web of Science / Publications of VŠB-TUO in Web of Science
Publikační činnost IT4Innovations / Publications of IT4Innovations (9600)
Články z časopisů s impakt faktorem / Articles from Impact Factor Journals