Experiences Detecting Defective Hardware in Exascale Supercomputers

Exascale computing
DOI: 10.1145/3624062.3624134 Publication Date: 2023-11-10T18:53:39Z
ABSTRACT
In May 2022, the newest supercomputer to top TOP 500 list was Frontier at Oak Ridge National Laboratory, demonstrating capability of computing more than 1.1 quintillion (1018) floating-point calculations every second. Driving this ground-breaking rate is Frontier's 37,000 graphics processing units (GPUs) and 9,408 central (CPUs). total, contains 60 million parts. At scale, smallest margin error may generate hundreds hardware errors across system. These are capable directly hindering world-class science performed on if not found. work, we describe evaluate two strategies for finding hardware-level faults in compute nodes. There developed: first uses Slurm scheduler scavenge available time run node screen, second builds upon lessons learned strategy enforces a weekly screen each node. Using June 2023 as case study, find that scheduling consumed ten times resources strategy, but successfully detected five defects Frontier. We summarize while developing running world's exascale supercomputer.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (18)
CITATIONS (0)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....