NFDI4DS | UHH-SEMS - Publication Details

Optimizing Application Performance with BlueField: Accelerating Large-Message Blocking and Nonblocking Collective Operations

Blocking (statistics)

DOI: 10.23919/isc.2024.10528935 Publication Date: 2024-05-10T17:22:23Z

Abstract Supplemental Material References Cited by

AUTHORS (11)

Richard Graham

George Bosilca

Yong Qin

Bradley Settlemyer

Gilad Shainer

Craig Stunkel

Geoffroy Vallee

Brody Williams

Gerardo Cisneros-...

Sebastian Ohlmann

Markus Rampp

ABSTRACT

With the end of Dennard scaling, specializing and distributing compute engines throughout system is a promising technique to improve applications performance. For example, NVIDIA's BlueField Data Processing Unit (DPU) integrates programmable processing elements within network offers specialized capabilities. These capabilities enable communication via offloads onto DPUs present new application opportunities for offloading nonblocking or complex patterns such as collective operations. This paper discusses lessons learned enabling DPU-based acceleration algorithms by describing impact offloaded operations on two applications: Octopus P3DFFT++. We MPI_Ialltoallv blocking MPI_Allgatherv that leverage DPU offloading, which are used above applications, evaluate them. Our experiments show performance improvement in range 14% 49% P3DFFT++ 17% Octopus, even though those collectives well-balanced OSU latency benchmarks shows comparable well-optimized host-based implementations these collectives. demonstrates taking into account load imbalance can help where common large magnitude.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES (50)

CITATIONS (1)

EXTERNAL LINKS

CROSSREF - Publications OPENALEX - Publications OPENAIRE - Products

PlumX Metrics

Optimizing Application Performance with BlueField: Accelerating Large-Message Blocking and Nonblocking Collective Operations

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....