Optimizing Application Performance with BlueField: Accelerating Large-Message Blocking and Nonblocking Collective Operations

Blocking (statistics)
DOI: 10.23919/isc.2024.10528935 Publication Date: 2024-05-10T17:22:23Z
ABSTRACT
With the end of Dennard scaling, specializing and distributing compute engines throughout system is a promising technique to improve applications performance. For example, NVIDIA's BlueField Data Processing Unit (DPU) integrates programmable processing elements within network offers specialized capabilities. These capabilities enable communication via offloads onto DPUs present new application opportunities for offloading nonblocking or complex patterns such as collective operations. This paper discusses lessons learned enabling DPU-based acceleration algorithms by describing impact offloaded operations on two applications: Octopus P3DFFT++. We MPI_Ialltoallv blocking MPI_Allgatherv that leverage DPU offloading, which are used above applications, evaluate them. Our experiments show performance improvement in range 14% 49% P3DFFT++ 17% Octopus, even though those collectives well-balanced OSU latency benchmarks shows comparable well-optimized host-based implementations these collectives. demonstrates taking into account load imbalance can help where common large magnitude.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (50)
CITATIONS (1)