NFDI4DS | UHH-SEMS - Publication Details

Vivek Kale

ORCID: 0000-0003-4687-1226

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5056992026

Research Areas

Distributed and Parallel Computing Systems
Parallel Computing and Optimization Techniques
Advanced Data Storage Technologies
Cloud Computing and Resource Management
Algorithms and Data Compression
Scientific Computing and Data Management
Embedded Systems Design Techniques
Energy Harvesting in Wireless Networks
Interconnection Networks and Systems
Distributed systems and fault tolerance
Biomedical and Engineering Education
Matrix Theory and Algorithms
Educational Games and Gamification
Medical Imaging Techniques and Applications
Simulation Techniques and Applications
Advanced Control Systems Optimization
Fuzzy Logic and Control Systems
IoT and Edge/Fog Computing
Lattice Boltzmann Simulation Studies
Mobile Learning in Education
Fluid Dynamics and Turbulent Flows
Software Testing and Debugging Techniques
Robotics and Automated Systems
Mobile Agent-Based Network Management
Molecular Communication and Nanonetworks

Sandia National Laboratories California
2023-2024

Brookhaven National Laboratory
2019-2022

Sandia National Laboratories
2022

University of Illinois Urbana-Champaign
2010-2015

Lawrence Livermore National Laboratory
2012

MPI + MPI: a new hybrid approach to parallel programming with MPI plus shared memory

OPENALEX - Publications

Torsten Hoefler James Dinan Darius Buntinas Pavan Balaji Brian Barrett and 4 more

10.1007/s00607-013-0324-2 article EN Computing 2013-05-18

Performance Analysis of the Lattice Boltzmann Model Beyond Navier-Stokes

OPENALEX - Publications

Amanda Randles Vivek Kale Jeff R. Hammond William Gropp Efthimios Kaxiras

The lattice Boltzmann method is increasingly important in facilitating large-scale fluid dynamics simulations. To date, these simulations have been built on discretized velocity models of up to 27 neighbors. Recent work has shown that higher order approximations the continuum equation enable not only recovery Navier-Stokes hydrodynamics, but also for a wider range Knudsen numbers, which especially micro- and nanoscale flows. These higher-order significant impact both communication...

10.1109/ipdps.2013.109 article EN 2013-05-01

OpenMP application experiences: Porting to accelerated nodes

OPENALEX - Publications

Seonmyeong Bak Colleen Bertoni Swen Boehm Reuben D. Budiardja Barbara Chapman and 19 more

10.1016/j.parco.2021.102856 article EN publisher-specific-oa Parallel Computing 2021-10-25

Hybrid Static/dynamic Scheduling for Already Optimized Dense Matrix Factorization

OPENALEX - Publications

Simplice Donfack Laura Grigori William Gropp Vivek Kale

We present the use of a hybrid static/dynamic scheduling strategy task dependency graph for direct methods used in dense numerical linear algebra. This provides balance data locality, load balance, and low dequeue overhead. show that usage this communication avoiding factorization leads to significant performance gains. On 48 core AMD Opteron NUMA machine, our experiments we can achieve up 64% improvement over version CALU uses fully dynamic scheduling, 30% static scheduling. 16-core Intel...

10.1109/ipdps.2012.53 article EN 2012-05-01

Locality-Optimized Mixed Static/Dynamic Scheduling for Improving Load Balancing on SMPs

OPENALEX - Publications

Vivek Kale Amanda Randles William Gropp

Application performance can be degraded significantly due to node-local load imbalances during application execution. Prior work suggested using a mixed static/dynamic scheduling approach for handling this problem, specifically in the context of fine-grained, transient imbalances. Here, we consider an alternate strategy more general where imbalance may coupled with coarse-grained imbalance. Specifically, implement scheme which modify data layout scheduling, and add additional tuned...

10.1145/2642769.2642788 article EN 2014-08-29

Parallel sorting pattern

OPENALEX - Publications

Vivek Kale Edgar Solomonik

A large number of parallel applications contain a computationally intensive phase in which list elements must be ordered based on some common attribute the elements. How do we sort sequence multiple processing units so as to minimize redistribution keys while allowing independent sorting work?

10.1145/1953611.1953621 article EN 2010-03-30

Addressing Load Imbalance in Bioinformatics and Biomedical Applications: Efficient Scheduling across Multiple GPUs

OPENALEX - Publications

Mathialakan Thavappiragasam Vivek Kale Óscar Hernández Ada Sedova

Computational bioinformatics and biomedical applications frequently contain heterogeneously sized units of work or tasks, for instance due to variability in the sizes biological sequences molecules. Variable-sized workloads lead load imbalances parallel implementations which detract from efficiency performance. Many modern computing resources now have multiple graphics processing units(GPUs) per computer acceleration. These GPU need be used efficiently through balancing across GPUs. OpenMP...

10.1109/bibm52615.2021.9669317 article EN 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) 2021-12-09

Abstract: Slack-Conscious Lightweight Loop Scheduling for Improving Scalability of Bulk-synchronous MPI Applications

OPENALEX - Publications

Vivek Kale Todd Gamblin Torsten Hoefler Bronis R. de Supinski William Gropp

Due to the strict communication dependences in global collective of MPI applications, noise that delays one process can amplify across processes a large run. The amount overhead amplification causes increase dramatically as we scale application very numbers (10,000 or more). For hybrid OpenMP/MPI (or MPI+X) reduce with on- node dynamic thread scheduling. However, cost dequeue such schemes be steep. To mitigate this cost, have introduced lightweight scheduling, which combines and static task...

10.1109/sc.companion.2012.209 article EN 2012-11-01

Toward Automated Detection of Portability Bugs in Kokkos Parallel Programs

OPENALEX - Publications

Vivek Kale Huang Yan Shyamali Mukherjee Jackson R. Mayo Keita Teranishi and 2 more

10.1109/scw63240.2024.00029 article EN 2024-11-17

Weighted locality-sensitive scheduling for mitigating noise on multi-core clusters

OPENALEX - Publications

Vivek Kale Abhinav Bhatelé William Gropp

Recent studies have shown that operating system (OS) interference, popularly called OS noise can be a significant problem as we scale to large number of processors. One solution for mitigating is turn off certain services on the machine. However, this typically infeasible because full-scale may required some applications. Furthermore, it not choice an end user make. Thus, need application-level solution. Building upon previous work demonstrated utility within-node light-weight load...

10.1109/hipc.2011.6152722 article EN 2011-12-01

The correlation between parallel patterns and multi-core benchmarks

OPENALEX - Publications

Vivek Kale

Parallel Patterns can be thought of as standard solutions used to evaluate parallelism in software. Multi-core benchmarks codes for evaluating hardware. In this document, we discuss the relationship and synergy between ongoing development parallel patterns multi-core benchmarks, specifically discussing how they both beneficial one another.

10.1145/1808954.1808969 article EN 2010-05-01

Towards using and improving the NAS parallel benchmarks

OPENALEX - Publications

Vivek Kale

The NAS parallel benchmarks, originally developed by NASA for evaluating performance of their high-performance computers, have been regarded as one the most widely used benchmark suites side-by-side comparisons machines. However, even though benchmarks grown tremendously in last two decades, documentation is lagging behind because rapid changes and additions to collection codes primarily due innovation architectures. Consequently, learning curve beginning graduate students, researchers, or...

10.1145/1953611.1953623 article EN 2010-03-30

OpenMP’s Asynchronous Offloading for All-pairs Shortest Path Graph Algorithms on GPUs

OPENALEX - Publications

Mathialakan Thavappiragasam Vivek Kale

Numerical scientific computations, which are based on floating-point operations, have been sped up greatly via GPUs or other accelerators of supercomputers. However, combinatorial integer do not use a node well. The reason for this is that offloading data and computation from the CPU (host) to GPU (accelerator device) supercomputer by default synchronous. Synchronous costly if host can meaningful offload independent tasks accelerator. To counter these costs, capability asynchronous in...

10.1109/hipar56574.2022.00006 article EN 2022-11-01

Coming Soon ...