Vivek Kale

ORCID: 0000-0003-4687-1226
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Distributed and Parallel Computing Systems
  • Parallel Computing and Optimization Techniques
  • Advanced Data Storage Technologies
  • Cloud Computing and Resource Management
  • Algorithms and Data Compression
  • Scientific Computing and Data Management
  • Embedded Systems Design Techniques
  • Energy Harvesting in Wireless Networks
  • Interconnection Networks and Systems
  • Distributed systems and fault tolerance
  • Biomedical and Engineering Education
  • Matrix Theory and Algorithms
  • Educational Games and Gamification
  • Medical Imaging Techniques and Applications
  • Simulation Techniques and Applications
  • Advanced Control Systems Optimization
  • Fuzzy Logic and Control Systems
  • IoT and Edge/Fog Computing
  • Lattice Boltzmann Simulation Studies
  • Mobile Learning in Education
  • Fluid Dynamics and Turbulent Flows
  • Software Testing and Debugging Techniques
  • Robotics and Automated Systems
  • Mobile Agent-Based Network Management
  • Molecular Communication and Nanonetworks

Sandia National Laboratories California
2023-2024

Brookhaven National Laboratory
2019-2022

Sandia National Laboratories
2022

University of Illinois Urbana-Champaign
2010-2015

Lawrence Livermore National Laboratory
2012

The lattice Boltzmann method is increasingly important in facilitating large-scale fluid dynamics simulations. To date, these simulations have been built on discretized velocity models of up to 27 neighbors. Recent work has shown that higher order approximations the continuum equation enable not only recovery Navier-Stokes hydrodynamics, but also for a wider range Knudsen numbers, which especially micro- and nanoscale flows. These higher-order significant impact both communication...

10.1109/ipdps.2013.109 article EN 2013-05-01

We present the use of a hybrid static/dynamic scheduling strategy task dependency graph for direct methods used in dense numerical linear algebra. This provides balance data locality, load balance, and low dequeue overhead. show that usage this communication avoiding factorization leads to significant performance gains. On 48 core AMD Opteron NUMA machine, our experiments we can achieve up 64% improvement over version CALU uses fully dynamic scheduling, 30% static scheduling. 16-core Intel...

10.1109/ipdps.2012.53 article EN 2012-05-01

Application performance can be degraded significantly due to node-local load imbalances during application execution. Prior work suggested using a mixed static/dynamic scheduling approach for handling this problem, specifically in the context of fine-grained, transient imbalances. Here, we consider an alternate strategy more general where imbalance may coupled with coarse-grained imbalance. Specifically, implement scheme which modify data layout scheduling, and add additional tuned...

10.1145/2642769.2642788 article EN 2014-08-29

A large number of parallel applications contain a computationally intensive phase in which list elements must be ordered based on some common attribute the elements. How do we sort sequence multiple processing units so as to minimize redistribution keys while allowing independent sorting work?

10.1145/1953611.1953621 article EN 2010-03-30

Computational bioinformatics and biomedical applications frequently contain heterogeneously sized units of work or tasks, for instance due to variability in the sizes biological sequences molecules. Variable-sized workloads lead load imbalances parallel implementations which detract from efficiency performance. Many modern computing resources now have multiple graphics processing units(GPUs) per computer acceleration. These GPU need be used efficiently through balancing across GPUs. OpenMP...

10.1109/bibm52615.2021.9669317 article EN 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) 2021-12-09

Due to the strict communication dependences in global collective of MPI applications, noise that delays one process can amplify across processes a large run. The amount overhead amplification causes increase dramatically as we scale application very numbers (10,000 or more). For hybrid OpenMP/MPI (or MPI+X) reduce with on- node dynamic thread scheduling. However, cost dequeue such schemes be steep. To mitigate this cost, have introduced lightweight scheduling, which combines and static task...

10.1109/sc.companion.2012.209 article EN 2012-11-01

Recent studies have shown that operating system (OS) interference, popularly called OS noise can be a significant problem as we scale to large number of processors. One solution for mitigating is turn off certain services on the machine. However, this typically infeasible because full-scale may required some applications. Furthermore, it not choice an end user make. Thus, need application-level solution. Building upon previous work demonstrated utility within-node light-weight load...

10.1109/hipc.2011.6152722 article EN 2011-12-01

Parallel Patterns can be thought of as standard solutions used to evaluate parallelism in software. Multi-core benchmarks codes for evaluating hardware. In this document, we discuss the relationship and synergy between ongoing development parallel patterns multi-core benchmarks, specifically discussing how they both beneficial one another.

10.1145/1808954.1808969 article EN 2010-05-01

The NAS parallel benchmarks, originally developed by NASA for evaluating performance of their high-performance computers, have been regarded as one the most widely used benchmark suites side-by-side comparisons machines. However, even though benchmarks grown tremendously in last two decades, documentation is lagging behind because rapid changes and additions to collection codes primarily due innovation architectures. Consequently, learning curve beginning graduate students, researchers, or...

10.1145/1953611.1953623 article EN 2010-03-30

Numerical scientific computations, which are based on floating-point operations, have been sped up greatly via GPUs or other accelerators of supercomputers. However, combinatorial integer do not use a node well. The reason for this is that offloading data and computation from the CPU (host) to GPU (accelerator device) supercomputer by default synchronous. Synchronous costly if host can meaningful offload independent tasks accelerator. To counter these costs, capability asynchronous in...

10.1109/hipar56574.2022.00006 article EN 2022-11-01
Coming Soon ...