NFDI4DS | UHH-SEMS - Publication Details

Nathan R. Tallent

ORCID: 0000-0003-4297-3057

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5047746050

Research Areas

Parallel Computing and Optimization Techniques
Cloud Computing and Resource Management
Distributed and Parallel Computing Systems
Advanced Data Storage Technologies
Scientific Computing and Data Management
Software System Performance and Reliability
Distributed systems and fault tolerance
Graph Theory and Algorithms
Advanced Graph Neural Networks
Interconnection Networks and Systems
Caching and Content Delivery
Advanced Memory and Neural Computing
Research Data Management Practices
Complex Network Analysis Techniques
Embedded Systems Design Techniques
Advanced Neural Network Applications
Photonic and Optical Devices
Machine Learning in Materials Science
Neural Networks and Reservoir Computing
IoT and Edge/Fog Computing
Data Management and Algorithms
Advanced Electron Microscopy Techniques and Applications
Ferroelectric and Negative Capacitance Devices
Software Testing and Debugging Techniques
Semantic Web and Ontologies

Pacific Northwest National Laboratory
2016-2025

Rice University
2002-2011

HPCTOOLKIT: tools for performance analysis of optimized parallel programs

OPENALEX - Publications

Laksono Adhianto Subarno Banerjee Mike Fagan Mark W. Krentel Gabriel Marin and 2 more

Abstract HPCT OOLKIT is an integrated suite of tools that supports measurement, analysis, attribution, and presentation application performance for both sequential parallel programs. can pinpoint quantify scalability bottlenecks in fully optimized programs with a measurement overhead only few percent. Recently, new capabilities were added to collecting call path profiles codes without any compiler support, pinpointing quantifying multithreaded programs, exploring information source code...

10.1002/cpe.1553 article EN Concurrency and Computation Practice and Experience 2009-12-30

Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect

OPENALEX - Publications

Ang Li Shuaiwen Leon Song Jieyang Chen Jiajia Li Xu Liu and 2 more

High performance multi-GPU computing becomes an inevitable trend due to the ever-increasing demand on computation capability in emerging domains such as deep learning, big data and planet-scale simulations. However, lack of understanding how modern GPUs can be connected real impact state-of-the-art interconnect technology application become a hurdle. In this paper, we fill gap by conducting thorough evaluation five latest types GPU interconnects: PCIe, NVLink-V1, NVLink-V2, NVLink-SLI...

10.1109/tpds.2019.2928289 article EN publisher-specific-oa IEEE Transactions on Parallel and Distributed Systems 2019-07-15

OpenAD/F

OPENALEX - Publications

Jean Utke Uwe Naumann Mike Fagan Nathan R. Tallent Michelle Mills Strout and 3 more

The Open/ADF tool allows the evaluation of derivatives functions defined by a Fortran program. derivative is performed code resulting from analysis and transformation original program that defines function interest. has been designed with particular emphasis on modularity, flexibility, use open source components. While follows basic principles automatic differentiation, implements new algorithmic approaches at various levels, for example, block preaccumulation call graph reversal. Unlike...

10.1145/1377596.1377598 article EN ACM Transactions on Mathematical Software 2008-07-01

Effective performance measurement and analysis of multithreaded applications

OPENALEX - Publications

Nathan R. Tallent John Mellor‐Crummey

Understanding why the performance of a multithreaded program does not improve linearly with number cores in shared-memory node populated one or more multicore processors is problem growing practical importance. This paper makes three contributions to analysis programs. First, we describe how measure and attribute parallel idleness, namely, where threads are stalled unable work. technique applies broadly programming models ranging from explicit threading (e.g., Pthreads) higher-level such as...

10.1145/1504176.1504210 article EN 2009-02-14

Analyzing lock contention in multithreaded applications

OPENALEX - Publications

Nathan R. Tallent John Mellor‐Crummey Allan Porterfield

10.1145/1693453.1693489 article EN 2010-01-09

OPENALEX - Publications

John Mellor‐Crummey Robert J. Fowler Gabriel Marin Nathan R. Tallent

10.1023/a:1015789220266 article EN The Journal of Supercomputing 2002-01-01

Scalable Identification of Load Imbalance in Parallel Executions Using Call Path Profiles

OPENALEX - Publications

Nathan R. Tallent Laksono Adhianto John Mellor‐Crummey

Applications must scale well to make efficient use of today's class petascale computers, which contain hundreds thousands processor cores. Inefficiencies that do not even appear in modest-scale executions can become major bottlenecks large-scale executions. Because scaling problems are often difficult diagnose, there is a critical need for scalable tools guide scientists the root causes problems. Load imbalance one most common To provide actionable insight into load imbalance, we present...

10.1109/sc.2010.47 article EN 2010-11-01

Palm

OPENALEX - Publications

Nathan R. Tallent Adolfy Hoisie

Analytical (predictive) application performance models are critical for diagnosing performance-limiting resources, optimizing systems, and designing machines. Creating models, however, is difficult because they must be both accurate concise. To ease the burden of modeling, we developed Palm (Performance Architecture Lab Modeling tool), a modeling tool that combines top-down (human-provided) semantic insight with bottom-up static dynamic analysis. First, provides source code annotation...

10.1145/2597652.2597683 article EN 2014-06-10

Tartan: Evaluating Modern GPU Interconnect via a Multi-GPU Benchmark Suite

OPENALEX - Publications

Ang Li Shuaiwen Leon Song Jieyang Chen Xu Liu Nathan R. Tallent and 1 more

High performance multi-GPU computing becomes an inevitable trend due to the ever-increasing demand on computation capability in emerging domains such as deep learning, big data and planet-scale applications. However, lack of understanding how modern GPUs can be connected actual impact state-of-the-art interconnect multiGPU application a hurdle. Additionally, absence practical benchmark suite poses further obstacles for conducting research era. In this paper, we fill gap by proposing named...

10.1109/iiswc.2018.8573483 article EN 2018-09-01

SuperSAM: Crafting a SAM Supernetwork via Structured Pruning and Unstructured Parameter Prioritization

OPENALEX - Publications

Waqwoya Abebe S. A. Jafari Sixing Yu Akash Dutta J. Strube and 4 more

Neural Architecture Search (NAS) is a powerful approach of automating the design efficient neural architectures. In contrast to traditional NAS methods, recently proposed one-shot methods prove be more in performing NAS. One-shot works by generating singular weight-sharing supernetwork that acts as search space (container) subnetworks. Despite its achievements, designing remains major challenge. this work we propose strategy for Vision Transformer (ViT)-based particular, convert Segment...

10.48550/arxiv.2501.08504 preprint EN arXiv (Cornell University) 2025-01-14

High-performance Visual Semantics Compression for AI-Driven Science

OPENALEX - Publications

Boyuan Zhang Luanzheng Guo Jiannan Tian Jinyang Liu Daoce Wang and 5 more

10.1145/3710848.3710851 article EN 2025-02-28

Binary analysis for measurement and attribution of program performance

OPENALEX - Publications

Nathan R. Tallent John Mellor‐Crummey Michael Fagan

Modern programs frequently employ sophisticated modular designs. As a result, performance problems cannot be identified from costs attributed to routines in isolation; understanding code requires information about routine's calling context. Existing tools fall short this respect. Prior strategies for attributing context-sensitive at the source level either compromise measurement accuracy, remain too close binary, or require custom compilers. To understand of fully optimized code, we...

10.1145/1542476.1542526 article EN 2009-06-15

Scalable fine-grained call path tracing

OPENALEX - Publications

Nathan R. Tallent John Mellor‐Crummey Michael Franco Reed Landrum Laksono Adhianto

Applications must scale well to make efficient use of even medium-scale parallel systems. Because scaling problems are often difficult diagnose, there is a critical need for scalable tools that guide scientists the root causes performance bottlenecks.

10.1145/1995896.1995908 article EN 2011-05-31

A case for application-oblivious energy-efficient MPI runtime

OPENALEX - Publications

Akshay Venkatesh Abhinav Vishnu Khaled Hamidouche Nathan R. Tallent Dhabaleswar K. Panda and 2 more

Power has become a major impediment in designing large scale high-end systems. Message Passing Interface (MPI) is the de facto communication interface used as back-end for applications, programming models and runtime these Slack --- time spent by an MPI process single call---provides potential energy power savings, if appropriate reduction technique such core-idling/Dynamic Voltage Frequency Scaling (DVFS) can be applied without affecting application's performance. Existing techniques that...

10.1145/2807591.2807658 article EN 2015-10-27

Scaling Deep Learning workloads: NVIDIA DGX-1/Pascal and Intel Knights Landing

OPENALEX - Publications

Nitin Gawande Jeff Daily Charles Siegel Nathan R. Tallent Abhinav Vishnu

10.1016/j.future.2018.04.073 article EN publisher-specific-oa Future Generation Computer Systems 2018-05-05

Diagnosing performance bottlenecks in emerging petascale applications

OPENALEX - Publications

Nathan R. Tallent John Mellor‐Crummey Laksono Adhianto Michael Fagan Mark W. Krentel

Cutting-edge science and engineering applications require petascale computing. It is, however, a significant challenge to use computing platforms effectively. Consequently, there is critical need for performance tools that enable scientists understand impediments on emerging systems. In this paper, we describe HPCToolkit---a suite of multi-platform supports sampling-based analysis application platforms. HPCToolkit uses sampling pinpoint quantify both scaling node bottlenecks. We study...

10.1145/1654059.1654111 article EN 2009-11-14

Effective performance measurement and analysis of multithreaded applications

OPENALEX - Publications

Nathan R. Tallent John Mellor‐Crummey

Understanding why the performance of a multithreaded program does not improve linearly with number cores in shared-memory node populated one or more multicore processors is problem growing practical importance. This paper makes three contributions to analysis programs. First, we describe how measure and attribute parallel idleness , namely, where threads are stalled unable work. technique applies broadly programming models ranging from explicit threading ( e.g. Pthreads) higher-level such as...

10.1145/1594835.1504210 article EN ACM SIGPLAN Notices 2009-02-14

HPCToolkit: performance tools for scientific computing

OPENALEX - Publications

Nathan R. Tallent John Mellor‐Crummey Laksono Adhianto Michael Fagan Mark W. Krentel

As part of the U.S. Department Energy's Scientific Discovery through Advanced Computing (SciDAC) program, science teams are tackling problems that require simulation and modeling on petascale computers. activities associated with SciDAC Center for Scalable Application Development Software (CScADS) Performance Engineering Research Institute (PERI), Rice University is building software tools performance analysis scientific applications leadership-class platforms. In this poster abstract, we...

10.1088/1742-6596/125/1/012088 article EN Journal of Physics Conference Series 2008-07-01

Analyzing lock contention in multithreaded applications

OPENALEX - Publications

Nathan R. Tallent John Mellor‐Crummey Allan Porterfield

Many programs exploit shared-memory parallelism using multithreading. Threaded codes typically use locks to coordinate access shared data. In many cases, contention for reduces parallel efficiency and hurts scalability. Being able quantify attribute lock is important understanding where a multithreaded program needs improvement. This paper proposes evaluates three strategies gaining insight into performance losses due contention. First, we consider straightforward strategy based on call...

10.1145/1837853.1693489 article EN ACM SIGPLAN Notices 2010-01-09

Fault Modeling of Extreme Scale Applications Using Machine Learning

OPENALEX - Publications

Abhinav Vishnu Hubertus J. J. van Dam Nathan R. Tallent Darren J. Kerbyson Adolfy Hoisie

Faults are commonplace in large scale systems. These systems experience a variety of faults such as transient, permanent and intermittent. Multi-bit typically not corrected by the hardware resulting an error. This paper attempts to answer important question: Given multi-bit fault main memory, will it result application error - hence recovery algorithm should be invoked or can safely ignored? We propose modeling methodology this question. signature (a set attributes comprising system state),...

10.1109/ipdps.2016.111 article EN 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2016-05-01

Vertex Reordering for Real-World Graphs and Applications: An Empirical Evaluation

OPENALEX - Publications

Reet Barik Marco Minutoli Mahantesh Halappanavar Nathan R. Tallent Ananth Kalyanaraman

Vertex reordering is a way to improve locality in graph computations. Given an input (or "natural") order, aims compute alternate permutation of the vertices that aimed at maximizing locality-based objective. decades research on this topic, there are tens schemes, and also several linear arrangement "gap" measures for treatment as objectives. However, comprehensive empirical analysis efficacy ordering schemes against different gap measures, real-world applications currently lacking. In...

10.1109/iiswc50251.2020.00031 article EN 2020-10-01

Scaling Deep Learning Workloads: NVIDIA DGX-1/Pascal and Intel Knights Landing

OPENALEX - Publications

Nitin Gawande Joshua Landwehr Jeff Daily Nathan R. Tallent Abhinav Vishnu and 1 more

Deep Learning (DL) algorithms have become ubiquitous in data analytics. As a result, major computing vendors - including NVIDIA, Intel, AMD and IBM architectural road-maps influenced by DL workloads. Furthermore, several recently advertised new products as accelerating Unfortunately, it is difficult for scientists to quantify the potential of these different products. This paper provides performance power analysis important workloads on two parallel architectures: NVIDIA DGX-1 (eight Pascal...

10.1109/ipdpsw.2017.36 article EN 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) 2017-05-01

Final Report for CHESS: Cloud, High-Performance Computing, and Edge for Science and Security

OPENALEX - Publications

Nathan R. Tallent J. Strube Luanzheng Guo Hyungro Lee Jesun Firoz and 7 more

Automating the theory-experiment cycle requires effective distributed workflows that utilize a computing continuum spanning lab instruments, edge sensors, resources at multiple facilities, data sets across information sources, and potentially cloud. Unfortunately, obvious methods for constructing platforms, orchestrating workflow tasks, curating datasets over time fail to achieve scientific requirements performance, energy, security, reliability. Furthermore, achieving best use of depends...

10.48550/arxiv.2410.16093 preprint EN arXiv (Cornell University) 2024-10-21

Coming Soon ...