Nathan R. Tallent

ORCID: 0000-0003-4297-3057
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Parallel Computing and Optimization Techniques
  • Cloud Computing and Resource Management
  • Distributed and Parallel Computing Systems
  • Advanced Data Storage Technologies
  • Scientific Computing and Data Management
  • Software System Performance and Reliability
  • Distributed systems and fault tolerance
  • Graph Theory and Algorithms
  • Advanced Graph Neural Networks
  • Interconnection Networks and Systems
  • Caching and Content Delivery
  • Advanced Memory and Neural Computing
  • Research Data Management Practices
  • Complex Network Analysis Techniques
  • Embedded Systems Design Techniques
  • Advanced Neural Network Applications
  • Photonic and Optical Devices
  • Machine Learning in Materials Science
  • Neural Networks and Reservoir Computing
  • IoT and Edge/Fog Computing
  • Data Management and Algorithms
  • Advanced Electron Microscopy Techniques and Applications
  • Ferroelectric and Negative Capacitance Devices
  • Software Testing and Debugging Techniques
  • Semantic Web and Ontologies

Pacific Northwest National Laboratory
2016-2025

Rice University
2002-2011

Abstract HPCT OOLKIT is an integrated suite of tools that supports measurement, analysis, attribution, and presentation application performance for both sequential parallel programs. can pinpoint quantify scalability bottlenecks in fully optimized programs with a measurement overhead only few percent. Recently, new capabilities were added to collecting call path profiles codes without any compiler support, pinpointing quantifying multithreaded programs, exploring information source code...

10.1002/cpe.1553 article EN Concurrency and Computation Practice and Experience 2009-12-30

High performance multi-GPU computing becomes an inevitable trend due to the ever-increasing demand on computation capability in emerging domains such as deep learning, big data and planet-scale simulations. However, lack of understanding how modern GPUs can be connected real impact state-of-the-art interconnect technology application become a hurdle. In this paper, we fill gap by conducting thorough evaluation five latest types GPU interconnects: PCIe, NVLink-V1, NVLink-V2, NVLink-SLI...

10.1109/tpds.2019.2928289 article EN publisher-specific-oa IEEE Transactions on Parallel and Distributed Systems 2019-07-15

The Open/ADF tool allows the evaluation of derivatives functions defined by a Fortran program. derivative is performed code resulting from analysis and transformation original program that defines function interest. has been designed with particular emphasis on modularity, flexibility, use open source components. While follows basic principles automatic differentiation, implements new algorithmic approaches at various levels, for example, block preaccumulation call graph reversal. Unlike...

10.1145/1377596.1377598 article EN ACM Transactions on Mathematical Software 2008-07-01

Understanding why the performance of a multithreaded program does not improve linearly with number cores in shared-memory node populated one or more multicore processors is problem growing practical importance. This paper makes three contributions to analysis programs. First, we describe how measure and attribute parallel idleness, namely, where threads are stalled unable work. technique applies broadly programming models ranging from explicit threading (e.g., Pthreads) higher-level such as...

10.1145/1504176.1504210 article EN 2009-02-14

Many programs exploit shared-memory parallelism using multithreading. Threaded codes typically use locks to coordinate access shared data. In many cases, contention for reduces parallel efficiency and hurts scalability. Being able quantify attribute lock is important understanding where a multithreaded program needs improvement.

10.1145/1693453.1693489 article EN 2010-01-09

10.1023/a:1015789220266 article EN The Journal of Supercomputing 2002-01-01

Applications must scale well to make efficient use of today's class petascale computers, which contain hundreds thousands processor cores. Inefficiencies that do not even appear in modest-scale executions can become major bottlenecks large-scale executions. Because scaling problems are often difficult diagnose, there is a critical need for scalable tools guide scientists the root causes problems. Load imbalance one most common To provide actionable insight into load imbalance, we present...

10.1109/sc.2010.47 article EN 2010-11-01

Analytical (predictive) application performance models are critical for diagnosing performance-limiting resources, optimizing systems, and designing machines. Creating models, however, is difficult because they must be both accurate concise. To ease the burden of modeling, we developed Palm (Performance Architecture Lab Modeling tool), a modeling tool that combines top-down (human-provided) semantic insight with bottom-up static dynamic analysis. First, provides source code annotation...

10.1145/2597652.2597683 article EN 2014-06-10

High performance multi-GPU computing becomes an inevitable trend due to the ever-increasing demand on computation capability in emerging domains such as deep learning, big data and planet-scale applications. However, lack of understanding how modern GPUs can be connected actual impact state-of-the-art interconnect multiGPU application a hurdle. Additionally, absence practical benchmark suite poses further obstacles for conducting research era. In this paper, we fill gap by proposing named...

10.1109/iiswc.2018.8573483 article EN 2018-09-01

Neural Architecture Search (NAS) is a powerful approach of automating the design efficient neural architectures. In contrast to traditional NAS methods, recently proposed one-shot methods prove be more in performing NAS. One-shot works by generating singular weight-sharing supernetwork that acts as search space (container) subnetworks. Despite its achievements, designing remains major challenge. this work we propose strategy for Vision Transformer (ViT)-based particular, convert Segment...

10.48550/arxiv.2501.08504 preprint EN arXiv (Cornell University) 2025-01-14

Modern programs frequently employ sophisticated modular designs. As a result, performance problems cannot be identified from costs attributed to routines in isolation; understanding code requires information about routine's calling context. Existing tools fall short this respect. Prior strategies for attributing context-sensitive at the source level either compromise measurement accuracy, remain too close binary, or require custom compilers. To understand of fully optimized code, we...

10.1145/1542476.1542526 article EN 2009-06-15

Applications must scale well to make efficient use of even medium-scale parallel systems. Because scaling problems are often difficult diagnose, there is a critical need for scalable tools that guide scientists the root causes performance bottlenecks.

10.1145/1995896.1995908 article EN 2011-05-31

Power has become a major impediment in designing large scale high-end systems. Message Passing Interface (MPI) is the de facto communication interface used as back-end for applications, programming models and runtime these Slack --- time spent by an MPI process single call---provides potential energy power savings, if appropriate reduction technique such core-idling/Dynamic Voltage Frequency Scaling (DVFS) can be applied without affecting application's performance. Existing techniques that...

10.1145/2807591.2807658 article EN 2015-10-27

Cutting-edge science and engineering applications require petascale computing. It is, however, a significant challenge to use computing platforms effectively. Consequently, there is critical need for performance tools that enable scientists understand impediments on emerging systems. In this paper, we describe HPCToolkit---a suite of multi-platform supports sampling-based analysis application platforms. HPCToolkit uses sampling pinpoint quantify both scaling node bottlenecks. We study...

10.1145/1654059.1654111 article EN 2009-11-14

Understanding why the performance of a multithreaded program does not improve linearly with number cores in shared-memory node populated one or more multicore processors is problem growing practical importance. This paper makes three contributions to analysis programs. First, we describe how measure and attribute parallel idleness , namely, where threads are stalled unable work. technique applies broadly programming models ranging from explicit threading ( e.g. Pthreads) higher-level such as...

10.1145/1594835.1504210 article EN ACM SIGPLAN Notices 2009-02-14

As part of the U.S. Department Energy's Scientific Discovery through Advanced Computing (SciDAC) program, science teams are tackling problems that require simulation and modeling on petascale computers. activities associated with SciDAC Center for Scalable Application Development Software (CScADS) Performance Engineering Research Institute (PERI), Rice University is building software tools performance analysis scientific applications leadership-class platforms. In this poster abstract, we...

10.1088/1742-6596/125/1/012088 article EN Journal of Physics Conference Series 2008-07-01

Many programs exploit shared-memory parallelism using multithreading. Threaded codes typically use locks to coordinate access shared data. In many cases, contention for reduces parallel efficiency and hurts scalability. Being able quantify attribute lock is important understanding where a multithreaded program needs improvement. This paper proposes evaluates three strategies gaining insight into performance losses due contention. First, we consider straightforward strategy based on call...

10.1145/1837853.1693489 article EN ACM SIGPLAN Notices 2010-01-09

Faults are commonplace in large scale systems. These systems experience a variety of faults such as transient, permanent and intermittent. Multi-bit typically not corrected by the hardware resulting an error. This paper attempts to answer important question: Given multi-bit fault main memory, will it result application error - hence recovery algorithm should be invoked or can safely ignored? We propose modeling methodology this question. signature (a set attributes comprising system state),...

10.1109/ipdps.2016.111 article EN 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2016-05-01

Vertex reordering is a way to improve locality in graph computations. Given an input (or "natural") order, aims compute alternate permutation of the vertices that aimed at maximizing locality-based objective. decades research on this topic, there are tens schemes, and also several linear arrangement "gap" measures for treatment as objectives. However, comprehensive empirical analysis efficacy ordering schemes against different gap measures, real-world applications currently lacking. In...

10.1109/iiswc50251.2020.00031 article EN 2020-10-01

Deep Learning (DL) algorithms have become ubiquitous in data analytics. As a result, major computing vendors - including NVIDIA, Intel, AMD and IBM architectural road-maps influenced by DL workloads. Furthermore, several recently advertised new products as accelerating Unfortunately, it is difficult for scientists to quantify the potential of these different products. This paper provides performance power analysis important workloads on two parallel architectures: NVIDIA DGX-1 (eight Pascal...

10.1109/ipdpsw.2017.36 article EN 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) 2017-05-01

Automating the theory-experiment cycle requires effective distributed workflows that utilize a computing continuum spanning lab instruments, edge sensors, resources at multiple facilities, data sets across information sources, and potentially cloud. Unfortunately, obvious methods for constructing platforms, orchestrating workflow tasks, curating datasets over time fail to achieve scientific requirements performance, energy, security, reliability. Furthermore, achieving best use of depends...

10.48550/arxiv.2410.16093 preprint EN arXiv (Cornell University) 2024-10-21
Coming Soon ...