Craig Stunkel

ORCID: 0000-0002-8265-933X
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Interconnection Networks and Systems
  • Parallel Computing and Optimization Techniques
  • Distributed systems and fault tolerance
  • Distributed and Parallel Computing Systems
  • Embedded Systems Design Techniques
  • Advanced Data Storage Technologies
  • Software-Defined Networks and 5G
  • Cloud Computing and Resource Management
  • Advanced Optical Network Technologies
  • VLSI and Analog Circuit Testing
  • Advanced Memory and Neural Computing
  • Network Time Synchronization Technologies
  • Optical Network Technologies
  • Supercapacitor Materials and Fabrication
  • Advancements in Battery Materials
  • Real-Time Systems Scheduling
  • Scientific Computing and Data Management
  • Cellular Automata and Applications
  • Advanced Queuing Theory Analysis
  • Wireless Communication Networks Research
  • Peer-to-Peer Network Technologies
  • VLSI and FPGA Design Techniques
  • Network Packet Processing and Optimization
  • Network Traffic and Congestion Control
  • Cybersecurity and Information Systems

Nvidia (United States)
2022-2024

Los Alamos National Laboratory
2023

University of Tennessee at Knoxville
2023

Konkuk University Medical Center
2022

Ericsson (Hungary)
2021

The Ohio State University
2002-2020

Lawrence Berkeley National Laboratory
2020

Polytechnic University of Turin
2020

University of Pittsburgh
2020

University of Castilla-La Mancha
2020

The heart of an IBM SP2™ system is the HighPerformance Switch, which a low-latency, highbandwidth switching network that binds together RISC System/6000® processors. switch incorporates unique combination topology and architectural features to scale aggregate bandwidth, enhance reliability, simplify cabling. It bidirectional multistage interconnect subsystem driven by common oscillator, delivers both data service packets over same links. Switching elements contain dynamically allocated...

10.1147/sj.342.0185 article EN IBM Systems Journal 1995-01-01

The interconnect plays a key role in both the cost and performance of large-scale HPC systems. future high-bandwidth electronic interconnects mushrooms due to expensive optical transceivers needed between switches. We describe potentially cheaper more power-efficient approach building high-performance interconnects. Through empirical analysis applications, we find that bulk inter-processor communication (barring collectives) is bounded degree changes very slowly or never. Thus propose...

10.1109/sc.2005.48 article EN 2005-12-22

This paper presents Unified Communication X (UCX), a set of network APIs and their implementations for high throughput computing. UCX comes from the combined effort national laboratories, industry, academia to design implement high-performing highly-scalable stack next generation applications systems. provides ability tailor its functionality suit wide variety application domains hardware. We envision these satisfy networking needs many programming models such as Message Passing Interface...

10.1109/hoti.2015.13 article EN 2015-08-01

The design of fault-tolerant hypercube multiprocessor architecture is discussed. authors propose the detection and location faulty processors concurrently with actual execution parallel applications on using a novel scheme algorithm-based error detection. System-level mechanisms have been implemented for three 16-processor Intel iPSC multiprocessor: matrix multiplication, Gaussian elimination, fast Fourier transform. Schemes other are under development. Extensive studies done coverage...

10.1109/12.57055 article EN IEEE Transactions on Computers 1990-01-01

The IBM scalable POWERparallel systems 9076 SP1 connects RISC System/6000 processors via a communication network called the high-performance switch. This switch-based upon Vulcan parallel processor incorporates number of unusual features to enhance reliability, diagnose faults, and simplify cabling. paper examines switch architecture implementation overviews support software. is bidirectional MIN, provides at least 4 usable redundant paths for most pairs communicating nodes.< <ETX...

10.1109/shpcc.1994.296638 article EN 2002-12-17

IBM's recently announced Scalable POWERparallel family of systems is based upon the Vulcan architecture, and currently available 9076 SP1 parallel system utilizes fundamental technology. The experimental processor designed to scale many thousands microprocessor-based nodes. To support a machine this size, nodes network incorporate number unusual features aggregate bandwidth, enhance reliability, diagnose faults, simplify cabling. multistage unified data service driven by single oscillator....

10.1109/ipps.1994.288290 article EN 2002-12-17

Trace-driven simulation is an important aid in performance analysis of computer systems. Capturing address traces for these simulations a difficult problem single processors and particularly multicomputers. Even when existing trace methods can be used on multicomputers, the amount collected data typically grows with number processors, so I/O storage costs increase. A new technique presented this paper which modifies executable code to dynamically collect from user analyzes during execution...

10.1145/75108.75380 article EN 1989-04-01

Oak Ridge National Laboratory's Summit supercomputer and Lawrence Livermore Sierra utilize InfiniBand interconnect in a Fat-tree network topology, interconnecting all compute nodes, storage administration, management nodes into one linearly scalable network. These networks are based on Mellanox 100-Gb/s EDR ConnectX-5 adapters Switch-IB2 switches, with compute-rack packaging cooling contributions from IBM. devices support in-network computing acceleration engines such as Scalable...

10.1147/jrd.2020.2967330 article EN IBM Journal of Research and Development 2020-05-01

Recently implemented parallel system address-tracing methods based on several metrics are surveyed. The issues specific to collection of traces for both shared and distributed memory computers highlighted. Five general categories address-trace examined: hardware-captured, interrupt-based, simulation-based, altered microcode-based, instrumented program-based traces. problems unique multiprocessors examined separately.< <ETX xmlns:mml="http://www.w3.org/1998/Math/MathML"...

10.1109/2.67191 article EN Computer 1991-01-01

Multidestination message passing has been proposed as an attractive mechanism for efficiently implementing multicast and other collective operations on direct networks. However, applying this to switch-based parallel systems is non-trivial. In paper we propose alternative switch architectures with differing buffer organizations implement multidestination worms systems. First, discuss issues related such implementation (deadlock-freedom, replication mechanisms, header encoding, routing)....

10.1145/264107.264129 article EN 1997-05-01

This paper proposes anew approach for implementing fast multicast and broadcast in unidirectional bidirectional multistage interconnection networks (MINs) with multiport encoded multidestination worms. For a MIN n stages, such worms use header flits each. One flit is used each stage of the network it indicates output ports to which message needs be replicated. A worm (d/sub 1/, d/sub 2/..., n/, 1/spl les/d/sub i//spl les/k) degrees replication respective stages capable covering 1//spl...

10.1109/71.730529 article EN IEEE Transactions on Parallel and Distributed Systems 1998-01-01

A discussion is presented of a fault-tolerant hypercube multiprocessor architecture which uses novel algorithm-based fault-detection approach for identifying faulty processors. The scheme involves the detection and location processors concurrently with actual execution parallel applications on hypercube. authors have implemented system-level mechanisms various 16-processor Intel iPSC multiprocessor. They report results two applications: matrix multiplication fast Fourier transform. performed...

10.1109/ftcs.1988.5344 article EN 1988-01-01

We propose Optical Circuit Switching for dynamically creating reconfigurable partitions in large-scale systems with Dragonfly networks. Up to 2x execution-time improvement is demonstrated global traffic patterns a >13,000-node system using production-grade network simulator.

10.1364/ofc.2016.w3j.3 article EN Optical Fiber Communication Conference 2016-01-01

We describe the adaptive source routing (ASR) method which is a first attempt to combine and methods. In ASR, adaptivity of each packet determined at processor. Every can be routed in fully or partially non-adaptive manner, all within same network time. evaluate compare performance proposed networks oblivious by simulations. also route generation algorithm that determines maximally routes multistage networks.

10.1109/ipps.1996.508067 article EN Proceedings of the International Conference on Parallel Processing 2002-12-23

Switch-based interconnects are used in a number of application domains, including parallel system interconnects, local area networks, and wide networks. However, very few switches have been designed that suitable for more than one these domains. Such switch must offer both extremely low latency high throughput variety different message sizes. While some architectures with output queuing shown to perform well terms throughput, their performance can suffer when systems where significant...

10.1109/71.993207 article EN IEEE Transactions on Parallel and Distributed Systems 2002-03-01

Article Free Access Share on A new switch chip for IBM RS/6000 SP systems Authors: Craig B. Stunkel T.J. Watson Research Center, P.O. Box 218, Yorktown Heights, NY NYView Profile , Jay Herring Server Division, 522 South Road, Poughkeepsie, Bulent Abali Rajeev Sivaram Authors Info & Claims SC '99: Proceedings of the 1999 ACM/IEEE conference SupercomputingJanuary 1999Pages 16–eshttps://doi.org/10.1145/331532.331548Published:01 January 1999Publication History 19citation324DownloadsMetricsTotal...

10.1145/331532.331548 article EN 1999-01-01

Trace-driven simulation is an important aid in performance analysis of computer systems. Capturing address traces for these simulations a difficult problem single processors and particularly multicomputers. Even when existing trace methods can be used on multicomputers, the amount collected data typically grows with number processors, so I/O storage costs increase. A new technique presented this paper which modifies executable code to dynamically collect from user analyzes during execution...

10.1145/75372.75380 article EN ACM SIGMETRICS Performance Evaluation Review 1989-04-01

Large, sparse, linear systems of equations arise frequently when constructing mathematical models natural phenomena. Most often, these are fully constrained and can be solved via direct or iterative techniques. However, one important problem class requires solutions to underconstrained that maximize some objective function. These optimization problems formulations many business plans often contain hundreds with thousands variables. Historically, have been the simplex method. Despite...

10.1145/63047.63104 article EN 1988-01-01

With the end of Dennard scaling, specializing and distributing compute engines throughout system is a promising technique to improve applications performance. For example, NVIDIA's BlueField Data Processing Unit (DPU) integrates programmable processing elements within network offers specialized capabilities. These capabilities enable communication via offloads onto DPUs present new application opportunities for offloading nonblocking or complex patterns such as collective operations. This...

10.23919/isc.2024.10528935 article EN 2024-05-01

Barrier synchronization is a crucial operation for parallel systems. Many schemes have been proposed in the literature to achieve fast barrier through software, hardware, or combination of these mechanisms. However few emphasize fault-tolerant operations. In this paper, we describe inexpensive support that can be added network switches achieving reliable hardware-based while recovering from lost corrupted messages. Necessary modifications switch architecture and associated message-passing...

10.1109/ipps.1997.580908 article EN 2002-11-22

We survey network topologies, in particular networks with full all-to-all bandwidth scaling. For more detailed study, we select several recently introduced, promising that are cheaper than a 3-level Fat-tree. Through combination of analysis and simulation on selected supercomputer workloads, compare these according to desirable properties such as robust performance, low cost, partitionability. conclude observations for future systems.

10.5555/3019057.3019059 article EN 2016-11-13

Recent research has proposed methods for enhancing the performance of multicast in networks with irregular topologies. These fall into two broad categories: (a) network interface (NI) based schemes that make use enhanced functionality software/firmware running at NI processor; and (b) switch-based enhancements to switch architecture support hardware multicast. However it is not clear how these compare each other when makes sense one over other. In order answer such questions, we perform a...

10.1109/icpp.1998.708517 article EN 2002-11-27

This paper proposes a new approach for implementing fast multicast and broadcast in multistage interconnection networks (MINs) with multiport encoded multidestination worms. For MIN k/spl times/k switches n stages such worms use header flits each. One flit is used each stage of the network it indicates output ports to which message must be replicated. A single worm has capability cover large number destinations communication startup. switch architecture proposed without deadlock. Grouping...

10.1109/spdp.1996.570314 article EN 2002-12-23

We survey network topologies, in particular networks with full all-to-all bandwidth scaling. For more detailed study, we select several recently introduced, promising that are cheaper than a 3-level Fat-tree. Through combination of analysis and simulation on selected supercomputer workloads, compare these according to desirable properties such as robust performance, low cost, partitionability. conclude observations for future systems.

10.1109/pmbs.2016.007 article EN 2016-11-01

10.1016/0141-9331(92)90067-4 article EN Microprocessors and Microsystems 1992-01-01
Coming Soon ...