Scott Mahlke

ORCID: 0000-0002-0438-0616
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Parallel Computing and Optimization Techniques
  • Embedded Systems Design Techniques
  • Interconnection Networks and Systems
  • Advanced Data Storage Technologies
  • Distributed and Parallel Computing Systems
  • Radiation Effects in Electronics
  • Distributed systems and fault tolerance
  • Real-Time Systems Scheduling
  • Cloud Computing and Resource Management
  • Low-power high-performance VLSI design
  • Advanced Memory and Neural Computing
  • VLSI and Analog Circuit Testing
  • Software Testing and Debugging Techniques
  • Formal Methods in Verification
  • Advanced Neural Network Applications
  • Security and Verification in Computing
  • Petri Nets in System Modeling
  • Advanced Wireless Communication Techniques
  • Ferroelectric and Negative Capacitance Devices
  • Wireless Communication Networks Research
  • Software Reliability and Analysis Research
  • Software System Performance and Reliability
  • Advanced Malware Detection Techniques
  • IoT and Edge/Fog Computing
  • VLSI and FPGA Design Techniques

University of Michigan
2016-2025

Nvidia (United States)
2022-2023

Nvidia (United Kingdom)
2023

Institute of Electronics
2018

Ghent University Hospital
2012

Pennsylvania State University
2012

Institut national de recherche en informatique et en automatique
2012

Institut de Recherche en Informatique et Systèmes Aléatoires
2012

University of Illinois Urbana-Champaign
1991-2009

Ann Arbor Center for Independent Living
2008

article Effective compiler support for predicated execution using the hyperblock Share on Authors: Scott A. Mahlke View Profile , David C. Lin William Y. Chen Richard E. Hank Roger Bringmann Authors Info & Claims ACM SIGMICRO NewsletterVolume 23Issue 1-2Dec. 1992 pp 45–54https://doi.org/10.1145/144965.144998Online:10 December 1992Publication History 297citation1,660DownloadsMetricsTotal Citations297Total Downloads1,660Last 12 Months64Last 6 weeks6 Get Citation AlertsNew Alert added!This...

10.1145/144965.144998 article EN ACM SIGMICRO newsletter/SIGMICRO newsletter/SIGMICRO, TCMICRO newsletter 1992-12-10

In this paper we introduce a runtime system to allow unmodified multi-threaded applications use multiple machines. The allows threads migrate freely between machines depending on the workload. Our prototype, COMET (Code Offload by Migrating Execution Transparently), is realization of design built top Dalvik Virtual Machine. leverages underlying memory model our implement distributed shared (DSM) with as few interactions possible. Making new VM-synchronization primitive, imposes little...

10.5555/2387880.2387890 article EN Operating Systems Design and Implementation 2012-10-08

As the size of Deep Neural Networks (DNNs) continues to grow increase accuracy and solve more complex problems, their energy footprint also scales. Weight pruning reduces DNN model computation by removing redundant weights. However, we implemented weight for several popular networks on a variety hardware platforms observed surprising results. For many networks, network sparsity caused will actually hurt overall performance despite large reductions in required multiply-accumulate operations....

10.1145/3079856.3080215 article EN 2017-06-24

The recent design shift towards multicore processors has spawned a significant amount of research in the area program parallelization. future abundance cores on single chip requires programmer and compiler intervention to increase parallel work possible. Much fallen into areas coarse-grain parallelization: new programming models different ways exploit threads data-level parallelism. This focuses complementary direction, improving performance through automated fine-grain main difficulty...

10.1109/micro.2007.15 article EN 2007-01-01

Approximate computing, where computation accuracy is traded off for better performance or higher data throughput, one solution that can help processing keep pace with the current and growing overabundance of information. For particular domains such as multimedia learning algorithms, approximation commonly used today. We consider automation to be essential provide transparent we show larger benefits achieved by constructing techniques fit underlying hardware. Our target platform GPU because...

10.1145/2540708.2540711 article EN 2013-12-07

Approximate computing is an approach where reduced accuracy of results traded off for increased speed, throughput, or both. Loss not permissible in all domains, but there are a growing number data-intensive domains the output programs need be perfectly correct to provide useful even noticeable differences end user. These soft include multimedia processing, machine learning, and data mining/analysis. An important challenge with approximate transparency insulate both software hardware...

10.1145/2541940.2541948 article EN 2014-02-24

Article Free Access Share on IMPACT: an architectural framework for multiple-instruction-issue processors Authors: Pohua P. Chang Center Reliable and High-Performance Computing, University of Illinois, Urbana, IL ILView Profile , Scott A. Mahlke William Y. Chen Nancy J. Warter Wen-mei W. Hwu Authors Info & Claims ISCA '91: Proceedings the 18th annual international symposium Computer architectureApril 1991 Pages 266–275https://doi.org/10.1145/115952.115979Published:01 April 1991Publication...

10.1145/115952.115979 article EN 1991-01-01

Predicated execution is an effective technique for dealing with conditional branches in application programs. However, there are several problems associated conventional compiler support predicated execution. First, all paths of control combined into a single path regardless their frequency and size if-conversion techniques. Second, speculative difficult to combine In this paper, we propose the use new structure, referred as hyperblock, overcome these problems. The hyperblock efficient...

10.1109/micro.1992.696999 article EN 2005-08-24

Abstract This paper describes the design and implementation of an optimizing compiler that automatically generates profile information to assist classic code optimizations. contains two new components, execution profiler a profile‐based optimizer, which are not commonly found in traditional compilers. The inserts probes into input program, executes program for several inputs, accumulates supplies this optimizer. optimizer uses expose optimization opportunities visible global methods....

10.1002/spe.4380211204 article EN Software Practice and Experience 1991-12-01

While multicore hardware has become ubiquitous, explicitly parallel programming models and compiler techniques for exploiting parallelism on these systems have noticeably lagged behind. Stream is one model that wide applicability in the multimedia, graphics, signal processing domains. Streaming execute as a set of independent actors communicate data through channels. This paper presents technique planning orchestrating execution streaming applications platforms. An integrated unfolding...

10.1145/1375581.1375596 article EN 2008-06-07

The physical layer of most wireless protocols is traditionally implemented in custom hardware to satisfy the heavy computational requirements while keeping power consumption a minimum. These implementations are time consuming design and difficult verify. A programmable platform capable supporting software layer, or defined radio, has number advantages. include support for multiple protocols, faster time-to-market, higher chip volumes, late implementation changes. challenge achieve this...

10.1145/1150019.1136494 article EN ACM SIGARCH Computer Architecture News 2006-05-01

Aggressive technology scaling provides designers with an ever increasing budget of cheaper and faster transistors. Unfortunately, this trend is accompanied by a decline in individual device reliability as transistors become increasingly susceptible to soft errors. We are quickly approaching new era where resilience errors no longer luxury that can be reserved for just processors high-reliability, mission-critical domains. Even used mainstream computing will soon require protection. However,...

10.1145/1735971.1736063 article EN ACM SIGPLAN Notices 2010-03-05

Application-specific instruction set extensions are an effective way of improving the performance processors. Critical computation subgraphs can be accelerated by collapsing them into new instructions that executed on specialized function units. Collapsing simultaneously reduces length as well number intermediate results stored in register file. The main problem with this approach is a processor must generated for each application domain. While designed automatically, there substantial...

10.1109/micro.2004.5 article EN 2005-12-13

Coarse-grained reconfigurable architectures (CGRAs) present an appealing hardware platform by providing the potential for high computation throughput, scalability, low cost, and energy efficiency. CGRAs consist of array function units register files often organized as a two dimensional grid. The most difficult challenge in deploying is compiler scheduling technology that can efficiently map software implementations compute intensive loops onto array. Traditional schedulers focus on placement...

10.1145/1454115.1454140 article EN 2008-10-25

Application-specific extensions to the computational capabilities of a processor provide an efficient mechanism meet growing performance and power demands embedded applications. Hardware, in form new function units (or co-processors), corresponding instructions, are added baseline critical target application. The central challenge with this approach is large degree human effort required identify create custom hardware units, as well porting application extended processor. In paper, we...

10.5555/956417.956538 article EN International Symposium on Microarchitecture 2003-12-03

As silicon technologies move into the nanometer regime, transistor reliability is expected to wane as devices become subject extreme process variation, particle-induced transient errors, and wear-out. Unless these challenges are addressed, computer vendors can expect low yields short mean-times-to-failure. In this paper, we examine of designing complex computing systems in presence permanent faults. We select one small aspect a typical chip multiprocessor (CMP) system study detail, single...

10.1109/hpca.2006.1598108 article EN 2006-03-21

Technology scaling, characterized by decreasing feature size, thinning gate oxide, and non-ideal voltage will become a major hindrance to microprocessor reliability in future technology generations. Physical analysis of device failure mechanisms has shown that most wearout projected plague generations are progressive, meaning the circuit-level effects develop intensify with age over lifetime chip. This work leverages progression time order present low-cost hardware structure identifies...

10.1109/micro.2007.35 article EN 2007-01-01

The demand for multitasking on graphics processing units (GPUs) is constantly increasing as they have become one of the default components modern computer systems along with traditional processors (CPUs). Preemptive CPUs has been primarily supported through context switching. However, same preemption strategy incurs substantial overhead due to large in GPUs. comes two dimensions: a preempting kernel suffers from long latency, and system throughput wasted during switch. Without precise...

10.1145/2694344.2694346 article EN 2015-03-03

Heterogeneous multicore systems -- comprised of multiple cores with varying capabilities, performance, and energy characteristics have emerged as a promising approach to increasing efficiency. Such reduce consumption by identifying phase changes in an application migrating execution the most efficient core that meets its current performance requirements. However, due overhead switching between cores, migration opportunities are limited coarse-grained phases (hundreds millions instructions),...

10.1109/micro.2012.37 article EN 2012-12-01

As the size of Deep Neural Networks (DNNs) continues to grow increase accuracy and solve more complex problems, their energy footprint also scales. Weight pruning reduces DNN model computation by removing redundant weights. However, we implemented weight for several popular networks on a variety hardware platforms observed surprising results. For many networks, network sparsity caused will actually hurt overall performance despite large reductions in required multiply-accumulate operations....

10.1145/3140659.3080215 article EN ACM SIGARCH Computer Architecture News 2017-06-24

Aggressive technology scaling provides designers with an ever increasing budget of cheaper and faster transistors. Unfortunately, this trend is accompanied by a decline in individual device reliability as transistors become increasingly susceptible to soft errors. We are quickly approaching new era where resilience errors no longer luxury that can be reserved for just processors high-reliability, mission-critical domains. Even used mainstream computing will soon require protection. However,...

10.1145/1736020.1736063 article EN 2010-03-13

Approximate computing can be employed for an emerging class of applications from various domains such as multimedia, machine learning and computer vision. The approximated output applications, even though not 100% numerically correct, is often either useful or the difference unnoticeable to end user. This opens up a new design dimension trade off application performance energy consumption with correctness. However, largely unaddressed challenge quality control: how ensure user experience...

10.1145/2749469.2750371 article EN 2015-05-26

Recent developments in Non-Volatile Memories (NVMs) have opened up a new horizon for in-memory computing. Despite the significant performance gain offered by computational NVMs, previous works relied on manual mapping of specialized kernels to memory arrays, making it infeasible execute more general workloads. We combat this problem proposing programmable processor architecture and data-parallel programming framework. The efficiency proposed comes from two sources: massive parallelism...

10.1145/3173162.3173171 article EN 2018-03-19

Duality Cache is an in-cache computation architecture that enables general purpose data parallel applications to run on caches. This paper presents a holistic approach of building system stack with techniques performing floating point arithmetic and transcendental functions, enabling data-parallel execution model, designing compiler accepts existing CUDA programs, providing flexibility in adopting for various workload characteristics.

10.1145/3307650.3322257 article EN 2019-06-14
Coming Soon ...