NFDI4DS | UHH-SEMS - Publication Details

The superblock: An effective technique for VLIW and superscalar compilation

OPENALEX - Publications

Wen mei Hwu Scott Mahlke William Y. Chen Pohua P. Chang Nancy J. Warter and 7 more

10.1007/bf01205185 article EN The Journal of Supercomputing 1993-05-01

Effective compiler support for predicated execution using the hyperblock

OPENALEX - Publications

Scott Mahlke David C. Lin William Y. Chen Richard E. Hank Roger A. Bringmann

article Effective compiler support for predicated execution using the hyperblock Share on Authors: Scott A. Mahlke View Profile , David C. Lin William Y. Chen Richard E. Hank Roger Bringmann Authors Info & Claims ACM SIGMICRO NewsletterVolume 23Issue 1-2Dec. 1992 pp 45–54https://doi.org/10.1145/144965.144998Online:10 December 1992Publication History 297citation1,660DownloadsMetricsTotal Citations297Total Downloads1,660Last 12 Months64Last 6 weeks6 Get Citation AlertsNew Alert added!This...

10.1145/144965.144998 article EN ACM SIGMICRO newsletter/SIGMICRO newsletter/SIGMICRO, TCMICRO newsletter 1992-12-10

COMET: code offload by migrating execution transparently

OPENALEX - Publications

Mark S. Gordon D. Anoushe Jamshidi Scott Mahlke Z. Morley Mao Chen Xu

In this paper we introduce a runtime system to allow unmodified multi-threaded applications use multiple machines. The allows threads migrate freely between machines depending on the workload. Our prototype, COMET (Code Offload by Migrating Execution Transparently), is realization of design built top Dalvik Virtual Machine. leverages underlying memory model our implement distributed shared (DSM) with as few interactions possible. Making new VM-synchronization primitive, imposes little...

10.5555/2387880.2387890 article EN Operating Systems Design and Implementation 2012-10-08

Scalpel

OPENALEX - Publications

Jiecao Yu Andrew Lukefahr David J. Palframan Ganesh Dasika Reetuparna Das and 1 more

As the size of Deep Neural Networks (DNNs) continues to grow increase accuracy and solve more complex problems, their energy footprint also scales. Weight pruning reduces DNN model computation by removing redundant weights. However, we implemented weight for several popular networks on a variety hardware platforms observed surprising results. For many networks, network sparsity caused will actually hurt overall performance despite large reductions in required multiply-accumulate operations....

10.1145/3079856.3080215 article EN 2017-06-24

Data Access Partitioning for Fine-grain Parallelism on Multicore Architectures

OPENALEX - Publications

Michael Chu Rajiv Ravindran Scott Mahlke

The recent design shift towards multicore processors has spawned a significant amount of research in the area program parallelization. future abundance cores on single chip requires programmer and compiler intervention to increase parallel work possible. Much fallen into areas coarse-grain parallelization: new programming models different ways exploit threads data-level parallelism. This focuses complementary direction, improving performance through automated fine-grain main difficulty...

10.1109/micro.2007.15 article EN 2007-01-01

SAGE

OPENALEX - Publications

Mehrzad Samadi Janghaeng Lee D. Anoushe Jamshidi Amir Hormati Scott Mahlke

Approximate computing, where computation accuracy is traded off for better performance or higher data throughput, one solution that can help processing keep pace with the current and growing overabundance of information. For particular domains such as multimedia learning algorithms, approximation commonly used today. We consider automation to be essential provide transparent we show larger benefits achieved by constructing techniques fit underlying hardware. Our target platform GPU because...

10.1145/2540708.2540711 article EN 2013-12-07

Paraprox

OPENALEX - Publications

Mehrzad Samadi D. Anoushe Jamshidi Janghaeng Lee Scott Mahlke

Approximate computing is an approach where reduced accuracy of results traded off for increased speed, throughput, or both. Loss not permissible in all domains, but there are a growing number data-intensive domains the output programs need be perfectly correct to provide useful even noticeable differences end user. These soft include multimedia processing, machine learning, and data mining/analysis. An important challenge with approximate transparency insulate both software hardware...

10.1145/2541940.2541948 article EN 2014-02-24

IMPACT

OPENALEX - Publications

Pohua P. Chang Scott Mahlke William Y. Chen Nancy J. Warter Wen‐mei Hwu

Article Free Access Share on IMPACT: an architectural framework for multiple-instruction-issue processors Authors: Pohua P. Chang Center Reliable and High-Performance Computing, University of Illinois, Urbana, IL ILView Profile , Scott A. Mahlke William Y. Chen Nancy J. Warter Wen-mei W. Hwu Authors Info & Claims ISCA '91: Proceedings the 18th annual international symposium Computer architectureApril 1991 Pages 266–275https://doi.org/10.1145/115952.115979Published:01 April 1991Publication...

10.1145/115952.115979 article EN 1991-01-01

Effective Compiler Support For Predicated Execution Using The Hyperblock

OPENALEX - Publications

Scott Mahlke D.C. Lin W.Y. Chen Richard E. Hank Roger A. Bringmann

Predicated execution is an effective technique for dealing with conditional branches in application programs. However, there are several problems associated conventional compiler support predicated execution. First, all paths of control combined into a single path regardless their frequency and size if-conversion techniques. Second, speculative difficult to combine In this paper, we propose the use new structure, referred as hyperblock, overcome these problems. The hyperblock efficient...

10.1109/micro.1992.696999 article EN 2005-08-24

Using profile information to assist classic code optimizations

OPENALEX - Publications

Pohua P. Chang Scott Mahlke Wen‐mei Hwu

Abstract This paper describes the design and implementation of an optimizing compiler that automatically generates profile information to assist classic code optimizations. contains two new components, execution profiler a profile‐based optimizer, which are not commonly found in traditional compilers. The inserts probes into input program, executes program for several inputs, accumulates supplies this optimizer. optimizer uses expose optimization opportunities visible global methods....

10.1002/spe.4380211204 article EN Software Practice and Experience 1991-12-01

Orchestrating the execution of stream programs on multicore platforms

OPENALEX - Publications

Manjunath Kudlur Scott Mahlke

While multicore hardware has become ubiquitous, explicitly parallel programming models and compiler techniques for exploiting parallelism on these systems have noticeably lagged behind. Stream is one model that wide applicability in the multimedia, graphics, signal processing domains. Streaming execute as a set of independent actors communicate data through channels. This paper presents technique planning orchestrating execution streaming applications platforms. An integrated unfolding...

10.1145/1375581.1375596 article EN 2008-06-07

SODA

OPENALEX - Publications

Yuan Lin Hyunseok Lee Mark Woh Yoav Harel Scott Mahlke and 3 more

The physical layer of most wireless protocols is traditionally implemented in custom hardware to satisfy the heavy computational requirements while keeping power consumption a minimum. These implementations are time consuming design and difficult verify. A programmable platform capable supporting software layer, or defined radio, has number advantages. include support for multiple protocols, faster time-to-market, higher chip volumes, late implementation changes. challenge achieve this...

10.1145/1150019.1136494 article EN ACM SIGARCH Computer Architecture News 2006-05-01

Shoestring

OPENALEX - Publications

Shuguang Feng Shantanu Gupta Amin Ansari Scott Mahlke

Aggressive technology scaling provides designers with an ever increasing budget of cheaper and faster transistors. Unfortunately, this trend is accompanied by a decline in individual device reliability as transistors become increasingly susceptible to soft errors. We are quickly approaching new era where resilience errors no longer luxury that can be reserved for just processors high-reliability, mission-critical domains. Even used mainstream computing will soon require protection. However,...

10.1145/1735971.1736063 article EN ACM SIGPLAN Notices 2010-03-05

Application-Specific Processing on a General-Purpose Core via Transparent Instruction Set Customization

OPENALEX - Publications

Nathan Clark Manjunath Kudlur Hyunchul Park Scott Mahlke Krisztián Flautner

Application-specific instruction set extensions are an effective way of improving the performance processors. Critical computation subgraphs can be accelerated by collapsing them into new instructions that executed on specialized function units. Collapsing simultaneously reduces length as well number intermediate results stored in register file. The main problem with this approach is a processor must generated for each application domain. While designed automatically, there substantial...

10.1109/micro.2004.5 article EN 2005-12-13

Edge-centric modulo scheduling for coarse-grained reconfigurable architectures

OPENALEX - Publications

Hyunchul Park Kevin Fan Scott Mahlke Taewook Oh Hee Seok Kim and 1 more

Coarse-grained reconfigurable architectures (CGRAs) present an appealing hardware platform by providing the potential for high computation throughput, scalability, low cost, and energy efficiency. CGRAs consist of array function units register files often organized as a two dimensional grid. The most difficult challenge in deploying is compiler scheduling technology that can efficiently map software implementations compute intensive loops onto array. Traditional schedulers focus on placement...

10.1145/1454115.1454140 article EN 2008-10-25

Processor acceleration through automated instruction set customization

OPENALEX - Publications

Nathan Clark Hongtao Zhong Scott Mahlke

Application-specific extensions to the computational capabilities of a processor provide an efficient mechanism meet growing performance and power demands embedded applications. Hardware, in form new function units (or co-processors), corresponding instructions, are added baseline critical target application. The central challenge with this approach is large degree human effort required identify create custom hardware units, as well porting application extended processor. In paper, we...

10.5555/956417.956538 article EN International Symposium on Microarchitecture 2003-12-03

BulletProof: A Defect~Tolerant CMP Switch Architecture

OPENALEX - Publications

Kypros Constantinides Stephen M. Plaza Jason Blome Bin Zhang Valeria Bertacco and 3 more

As silicon technologies move into the nanometer regime, transistor reliability is expected to wane as devices become subject extreme process variation, particle-induced transient errors, and wear-out. Unless these challenges are addressed, computer vendors can expect low yields short mean-times-to-failure. In this paper, we examine of designing complex computing systems in presence permanent faults. We select one small aspect a typical chip multiprocessor (CMP) system study detail, single...

10.1109/hpca.2006.1598108 article EN 2006-03-21

Self-calibrating Online Wearout Detection

OPENALEX - Publications

Jason Blome Shuguang Feng Shantanu Gupta Scott Mahlke

Technology scaling, characterized by decreasing feature size, thinning gate oxide, and non-ideal voltage will become a major hindrance to microprocessor reliability in future technology generations. Physical analysis of device failure mechanisms has shown that most wearout projected plague generations are progressive, meaning the circuit-level effects develop intensify with age over lifetime chip. This work leverages progression time order present low-cost hardware structure identifies...

10.1109/micro.2007.35 article EN 2007-01-01

Chimera

OPENALEX - Publications

Jason Jong Kyu Park Yongjun Park Scott Mahlke

The demand for multitasking on graphics processing units (GPUs) is constantly increasing as they have become one of the default components modern computer systems along with traditional processors (CPUs). Preemptive CPUs has been primarily supported through context switching. However, same preemption strategy incurs substantial overhead due to large in GPUs. comes two dimensions: a preempting kernel suffers from long latency, and system throughput wasted during switch. Without precise...

10.1145/2694344.2694346 article EN 2015-03-03

Composite Cores: Pushing Heterogeneity Into a Core

OPENALEX - Publications

Andrew Lukefahr Shruti Padmanabha Reetuparna Das Faissal M. Sleiman Ronald Dreslinski and 2 more

Heterogeneous multicore systems -- comprised of multiple cores with varying capabilities, performance, and energy characteristics have emerged as a promising approach to increasing efficiency. Such reduce consumption by identifying phase changes in an application migrating execution the most efficient core that meets its current performance requirements. However, due overhead switching between cores, migration opportunities are limited coarse-grained phases (hundreds millions instructions),...

10.1109/micro.2012.37 article EN 2012-12-01

Scalpel

OPENALEX - Publications

Jiecao Yu Andrew Lukefahr David J. Palframan Ganesh Dasika Reetuparna Das and 1 more

As the size of Deep Neural Networks (DNNs) continues to grow increase accuracy and solve more complex problems, their energy footprint also scales. Weight pruning reduces DNN model computation by removing redundant weights. However, we implemented weight for several popular networks on a variety hardware platforms observed surprising results. For many networks, network sparsity caused will actually hurt overall performance despite large reductions in required multiply-accumulate operations....

10.1145/3140659.3080215 article EN ACM SIGARCH Computer Architecture News 2017-06-24

Shoestring

OPENALEX - Publications

Shuguang Feng Shantanu Gupta Amin Ansari Scott Mahlke

Aggressive technology scaling provides designers with an ever increasing budget of cheaper and faster transistors. Unfortunately, this trend is accompanied by a decline in individual device reliability as transistors become increasingly susceptible to soft errors. We are quickly approaching new era where resilience errors no longer luxury that can be reserved for just processors high-reliability, mission-critical domains. Even used mainstream computing will soon require protection. However,...

10.1145/1736020.1736063 article EN 2010-03-13

Rumba

OPENALEX - Publications

Daya Shanker Khudia Babak Zamirai Mehrzad Samadi Scott Mahlke

Approximate computing can be employed for an emerging class of applications from various domains such as multimedia, machine learning and computer vision. The approximated output applications, even though not 100% numerically correct, is often either useful or the difference unnoticeable to end user. This opens up a new design dimension trade off application performance energy consumption with correctness. However, largely unaddressed challenge quality control: how ensure user experience...

10.1145/2749469.2750371 article EN 2015-05-26

In-Memory Data Parallel Processor

OPENALEX - Publications

Daichi Fujiki Scott Mahlke Reetuparna Das

Recent developments in Non-Volatile Memories (NVMs) have opened up a new horizon for in-memory computing. Despite the significant performance gain offered by computational NVMs, previous works relied on manual mapping of specialized kernels to memory arrays, making it infeasible execute more general workloads. We combat this problem proposing programmable processor architecture and data-parallel programming framework. The efficiency proposed comes from two sources: massive parallelism...

10.1145/3173162.3173171 article EN 2018-03-19

Duality cache for data parallel acceleration

OPENALEX - Publications

Daichi Fujiki Scott Mahlke Reetuparna Das

Duality Cache is an in-cache computation architecture that enables general purpose data parallel applications to run on caches. This paper presents a holistic approach of building system stack with techniques performing floating point arithmetic and transcendental functions, enabling data-parallel execution model, designing compiler accepts existing CUDA programs, providing flexibility in adopting for various workload characteristics.

10.1145/3307650.3322257 article EN 2019-06-14