- Parallel Computing and Optimization Techniques
- Numerical Methods and Algorithms
- Low-power high-performance VLSI design
- Embedded Systems Design Techniques
- Interconnection Networks and Systems
- Digital Filter Design and Implementation
- Analog and Mixed-Signal Circuit Design
- Advanced Data Storage Technologies
- Distributed and Parallel Computing Systems
- Radiation Effects in Electronics
- Advanced Wireless Communication Techniques
- Cryptography and Residue Arithmetic
- Particle Detector Development and Performance
- Particle physics theoretical and experimental studies
- Algorithms and Data Compression
- Computational Physics and Python Applications
- Real-Time Systems Scheduling
- Optical measurement and interference techniques
- Wireless Communication Networks Research
- Coding theory and cryptography
- VLSI and Analog Circuit Testing
- 3D IC and TSV technologies
- Advanced Memory and Neural Computing
- Polynomial and algebraic computation
- Surface Roughness and Optical Measurements
Advanced Micro Devices (Canada)
2002-2024
Advanced Micro Devices (United States)
2010-2023
University of Wisconsin–Madison
2005-2019
IEEE Computer Society
2013
Universidad de Málaga
2007-2010
TU Dortmund University
2010
Bremen Institute for Applied Beam Technology
2007
Madison Group (United States)
2007
University of Wisconsin System
2004-2006
Lehigh University
1998-2005
The set-top and portable device market continues to grow, as does the demand for more performance under increasing cost, power, thermal constraints. integration of Graphics Processing Units (GPUs) into these devices emergence general-purpose computations on graphics hardware enable a new set highly parallel applications. In this paper, we propose make case GPU multitasking technique called spatial multitasking. Traditional techniques, such cooperative preemptive multitasking, partition time...
Decimal multiplication is important in many commercial applications including financial analysis, banking, tax calculation, currency conversion, insurance, and accounting. We present two novel designs for fixed-point decimal that utilize carry-save addition to reduce the critical path delay. First, a multiplier stores reduced number of multiplicand multiples uses iterative portion design presented. Then, second proposed with several notable improvements fast generation do not need be stored,...
This article provides an overview of AMD's vision for exascale computing, and in particular, how heterogeneity will play a central role realizing this vision. Exascale computing requires high levels performance capabilities while staying within stringent power budgets. Using hardware optimized specific functions is much more energy efficient than implementing those with general-purpose cores. However, there strong desire supercomputer customers not to have pay custom components designed only...
State-of-the-art graphic processing units (GPUs) provide very high memory bandwidth, but the performance of many general-purpose GPU (GPGPU) workloads is still bounded by bandwidth. Although compression techniques have been adopted commercial GPUs, they are only used for compressing texture and color data, not data GPGPU workloads. Furthermore, microarchitectural details proprietary its benefits previously published. In this paper, we first investigate required changes to support lossless...
The challenges to push computing exaflop levels are difficult given desired targets for memory capacity, bandwidth, power efficiency, reliability, and cost. This paper presents a vision an architecture that can be used construct exascale systems. We describe conceptual Exascale Node Architecture (ENA), which is the computational building block supercomputer. ENA consists of Heterogeneous Processor (EHP) coupled with advanced system. EHP provides high-performance accelerated processing unit...
Decimal multiplication is important in many commercial applications including financial analysis, banking, tax calculation, currency conversion, insurance, and accounting. This paper presents a novel design for fixed-point decimal that utilizes simple recoding scheme to produce signed-magnitude representations of the operands thereby greatly simplifying process generating partial products each multiplier digit. The are generated using digit-by-digit on word-by-digit basis, first signed-digit...
There is increasing interest in hardware support for decimal arithmetic as a result of recent growth commercial, financial, and Internet-based applications. Consequently, new specifications floating-point have been added to the draft revision IEEE-754 Standard arithmetic. This paper introduces analyzes three techniques performing fast addition on multiple binary coded (BCD) operands. Two speculate BCD correction values correct intermediate results while adding input The first speculates over...
State-of-the-art graphic processing units (GPUs) can offer very high computational throughput for highly parallel applications using hundreds of integrated cores. In general, the peak a GPU is proportional to product number cores and their frequency. However, often limited by power constraint. Although be increased with more some applications, it cannot others because parallelism and/or bandwidth on-chip interconnects/caches off-chip memory are limited. this paper, first, we demonstrate that...
Sudden variations in current (large di/dt) can lead to significant power supply voltage droops and timing errors modern microprocessors. Several papers discuss the complexity involved with developing test programs, also known as stress marks, system. Authors of these produced tools methodologies generate marks automatically using techniques such integer linear programming or genetic algorithms. However, nearly all previous work took place context single-core systems, results were collected...
High-throughput and low-latency sorting is a key requirement in many applications that deal with large amounts of data. This paper presents efficient techniques for designing high-throughput, units. Our architectures utilize modular design hierarchically construct units from smaller building blocks. The are optimized situations which only the M largest numbers N inputs needed, because this situation commonly occurs scientific computing, data mining, network processing, digital signal...
The peak compute performance of GPUs has been increased by integrating more resources and operating them at higher frequency. However, such approaches significantly increase power consumption GPUs, limiting further due to the constraint. Facing a challenge, we propose three techniques improve efficiency in this paper. First, observe that many GPGPU applications are integer-intensive. For applications, combine pair dependent integer instructions into composite instruction can be executed an...
With technology scaling, manufacturers are integrating both CPU and GPU cores in a single chip to improve the throughput of emerging applications. To maximize single-chip heterogeneous processor (SCHP), power budget shared between must be effectively utilized. At same time, an SCHP each satisfy its own constraint. Furthermore, allocated impacts performance. In this paper, using detailed cycle-level simulator, we first demonstrate that joint optimization workload partitioning can provide 13%...
Reducing the power dissipation of parallel multipliers is important in design digital signal processing systems. In many these systems, products are rounded to avoid growth word size. The and area can be significantly reduced by a technique known as truncated multiplication. With this technique, least significant columns multiplication matrix not used. Instead, carries generated estimated. This estimate added with most produce product. paper presents implementation multipliers. Simulations...
Column compression multipliers are frequently used in high-performance computer systems due to their short worst case delay. This paper examines the area, delay, and power characteristics of Dadda (1965) Wallace (1964) column deep submicron technology. Our analysis shows that have slightly more area approximately same delay as multipliers. It also importance considering parasitic capacitances when determining multipliers, since parasitics can increase multiplier by over 60%. As size...
Decimal floating-point multiplication is important in many commercial applications including banking, tax calculation, currency conversion, and other financial areas. This paper presents a fully parallel decimal multiplier compliant with the recent draft of IEEE P754 Standard for Floating-point Arithmetic (IEEE P754). The novelty design that it first offering low latency high throughput. based on previously published fixed-point which uses alternate digit encodings to reduce area delay....
The demand for improved SIMD floating-point performance on general-purpose x86-compatible microprocessors is rising. At the same time, there a conflicting in low-power computing market reduction power consumption. Along with this, absolute necessity of backward compatibility microprocessors, which includes support x87 scientific instructions. combined effect that need low-power, low-cost units are still capable delivering good while maintaining full x86 functionality. This paper presents...
Decimal multiplication is important in many commercial applications including financial analysis, banking, tax calculation, currency conversion, insurance, and accounting. This paper presents the design of two decimal floating-point multipliers: one whose partial product accumulation strategy employs carry-save addition that binary addition. The multiplier based on favors a nonpipelined iterative implementation. utilizing allows for an efficient pipelined implementation when latency...
Per-core voltage domains can improve performance under a power constraint. Most commercial processors, however, only have single domain for all processor cores. This is because splitting the into per-core and powering them with multiple off-chip regulators (VRs) incur high cost platform package designs. Although using on-chip switching VRs be an alternative solution, integrating high-quality inductors cores has been technical challenge. In this paper, we propose cost-effective delivery...
Due to rapid growth in financial, commercial, and Internet-based applications, there is an increasing desire allow computers operate on both binary decimal floating-point numbers. Consequently, specifications for support are being added the IEEE-754 Standard Floating-Point Arithmetic. In this paper, we present design implementation of a adder that compliant with current draft revision standard. The supports operations 64-bit (16-digit) operands. We provide synthesis results indicating...
Decimal arithmetic is often used in commercial, financial, and Internet-based applications. Due to the growing importance of decimal floating-point (DFP) arithmetic, IEEE 754-2008 Standard for Floating-Point Arithmetic (IEEE 754-2008) includes specifications DFP arithmetic. IBM recently announced adding instructions their POWER6, z9, z10 microprocessor architectures. As processor support emerges, it important investigate efficient algorithms hardware designs common operations. This paper...
Media processing applications typically involve large amounts of data-level parallelism and operate on low-precision operands. This paper presents multiplier architectures for multimedia compares them to conventional general-purpose in terms area delay. The proposed support subword additional features, which enhance their performance applications, yet require only slightly more delay than multipliers processing.
Barrel shifters are often utilized by embedded digital signal processors and general-purpose to manipulate data. This paper examines design alternatives for barrel that perform the following functions: shift right logical, arithmetic, rotate right, left left. Four different shifter designs presented compared in terms of area delay a variety operand sizes. also techniques detecting results overflow zero parallel with or operation. Several Java programs developed generate structural VHDL...