- Parallel Computing and Optimization Techniques
- Interconnection Networks and Systems
- Advanced Data Storage Technologies
- Cloud Computing and Resource Management
- Embedded Systems Design Techniques
- Network Traffic and Congestion Control
- Caching and Content Delivery
- Distributed and Parallel Computing Systems
- Software-Defined Networks and 5G
- Software Engineering Research
- Software Testing and Debugging Techniques
- Teaching and Learning Programming
- Real-Time Systems Scheduling
- Low-power high-performance VLSI design
- Age of Information Optimization
- Advanced Optical Network Technologies
- Advanced Vision and Imaging
- Computer Graphics and Visualization Techniques
- IPv6, Mobility, Handover, Networks, Security
- Algorithms and Data Compression
- Online Learning and Analytics
- Advanced Wireless Communication Techniques
- Green IT and Sustainability
- Video Coding and Compression Technologies
- Mobile Agent-Based Network Management
Rice University
2012-2023
Rensselaer Polytechnic Institute
2013
École Polytechnique Fédérale de Lausanne
2007
Purdue University West Lafayette
2007
Stanford University
1998-2003
Massachusetts Institute of Technology
2000-2002
The bandwidth and latency of a memory system are strongly dependent on the manner in which accesses interact with "3-D" structure banks, rows, columns characteristic contemporary DRAM chips. There is nearly an order magnitude difference between successive references to different within row rows bank. This paper introduces access scheduling, technique that improves performance by reordering exploit locality 3-D structure. Conservative reordering, first ready reference sequence performed, 40%...
This paper explores the relationship between domain scheduling in avirtual machine monitor (VMM) and I/O performance. Traditionally, VMM schedulers have focused on fairly sharing processor resources among domains while leaving of as asecondary concern. However, this can resultin poor and/or unpredictable application performance, making virtualization less desirable for applications that require efficient consistent behavior.
Hadoop is a popular open-source implementation of MapReduce for the analysis large datasets. To manage storage resources across cluster, uses distributed user-level filesystem. This filesystem - HDFS written in Java and designed portability heterogeneous hardware software platforms. paper analyzes performance uncovers several issues. First, architectural bottlenecks exist that result inefficient usage due to delays scheduling new tasks. Second, limitations prevent from exploiting features...
The power-efficient Imagine stream processor achieves performance densities comparable to those of special-purpose embedded processors. Executing programs mapped streams and kernels, a single is expected have peak 20 gflops sustain 18.3 gops on mpeg-2 encoding.
The demand for flexibility in media processing motivates the use of programmable processors. Stream bridges gap between inflexible special-purpose solutions and current architectures that cannot meet computational demands media-processing applications. central idea behind stream is to organize an application into streams kernels expose inherent locality concurrency performance Imagine processor on these given.
Processor architectures with tens to hundreds of arithmetic units are emerging handle media processing applications. These applications, such as image coding, synthesis and understanding, require rates up 10/sup 11/ operations per second. As the number in a processor increases meet these demands, register storage communication between dominate area, delay power units. In this paper, we show that partitioning file along three axes reduces cost without significantly impacting performance. We...
Media applications are characterized by large amounts of available parallelism, little data reuse, and a high computation to memory access ratio. While these characteristics poorly matched conventional microprocessor architectures, they good fit for modern VLSI technology with its arithmetic capacity but limited global bandwidth. The stream programming model, in which an application is coded as streams records passing through kernels, exposes both parallelism locality media that can be...
The Imagine Stream Processor is a single-chip programmable media processor with 48 parallel ALUs. At 400 MHz, this translates to peak arithmetic rate of 16 GFLOPS on single-precision data and 32 GOPS bit fixed-point data. scalability Imagine's programming model architecture enable it achieve such high rates. executes applications that have been mapped the stream model. decomposes into set computation kernels operate streams. This mapping exposes inherent locality parallelism in application,...
This paper explores the design space of MMU caches that accelerate virtual-to-physical address translation in processor architectures, such as x86-64, use a radix tree page table. In particular, these table walk occurs after miss Translation Lookaside Buffer. shows most effective are caches, which store partial translations and allow hardware to skip one or more levels
This paper presents hardware and software mechanisms to enable concurrent direct network access (CDNA) by operating systems running within a virtual machine monitor. In conventional monitor, each system must the through software-virtualized interface. These interfaces are multiplexed in onto physical interface, incurring significant performance overheads. The CDNA architecture improves networking efficiency dividing tasks of traffic multiplexing, interrupt delivery, memory protection between...
Data-intensive computing applications are using more and memory placing an increasing load on the virtual system. While use of large pages can help alleviate overhead address translation, they limit control operating system has over allocation protection. We present a novel device, SpecTLB, that exploits predictable behavior reservation-based physical allocators to interpolate translations.
This paper analyzes memory access scheduling and virtual channels as mechanisms to reduce the latency of main accesses by CPU peripherals in web servers. Despite address filtering effects CPU's cache hierarchy, there is significant locality bank parallelism DRAM stream a server, which includes traffic from operating system, application, peripherals. However, sequential controller leaves much this unexploited, serialization conflicts affect realizable latency. Aggressive within exploit...
The bandwidth and latency of a memory system are strongly dependent on the manner in which accesses interact with “3-D” structure banks, rows, columns characteristic contemporary DRAM chips. There is nearly an order magnitude difference between successive references to different within row rows bank. This paper introduces access scheduling, technique that improves performance by reordering exploit locality 3-D structure. Conservative reordering, first ready reference sequence performed, 40%...
This paper presents mechanisms and optimizations to reduce the overhead of network interface virtualization when using driver domain I/O model. The model provides benefits such as support for legacy device drivers fault isolation. However, processing overheads incurred in achieve these limit overall performance. demonstrates effectiveness two approaches overheads. First, Xen is modified multi-queue interfaces eliminate software packet demultiplexing copying. Second, a grant reuse mechanism...
Web search engines are optimized to reduce the high-percentile response time consistently provide fast responses almost all user queries. This is a challenging task because query workload exhibits large variability, consisting of many short-running queries and few long-running that significantly impact time. With modern multicore servers, parallelizing processing an individual promising solution execution time, but it gives limited benefits compared sequential since most see little or no...
Human/human interaction is a critical component of learning in many domains including introductory computer programming. For on-campus courses, lectures and problem sessions provide opportunities for students to interact with the instructor(s) their peers. online human/human are more limited usually correspond activities like forum postings study groups. programming situation potentially even worse since computational tools designed facilitate program, such as unit testing, emphasize...
In datacenter networks, link and switch failures are a common occurrence. Although most of these do not disconnect the underlying topology, they cause routing failures, disrupting communications between some hosts. Unfortunately, current 1:1 redundancy groups only partly effective at reducing impact failures. principle, local fast failover schemes, such as OpenFlow groups, could reduce by preinstalling backup routes that protect against multiple simultaneous However, providing sufficient...
Media applications are characterized by large amounts of available parallelism, little data reuse, and a high computation to memory access ratio. While these characteristics poorly matched conventional microprocessor architectures, they good fit for modern VLSI technology with its arithmetic capacity but limited global bandwidth. The stream programming model, in which an application is coded as streams records passing through kernels, exposes both parallelism locality media that can be...
Article Free Access Share on Efficient conditional operations for data-parallel architectures Authors: Ujval J. Kapasi Computer Systems Laboratory, Stanford University, Stanford, CA CAView Profile , William Dally Scott Rixner Peter R. Mattson John D. Owens Brucek Khailany Authors Info & Claims MICRO 33: Proceedings of the 33rd annual ACM/IEEE international symposium MicroarchitectureDecember 2000 Pages 159–170https://doi.org/10.1145/360128.360145Online:01 December 2000Publication History...
Media applications, such as image processing, signal video, and graphics, require high computation rates data bandwidths. The stream programming model is a natural powerful way to describe these applications. Expressing media applications in this allows hardware software systems take advantage of their concurrency locality order meet computational demands. Imagine system, set tools algorithms, used program the model. We achieve real-time performance on variety processing with (4-15 billion...
A web search query made to Microsoft Bing is currently parallelized by distributing the processing across many servers. Within each of these servers, is, however, processed sequentially. Although server may be multiple queries concurrently, with modern multicore parallelizing an individual within nonetheless improve user's experience reducing response time. In this paper, we describe issues that make parallelization a challenging, and present approach effectively addresses challenges. Since...
This paper introduces Plinko, a network architecture that uses novel forwarding model and routing algorithm to build networks with paths that, assuming arbitrarily large tables, are provably resilient against t link failures, ∀t ∈ N. However, in practice, there clearly limits on the size of tables. Nonetheless, when constrained hardware comparable modern top-of-rack (TOR) switches, Plinko scales high resilience up ten thousand hosts. Thus, as long or fewer links have failed, only reason...