- Parallel Computing and Optimization Techniques
- Embedded Systems Design Techniques
- Advanced Data Storage Technologies
- Interconnection Networks and Systems
- Distributed systems and fault tolerance
- Distributed and Parallel Computing Systems
- Cloud Computing and Resource Management
- Graph Theory and Algorithms
- Low-power high-performance VLSI design
- Machine Learning and Data Classification
- Advanced Neural Network Applications
- Cognitive Functions and Memory
- Advanced Database Systems and Queries
- Model-Driven Software Engineering Techniques
- Scientific Computing and Data Management
- Formal Methods in Verification
- Logic, programming, and type systems
- Real-Time Systems Scheduling
- Advanced Graph Neural Networks
- Advanced Multi-Objective Optimization Algorithms
- VLSI and Analog Circuit Testing
- VLSI and FPGA Design Techniques
- Data Management and Algorithms
- Software System Performance and Reliability
- Markov Chains and Monte Carlo Methods
Stanford University
2016-2025
Stanford Medicine
2010-2021
Palo Alto University
2009-2013
IBM (United States)
2009
Xi'an Jiaotong University
2009
Laboratoire d'Informatique de Paris-Nord
2006
Oracle (United States)
2005
University of Michigan–Ann Arbor
1987-2002
The Niagara processor implements a thread-rich architecture designed to provide high-performance solution for commercial server applications. This is an entirely new implementation of the Sparc V9 architectural specification, which exploits large amounts on-chip parallelism high throughput. hardware supports 32 threads with memory subsystem consisting on-board crossbar, level-2 cache, and controllers highly integrated design that thread-level inherent applications, while targeting low levels...
Transactional Memory (TM) is emerging as a promising technology to simplify parallel programming. While several TM systems have been proposed in the research literature, we are still missing tools and workloads necessary analyze compare proposals. Most evaluated using microbenchmarks, which may not be representative of any real-world behavior, or individual applications, do stress wide range execution scenarios. We introduce Stanford Application for Multi-Processing (STAMP), comprehensive...
Advances in IC processing allow for more microprocessor design options. The increasing gate density and cost of wires advanced integrated circuit technologies require that we look new ways to use their capabilities effectively. This paper shows it is possible implement a single-chip multiprocessor the same area as wide issue superscalar processor. We find applications with little parallelism performance two microarchitectures comparable. For large amounts at both fine coarse grained levels,...
In this paper, we propos a new shared memory model: Transactionalmemory Coherence and Consistency (TCC).TCC providesa model in which atomic transactions are always the basicunit of parallel work, communication, coherence, andmemory reference consistency.TCC greatly simplifies parallelsoftware by eliminating need for synchronization using conventionallocks semaphores, along with their complexities.TCC hardware must combine all writes from each transaction regionin program into single packet...
Thread-level speculation is a technique that enables parallel execution of sequential applications on multiprocessor. This paper describes the complete implementation support for threadlevel Hydra chip multiprocessor (CMP). The consists number software control handlers and modifications to shared secondary cache memory system CMP evaluated using five representative integer applications. Our results show speculative only able improve performance when there substantial amount medium--grained...
Presents the case for billion-transistor processor architectures that will consist of chip multiprocessors (CMPs): multiple (four to 16) simple, fast processors on one chip. In their proposal, each is tightly coupled a small, fast, level-one cache, and all share larger level-two cache. The may collaborate parallel job or run independent tasks (as in SMT proposal). CMP architecture lends itself simpler design, faster validation, cleaner functional partitioning, higher theoretical peak...
Graphs are powerful data representations favored in many computational domains. Modern GPUs have recently shown promising results accelerating computationally challenging graph problems but their performance suffered heavily when the structure is highly irregular, as most real-world graphs tend to be. In this study, we first observe that poor caused by work imbalance and an artifact of a discrepancy between GPU programming model underlying architecture.We then propose novel virtual...
Graphs are a fundamental data representation that has been used extensively in various domains. In graph-based applications, systematic exploration of the graph such as breadth-first search (BFS) often serves key component processing their massive sets. this paper, we present new method for implementing parallel BFS algorithm on multi-core CPUs which exploits property randomly shaped real-world instances. By utilizing memory bandwidth more efficiently, our shows improved performance over...
The increasing importance of graph-data based applications is fueling the need for highly efficient and parallel implementations graph analysis software. In this paper we describe Green-Marl, a domain-specific language (DSL) whose high level constructs allow developers to their algorithms intuitively, but expose data-level parallelism inherent in algorithms. We also present our Green-Marl compiler which translates high-level algorithmic description written into an C++ implementation by...
Next-generation information technologies will process unprecedented amounts of loosely structured data that overwhelm existing computing systems. N3XT improves the energy efficiency abundant-data applications 1,000-fold by using new logic and memory technologies, 3D integration with fine-grained connectivity, architectures for computation immersed in memory.
Reconfigurable architectures have gained popularity in recent years as they allow the design of energy-efficient accelerators. Fine-grain fabrics (e.g. FPGAs) traditionally suffered from performance and power inefficiencies due to bit-level reconfigurable abstractions. Both fine-grain coarse-grain CGRAs) require low level programming suffer long compilation times. We address both challenges with Plasticine, a new spatially architecture designed efficiently execute applications composed...
There are two types of high-performance graph processing engines: low- and high-level engines. Low-level engines (Galois, PowerGraph, Snap) provide optimized data structures computation models but require users to write low-level imperative code, hence ensuring that efficiency is the burden user. In engines, in query languages like datalog (SociaLite) or SQL (Grail). High-level easier use orders magnitude slower than We present EmptyHeaded, a engine supports rich datalog-like language...
We propose signature-accelerated transactional memory (SigTM), ahybrid TM system that reduces the overhead of software transactions. SigTM uses hardware signatures to track read-set and write-set forpending transactions perform conflict detection between concurrent threads. All other functionality, including dataversioning, is implemented in software. Unlike previously proposed hybrid systems, requires no modifications caches, which cost simplifies support for nested multithreaded processor...
We propose a concurrent relaxed balance AVL tree algorithm that is fast, scales well, and tolerates contention. It based on optimistic techniques adapted from software transactional memory, but takes advantage of specific knowledge the to reduce overheads avoid unnecessary retries. extend our with fast linearizable clone operation, which can be used for consistent iteration tree. Experimental evidence shows outperforms highly tuned skip list many access patterns, an average 39% higher...
Computing systems are becoming increasingly parallel and heterogeneous, therefore new applications must be capable of exploiting parallelism in order to continue achieving high performance. However, targeting these emerging devices often requires using multiple disparate programming models making decisions that can limit forward scalability. In previous work we proposed the use domain-specific languages (DSLs) provide high-level abstractions enable transformations performance code without...
Developing high-performance software is a difficult task that requires the use of low-level, architecture-specific programming models (e.g., OpenMP for CMPs, CUDA GPUs, MPI clusters). It typically not possible to write single application can run efficiently in different environments, leading multiple versions and increased complexity. Domain-Specific Languages (DSLs) are promising avenue enable programmers high-level abstractions still achieve good performance on variety hardware. This...
As the amount of memory in database systems grows, entire tables, or even databases, are able to fit system's memory, making in-memory operations more prevalent. This shift from disk-based has contributed a move row-wise columnar data storage. Furthermore, common workloads have grown beyond online transaction processing (OLTP) include analytical and mining. These analyze huge datasets that often irregular not indexed, traditional like joins much expensive.
Transactional memory (TM) provides mechanisms that promise to simplify parallel programming by eliminating the need for locks and their associated problems (deadlock, livelock, priority inversion, convoying). For TM be adopted in long term, not only does it deliver on these promises, but needs scale a high number of processors. To date, proposals scalable have relegated livelock issues user-level contention managers. This paper presents first implementation directory-based distributed shared...
Industry is increasingly turning to reconfigurable architectures like FPGAs and CGRAs for improved performance energy efficiency. Unfortunately, adoption of these has been limited by their programming models. HDLs lack abstractions productivity are difficult target from higher level languages. HLS tools more productive, but offer an ad-hoc mix software hardware which make optimizations difficult.
Exploiting heterogeneous parallel hardware currently requires mapping application code to multiple disparate programming models. Unfortunately, general-purpose models available today can yield high performance but are too low-level be accessible the average programmer. We propose leveraging domain-specific languages (DSLs) map high-level devices. To demonstrate potential of this approach we present OptiML, a DSL for machine learning. OptiML programs implicitly and achieve on with no...
Stochastic gradient descent (SGD) is one of the most popular numerical algorithms used in machine learning and other domains. Since this likely to continue for foreseeable future, it important study techniques that can make run fast on parallel hardware. In paper, we provide first analysis a technique called Buck-wild! uses both asynchronous execution low-precision computation. We introduce DMGC model, conceptualization parameter space exists when implementing SGD, show provides way classify...
Researchers have proposed hardware, software, and algorithmic optimizations to improve the computational performance of deep learning. While some these perform same operations faster (e.g., increasing GPU clock speed), many others modify semantics training procedure reduced precision), can impact final model's accuracy on unseen data. Due a lack standard evaluation criteria that considers trade-offs, it is difficult directly compare optimizations. To address this problem, we recently...
There are two types of high-performance graph processing engines: low- and high-level engines. Low-level engines (Galois, PowerGraph, Snap) provide optimized data structures computation models but require users to write low-level imperative code, hence ensuring that efficiency is the burden user. In engines, in query languages like datalog (SociaLite) or SQL (Grail). High-level easier use orders magnitude slower than We present EmptyHeaded, a engine supports rich datalog-like language...