- Parallel Computing and Optimization Techniques
- Catalysts for Methane Reforming
- Advanced Data Storage Technologies
- Stochastic Gradient Optimization Techniques
- Membrane Separation and Gas Transport
- Catalytic Processes in Materials Science
- Advanced Neural Network Applications
- Distributed and Parallel Computing Systems
- Ammonia Synthesis and Nitrogen Reduction
- Recommender Systems and Techniques
- Interconnection Networks and Systems
- Embedded Systems Design Techniques
- Neural Networks and Applications
- Cloud Computing and Resource Management
- Tensor decomposition and applications
- Carbon Dioxide Capture Technologies
- Matrix Theory and Algorithms
- Generative Adversarial Networks and Image Synthesis
- Algorithms and Data Compression
- Advanced Data Compression Techniques
- Advanced Graph Neural Networks
- Hydrogen Storage and Materials
- Advanced SAR Imaging Techniques
- Advanced Image and Video Retrieval Techniques
- Aeroelasticity and Vibration Control
Alpha Omega Alpha Medical Honor Society
2023
Menlo School
2021-2023
BC Platforms (Finland)
2022
Meta (United States)
2017-2022
Yonsei University
2021
Intel (Germany)
2018
Intel (United States)
2012-2018
Meta (Israel)
2018
Korea Institute of Energy Research
2005-2017
Intel (United Kingdom)
2012-2016
With the advent of deep learning, neural network-based recommendation models have emerged as an important tool for tackling personalization and tasks. These networks differ significantly from other learning due to their need handle categorical features are not well studied or understood. In this paper, we develop a state-of-the-art model (DLRM) provide its implementation in both PyTorch Caffe2 frameworks. addition, design specialized parallelization scheme utilizing parallelism on embedding...
This paper presents the design of Glow, a machine learning compiler for heterogeneous hardware. It is pragmatic approach to compilation that enables generation highly optimized code multiple targets. Glow lowers traditional neural network dataflow graph into two-phase strongly-typed intermediate representation. The high-level representation allows optimizer perform domain-specific optimizations. lower-level instruction-based address-only memory-related optimizations, such as instruction...
Deep learning recommendation models (DLRMs) have been used across many business-critical services at Meta and are the single largest AI application in terms of infrastructure demand its data-centers. In this paper, we present Neo, a software-hardware co-designed system for high-performance distributed training large-scale DLRMs. Neo employs novel 4D parallelism strategy that combines table-wise, row-wise, column-wise, data massive embedding operators addition, enables extremely...
Graph algorithms are becoming increasingly important for analyzing large datasets in many fields. Real-world graph data follows a pattern of sparsity, that is not uniform but highly skewed towards few items. Implementing traversal, statistics and machine learning on such scalable manner quite challenging. As result, several analytics frameworks (GraphLab, CombBLAS, Giraph, SociaLite Galois among others) have been developed, each offering solution with different programming models targeted at...
Hardwired ASICs - 50X more efficient than programmable processors sacrifice programmability to meet the efficiency requirements of demanding embedded systems. Programmable use energy mostly supply instructions and data arithmetic units, several techniques can reduce instruction- data-supply costs. Using these in Stanford ELM processor closes gap with within 3X.
This paper presents the first comprehensive empirical study demonstrating efficacy of Brain Floating Point (BFLOAT16) half-precision format for Deep Learning training across image classification, speech recognition, language modeling, generative networks and industrial recommendation systems. BFLOAT16 is attractive two reasons: range values it can represent same as that IEEE 754 floating-point (FP32) conversion to/from FP32 simple. Maintaining important to ensure no hyper-parameter tuning...
Phenomenally successful in practical inference problems, convolutional neural networks (CNN) are widely deployed mobile devices, data centers, and even supercomputers. The number of parameters needed CNNs, however, often large undesirable. Consequently, various methods have been developed to prune a CNN once it is trained. Nevertheless, the resulting CNNs offer limited benefits. While pruning fully connected layers reduces CNN's size considerably, does not improve speed noticeably as compute...
Large-scale graph analysis is becoming important with the rise of world-wide social network services. Recently in SociaLite, we proposed extensions to Datalog efficiently and succinctly implement programs on sequential machines. This paper describes novel optimizations SociaLite for parallel distributed executions support large-scale analysis. With programmers simply annotate how data are be distributed, then necessary communication automatically inferred generate code cluster multi-core It...
The development of cheap, simple, and green synthetic methods for hierarchically porous nitrogen-doped carbon, especially derived from renewable biomass, such as chitosan, remains a challenging topic. Here, we first synthesized carbon (KIE-8) having graphene-like structure via simple pyrolysis chitosan/urea/KOH mixture without any conventional sophisticated treatments, freeze-drying, hydrothermal carbonization, soft or hard templating. On the basis various analyses KIE-8, demonstrated that...
The application of deep learning techniques resulted in remarkable improvement machine models. In this paper provides detailed characterizations models used many Facebook social network services. We present computational characteristics our models, describe high performance optimizations targeting existing systems, point out their limitations and make suggestions for the future general-purpose/accelerated inference hardware. Also, we highlight need better co-design algorithms, numerics...
As the effort to scale up existing quantum hardware proceeds, it becomes necessary schedule gates in a way that minimizes number of operations. There are three constraints have be satisfied: order or dependency specific algorithm, fact any qubit may involved at most one gate time, and restriction two-qubit implementable only between connected qubits. The last aspect implies compilation depends not on but also properties like connectivity. Here we suggest two-step approach which logical...
Sparse methods and the use of Winograd convolutions are two orthogonal approaches, each which significantly accelerates convolution computations in modern CNNs. merges these thus has potential to offer a combined performance benefit. Nevertheless, training layers so that resulting kernels sparse not hitherto been very successful. By introducing layer place standard layer, we can learn prune coefficients "natively" obtain sparsity level beyond 90% with only 0.1% accuracy loss AlexNet on...
A new sparse high performance conjugate gradient benchmark (HPCG) has been recently released to address challenges in the design of linear solvers for next generation extreme-scale computing systems. Key computation, data access, and communication pattern HPCG represent building blocks commonly found today's HPC applications. While it is a well known challenge efficiently parallelize Gauss-Seidel smoother, most time-consuming kernel HPCG, our algorithmic architecture-aware optimizations...
This paper presents a compiler and runtime framework for parallelizing sparse matrix computations that have loop-carried dependences. Our approach automatically generates inspector to collect data dependence information achieves wavefront parallelization of the computation, where iterations within execute in parallel, synchronization is required across wavefronts. A key contribution this involves simplification, which reduces time space overhead inspector. implemented polyhedral framework,...
We present an efficient programmable architecture for compute-intensive embedded applications. The processor uses instruction registers to reduce the cost of delivering instructions, and a hierarchical distributed data register organization deliver data. Instruction capture reuse locality in inexpensive storage structures that arc located near functional units. captures different levels hierarchy Exposed communication resources eliminate pipeline control logic, allow compiler schedule...
A plate-type catalytic membrane reactor (PCMR) was prepared for the water-gas shift (WGS) reaction. The nickel metal catalyst with a disk-shape placed on disk-type without cage or mesh to hold in reactor. WGS reaction PCMR experimentally investigated using simulated feed from coal gasification as function of pressure (up 1.1 MPa) and GHSV 20 000 h− 1). stronger adsorption CO Pd seems be responsible greater reduction hydrogen permeating flux, which more power inhibitor than steam. When S/C =...