- Advanced Neural Network Applications
- Advanced Memory and Neural Computing
- Software-Defined Networks and 5G
- Caching and Content Delivery
- Ferroelectric and Negative Capacitance Devices
- Privacy-Preserving Technologies in Data
- Adversarial Robustness in Machine Learning
- Domain Adaptation and Few-Shot Learning
- CCD and CMOS Imaging Sensors
- Advanced Image and Video Retrieval Techniques
- Parallel Computing and Optimization Techniques
- Network Traffic and Congestion Control
- Advanced Data Storage Technologies
- Natural Language Processing Techniques
- IoT and Edge/Fog Computing
- Gaze Tracking and Assistive Technology
- Low-power high-performance VLSI design
- Neural Networks and Applications
- Topic Modeling
- Network Packet Processing and Optimization
- Virtual Reality Applications and Impacts
- Distributed Control Multi-Agent Systems
- VLSI and FPGA Design Techniques
- Advanced Malware Detection Techniques
- Numerical Methods and Algorithms
New York University
2024-2025
Courant Institute of Mathematical Sciences
2025
Harvard University Press
2018-2024
META Health
2024
Meta (United States)
2024
Harvard University
2019-2020
University of Toronto
2015-2017
Chalmers University of Technology
2016
This paper describes a novel approach of packing sparse convolutional neural networks into denser format for efficient implementations using systolic arrays. By combining multiple columns filter matrix single dense column stored in the array, utilization efficiency array can be substantially increased (e.g., 8x) due to density nonzero weights resulting packed matrix. In columns, each row, all but one with largest magnitude are pruned. The remaining retrained preserve high accuracy. We study...
Federated learning (FL) is a training technique that enables client devices to jointly learn shared model by aggregating locally computed models without exposing their raw data. While most of the existing work focuses on improving FL accuracy, in this paper, we focus efficiency, which often hurdle for adopting real world applications. Specifically, design an efficient framework optimizes processing latency and communication all are primary considerations implementation FL. Inspired recent...
Many multicast services such as live multimedia distribution and real-time event monitoring require mechanisms that involve network functions (e.g., firewall video transcoding). Network function virtualization (NFV) is a concept proposes using to implement on infrastructure building block (such high volume servers virtual machines), where software provides the functionality of existing purpose-built equipment. We present an approach for mechanism whereby flows are processed by NFV before...
In cooperative multi-agent reinforcement learning (c-MARL), agents learn to cooperatively take actions as a team maximize total reward. We analyze the robustness of c-MARL adversaries capable attacking one on team. Through ability manipulate this agent's observations, adversary seeks decrease Attacking is challenging for three reasons: first, it difficult estimate rewards or how they are impacted by an agent mispredicting; second, models non-differentiable; and third, feature space...
Block Floating Point (BFP) can efficiently support quantization for Deep Neural Network (DNN) training by providing a wide dynamic range via shared exponent across group of values. In this paper, we propose Fast First, Accurate Second Training (FAST) system DNNs, where the weights, activations, and gradients are represented in BFP. FAST supports matrix multiplication with variable precision BFP input operands, enabling incremental increases DNN throughout training. By increasing both...
Leveraging real-time eye tracking, foveated rendering optimizes hardware efficiency and enhances visual quality virtual reality (VR). This approach leverages eye-tracking techniques to determine where the user is looking, allowing system render high-resolution graphics only in foveal region-the small area of retina acuity highest, while peripheral view rendered at lower resolution. However, modern deep learning-based gaze-tracking solutions often exhibit a long-tail distribution tracking...
Network function visualization (NFV) has emerged as a promising paradigm in networking, where the hardware-based middleboxes are replaced with software-based virtualized entities typically running on cloud to provide specific functionalities. By deploying NFV, network services become more adaptive and cost-effective. Many multicast such real-time multimedia streaming intrusion detection require appropriate chaining; however, NFVs placement well traffic routing strategy guarantee that flows...
Multi-agent reinforcement learning (MARL) has recently received considerable attention due to its applicability a wide range of real-world applications. However, achieving efficient communication among agents always been an overarching problem in MARL. In this work, we propose Variance Based Control (VBC), simple yet technique improve efficiency By limiting the variance exchanged messages between during training phase, noisy component can be eliminated effectively, while useful part...
We present a full-stack optimization framework for accelerating inference of CNNs (Convolutional Neural Networks) and validate the approach with field-programmable gate array (FPGA) implementation. By jointly optimizing CNN models, computing architectures, hardware implementations, our achieves unprecedented performance in trade-off space characterized by latency, energy efficiency, utilization, accuracy. An FPGA implementation is used as validation vehicle design, achieving 2.28ms latency...
To deploy deep neural networks on resource-limited devices, quantization has been widely explored. In this work, we study the extremely low-bit which have tremendous speed-up, memory saving with quantized activation and weights. We first bring up three omitted issues in networks: squashing range of values; gradient vanishing during backpropagation unexploited hardware acceleration ternary networks. By reparameterizing weights vector full precision scale offset for fixed vector, decouple...
The emergence of the Internet Things (IoT) has led to a remarkable increase in volume data generated at network edge. In order support real-time smart IoT applications, massive amounts from edge devices need be processed using methods such as deep neural networks (DNNs) with low latency. To improve application performance and minimize resource cost, enterprises have begun adopt Edge computing, computation paradigm that advocates processing input locally However, nodes are often...
Recent studies have shown that introducing communication between agents can significantly improve overall performance in cooperative Multi-agent reinforcement learning (MARL). However, existing schemes often require to exchange an excessive number of messages at run-time under a reliable channel, which hinders its practicality many real-world situations. In this paper, we present \textit{Temporal Message Control} (TMC), simple yet effective approach for achieving succinct and robust MARL....
Many multicast services such as live multimedia distribution and real-time event monitoring require constructing a mechanism that involves network functions (e.g. firewall, video transcoding). Network Function Virtualization (NFV) is concept proposes using virtualization to implement on infrastructure building block (such high volume servers, virtual machines), where software provides the functionality of existing purpose-built equipment. We present an approach for whereby flows are...
This paper describes a novel approach of packing sparse convolutional neural networks for their efficient systolic array implementations. By combining subsets columns in the original filter matrix associated with layer, we increase utilization efficiency substantially (e.g., ~4x) due to increased density nonzeros resulting packed matrix. In columns, each row, all weights but one largest magnitude are pruned. We retrain remaining preserve high accuracy. demonstrate that mitigating data...
In recent years, numerous designs have used systolic arrays to accelerate convolutional neural network (CNN) inference. this work, we demonstrate that can further speed up CNN inference and lower its power consumption by mapping onto 3D circuit structures as opposed conventional 2D structures. Specifically, operating in space, a wide array consisting of number subarrays efficiently implement layers prevalent state the art CNNs. Additionally, accumulating intermediate results along third...
We present the Maestro memory-on-logic 3D-IC architecture for coordinated parallel use of a plurality systolic arrays (SAs) in performing deep neural network (DNN) inference. reduces under-utilization common single large SA by allowing many smaller SAs on DNN weight matrices varying shapes and sizes. In order to buffer immediate results memory blocks (MBs) provide high-bandwidth communication between MBs transferring weights employs three innovations. (1) An logic die can access its...
We introduce adaptive tiling, a method of partitioning layers in sparse convolutional neural network (CNN) into blocks filters and channels, called tiles, each implementable with fixed-size systolic array. By allowing tile to adapt its size so that it can cover large area, we minimize the total number or equivalently, array calls required perform CNN inference. The proposed scheme resolves challenge applying architectures, traditionally designed for dense matrices, CNNs. To validate...
Distributed file systems such as Google File System and Hadoop have been used to store large volumes of data in Cloud centers. These divide sets blocks fixed size replicate them over multiple machines achieve both reliability efficiency. Recent studies shown that tend a wide disparity popularity. In this context, the naive block replication schemes by these often cause an uneven load distribution across machines, which reduces overall I/O throughput system. While many algorithms proposed,...
On-device learning allows AI models to adapt user data, thereby enhancing service quality on edge platforms. However, training resource-limited devices poses significant challenges due the demanding computing workload and substantial memory consumption data access required by deep neural networks (DNNs). To address these issues, we propose utilizing embedded dynamic random-access (eDRAM) as primary storage medium for transient data. In comparison static (SRAM), eDRAM provides higher density...