- Parallel Computing and Optimization Techniques
- Advanced Data Storage Technologies
- Advanced Neural Network Applications
- Advanced Memory and Neural Computing
- Security and Verification in Computing
- Distributed and Parallel Computing Systems
- Advanced Graph Neural Networks
- Physical Unclonable Functions (PUFs) and Hardware Security
- Machine Learning and ELM
- Coding theory and cryptography
- Domain Adaptation and Few-Shot Learning
- Cryptographic Implementations and Security
- Cryptography and Residue Arithmetic
- Advanced Malware Detection Techniques
- Embedded Systems Design Techniques
- Advanced Data Compression Techniques
- Numerical Methods and Algorithms
- Adversarial Robustness in Machine Learning
- Machine Learning in Materials Science
- Data Quality and Management
- Distributed systems and fault tolerance
- Multimodal Machine Learning Applications
- Ferroelectric and Negative Capacitance Devices
- Cloud Computing and Resource Management
- Cloud Data Security Solutions
University of Illinois Urbana-Champaign
2019-2025
Shanghai Jiao Tong University
2017-2019
The ever-growing demands for memory with larger capacity and higher bandwidth have driven recent innovations on expansion disaggregation technologies based Compute eXpress Link (CXL). Especially, CXL-based technology has recently gained notable attention its ability not only to economically expand but also decouple from a specific interface of the CPU. However, since CXL devices been widely available, they emulated using DDR in remote NUMA node. In this paper, first time, we comprehensively...
Deep Neural Networks (DNNs) play a key role in prevailing machine learning applications. Resistive random-access memory (ReRAM) is capable of both computation and storage, contributing to the acceleration on DNNs by processing memory. Besides, significant amount zero weights observed DNNs, providing space reduce cost further skipping ineffectual calculations associated with them. However, irregular distribution makes it difficult for resistive accelerators take advantage sparsity as expected...
Memory deduplication plays a critical role in reducing memory consumption and the total cost of ownership (TCO) hyperscalers, particularly as advent large language models imposes unprecedented demands on resources. However, conventional CPU-based can interfere with co-running applications, significantly impacting performance time-sensitive workloads. Intel introduced <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">on-chip</i> Data Streaming...
Graph Neural Networks (GNNs) are becoming popular because they effective at extracting information from graphs. To execute GNNs, CPUs good platforms of their high availability and terabyte-level memory capacity, which enables full-batch computation on large However, GNNs heavily bound, limits performance.
General Matrix Multiplication (GEMM) is the key operation in Deep Neural Networks (DNNs). While dense GEMM uses SIMD CPUs efficiently, sparse much less efficient, especially at modest levels of unstructured sparsity common DNN inference/training. Thus, most DNNs use GEMM.In this paper, we propose SAVE, a novel vector engine for that efficiently skips ineffectual computation due to implementations. SAVE's hardware extensions pipeline are transparent software. SAVE accelerates FP32 and...
Many hardware-based defense schemes against speculative execution attacks use special mechanisms to protect instructions while speculative, and lift the when turn non-speculative. In this paper, we observe that can sometimes become Speculation Invariant before turning invariance means (i) whether instruction will execute (ii) instruction's operands are not a function of state. Hence, propose protection on these early, they speculation invariant, issue them without protection. As result,...
We present a fast cyclic redundancy check (CRC) algorithm that performs CRC computation for any length of message in parallel. For given with length, we first divide the into blocks, each which has fixed size equal to degree generator polynomial. Then perform among blocks parallel using Galois field multiplication and accumulation (GFMAC). Theoretically, our can achieve unlimited speedup over bit-serial or byte-wise table lookup at expense adding enough GFMAC units. Our lengthy two three...
Our community has greatly improved the efficiency of deep learning applications, including by exploiting sparsity in inputs. Most that work, though, is for inference, where weight known statically, and/or specialized hardware. We propose a scheme to leverage dynamic during training. In particular, we exploit zeros introduced ReLU activation function both feature maps and their gradients. This challenging because degree moderate locations change over time. also rely purely on software....
In this paper we present an optimized processor for fast Reed-Solomon encoding and decoding by using a configurable with parallel Galois Field multiplication accumulation (GFMAC) units. With processor, maximum performance can be achieved RS encoding/decoding over traditional implementations. Our implementation requires to add small number of logical gates customized GFMAC instructions maintain fewer registers. The is quite flexible compact supporting different coding standards. Compared...
Training Convolutional Neural Networks(CNNs) is both memory-and computation-intensive. The resistive random access memory (ReRAM) has shown its advantage to accelerate such tasks with high energy-efficiency. However, the ReRAM-based pipeline architecture suffers from low utilization of computing resource, caused by imbalanced data throughput in different stages because inherent down-sampling effect CNNs and inflexible usage ReRAM cells. In this paper, we propose a novel bidirectional...
With the development of cloud computing, disk arrays tolerating triple failures (3DFTs) are receiving more attention nowadays because they can provide high data reliability with low monetary cost. However, a challenging issue in these is how to efficiently reconstruct lost data, especially for partial stripe errors (e.g., sector and chunk errors). It one most significant scenarios practice. existing cache strategies not efficient reconstruction 3DFTs, which complex relationships among...
In security frameworks for speculative execution, an instruction is said to reach its Visibility Point (VP) when it no longer vulnerable pipeline squashes. Before a potentially leaky reaches VP, has stall—unless defense scheme such as invisible speculation provides protection. Unfortunately, either stalling or protecting the execution of pre-VP instructions typically performance cost.
Knowledge graph has been widely used in fact checking, owing to its capability provide crucial background knowledge help verify claims. Traditional checking works mainly focus on analyzing a single claim but have largely ignored analysis the semantic consistency of pair-wise claims, despite key importance real-world applications, e.g., multimodal fake news detection. This paper proposes neural network based model INSPECTOR for checking. Given pair aims detect potential inconsistency input...