Zhenheng Tang

ORCID: 0000-0001-8769-9974
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Stochastic Gradient Optimization Techniques
  • Advanced Neural Network Applications
  • Privacy-Preserving Technologies in Data
  • Advanced Memory and Neural Computing
  • Ferroelectric and Negative Capacitance Devices
  • Sparse and Compressive Sensing Techniques
  • Domain Adaptation and Few-Shot Learning
  • Water Systems and Optimization
  • IoT and Edge/Fog Computing
  • Topic Modeling
  • Satellite Image Processing and Photogrammetry
  • Recommender Systems and Techniques
  • Neural Networks and Applications
  • Cutaneous Melanoma Detection and Management
  • Natural Language Processing Techniques
  • Machine Learning and ELM
  • Remote Sensing and LiDAR Applications
  • Adversarial Robustness in Machine Learning
  • Dermatological and COVID-19 studies
  • Skin Protection and Aging
  • Robotics and Automated Systems
  • Brain Tumor Detection and Classification
  • Age of Information Optimization
  • Remote Sensing in Agriculture
  • Smart Grid Security and Resilience

Hong Kong University of Science and Technology
2024-2025

University of Hong Kong
2024-2025

Hong Kong Baptist University
2019-2024

Pipe bursts in water distribution networks lead to considerable loss and pose risks of bacteria pollutant contamination. burst localisation methods help service providers repair the pipes restore supply timely efficiently. Although have been reported on detection localisation, there is a lack studies accurate within potential district by accessible meters. To address this, novel Burst Location Identification Framework Fully-linear DenseNet (BLIFF) proposed. In this framework, additional...

10.1016/j.watres.2019.115058 article EN cc-by Water Research 2019-09-06

Distributed synchronous stochastic gradient descent (S-SGD) with data parallelism has been widely used in training large-scale deep neural networks (DNNs), but it typically requires very high communication bandwidth between computational workers (e.g., GPUs) to exchange gradients iteratively. Recently, Top-k sparsification techniques have proposed reduce the volume of be exchanged among and thus alleviate network pressure. can zero-out a significant portion without impacting model...

10.1109/icdcs.2019.00220 article EN 2019-07-01

Distributed deep learning (DL) has become prevalent in recent years to reduce training time by leveraging multiple computing devices (e.g., GPUs/TPUs) due larger models and datasets. However, system scalability is limited communication becoming the performance bottleneck. Addressing this issue a prominent research topic. In paper, we provide comprehensive survey of communication-efficient distributed algorithms, focusing on both system-level algorithmic-level optimizations. We first propose...

10.48550/arxiv.2003.06307 preprint EN other-oa arXiv (Cornell University) 2020-01-01

Recently, federated learning (FL) techniques have enabled multiple users to train machine models collaboratively without data sharing. However, existing FL algorithms suffer from the communication bottleneck due network bandwidth pressure and/or low utilization of participating clients in both centralized and decentralized architectures. To deal with problem while preserving convergence performance, we introduce a communication-efficient framework GossipFL. In GossipFL, 1) design novel...

10.1109/tpds.2022.3230938 article EN IEEE Transactions on Parallel and Distributed Systems 2022-12-21

Gradient sparsification is a promising technique to significantly reduce the communication overhead in decentralized synchronous stochastic gradient descent (S-SGD) algorithms. Yet, many existing schemes (e.g., Top-k sparsification) have complexity of O(kP), where k number selected gradients by each worker and P workers. Recently, gTop-k scheme has been proposed from O(kP) O(k logP), which boosts system scalability. However, it remains unclear whether can converge theory. In this paper, we...

10.24963/ijcai.2019/473 article EN 2019-07-28

Over the past years, great progress has been made in improving computing power of general-purpose graphics processing units (GPGPUs), which facilitates prosperity deep neural networks (DNNs) multiple fields like computer vision and natural language processing. A typical DNN training process repeatedly updates tens millions parameters, not only requires huge resources but also consumes significant energy. In order to train DNNs a more energy-efficient way, we empirically investigate impact...

10.1145/3307772.3328315 article EN 2019-06-13

Deep learning has become widely used in complex AI applications. Yet, training a deep neural network (DNNs) model requires considerable amount of calculations, long running time, and much energy. Nowadays, many-core accelerators (e.g., GPUs TPUs) are designed to improve the performance training. However, processors from different vendors perform dissimilarly terms energy consumption. To investigate differences among several popular off-the-shelf (i.e., Intel CPU, NVIDIA GPU, AMD Google TPU)...

10.1109/ccgrid49817.2020.00-15 preprint EN 2020-05-01

Federated Learning (FL) is a distributed learning paradigm that can learn global or personalized model from decentralized datasets on edge devices. However, in the computer vision domain, performance FL far behind centralized training due to lack of exploration diverse tasks with unified framework. has rarely been demonstrated effectively advanced such as object detection and image segmentation. To bridge gap facilitate development for tasks, this work, we propose federated library...

10.48550/arxiv.2111.11066 preprint EN cc-by arXiv (Cornell University) 2021-01-01

10.1109/cvprw63382.2024.00575 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) 2024-06-17

The increasing size of machine learning models, especially deep neural network can improve the model generalization capability. However, large models require more training data and computing resources (such as GPU clusters) to train. In distributed training, communication overhead exchanging gradients or among workers becomes a potential system bottleneck that limits scalability. Recently, many research works aim reduce time two types architectures, centralized decentralized.

10.1109/icdcs47774.2020.00153 article EN 2020-11-01

Nowadays, large and complex deep learning (DL) models are increasingly trained in a distributed manner across multiple worker machines, which extensive communications between workers pose serious scaling problems. In this article, we present quantitative survey of communication optimization techniques for data parallel DL. We first identify the major challenges classify existing solutions into three levels, namely algorithm, system architecture, network infrastructure. state-of-the-art...

10.1109/mnet.011.2000530 article EN IEEE Network 2020-12-02

Recent large language models (LLMs) have tended to leverage sparsity reduce computations, employing the sparsely activated mixture-of-experts (MoE) technique. MoE introduces four modules, including token routing, communication, expert computation, and parallelism, that impact model quality training efficiency. To enable versatile usage of models, we introduce FSMoE, a flexible system optimizing task scheduling with three novel techniques: 1) Unified abstraction online profiling modules for...

10.1145/3669940.3707272 preprint EN 2025-02-03

To reduce memory costs in long-context inference with Large Language Models (LLMs), many recent works focus on compressing the key-value (KV) cache of different tokens. However, we identify that previous KV compression methods measure token importance individually, neglecting dependency between tokens real-world language characterics. In light this, introduce ChunkKV, grouping a chunk as basic unit, and retaining most informative semantic chunks while discarding less important ones....

10.48550/arxiv.2502.00299 preprint EN arXiv (Cornell University) 2025-01-31

This paper investigates an under-explored challenge in large language models (LLMs): the impact of KV cache compression methods on LLMs' fundamental capabilities. While existing achieve impressive ratios long-context benchmarks, their effects core model capabilities remain understudied. We present a comprehensive empirical study evaluating prominent across diverse tasks, spanning world knowledge, commonsense reasoning, arithmetic code generation, safety, and understanding generation.Our...

10.48550/arxiv.2502.01941 preprint EN arXiv (Cornell University) 2025-02-03

Model merging aggregates Large Language Models (LLMs) finetuned on different tasks into a stronger one. However, parameter conflicts between models leads to performance degradation in averaging. While model routing addresses this issue by selecting individual during inference, it imposes excessive storage and compute costs, fails leverage the common knowledge from models. In work, we observe that layers exhibit varying levels of conflicts. Building insight, average with minimal use novel...

10.48550/arxiv.2502.04411 preprint EN arXiv (Cornell University) 2025-02-06

One-shot Federated Learning (OFL) is a distributed machine learning paradigm that constrains client-server communication to single round, addressing privacy and overhead issues associated with multiple rounds of data exchange in traditional (FL). OFL demonstrates the practical potential for integration future approaches require collaborative training models, such as large language models (LLMs). However, current methods face two major challenges: heterogeneity model heterogeneity, which...

10.48550/arxiv.2502.09104 preprint EN arXiv (Cornell University) 2025-02-13

The growth of large language models (LLMs) increases challenges accelerating distributed training across multiple GPUs in different data centers. Moreover, concerns about privacy and exhaustion have heightened interest geo-distributed Communication parallel (DDP) with stochastic gradient descent (S-SGD) is the main bottleneck low-bandwidth environments. Local SGD mitigates communication overhead by reducing synchronization frequency, recent studies successfully applied it to...

10.48550/arxiv.2502.11058 preprint EN arXiv (Cornell University) 2025-02-16

Skin disease is one of the most common types human diseases, which may happen to everyone regardless age, gender or race. Due high visual diversity, diagnosis highly relies on personal experience; and there a serious shortage experienced dermatologists in many countries. To alleviate this problem, computer-aided with state-of-the-art (SOTA) machine learning techniques would be promising solution. In paper, we aim at understanding performance convolutional neural network (CNN) based...

10.1109/bigdata47090.2019.9006528 article EN 2021 IEEE International Conference on Big Data (Big Data) 2019-12-01

In federated learning (FL), model performance typically suffers from client drift induced by data heterogeneity, and mainstream works focus on correcting drift. We propose a different approach named virtual homogeneity (VHL) to directly "rectify" the heterogeneity. particular, VHL conducts FL with homogeneous dataset crafted satisfy two conditions: containing no private information being separable. The can be generated pure noise shared across clients, aiming calibrate features heterogeneous...

10.48550/arxiv.2206.02465 preprint EN other-oa arXiv (Cornell University) 2022-01-01

One-shot neural architecture search (NAS) substantially improves the efficiency by training one supernet to estimate performance of every possible child (i.e., subnet). However, inconsistency characteristics among subnets incurs serious interference in optimization, resulting poor ranking correlation subnets. Subsequent explorations decompose weights via a particular criterion, e.g., gradient matching, reduce interference; yet they suffer from huge computational cost and low space...

10.1609/aaai.v37i6.25949 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2023-06-26

The rapid growth of memory and computation requirements large language models (LLMs) has outpaced the development hardware, hindering people who lack large-scale high-end GPUs from training or deploying LLMs. However, consumer-level GPUs, which constitute a larger market share, are typically overlooked in LLM due to their weaker computing performance, smaller storage capacity, lower communication bandwidth. Additionally, users may have privacy concerns when interacting with remote In this...

10.48550/arxiv.2309.01172 preprint EN other-oa arXiv (Cornell University) 2023-01-01

To reduce the long training time of large deep neural network (DNN) models, distributed synchronous stochastic gradient descent (S-SGD) is commonly used on a cluster workers. However, speedup brought by multiple workers limited communication overhead. Two approaches, namely pipelining and sparsification, have been separately proposed to alleviate impact overheads. Yet, sparsification methods can only initiate after backpropagation, hence miss opportunity. In this paper, we propose new...

10.48550/arxiv.1911.08727 preprint EN other-oa arXiv (Cornell University) 2019-01-01

Federated Learning (FL) models often experience client drift caused by heterogeneous data, where the distribution of data differs across clients. To address this issue, advanced research primarily focuses on manipulating existing gradients to achieve more consistent models. In paper, we present an alternative perspective and aim mitigate it generating improved local First, analyze generalization contribution training conclude that is bounded conditional Wasserstein distance between different...

10.48550/arxiv.2402.07011 preprint EN arXiv (Cornell University) 2024-02-10
Coming Soon ...