- Multimodal Machine Learning Applications
- Advanced Neural Network Applications
- Domain Adaptation and Few-Shot Learning
- Advanced Image and Video Retrieval Techniques
- Visual Attention and Saliency Detection
- COVID-19 diagnosis using AI
- Advanced Vision and Imaging
- Human Pose and Action Recognition
- Natural Language Processing Techniques
- Machine Learning and Data Classification
- Imbalanced Data Classification Techniques
- Image Processing and 3D Reconstruction
- Explainable Artificial Intelligence (XAI)
- Hydrology and Drought Analysis
- Time Series Analysis and Forecasting
- Cloud Data Security Solutions
- Cancer-related molecular mechanisms research
- Customer churn and segmentation
- Electromagnetic Simulation and Numerical Methods
- Cloud Computing and Resource Management
- Text Readability and Simplification
- CCD and CMOS Imaging Sensors
- Online Learning and Analytics
- Software System Performance and Reliability
- Neural Networks and Applications
North China University of Water Resources and Electric Power
2024-2025
Wuhan University
2020-2024
University of California, Merced
2024
Nanyang Technological University
2023-2024
Tianjin Chengjian University
2021
Xidian University
2008
Visual segmentation seeks to partition images, video frames, or point clouds into multiple segments groups. This technique has numerous real-world applications, such as autonomous driving, image editing, robot sensing, and medical analysis. Over the past decade, deep learning-based methods have made remarkable strides in this area. Recently, transformers, a type of neural network based on self-attention originally designed for natural language processing, considerably surpassed previous...
In the field of visual scene understanding, deep neural networks have made impressive advancements in various core tasks like segmentation, tracking, and detection. However, most approaches operate on close-set assumption, meaning that model can only identify pre-defined categories are present training set. Recently, open vocabulary settings were proposed due to rapid progress vision language pre-training. These new seek locate recognize beyond annotated label space. The approach is more...
Previous multi-task dense prediction studies developed complex pipelines such as multi-modal distillations in multiple stages or searching for task relational contexts each task. The core insight beyond these methods is to maximize the mutual effects of Inspired by recent query-based Transformers, we propose a simple pipeline named Multi-Query Transformer (MQTransformer) that equipped with queries from different tasks facilitate reasoning among and simplify cross-task interaction pipeline....
Panoptic Part Segmentation (PPS) unifies panoptic and part segmentation into one task. Previous works utilize separate approaches to handle things, stuff, predictions without shared computation task association. We aim unify these tasks at the architectural level, designing first end-to-end unified framework, Panoptic-PartFormer. Moreover, we find previous metric PartPQ biases PQ. To both issues, design a meta-architecture that decouples features things/stuff features, respectively. model...
The carbon cycle in terrestrial ecosystems is a crucial component of the global cycle, and drought increasingly recognized as significant stressor impacting their sink function. Net ecosystem productivity (NEP), which key indicator capacity, closely related to vegetation Primary Productivity (NPP), derived using Carnegie-Ames-Stanford Approach (CASA) model. However, there limited research on desert grassland ecosystems, offer unique insights due long-term data series. relationship between...
Video segmentation aims to segment and track every pixel in diverse scenarios accurately. In this paper, we present Tube-Link, a versatile framework that addresses multiple core tasks of video with unified architecture. Our is near-online approach takes short subclip as input outputs the corresponding spatial-temporal tube masks. To enhance modeling cross-tube relationships, propose an effective way perform tube-level linking via attention along queries. addition, introduce temporal...
Few-shot class-incremental learning (FSCIL) has been a challenging problem as only few training samples are accessible for each novel class in the new sessions. Finetuning backbone or adjusting classifier prototypes trained prior sessions would inevitably cause misalignment between feature and of old classes, which explains well-known catastrophic forgetting problem. In this paper, we deal with dilemma FSCIL inspired by recently discovered phenomenon named neural collapse, reveals that...
Advanced by transformer architecture, vision foundation models (VFMs) achieve remarkable progress in performance and generalization ability. Segment Anything Model (SAM) is one model that can generalized segmentation. However, most VFMs cannot run realtime, which makes it difficult to transfer them into several products. On the other hand, current real-time segmentation mainly has purpose, such as semantic on driving scene. We argue diverse outputs are needed for real applications. Thus,...
In this work, for the first time, we demonstrate that Mamba-based point cloud methods can outperform point-based methods. Mamba exhibits strong global modeling capabilities and linear computational complexity, making it highly attractive analysis. To enable more effective processing of 3-D data by Mamba, propose a novel Consistent Traverse Serialization to convert clouds into 1-D sequences while ensuring neighboring points in sequence are also spatially adjacent. yields six variants...
This work presents Sa2VA, the first unified model for dense grounded understanding of both images and videos. Unlike existing multi-modal large language models, which are often limited to specific modalities tasks, Sa2VA supports a wide range image video including referring segmentation conversation, with minimal one-shot instruction tuning. combines SAM-2, foundation model, LLaVA, an advanced vision-language unifies text, image, into shared LLM token space. Using LLM, generates tokens that...
Proofs of retrievability and proofs replication are two cryptographic tools that enable a remote server to prove the users' data has been correctly stored. Nevertheless, literature either requires users themselves perform expensive verification jobs, or relies on “fully trustworthy” third party auditor (TPA) execute public verification. In addition, none existing solutions consider underlying incentive issues behind rational who is motivated collect but tries evade checking in order save...
Visual segmentation seeks to partition images, video frames, or point clouds into multiple segments groups. This technique has numerous real-world applications, such as autonomous driving, image editing, robot sensing, and medical analysis. Over the past decade, deep learning-based methods have made remarkable strides in this area. Recently, transformers, a type of neural network based on self-attention originally designed for natural language processing, considerably surpassed previous...
The CLIP and Segment Anything Model (SAM) are remarkable vision foundation models (VFMs). SAM excels in segmentation tasks across diverse domains, while is renowned for its zero-shot recognition capabilities. This paper presents an in-depth exploration of integrating these two into a unified framework. Specifically, we introduce the Open-Vocabulary SAM, SAM-inspired model designed simultaneous interactive recognition, leveraging unique knowledge transfer modules: SAM2CLIP CLIP2SAM. former...
In this work, we address various segmentation tasks, each traditionally tackled by distinct or partially unified models. We propose OMG-Seg, One Model that is Good enough to efficiently and effectively handle all the including image semantic, instance, panoptic segmentation, as well their video counterparts, open vocabulary settings, prompt-driven, interactive like SAM, object segmentation. To our knowledge, first model these tasks in one achieve satisfactory performance. show a...
Transformer-based segmentation methods face the challenge of efficient inference when dealing with high-resolution images. Recently, several linear attention architectures, such as Mamba and RWKV, have attracted much they can process long sequences efficiently. In this work, we focus on designing an segment-anything model by exploring these different architectures. Specifically, design a mixed backbone that contains convolution RWKV operation, which achieves best for both accuracy...
The higher order moment method has far fewer unknowns compared to the low methods. However, computation of self-term matrix elements is extremely time-consuming. This paper presents an algorithm extract singularity by dividing integrand into two parts on account Taylor's formula. first part with a removable discontinuity easy be integrated. rest consists three principal singular functions. Their singularities are canceled Jacobian simple transformation. extraction leads rapid non-redundant...
Education is something that every country values, and education data a very important resource for the country. With increase in proportion of country, size student body getting bigger bigger. Student performance directly related to core entire education. By analyzing student's information predicting future performance, this prediction does not only mean improvement grades, but also can summarize methods effectively help avoid above situation. In study, some various algorithms machine...
Previous multi-task dense prediction studies developed complex pipelines such as multi-modal distillations in multiple stages or searching for task relational contexts each task. The core insight beyond these methods is to maximize the mutual effects of Inspired by recent query-based Transformers, we propose a simple pipeline named Multi-Query Transformer (MQTransformer) that equipped with queries from different tasks facilitate reasoning among and simplify cross-task interaction pipeline....
Current universal segmentation methods demonstrate strong capabilities in pixel-level image and video understanding. However, they lack reasoning abilities cannot be controlled via text instructions. In contrast, large vision-language multimodal models exhibit powerful vision-based conversation but understanding have difficulty accepting visual prompts for flexible user interaction. This paper proposes OMG-LLaVA, a new elegant framework combining vision with abilities. It can accept various...
Meteorological and agricultural droughts are inherently correlated, whereas the propagation mechanism between them remains unclear in Northwestern China. Investigating linkages these drought types identifying potential influencing factors is crucial for effective water resource management mitigation. This study adopted Standardized Precipitation Evapotranspiration Index (SPEI) Soil Moisture (SSMI) to characterize meteorological from 1960 2018. The time was detected using Pearson correlation...
Estimating the 3D structure of drivable surface and surrounding environment is a crucial task for assisted autonomous driving. It commonly solved either by using sensors such as LiDAR or directly predicting depth points via deep learning. However, former expensive, latter lacks use geometry information scene. In this paper, instead following existing methodologies, we propose Road Planar Parallax Attention Network (RPANet), new neural network sensing from monocular image sequences based on...
Panoptic Part Segmentation (PPS) unifies panoptic and part segmentation into one task. Previous works utilize separate approaches to handle things, stuff, predictions without shared computation task association. We aim unify these tasks at the architectural level, designing first end-to-end unified framework, Panoptic-PartFormer. Moreover, we find previous metric PartPQ biases PQ. To both issues, design a meta-architecture that decouples features things/stuff features, respectively. model...
Video segmentation aims to segment and track every pixel in diverse scenarios accurately. In this paper, we present Tube-Link, a versatile framework that addresses multiple core tasks of video with unified architecture. Our is near-online approach takes short subclip as input outputs the corresponding spatial-temporal tube masks. To enhance modeling cross-tube relationships, propose an effective way perform tube-level linking via attention along queries. addition, introduce temporal...