- Advanced Neural Network Applications
- Multimodal Machine Learning Applications
- Advanced Image and Video Retrieval Techniques
- Domain Adaptation and Few-Shot Learning
- Natural Language Processing Techniques
- Industrial Vision Systems and Defect Detection
- Topic Modeling
- Dental Radiography and Imaging
- Video Analysis and Summarization
- Dental Research and COVID-19
- Intelligent Tutoring Systems and Adaptive Learning
- COVID-19 diagnosis using AI
- AI in cancer detection
- Image and Object Detection Techniques
- Human Pose and Action Recognition
- Non-Destructive Testing Techniques
- Integrated Circuits and Semiconductor Failure Analysis
- Virtual Reality Applications and Impacts
- VLSI and Analog Circuit Testing
- Visual Attention and Saliency Detection
- Advanced Optical Imaging Technologies
- Anomaly Detection Techniques and Applications
- Image and Video Quality Assessment
- Engineering and Test Systems
- Advanced Data Compression Techniques
Jiangnan University
2024-2025
Huazhong University of Science and Technology
2020-2022
University of Science and Technology Beijing
2021
Object detection has recently experienced substantial progress. Yet, the widely adopted horizontal bounding box representation is not appropriate for ubiquitous oriented objects such as in aerial images and scene texts. In this paper, we propose a simple yet effective framework to detect multi-oriented objects. Instead of directly regressing four vertices, glide vertex on each corresponding side accurately describe object. Specifically, We regress length ratios characterizing relative...
Temporal action detection (TAD) aims to determine the semantic label and temporal interval of every instance in an untrimmed video. It is a fundamental challenging task video understanding. Previous methods tackle this with complicated pipelines. They often need train multiple networks involve hand-designed operations, such as non-maximal suppression anchor generation, which limit flexibility prevent end-to-end learning. In paper, we propose Transformer-based method for TAD, termed TadTR....
Multi-modal large language models (MLLMs), such as GPT-4, exhibit great comprehension capabilities on human instruction, well zero-shot ability new downstream multi-modal tasks. To integrate the different modalities within a unified embedding space, previous MLLMs attempted to conduct visual instruction tuning with massive and high-quality image-text pair data, which requires substantial costs in data collection training resources. In this article, we propose TOMGPT (Text-Only GPT),...
Abstract In defect detection on metal surfaces, there are many small defects with subtle features that difficult to distinguish from the background environment using mainstream object methods. To alleviate this issue, study proposes an improved CenterNet model for enhancing of namely MSDD. work, we utilize attention mechanism reconstruct basic feature extraction module in network, aiming enhance focus related defects. Additionally, redesign efficient deconvolution extract multi‐scale...
Non-maximum suppression (NMS) is widely used in object detection pipelines for removing duplicated bounding boxes. The inconsistency between the confidence NMS and real localization seriously affects performance. Prior works propose to predict Intersection-over-Union (IoU) boxes corresponding ground-truths improve NMS, while accurately predicting IoU still a challenging problem. We argue that complex definition of feature misalignment make it difficult accurately. In this paper, we novel...
A well-known dilemma in large vision-language models (e.g., GPT-4, LLaVA) is that while increasing the number of vision tokens generally enhances visual understanding, it also significantly raises memory and computational costs, especially long-term, dense video frame streaming scenarios. Although learnable approaches like Q-Former Perceiver Resampler have been developed to reduce token burden, they overlook context causally modeled by LLMs (i.e., key-value cache), potentially leading missed...
Instructional documents are rich sources of knowledge for completing various tasks, yet their unique challenges in conversational question answering (CQA) have not been thoroughly explored. Existing benchmarks primarily focused on basic factual question-answering from single narrative documents, making them inadequate assessing a model`s ability to comprehend complex real-world instructional and provide accurate step-by-step guidance daily life. To bridge this gap, we present InsCoQA, novel...
This article introduces a universal semiconductor Automatic Test Pattern Generation (ATPG) solution for Automated Equipment (ATE) platform. With the increasing trend of Artificial Intelligence (AI) and Advanced Driving Assistance System (ADAS) communication between devices requires advanced protocols such as Mobile Industry Processor Interface (MIPI) Point-to-point (P2P) protocols. A designer-based is developed to provide one-click software approach create test vectors common customized As...
CLIP (Contrastive Language-Image Pretraining) is well-developed for open-vocabulary zero-shot image-level recognition, while its applications in pixel-level tasks are less investigated, where most efforts directly adopt features without deliberative adaptations. In this work, we first demonstrate the necessity of image-pixel feature adaption, then provide Multi-View Prompt learning (MVP-SEG) as an effective solution to achieve adaptation and solve semantic segmentation. Concretely, MVP-SEG...
Non-maximum suppression (NMS) is widely used in object detection pipelines for removing duplicated bounding boxes. The inconsistency between the confidence NMS and real localization seriously affects performance. Prior works propose to predict Intersection-over-Union (IoU) boxes corresponding ground-truths improve NMS, while accurately predicting IoU still a challenging problem. We argue that complex definition of feature misalignment make it difficult accurately. In this paper, we novel...