- Advanced Neural Network Applications
- Advanced Image and Video Retrieval Techniques
- Video Surveillance and Tracking Methods
- Multimodal Machine Learning Applications
- Robotics and Sensor-Based Localization
- Domain Adaptation and Few-Shot Learning
- Autonomous Vehicle Technology and Safety
- Adversarial Robustness in Machine Learning
- Human Pose and Action Recognition
- Neural Networks and Applications
- Advanced Vision and Imaging
- Visual Attention and Saliency Detection
- Cell Image Analysis Techniques
- Face recognition and analysis
- Neural Networks Stability and Synchronization
- Anomaly Detection Techniques and Applications
- Integrated Circuits and Semiconductor Failure Analysis
- Translation Studies and Practices
- Higher Education and Teaching Methods
- Video Analysis and Summarization
- Advanced Memory and Neural Computing
- Advanced Graph Neural Networks
- Image Processing Techniques and Applications
- Text and Document Classification Technologies
- Medical Image Segmentation Techniques
Rochester Institute of Technology
2021-2025
Southwest University
2019-2023
University of Michigan
2022
Baxter (United States)
2022
Michigan Department of Transportation
2022
Purdue University West Lafayette
2019-2022
Shandong Institute for Product Quality Inspection
2021
Purdue University System
2020
Tianjin Tianhe Hospital
2017-2019
Tianjin Medical University
2017-2019
Video instance segmentation (VIS) is a new and critical task in computer vision. To date, top-performing VIS methods extend the two-stage Mask R-CNN by adding tracking branch, leaving plenty of room for improvement. In contrast, we approach from perspective propose one-stage spatial granularity network (SG-Net). Compared to conventional methods, SG-Net demonstrates four advantages: 1) Our method has compact architecture each head (detection, segmentation, tracking) crafted interdependently...
Video objection detection is a challenging task because isolated video frames may encounter appearance deterioration, which introduces great confusion for detection. One of the popular solutions to exploit temporal information and enhance per-frame representation through aggregating features from neighboring frames. Despite achieving improvements in detection, existing methods focus on selection higher-level aggregation rather than modeling lower-level relations increase feature...
In this work, we introduce a Denser Feature Network(DenserNet) for visual localization. Our work provides three principal contributions. First, develop convolutional neural network (CNN) architecture which aggregates feature maps at different semantic levels image representations. Using denser maps, our method can produce more key point features and increase retrieval accuracy. Second, model is trained end-to-end without pixel-level an-notation other than positive negative GPS-tagged pairs....
Video captioning is a challenging task as it needs to accurately transform visual understanding into natural language description. To date, state-of-the-art methods inadequately model global-local vision representation for sentence generation, leaving plenty of room improvement. In this work, we approach the video from new perspective and propose GLR framework, namely granularity. Our demonstrates three advantages over prior efforts. First, simple solution, which exploits extensive...
We introduce a novel method using new generative model that automatically learns effective representations of the target and background appearance to detect, segment track each instance in video sequence. Differently from current discriminative tracking-by-detection solutions, our proposed hierarchical structural embedding learning can predict more high-quality masks with accurate boundary details over spatio-temporal space via normalizing flows. formulate inference procedure as embedded...
Optical flow is an indispensable building block for various important computer vision tasks, including motion estimation, object tracking, and disparity measurement. In this work, we propose TransFlow, a pure transformer architecture optical estimation. Compared to dominant CNN-based methods, TransFlow demonstrates three advantages. First, it provides more accurate correlation trustworthy matching in estimation by utilizing spatial self-attention crossattention mechanisms between adjacent...
Learning pyramidal feature representations is important for many dense prediction tasks (e.g., object detection, semantic segmentation) that demand multi-scale visual understanding. Feature Pyramid Network (FPN) a well-known architecture learning, however, intrinsic weaknesses in extraction and fusion impede the production of informative features. This work addresses FPN through novel tripartite enhanced pyramid network (TFPN), with three distinct effective designs. First, we develop...
This paper proposes to use the three vectors in a rotation matrix as representation head pose estimation and develops new neural network based on characteristic of such representation. We address two potential issues existed current works: 1. Public datasets for either Euler angles or quaternions annotate data samples. However, both these annotations have issue discontinuity thus could result some performance training. 2. Most research works report Mean Absolute Error (MAE) measurement...
Particleboard surface defect detection technology is of great significance to the automation particleboard detection, but current has disadvantages such as low accuracy and poor real-time performance. Therefore, this paper proposes an improved lightweight method You Only Live Once v5 (YOLOv5), namely PB-YOLOv5 (Particle Board-YOLOv5). Firstly, gamma-ray transform image difference are combined deal with uneven illumination acquired images, so that well corrected. Secondly, Ghost Bottleneck...
Structure information extraction refers to the task of extracting structured text fields from web pages, such as a product offer shopping page including title, description, brand and price. It is an important research topic which has been widely studied in document understanding search. Recent natural language models with sequence modeling have demonstrated state-of-the-art performance on extraction. However, effectively serializing tokens unstructured pages challenging practice due variety...
We devise deep nearest centroids (DNC), a conceptually elegant yet surprisingly effective network for large-scale visual recognition, by revisiting Nearest Centroids, one of the most classic and simple classifiers. Current models learn classifier in fully parametric manner, ignoring latent data structure lacking simplicity explainability. DNC instead conducts nonparametric, case-based reasoning; it utilizes sub-centroids training samples to describe class distributions clearly explains...
Video captioning is a challenging task as it needs to accurately transform visual understanding into natural language description. To date, state-of-the-art methods inadequately model global-local representation across video frames for caption generation, leaving plenty of room improvement. In this work, we approach the from new perspective and propose GL-RG framework captioning, namely Global-Local Representation Granularity. Our demonstrates three advantages over prior efforts: 1)...
When analyzing data from in situ RNA detection technologies, cell segmentation is an essential step identifying boundaries, assigning reads to cells, and studying the gene expression morphological features of cells. We developed a deep-learning-based method, GeneSegNet, that integrates both imaging information perform segmentation. GeneSegNet also employs recursive training strategy deal with noisy labels. show significantly improves performances over existing methods either ignore or...
We introduce the novel Diffusion Visual Programmer (DVP), a neuro-symbolic image translation framework. Our proposed DVP seamlessly embeds condition-flexible diffusion model within GPT architecture, orchestrating coherent sequence of visual programs (i.e., computer vision models) for various pro-symbolic steps, which span RoI identification, style transfer, and position manipulation, facilitating transparent controllable processes. Extensive experiments demonstrate DVP's remarkable...
Monocular Depth Estimation (MDE) plays a vital role in applications such as autonomous driving.However, various attacks target MDE models, with physical posing significant threats to system security.Traditional adversarial training methods, which require ground-truth labels, are not directly applicable models that lack depth.Some self-supervised model hardening techniques (e.g., contrastive learning) overlook the domain knowledge of MDE, resulting suboptimal performance.In this work, we...
Semi-supervised learning based on consistency offers significant promise for enhancing medical image segmentation. Current approaches use copy-paste as an effective data perturbation technique to facilitate weak-to-strong learning. However, these techniques often lead a decrease in the accuracy of synthetic labels corresponding and introduce excessive perturbations distribution training data. Such over-perturbation causes stray from its true distribution, thereby impairing model's...
We present CLUSTSEG, a general, transformer-based framework that tackles different image segmentation tasks (i.e., superpixel, semantic, instance, and panoptic) through unified neural clustering scheme. Regarding queries as cluster centers, CLUSTSEG is innovative in two aspects:1) centers are initialized heterogeneous ways so to pointedly address task-specific demands (e.g., instance- or category-level distinctiveness), yet without modifying the architecture; 2) pixel-cluster assignment,...
Fine-tuning large vision-language models is a challenging task. Prompt tuning approaches have been introduced to learn fixed textual or visual prompts while freezing the pre-trained model in downstream tasks. Despite effectiveness of prompt tuning, what do those learnable remains unexplained. In this work, we explore whether fine-tuning can knowledge-aware from pre-training, by designing two different sets pre-training and phases respectively. Specifically, present Video-Language (VL-Prompt)...
As the size of transformer-based, models continues to grow, fine-tuning these large-scale pretrained vision for new tasks has become increasingly parameter-intensive. Parameter-efficient learning been developed reduce number tunable parameters during fine-tuning. Although methods show promising results, there is still a significant performance gap compared full To address this challenge, we propose an Effective and Efficient Visual Prompt Tuning (E <sup...
Self-supervised space-time correspondence learning utilizing unlabeled videos holds great potential in computer vision. Most existing methods rely on contrastive with mining negative samples or adapting reconstruction from the image domain, which requires dense affinity across multiple frames optical flow constraints. Moreover, video prediction models need to uncover more inherent properties of video, such as structural information. In this work, we propose HiGraph+, a sophisticated...
Qifan Wang, Jingang Xiaojun Quan, Fuli Feng, Zenglin Xu, Shaoliang Nie, Sinong Madian Khabsa, Hamed Firooz, Dongfang Liu. Proceedings of the 61st Annual Meeting Association for Computational Linguistics (Volume 1: Long Papers). 2023.