Yujie Zhong

ORCID: 0009-0007-9127-3387
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Human Pose and Action Recognition
  • Multimodal Machine Learning Applications
  • Advanced Image and Video Retrieval Techniques
  • Video Surveillance and Tracking Methods
  • Anomaly Detection Techniques and Applications
  • Domain Adaptation and Few-Shot Learning
  • Video Analysis and Summarization
  • Advanced Neural Network Applications
  • Autonomous Vehicle Technology and Safety
  • Image Processing Techniques and Applications
  • Machine Learning and Data Classification
  • Natural Language Processing Techniques
  • Image Retrieval and Classification Techniques
  • Neural Networks and Applications
  • Topic Modeling
  • Advanced Text Analysis Techniques
  • Gait Recognition and Analysis
  • Sports Analytics and Performance
  • Image and Signal Denoising Methods
  • Transportation and Mobility Innovations
  • Face recognition and analysis
  • Magnetic confinement fusion research
  • Laser-Plasma Interactions and Diagnostics
  • Image Enhancement Techniques
  • Advanced Image Processing Techniques

Xi'an Jiaotong University
2025

Huazhong University of Science and Technology
2024

Meizu (China)
2024

In this paper, we present a one-stage framework TriDet for temporal action detection. Existing methods often suffer from imprecise boundary predictions due to the ambiguous boundaries in videos. To alleviate problem, propose novel Trident-head model via an estimated relative probability distribution around boundary. feature pyramid of TriDet, efficient Scalable-Granularity Perception (SGP) layer mitigate rank loss problem self-attention that takes place video features and aggregate...

10.1109/cvpr52729.2023.01808 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023-06-01

Object re-identification (ReID) aims to find instances with the same identity as given probe from a large gallery. Pairwise losses play an important role in training strong ReID network. Existing pairwise densely exploit each instance anchor and sample its triplets mini-batch. This dense sampling mechanism inevitably introduces positive pairs that share few visual similarities, which can be harmful training. To address this problem, we propose novel loss paradigm termed Sparse Pair-wise (SP)...

10.1109/cvpr52729.2023.01886 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023-06-01

Document-level relation extraction (DocRE) aims at predicting relations of all entity pairs in one document, which plays an important role information extraction. DocRE is more challenging than previous sentence-level extraction, as it often requires coreference and logical reasoning across multiple sentences. Graph-based methods are the mainstream solution to this complex DocRE. They generally construct heterogeneous graphs with entities, mentions, sentences nodes, co-occurrence...

10.1109/tpami.2025.3528246 article EN IEEE Transactions on Pattern Analysis and Machine Intelligence 2025-01-01

Recent LSS-based multi-view 3D object detection has made tremendous progress, by processing the features in Brid-Eye-View (BEV) via convolutional detector. However, typical convolution ignores radial symmetry of BEV and increases difficulty detector optimization. To preserve inherent property ease optimization, we propose an azimuth-equivariant (AeConv) anchor. The sampling grid AeConv is always direction, thus it can learn azimuth-invariant features. proposed anchor enables head to...

10.1109/cvpr52729.2023.02067 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023-06-01

The SoccerNet 2022 challenges were the second annual video understanding organized by team. In 2022, composed of 6 vision-based tasks: (1) action spotting, focusing on retrieving timestamps in long untrimmed videos, (2) replay grounding, live moment an shown a replay, (3) pitch localization, detecting line and goal part elements, (4) camera calibration, dedicated to intrinsic extrinsic parameters, (5) player re-identification, same players across multiple views, (6) object tracking, tracking...

10.1145/3552437.3558545 preprint EN 2022-09-30

Pre-trained visual-language (ViL) models have demonstrated good zero-shot capability in video understanding tasks, where they were usually adapted through fine-tuning or temporal modeling. However, the task of open-vocabulary action localization (OV-TAL), such adaption reduces robustness ViL against different data distributions, leading to a misalignment between visual representations and text descriptions unseen categories. As result, existing methods often strike trade-off detection...

10.1109/tpami.2024.3395778 article EN IEEE Transactions on Pattern Analysis and Machine Intelligence 2024-05-02

Recently, the open-vocabulary semantic segmentation problem has attracted increasing attention and best performing methods are based on two-stream networks: one stream for proposal mask generation other segment classification using a pre-trained visual-language model. However, existing require passing great number of (up to hundred) image crops into model, which is highly inefficient. To address problem, we propose network that only needs single pass through model each input image....

10.1109/iccv51070.2023.00106 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2023-10-01

Existing end-to-end Multi-Object Tracking (e2e-MOT) methods have not surpassed non-end-to-end tracking-by-detection methods. One potential reason is its label assignment strategy during training that consistently binds the tracked objects with tracking queries and then assigns few newborns to detection queries. With one-to-one bipartite matching, such an will yield unbalanced training, i.e., scarce positive samples for queries, especially enclosed scene, as majority of come on stage at...

10.48550/arxiv.2305.12724 preprint EN cc-by arXiv (Cornell University) 2023-01-01

In this paper, we consider the problem of generalised visual object counting, with goal developing a computational model for counting number objects from arbitrary semantic categories, using "exemplars", i.e. zero-shot or few-shot counting. To end, make following four contributions: (1) We introduce novel transformer-based architecture termed as Counting Transformer (CounTR), which explicitly capture similarity between image patches given "exemplars" attention mechanism;(2) adopt two-stage...

10.48550/arxiv.2208.13721 preprint EN other-oa arXiv (Cornell University) 2022-01-01

10.1109/cvpr52733.2024.01339 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024-06-16

In person re-identification (ReID) tasks, many works explore the learning of part features to improve performance over global image features. Existing methods explicitly extract by either using a hand-designed division or keypoints obtained with external visual systems. this work, we propose learn Discriminative implicit Parts (DiPs) which are decoupled from explicit body parts. Therefore, DiPs can any discriminative that benefit in distinguishing identities, is beyond predefined parts (such...

10.48550/arxiv.2212.13906 preprint EN other-oa arXiv (Cornell University) 2022-01-01

Despite recent progress in video and language representation learning, the weak or sparse correspondence between two modalities remains a bottleneck area. Most video-language models are trained via pair-level loss to predict whether pair of text is aligned. However, even paired video-text segments, only subset frames semantically relevant corresponding text, with remainder representing noise; where ratio noisy higher for longer videos. We propose FineCo (Fine-grained Contrastive Loss Frame...

10.48550/arxiv.2210.05039 preprint EN cc-by arXiv (Cornell University) 2022-01-01

Temporal Action Detection (TAD) focuses on detecting pre-defined actions, while Moment Retrieval (MR) aims to identify the events described by open-ended natural language within untrimmed videos. Despite that they focus different events, we observe have a significant connection. For instance, most descriptions in MR involve multiple actions from TAD. In this paper, aim investigate potential synergy between TAD and MR. Firstly, propose unified architecture, termed Unified (UniMD), for both It...

10.48550/arxiv.2404.04933 preprint EN arXiv (Cornell University) 2024-04-07

Recent advancements have empowered Large Language Models for Vision (vLLMs) to generate detailed perceptual outcomes, including bounding boxes and masks. Nonetheless, there are two constraints that restrict the further application of these vLLMs: incapability handling multiple targets per query failure identify absence objects in image. In this study, we acknowledge main cause problems is insufficient complexity training queries. Consequently, define general sequence format complex Then...

10.48550/arxiv.2404.08506 preprint EN arXiv (Cornell University) 2024-04-12

Omnidirectional (360-degree) video is rapidly gaining popularity due to advancements in immersive technologies like virtual reality (VR) and extended (XR). However, real-time streaming of such videos, especially live mobile scenarios unmanned aerial vehicles (UAVs), challenged by limited bandwidth strict latency constraints. Traditional methods, as compression adaptive resolution, help but often compromise quality introduce artifacts that degrade the viewer experience. Additionally, unique...

10.48550/arxiv.2411.06738 preprint EN arXiv (Cornell University) 2024-11-11

Large Multimodal Models (LMMs) have demonstrated exceptional comprehension and interpretation capabilities in Autonomous Driving (AD) by incorporating large language models. Despite the advancements, current data-driven AD approaches tend to concentrate on a single dataset specific tasks, neglecting their overall ability generalize. To bridge these gaps, we propose DriveMM, general multimodal model designed process diverse data inputs, such as images multi-view videos, while performing broad...

10.48550/arxiv.2412.07689 preprint EN arXiv (Cornell University) 2024-12-10

Existing methods enhance the training of detection transformers by incorporating an auxiliary one-to-many assignment. In this work, we treat model as a multi-task framework, simultaneously performing one-to-one and predictions. We investigate roles each component in transformer decoder across these two targets, including self-attention, cross-attention, feed-forward network. Our empirical results demonstrate that any independent can effectively learn both targets simultaneously, even when...

10.48550/arxiv.2412.10028 preprint EN arXiv (Cornell University) 2024-12-13

Generative models have recently exhibited exceptional capabilities in text-to-image generation, but still struggle to generate image sequences coherently. In this work, we focus on a novel, yet challenging task of generating coherent sequence based given storyline, denoted as open-ended visual storytelling. We make the following three contributions: (i) fulfill storytelling, propose learning-based auto-regressive generation model, termed StoryGen, with novel vision-language context module,...

10.48550/arxiv.2306.00973 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Significant progress has been achieved in multi-object tracking (MOT) through the evolution of detection and re-identification (ReID) techniques. Despite these advancements, accurately objects scenarios with homogeneous appearance heterogeneous motion remains a challenge. This challenge arises from two main factors: insufficient discriminability ReID features predominant utilization linear models MOT. In this context, we introduce novel motion-based tracker, MotionTrack, centered around...

10.48550/arxiv.2306.02585 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Temporal action detection (TAD) aims to detect all boundaries and their corresponding categories in an untrimmed video. The unclear of actions videos often result imprecise predictions by existing methods. To resolve this issue, we propose a one-stage framework named TriDet. First, Trident-head model the boundary via estimated relative probability distribution around boundary. Then, analyze rank-loss problem (i.e. instant discriminability deterioration) transformer-based methods efficient...

10.48550/arxiv.2309.05590 preprint EN other-oa arXiv (Cornell University) 2023-01-01

In this paper, we present a one-stage framework TriDet for temporal action detection. Existing methods often suffer from imprecise boundary predictions due to the ambiguous boundaries in videos. To alleviate problem, propose novel Trident-head model via an estimated relative probability distribution around boundary. feature pyramid of TriDet, efficient Scalable-Granularity Perception (SGP) layer mitigate rank loss problem self-attention that takes place video features and aggregate...

10.48550/arxiv.2303.07347 preprint EN other-oa arXiv (Cornell University) 2023-01-01
Coming Soon ...