Mengxue Qu

ORCID: 0000-0001-9432-0205
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Multimodal Machine Learning Applications
  • Human Pose and Action Recognition
  • Topic Modeling
  • Advanced Image and Video Retrieval Techniques
  • Industrial Vision Systems and Defect Detection
  • Domain Adaptation and Few-Shot Learning
  • Advanced Neural Network Applications
  • Adversarial Robustness in Machine Learning
  • Medical Image Segmentation Techniques
  • Subtitles and Audiovisual Media
  • Oceanographic and Atmospheric Processes
  • Heavy metals in environment
  • Arctic and Antarctic ice dynamics
  • Plant Stress Responses and Tolerance
  • Aluminum toxicity and tolerance in plants and animals
  • Advanced Vision and Imaging
  • Epigenetics and DNA Methylation
  • Video Analysis and Summarization
  • Cancer-related gene regulation
  • Explainable Artificial Intelligence (XAI)
  • Artificial Intelligence in Healthcare and Education
  • Underwater Acoustics Research
  • Plant tissue culture and regeneration
  • Plant Virus Research Studies
  • Visual Attention and Saliency Detection

Beijing Jiaotong University
2022-2025

Wuhan University
2023

Shandong Agricultural University
2023

First Institute of Oceanography
2022

Ministry of Natural Resources
2022

Advertising is pervasive in everyday life. Some advertisements are not as readily comprehensible, they convey a deeper message or purpose, which referred to “meaningful advertising”. These ads often aim create an emotional connection with the audience promote social cause. Developing method for automatically understanding meaningful advertising would be advantageous dissemination and creation of such ads. However, current models ad primarily focus on superficial aspects images. In this...

10.1145/3720546 article EN ACM Transactions on Multimedia Computing Communications and Applications 2025-02-27

Referring Expression Segmentation (RES) can facilitate pixel-level semantic alignment between vision and language. Most of the existing RES approaches require massive annotations, which are expensive exhaustive. In this paper, we propose a new partially supervised training paradigm for RES, i.e., using abundant referring bounding boxes only few (e.g., 1%) masks. To maximize transferability from REC model, construct our model based on point-based sequence prediction model. We co-content...

10.1109/cvpr52729.2023.00295 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023-06-01

10.1109/cvprw63382.2024.00191 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) 2024-06-17

In real-life scenarios, humans seek out objects in the 3D world to fulfill their daily needs or intentions. This inspires us introduce intention grounding, a new task object detection employing RGB-D, based on human intention, such as "I want something support my back". Closely related, visual grounding focuses understanding reference. To achieve it relies observe scene, reason target that aligns with ("pillow" this case), and finally provide reference AI system, "A pillow couch". Instead,...

10.48550/arxiv.2405.18295 preprint EN arXiv (Cornell University) 2024-05-28

Spatio-Temporal Video Grounding (STVG) aims at localizing the spatio-temporal tube of a specific object in an untrimmed video given free-form natural language query. As annotation tubes is labor intensive, researchers are motivated to explore weakly supervised approaches recent works, which usually results significant performance degradation. To achieve less expensive STVG method with acceptable accuracy, this work investigates "single-frame supervision" paradigm that requires single frame...

10.1109/tpami.2024.3415087 article EN IEEE Transactions on Pattern Analysis and Machine Intelligence 2024-01-01

Intention-oriented object detection aims to detect desired objects based on specific intentions or requirements. For instance, when we desire "lie down and rest", instinctively seek out a suitable option such as "bed" "sofa" that can fulfill our needs. Previous work in this area is limited either by the number of intention descriptions affordance vocabulary available for objects. These limitations make it challenging handle open environments effectively. To facilitate research, construct...

10.48550/arxiv.2310.17290 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Semi-Supervised Visual Grounding (SSVG) is a new challenge for its sparse labeled data with the need multimodel understanding. A previous study, RefTeacher, makes first attempt to tackle this task by adopting teacher-student framework provide pseudo confidence supervision and attention-based supervision. However, approach incompatible current state-of-the-art visual grounding models, which follow Transformer-based pipeline. These pipelines directly regress results without region proposals or...

10.48550/arxiv.2407.03251 preprint EN arXiv (Cornell University) 2024-07-03

Video Temporal Grounding (VTG) aims to ground specific segments within an untrimmed video corresponding the given natural language query. Existing VTG methods largely depend on supervised learning and extensive annotated data, which is labor-intensive prone human biases. To address these challenges, we present ChatVTG, a novel approach that utilizes Dialogue Large Language Models (LLMs) for zero-shot temporal grounding. Our ChatVTG leverages LLMs generate multi-granularity segment captions...

10.48550/arxiv.2410.12813 preprint EN arXiv (Cornell University) 2024-10-01

Advances in CLIP and large multimodal models (LMMs) have enabled open-vocabulary free-text segmentation, yet existing still require predefined category prompts, limiting free-form self-generation. Most segmentation LMMs also remain confined to sparse predictions, restricting their applicability open-set environments. In contrast, we propose ROSE, a Revolutionary Open-set dense SEgmentation LMM, which enables mask prediction open-category generation through patch-wise perception. Our method...

10.48550/arxiv.2412.00153 preprint EN arXiv (Cornell University) 2024-11-29

In this paper, we investigate how to achieve better visual grounding with modern vision-language transformers, and propose a simple yet powerful Selective Retraining (SiRi) mechanism for challenging task. Particularly, SiRi conveys significant principle the research of grounding, i.e., initialized encoder would help model converge local minimum, advancing performance accordingly. specific, continually update parameters as training goes on, while periodically re-initialize rest compel be...

10.48550/arxiv.2207.13325 preprint EN other-oa arXiv (Cornell University) 2022-01-01
Coming Soon ...