NFDI4DS | UHH-SEMS - Publication Details

Qingpei Guo

ORCID: 0009-0001-0521-9664

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5086923590

Research Areas

Multimodal Machine Learning Applications
Video Analysis and Summarization
Topic Modeling
Advanced Image and Video Retrieval Techniques
Domain Adaptation and Few-Shot Learning
Natural Language Processing Techniques
Ultrasonics and Acoustic Wave Propagation
Machine Learning in Healthcare
Text and Document Classification Technologies
Biomedical Text Mining and Ontologies
Handwritten Text Recognition Techniques
Magnetic Properties and Applications
Multimedia Communication and Technology
Human Pose and Action Recognition
Law, AI, and Intellectual Property
Business Law and Ethics
Image Retrieval and Classification Techniques
Video Coding and Compression Technologies
Structural Health Monitoring Techniques

University of Wisconsin System
1993

SHE-Net: Syntax-Hierarchy-Enhanced Text-Video Retrieval

OPENALEX - Publications

Xuzheng Yu Chen Jiang Xingning Dong Tian Gan Ming Yang and 1 more

10.1109/tcsvt.2025.3543840 article EN IEEE Transactions on Circuits and Systems for Video Technology 2025-01-01

MedTransTab: Advancing Medical Cross-Table Tabular Data Generation

OPENALEX - Publications

Yuyan Chen Qingpei Guo Sang-Won You Zhixu Li

10.1145/3701551.3703501 article EN 2025-02-26

EVE: Efficient Zero-Shot Text-Based Video Editing With Depth Map Guidance and Temporal Consistency Constraints

OPENALEX - Publications

Yutao Chen Xingning Dong Tian Gan Chunluan Zhou Ming Yang and 1 more

Motivated by the superior performance of image diffusion models, more and researchers strive to extend these models text-based video editing task. Nevertheless, current tasks mainly suffer from dilemma between high fine-tuning cost limited generation capacity. Compared with images, we conjecture that videos necessitate constraints preserve temporal consistency during editing. Towards this end, propose EVE, a robust Efficient zero-shot Video Editing method. Under guidance depth maps...

10.24963/ijcai.2024/75 article EN 2024-07-26

Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs

OPENALEX - Publications

Shiyu Xuan Qingpei Guo Ming Yang Shiliang Zhang

Multi-modal Large Language Models (MLLMs) have shown remarkable capabilities in various multi-modal tasks. Nevertheless, their performance fine-grained image understanding tasks is still limited. To address this issue, paper proposes a new framework to enhance the abilities of MLLMs. Specifically, we present method for constructing instruction tuning dataset at low cost by leveraging annotations existing datasets. A self-consistent bootstrapping also introduced extend dense object into...

10.48550/arxiv.2310.00582 preprint EN cc-by arXiv (Cornell University) 2023-01-01

SyCoCa: Symmetrizing Contrastive Captioners with Attentive Masking for Multimodal Alignment

OPENALEX - Publications

Ziping Ma Furong Xu Jian Liu Ming–Hsuan Yang Qingpei Guo

Multimodal alignment between language and vision is the fundamental topic in current vision-language model research. Contrastive Captioners (CoCa), as a representative method, integrates Language-Image Pretraining (CLIP) Image Caption (IC) into unified framework, resulting impressive results. CLIP imposes bidirectional constraints on global representation of entire images sentences. Although IC conducts an unidirectional image-to-text generation local representation, it lacks any constraint...

10.48550/arxiv.2401.02137 preprint EN other-oa arXiv (Cornell University) 2024-01-01

$\boldsymbol{M^2}$-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining

OPENALEX - Publications

Qingpei Guo F. R. Xu Hanxiao Zhang Ren Wang Ziping Ma and 4 more

Vision-language foundation models like CLIP have revolutionized the field of artificial intelligence. Nevertheless, VLM supporting multi-language, e.g., in both Chinese and English, lagged due to relative scarcity large-scale pretraining datasets. Toward this end, we introduce a comprehensive bilingual (Chinese-English) dataset BM-6B with over 6 billion image-text pairs, aimed at enhancing multimodal well understand images languages. To handle such scale dataset, propose novel grouped...

10.48550/arxiv.2401.15896 preprint EN arXiv (Cornell University) 2024-01-29

Social Debiasing for Fair Multi-modal LLMs

OPENALEX - Publications

Harry H. Cheng Yangyang Guo Qingpei Guo Ming Yang Tian Gan and 1 more

Multi-modal Large Language Models (MLLMs) have advanced significantly, offering powerful vision-language understanding capabilities. However, these models often inherit severe social biases from their training datasets, leading to unfair predictions based on attributes like race and gender. This paper addresses the issue of in MLLMs by i) Introducing a comprehensive Counterfactual dataset with Multiple Social Concepts (CMSC), which provides more diverse extensive set compared existing...

10.48550/arxiv.2408.06569 preprint EN arXiv (Cornell University) 2024-08-12

SNP-S3: Shared Network Pre-Training and Significant Semantic Strengthening for Various Video-Text Tasks

OPENALEX - Publications

Xingning Dong Qingpei Guo Tian Gan Qing Wang Jianlong Wu and 3 more

We present a framework for learning cross-modal video representations by directly pre-training on raw data to facilitate various downstream video-text tasks. Our main contributions lie in the and proxy First, based shortcomings of two mainstream pixel-level architectures (limited applications or less efficient), we propose Shared Network Pre-training (SNP). By employing one shared BERT-type network refine textual features simultaneously, SNP is lightweight could support applications. Second,...

10.1109/tcsvt.2023.3303945 article EN IEEE Transactions on Circuits and Systems for Video Technology 2023-08-10

M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards Effective and Efficient Zero-shot Video-text Retrieval

OPENALEX - Publications

Xingning Dong Zipeng Feng Chunluan Zhou Xuzheng Yu Ming Yang and 1 more

We present a Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards effective and efficient zero-shot video-text retrieval, dubbed M2-RAAP. Upon popular image-text models like CLIP, most current adaptation-based pre-training methods are confronted by three major issues, i.e., noisy data corpus, time-consuming pre-training, limited performance gain. Towards this end, we conduct comprehensive study including four critical steps in pre-training. Specifically, investigate 1)...

10.48550/arxiv.2401.17797 preprint EN arXiv (Cornell University) 2024-01-31

M 2 -RAAP: A Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards Effective and Efficient Zero-shot Video-text Retrieval

OPENALEX - Publications

Xingning Dong Zipeng Feng Chunluan Zhou Xuzheng Yu Ming Yang and 1 more

10.1145/3626772.3657833 article EN Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval 2024-07-10

HOTVCOM: Generating Buzzworthy Comments for Videos

OPENALEX - Publications

Yuyan Chen Qian Yu Songzhou Yan Jiyuan Jia Zhixu Li and 4 more

In the era of social media video platforms, popular ``hot-comments'' play a crucial role in attracting user impressions short-form videos, making them vital for marketing and branding purpose. However, existing research predominantly focuses on generating descriptive comments or ``danmaku'' English, offering immediate reactions to specific moments. Addressing this gap, our study introduces \textsc{HotVCom}, largest Chinese hot-comment dataset, comprising 94k diverse videos 137 million...

10.48550/arxiv.2409.15196 preprint EN arXiv (Cornell University) 2024-09-23

Text as Image: Learning Transferable Adapter for Multi-Label Classification

OPENALEX - Publications

Xuelin Zhu Jiuxin Cao Jian Liu Dongqi Tang F. R. Xu and 5 more

Pre-trained vision-language models have notably accelerated progress of open-world concept recognition. Their impressive zero-shot ability has recently been transferred to multi-label image classification via prompt tuning, enabling discover novel labels in an open-vocabulary manner. However, this paradigm suffers from non-trivial training costs, and becomes computationally prohibitive for a large number candidate labels. To address issue, we note that pre-training aligns images texts...

10.48550/arxiv.2312.04160 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Optimal estimation of vocal tract area functions from speech signal constrained by X-ray microbeam data

OPENALEX - Publications

Qingpei Guo Paul Milenkovic

The authors describe a modified Mermelstein articulatory model and present an analytical description of the configuration vocal tract as well relationship between articulators. Based on this parameters can be estimated directly from speech signal by solving constrained optimization problem. An adaptive technique is used to find values model, that minimize difference spectra measured spectra.< <ETX xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">></ETX>

10.1109/icassp.1993.319361 article EN IEEE International Conference on Acoustics Speech and Signal Processing 1993-01-01

Coming Soon ...