- Multimodal Machine Learning Applications
- Video Analysis and Summarization
- Topic Modeling
- Advanced Image and Video Retrieval Techniques
- Domain Adaptation and Few-Shot Learning
- Natural Language Processing Techniques
- Ultrasonics and Acoustic Wave Propagation
- Machine Learning in Healthcare
- Text and Document Classification Technologies
- Biomedical Text Mining and Ontologies
- Handwritten Text Recognition Techniques
- Magnetic Properties and Applications
- Multimedia Communication and Technology
- Human Pose and Action Recognition
- Law, AI, and Intellectual Property
- Business Law and Ethics
- Image Retrieval and Classification Techniques
- Video Coding and Compression Technologies
- Structural Health Monitoring Techniques
University of Wisconsin System
1993
Motivated by the superior performance of image diffusion models, more and researchers strive to extend these models text-based video editing task. Nevertheless, current tasks mainly suffer from dilemma between high fine-tuning cost limited generation capacity. Compared with images, we conjecture that videos necessitate constraints preserve temporal consistency during editing. Towards this end, propose EVE, a robust Efficient zero-shot Video Editing method. Under guidance depth maps...
Multi-modal Large Language Models (MLLMs) have shown remarkable capabilities in various multi-modal tasks. Nevertheless, their performance fine-grained image understanding tasks is still limited. To address this issue, paper proposes a new framework to enhance the abilities of MLLMs. Specifically, we present method for constructing instruction tuning dataset at low cost by leveraging annotations existing datasets. A self-consistent bootstrapping also introduced extend dense object into...
Multimodal alignment between language and vision is the fundamental topic in current vision-language model research. Contrastive Captioners (CoCa), as a representative method, integrates Language-Image Pretraining (CLIP) Image Caption (IC) into unified framework, resulting impressive results. CLIP imposes bidirectional constraints on global representation of entire images sentences. Although IC conducts an unidirectional image-to-text generation local representation, it lacks any constraint...
Vision-language foundation models like CLIP have revolutionized the field of artificial intelligence. Nevertheless, VLM supporting multi-language, e.g., in both Chinese and English, lagged due to relative scarcity large-scale pretraining datasets. Toward this end, we introduce a comprehensive bilingual (Chinese-English) dataset BM-6B with over 6 billion image-text pairs, aimed at enhancing multimodal well understand images languages. To handle such scale dataset, propose novel grouped...
Multi-modal Large Language Models (MLLMs) have advanced significantly, offering powerful vision-language understanding capabilities. However, these models often inherit severe social biases from their training datasets, leading to unfair predictions based on attributes like race and gender. This paper addresses the issue of in MLLMs by i) Introducing a comprehensive Counterfactual dataset with Multiple Social Concepts (CMSC), which provides more diverse extensive set compared existing...
We present a framework for learning cross-modal video representations by directly pre-training on raw data to facilitate various downstream video-text tasks. Our main contributions lie in the and proxy First, based shortcomings of two mainstream pixel-level architectures (limited applications or less efficient), we propose Shared Network Pre-training (SNP). By employing one shared BERT-type network refine textual features simultaneously, SNP is lightweight could support applications. Second,...
We present a Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards effective and efficient zero-shot video-text retrieval, dubbed M2-RAAP. Upon popular image-text models like CLIP, most current adaptation-based pre-training methods are confronted by three major issues, i.e., noisy data corpus, time-consuming pre-training, limited performance gain. Towards this end, we conduct comprehensive study including four critical steps in pre-training. Specifically, investigate 1)...
In the era of social media video platforms, popular ``hot-comments'' play a crucial role in attracting user impressions short-form videos, making them vital for marketing and branding purpose. However, existing research predominantly focuses on generating descriptive comments or ``danmaku'' English, offering immediate reactions to specific moments. Addressing this gap, our study introduces \textsc{HotVCom}, largest Chinese hot-comment dataset, comprising 94k diverse videos 137 million...
Pre-trained vision-language models have notably accelerated progress of open-world concept recognition. Their impressive zero-shot ability has recently been transferred to multi-label image classification via prompt tuning, enabling discover novel labels in an open-vocabulary manner. However, this paradigm suffers from non-trivial training costs, and becomes computationally prohibitive for a large number candidate labels. To address issue, we note that pre-training aligns images texts...
The authors describe a modified Mermelstein articulatory model and present an analytical description of the configuration vocal tract as well relationship between articulators. Based on this parameters can be estimated directly from speech signal by solving constrained optimization problem. An adaptive technique is used to find values model, that minimize difference spectra measured spectra.< <ETX xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">></ETX>