- Multimodal Machine Learning Applications
- Video Analysis and Summarization
- Human Pose and Action Recognition
- Advanced Image and Video Retrieval Techniques
- Advanced Computational Techniques and Applications
- Sentiment Analysis and Opinion Mining
- Music and Audio Processing
- Image Retrieval and Classification Techniques
- Anomaly Detection Techniques and Applications
- Data Mining Algorithms and Applications
- Time Series Analysis and Forecasting
- Scientific Computing and Data Management
- Data Analysis with R
- Technology and Security Systems
- Embedded Systems and FPGA Design
- Layered Double Hydroxides Synthesis and Applications
- Research Data Management Practices
- Embedded Systems Design Techniques
- Software Engineering Techniques and Practices
- Advanced Text Analysis Techniques
- Quantum Computing Algorithms and Architecture
- Diabetes Treatment and Management
- Advanced Materials and Mechanics
- Dynamics and Control of Mechanical Systems
- Advanced Software Engineering Methodologies
University of Electronic Science and Technology of China
2012-2025
Collaborative Innovation Center of Advanced Microstructures
2024
Nanjing University
2012-2024
Liaoning University
2024
NARI Group (China)
2023
Amgen (United States)
2021-2022
Yale University
2021
Zhejiang Ocean University
2019
Chongqing University of Posts and Telecommunications
2017
University of Connecticut
2013
The materials discovery process can be significantly expedited and simplified if we learn effectively from available knowledge data. In the present contribution, show that efficient accurate prediction of a diverse set properties material systems is possible by employing machine (or statistical) learning methods trained on quantum mechanical computations in combination with notions chemical similarity. Using family one-dimensional chain systems, general formalism allows us to discover...
Video events grounding aims at retrieving the most relevant moments from an untrimmed video in terms of a given natural language query. Most previous works focus on Sentence Grounding (VSG), which localizes moment with sentence Recently, researchers extended this task to Paragraph (VPG) by multiple paragraph. However, we find existing VPG methods may not perform well context modeling and highly rely video-paragraph annotations. To tackle problem, propose novel method termed Semi-supervised...
The Weakly-Supervised Audio-Visual Video Parsing (AVVP) task aims to parse a video into temporal segments and predict their event categories in terms of modalities, labeling them as either audible, visible, or both. Since the boundaries modalities annotations are not provided, only video-level labels available, this is more challenging than conventional understanding tasks.Most previous works attempt analyze videos by jointly modeling audio data then learning information from segment-level...
The <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Composed Query-Based Image Retrieval (CQBIR)</i> task aims to precisely obtain the preserved and modified parts, based on multi-grained semantics learned from composed query. Since query includes a reference image modification text, not just single modality, this is more challenging than general retrieval tasks. Most previous methods attempt learn parts via different attention modules fuse...
Artificial intelligence, particularly language models (LMs), is reshaping research paradigms across scientific domains. In the fields of chemistry and pharmacy, chemical (CLMs) have achieved remarkable success in two-dimensional (2D) molecular modeling tasks by leveraging one-dimensional (1D) representations molecules, such as SMILES SELFIES. However, extending these successes to three-dimensional (3D) remains a significant challenge, largely due absence effective 1D for capturing 3D...
Video Moment Retrieval (VMR) aims at retrieving the most relevant events from an untrimmed video with natural language queries. Existing VMR methods suffer two defects: (1) massive expensive temporal annotations are required to obtain satisfying performance; (2) complicated cross-modal interaction modules deployed, which lead high computational cost and low efficiency for retrieval process. To address these issues, we propose a novel method termed Cheaper Faster (CFMR), balances accuracy,...
Text-based video retrieval is a well-studied task aimed at retrieving relevant videos from large collection in response to given text query. Most existing TVR works assume that are already trimmed and fully the query thus ignoring most real-world scenarios untrimmed contain massive irrelevant content. Moreover, as users' queries only events rather than complete videos, it also more practical provide specific an list. In this paper, we introduce challenging but realistic called...
Temporal language grounding (TLG) is one of the most challenging cross-modal video understanding tasks, which aims at retrieving relevant segment from an untrimmed according to a natural sentence. The existing methods can be separated into two dominant types: 1) proposal-based and 2) proposal-free methods, where former conduct contextual interactions latter localizes timestamps flexibly. However, constant-scale candidates in limit localization precision bring extra computational costs. In...
In this article, we study the challenging cross-modal image retrieval task, Composed Query-Based Image Retrieval (CQBIR) , in which query is not a single text but composed query, i.e., reference image, and modification text. Compared with conventional image-text CQBIR more as it requires properly preserving modifying specific region according to multi-level semantic information learned from multi-modal query. Most recent works focus on extracting preserved modified compositing into unified...
Multimodal Sentiment Analysis (MSA) aims at teaching computers or robotics to understand human sentiment with diverse multimodal signals, including audio, vision, and text. Current MSA approaches primarily concentrate on devising fusion strategies for signals trying learn better joint representations. However, employing directly is not appropriate since the psychological states are fuzzy can be categorized easily, which undermines effectiveness of existing methods. In this paper, we regard...
Reproducible document standards, like R Markdown, facilitate the programmatic creation of documents whose content is itself programmatically generated. While alone may not be sufficient for a rendered since it does include prose (content generated by an author to provide context, narrative, etc.) generation can substantial efficiencies structuring and constructing documents. This paper explores reproducible distinguishing components that created computational means from those requiring...
Video Paragraph Grounding aims at retrieving multiple relevant moments from an untrimmed video with a given natural language paragraph query. However, the complex query brings more challenges to multimodal fusion and context modeling, which limited performance of existing VPG methods. To this end, we propose novel framework for in paper, termed Graph-based Transformer Language Reconstruction (GTLR). It consists three components: (1) Multimodal Graph Encoder conducting graph reasoning...
Compositional temporal grounding (CTG) aims to localize the most relevant segment from an untrimmed video based on a given natural language sentence, and test samples for this task contain novel components not seen in training. However, existing CTG methods suffer two shortcomings: (1) Most adopt transformers model global information only, thus failing balance long-range perception regional representation of sequences; (2) Due lack aligning videos sentences at fine-grained level, model's...