- Multimodal Machine Learning Applications
- Video Analysis and Summarization
- Human Pose and Action Recognition
- Topic Modeling
- Domain Adaptation and Few-Shot Learning
- Advanced Graph Neural Networks
- Music and Audio Processing
- Bioinformatics and Genomic Networks
Tsinghua University
2021-2023
Tsinghua–Berkeley Shenzhen Institute
2022
Beijing Normal University
2018
Multilingual knowledge graphs (KGs) such as DBpedia and YAGO contain structured of entities in several distinct languages, they are useful resources for cross-lingual AI NLP applications. Cross-lingual KG alignment is the task matching with their counterparts different which an important way to enrich links multilingual KGs. In this paper, we propose a novel approach via graph convolutional networks (GCNs). Given set pre-aligned entities, our trains GCNs embed each language into unified...
Temporal Sentence Grounding in Videos (TSGV), i.e., grounding a natural language sentence which indicates complex human activities long and untrimmed video sequence, has received unprecedented attentions over the last few years. Although each newly proposed method plausibly can achieve better performance than previous ones, current TSGV models still tend to capture moment annotation biases fail take full advantage of multi-modal inputs. Even more incredibly, several extremely simple...
Temporal sentence grounding in videos (TSGV), which aims at localizing one target segment from an untrimmed video with respect to a given query, has drawn increasing attentions the research community over past few years. Different task of temporal action localization, TSGV is more flexible since it can locate complicated activities via natural languages, without restrictions predefined categories. Meanwhile, challenging requires both textual and visual understanding for semantic alignment...
Video Grounding (VG) aims to locate the desired segment from a video given sentence query. Recent studies have found that current VG models are prone over-rely groundtruth moment annotation distribution biases in training set. To discourage standard model's behavior of exploiting such temporal and improve model generalization ability, we propose multiple negative augmentations hierarchical way, including cross-video clip-/video-level, self-shuffled with masks. These can effectively diversify...
Temporal Sentence Grounding in Videos (TSGV) , which aims to ground a natural language sentence that indicates complex human activities an untrimmed video, has drawn widespread attention over the past few years. However, recent studies have found current benchmark datasets may obvious moment annotation biases, enabling several simple baselines even without training achieve state-of-the-art (SOTA) performance. In this paper, we take closer look at existing evaluation protocols for TSGV, and...
Temporal Sentence Grounding aims to retrieve a video moment given natural language query. Most existing literature merely focuses on visual information in videos without considering the naturally accompanied audio which may contain rich semantics. The few works simply regard it as an additional modality, overlooking that: i) it's non-trivial explore consistency and complementarity between visual; ii) such exploration requires handling different levels of densities noises two modalities. To...
Video Grounding (VG), has drawn widespread attention over the past few years, and numerous studies have been devoted to improving performance on various VG benchmarks. Nevertheless, label annotation procedures in produce imbalanced query-moment-label distributions datasets, which severely deteriorate learning model's capability of truly understanding video contents. Existing works debiased either focus adjusting model or conducting video-level augmentation, failing handle temporal bias issue...
Video grounding aims to ground a sentence query in video by determining the start and end timestamps of semantically matched segment. It is fundamental essential vision-and-language problem widely investigated research community, it also has potential values applied industrial domains. This tutorial will give detailed introduction about development evolution this task, point out limitations existing benchmarks, extend such text-based task more general scenarios, especially how guides...
Temporal sentence grounding in videos(TSGV), which aims to localize one target segment from an untrimmed video with respect a given query, has drawn increasing attentions the research community over past few years. Different task of temporal action localization, TSGV is more flexible since it can locate complicated activities via natural languages, without restrictions predefined categories. Meanwhile, challenging requires both textual and visual understanding for semantic alignment between...
Temporal Sentence Grounding in Videos (TSGV), which aims to ground a natural language sentence an untrimmed video, has drawn widespread attention over the past few years. However, recent studies have found that current benchmark datasets may obvious moment annotation biases, enabling several simple baselines even without training achieve SOTA performance. In this paper, we take closer look at existing evaluation protocols, and find both prevailing dataset metrics are devils lead...