NFDI4DS | UHH-SEMS - Publication Details

CTVIS: Consistent Training for Online Video Instance Segmentation

OPENALEX - Publications

Kaining Ying Qing Zhong Weian Mao Zhenhua Wang Hao Chen and 5 more

The discrimination of instance embeddings plays a vital role in associating instances across time for online video segmentation (VIS). Instance embedding learning is directly supervised by the contrastive loss computed upon items (CIs), which are sets anchor/positive/negative embeddings. Recent VIS methods leverage CIs sourced from one reference frame only, we argue insufficient highly discriminative Intuitively, possible strategy to enhance replicating inference phase during training. To...

10.1109/iccv51070.2023.00089 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2023-10-01

ConvBench: A Multi-Turn Conversation Evaluation Benchmark with Hierarchical Capability for Large Vision-Language Models

OPENALEX - Publications

Shuo Liu Kaining Ying Hao Zhang Yue Yang Yuqi Lin and 6 more

This paper presents ConvBench, a novel multi-turn conversation evaluation benchmark tailored for Large Vision-Language Models (LVLMs). Unlike existing benchmarks that assess individual capabilities in single-turn dialogues, ConvBench adopts three-level multimodal capability hierarchy, mimicking human cognitive processes by stacking up perception, reasoning, and creativity. Each level focuses on distinct capability, mirroring the progression from basic perception to logical reasoning...

10.48550/arxiv.2403.20194 preprint EN arXiv (Cornell University) 2024-03-29

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

OPENALEX - Publications

Kaining Ying Fanqing Meng Jin Wang Zhiqian Li Lin Han and 17 more

Large Vision-Language Models (LVLMs) show significant strides in general-purpose multimodal applications such as visual dialogue and embodied navigation. However, existing evaluation benchmarks cover a limited number of tasks testing rudimentary capabilities, falling short tracking LVLM development. In this study, we present MMT-Bench, comprehensive benchmark designed to assess LVLMs across massive requiring expert knowledge deliberate recognition, localization, reasoning, planning....

10.48550/arxiv.2404.16006 preprint EN arXiv (Cornell University) 2024-04-24

ISDA: Position-Aware Instance Segmentation with Deformable Attention

OPENALEX - Publications

Kaining Ying Zhenhua Wang Cong Bai Pengfei Zhou

Most instance segmentation models are not end-to-end trainable due to either the incorporation of proposal estimation (RPN) as a pre-processing or non-maximum suppression (NMS) post-processing. Here we propose novel method termed ISDA. It reshapes task into predicting set object masks, which generated via traditional convolution operation with learned position-aware kernels and features objects. Such by leveraging deformable attention network multi-scale representation. Thanks introduced...

10.1109/icassp43922.2022.9747246 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022-04-27

Self-Supervised Enhancement for Named Entity Disambiguation via Multimodal Graph Convolution

OPENALEX - Publications

Pengfei Zhou Kaining Ying Zhenhua Wang Dongyan Guo Cong Bai

Named entity disambiguation (NED) finds the specific meaning of an mention in a particular context and links it to target entity. With emergence multimedia, modalities content on Internet have become more diverse, which poses difficulties for traditional NED, vast amounts information make impossible manually label every kind ambiguous data train practical NED model. In response this situation, we present MMGraph, uses multimodal graph convolution aggregate visual contextual language accurate...

10.1109/tnnls.2022.3173179 article EN IEEE Transactions on Neural Networks and Learning Systems 2022-05-13

Human Interaction Understanding With Consistency-Aware Learning

OPENALEX - Publications

Jiajun Meng Zhenhua Wang Kaining Ying Jianhua Zhang Dongyan Guo and 3 more

Compared with the progress made on human activity classification, much less success has been achieved interaction understanding (HIU). Apart from latter task is more challenging, main causation that recent approaches learn interactive relations via shallow graphical representations, which are inadequate to model complicated interactive-relations. This paper proposes a deep consistency-aware framework aiming at tackling grouping and labelling inconsistencies in HIU. consists of three...

10.1109/tpami.2023.3280906 article EN IEEE Transactions on Pattern Analysis and Machine Intelligence 2023-05-29

CTVIS: Consistent Training for Online Video Instance Segmentation

OPENALEX - Publications

Kaining Ying Qing Zhong Weian Mao Zhenhua Wang Hao Chen and 5 more

The discrimination of instance embeddings plays a vital role in associating instances across time for online video segmentation (VIS). Instance embedding learning is directly supervised by the contrastive loss computed upon items (CIs), which are sets anchor/positive/negative embeddings. Recent VIS methods leverage CIs sourced from one reference frame only, we argue insufficient highly discriminative Intuitively, possible strategy to enhance replicating inference phase during training. To...

10.48550/arxiv.2307.12616 preprint EN cc-by arXiv (Cornell University) 2023-01-01

RWMF: A Real-World Multimodal Foodlog Database

OPENALEX - Publications

Pengfei Zhou Cong Bai Kaining Ying Jie Xia Lixin Huang

With the increasing health concerns on diet, it's worthwhile to develop an intelligent assistant that can help users eat healthier. Such automatically give personal advice for user's diet and generate report about eating a regular basis. To boost research such assistant, we establish real-world foodlog database using various methods as filter, cluster graph convolutional network. This is built based lifelog medical data, which named Real-World Multimodal Foodlog (RWMF). It contains 7500...

10.1109/icpr48806.2021.9412433 article EN 2022 26th International Conference on Pattern Recognition (ICPR) 2021-01-10

Human-to-Human Interaction Detection

OPENALEX - Publications

Zhenhua Wang Kaining Ying Jiajun Meng Jifeng Ning Cong Bai

A comprehensive understanding of interested human-to-human interactions in video streams, such as queuing, handshaking, fighting and chasing, is immense importance to the surveillance public security regions like campuses, squares parks. Different from conventional human interaction recognition, which uses choreographed videos inputs, neglects concurrent interactive groups, performs detection recognition separate stages, we introduce a new task named (HID). HID devotes detecting subjects,...

10.48550/arxiv.2307.00464 preprint EN cc-by-nc-sa arXiv (Cornell University) 2023-01-01

ISDA: Position-Aware Instance Segmentation with Deformable Attention

OPENALEX - Publications

Kaining Ying Zhenhua Wang Cong Bai Pengfei Zhou

Most instance segmentation models are not end-to-end trainable due to either the incorporation of proposal estimation (RPN) as a pre-processing or non-maximum suppression (NMS) post-processing. Here we propose novel method termed ISDA. It reshapes task into predicting set object masks, which generated via traditional convolution operation with learned position-aware kernels and features objects. Such by leveraging deformable attention network multi-scale representation. Thanks introduced...

10.48550/arxiv.2202.12251 preprint EN cc-by-nc-nd arXiv (Cornell University) 2022-01-01