Kaining Ying

ORCID: 0000-0003-2596-1847
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Multimodal Machine Learning Applications
  • Domain Adaptation and Few-Shot Learning
  • Human Pose and Action Recognition
  • Anomaly Detection Techniques and Applications
  • Advanced Image and Video Retrieval Techniques
  • Advanced Neural Network Applications
  • Natural Language Processing Techniques
  • Subtitles and Audiovisual Media
  • Speech and dialogue systems
  • Mobile Health and mHealth Applications
  • AI in Service Interactions
  • Image Retrieval and Classification Techniques
  • Generative Adversarial Networks and Image Synthesis
  • Topic Modeling
  • Text and Document Classification Technologies
  • Nutritional Studies and Diet

Zhejiang University of Technology
2021-2023

Zhejiang University
2023

The discrimination of instance embeddings plays a vital role in associating instances across time for online video segmentation (VIS). Instance embedding learning is directly supervised by the contrastive loss computed upon items (CIs), which are sets anchor/positive/negative embeddings. Recent VIS methods leverage CIs sourced from one reference frame only, we argue insufficient highly discriminative Intuitively, possible strategy to enhance replicating inference phase during training. To...

10.1109/iccv51070.2023.00089 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2023-10-01

This paper presents ConvBench, a novel multi-turn conversation evaluation benchmark tailored for Large Vision-Language Models (LVLMs). Unlike existing benchmarks that assess individual capabilities in single-turn dialogues, ConvBench adopts three-level multimodal capability hierarchy, mimicking human cognitive processes by stacking up perception, reasoning, and creativity. Each level focuses on distinct capability, mirroring the progression from basic perception to logical reasoning...

10.48550/arxiv.2403.20194 preprint EN arXiv (Cornell University) 2024-03-29

Large Vision-Language Models (LVLMs) show significant strides in general-purpose multimodal applications such as visual dialogue and embodied navigation. However, existing evaluation benchmarks cover a limited number of tasks testing rudimentary capabilities, falling short tracking LVLM development. In this study, we present MMT-Bench, comprehensive benchmark designed to assess LVLMs across massive requiring expert knowledge deliberate recognition, localization, reasoning, planning....

10.48550/arxiv.2404.16006 preprint EN arXiv (Cornell University) 2024-04-24

Most instance segmentation models are not end-to-end trainable due to either the incorporation of proposal estimation (RPN) as a pre-processing or non-maximum suppression (NMS) post-processing. Here we propose novel method termed ISDA. It reshapes task into predicting set object masks, which generated via traditional convolution operation with learned position-aware kernels and features objects. Such by leveraging deformable attention network multi-scale representation. Thanks introduced...

10.1109/icassp43922.2022.9747246 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022-04-27

Named entity disambiguation (NED) finds the specific meaning of an mention in a particular context and links it to target entity. With emergence multimedia, modalities content on Internet have become more diverse, which poses difficulties for traditional NED, vast amounts information make impossible manually label every kind ambiguous data train practical NED model. In response this situation, we present MMGraph, uses multimodal graph convolution aggregate visual contextual language accurate...

10.1109/tnnls.2022.3173179 article EN IEEE Transactions on Neural Networks and Learning Systems 2022-05-13

Compared with the progress made on human activity classification, much less success has been achieved interaction understanding (HIU). Apart from latter task is more challenging, main causation that recent approaches learn interactive relations via shallow graphical representations, which are inadequate to model complicated interactive-relations. This paper proposes a deep consistency-aware framework aiming at tackling grouping and labelling inconsistencies in HIU. consists of three...

10.1109/tpami.2023.3280906 article EN IEEE Transactions on Pattern Analysis and Machine Intelligence 2023-05-29

The discrimination of instance embeddings plays a vital role in associating instances across time for online video segmentation (VIS). Instance embedding learning is directly supervised by the contrastive loss computed upon items (CIs), which are sets anchor/positive/negative embeddings. Recent VIS methods leverage CIs sourced from one reference frame only, we argue insufficient highly discriminative Intuitively, possible strategy to enhance replicating inference phase during training. To...

10.48550/arxiv.2307.12616 preprint EN cc-by arXiv (Cornell University) 2023-01-01

With the increasing health concerns on diet, it's worthwhile to develop an intelligent assistant that can help users eat healthier. Such automatically give personal advice for user's diet and generate report about eating a regular basis. To boost research such assistant, we establish real-world foodlog database using various methods as filter, cluster graph convolutional network. This is built based lifelog medical data, which named Real-World Multimodal Foodlog (RWMF). It contains 7500...

10.1109/icpr48806.2021.9412433 article EN 2022 26th International Conference on Pattern Recognition (ICPR) 2021-01-10

A comprehensive understanding of interested human-to-human interactions in video streams, such as queuing, handshaking, fighting and chasing, is immense importance to the surveillance public security regions like campuses, squares parks. Different from conventional human interaction recognition, which uses choreographed videos inputs, neglects concurrent interactive groups, performs detection recognition separate stages, we introduce a new task named (HID). HID devotes detecting subjects,...

10.48550/arxiv.2307.00464 preprint EN cc-by-nc-sa arXiv (Cornell University) 2023-01-01

Most instance segmentation models are not end-to-end trainable due to either the incorporation of proposal estimation (RPN) as a pre-processing or non-maximum suppression (NMS) post-processing. Here we propose novel method termed ISDA. It reshapes task into predicting set object masks, which generated via traditional convolution operation with learned position-aware kernels and features objects. Such by leveraging deformable attention network multi-scale representation. Thanks introduced...

10.48550/arxiv.2202.12251 preprint EN cc-by-nc-nd arXiv (Cornell University) 2022-01-01
Coming Soon ...