Licheng Yu

ORCID: 0000-0002-4943-6732
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Multimodal Machine Learning Applications
  • Domain Adaptation and Few-Shot Learning
  • Topic Modeling
  • Human Pose and Action Recognition
  • Advanced Image and Video Retrieval Techniques
  • Parallel Computing and Optimization Techniques
  • Advanced Vision and Imaging
  • Video Analysis and Summarization
  • Generative Adversarial Networks and Image Synthesis
  • Advanced Data Storage Technologies
  • Interconnection Networks and Systems
  • Natural Language Processing Techniques
  • Advanced Nanomaterials in Catalysis
  • Advanced biosensing and bioanalysis techniques
  • Metal-Organic Frameworks: Synthesis and Applications
  • Computer Graphics and Visualization Techniques
  • Advanced Image Processing Techniques
  • Electrocatalysts for Energy Conversion
  • Nanoplatforms for cancer theranostics
  • Nanocluster Synthesis and Applications
  • Distributed and Parallel Computing Systems
  • Image Processing Techniques and Applications
  • Nanoparticle-Based Drug Delivery
  • Video Coding and Compression Technologies
  • Handwritten Text Recognition Techniques

Nankai University
2019-2024

Nanjing University of Science and Technology
2021-2024

ShanghaiTech University
2023-2024

Chinese People's Armed Police Force
2024

University of California, Berkeley
2021-2024

JiangSu Armed Police General Hospital
2024

Fujian Institute of Research on the Structure of Matter
2023

Chinese Academy of Sciences
2023

Tan Kah Kee Innovation Laboratory
2023

Alpha Omega Alpha Medical Honor Society
2022

In this paper, we address referring expression comprehension: localizing an image region described by a natural language expression. While most recent work treats expressions as single unit, propose to decompose them into three modular components related subject appearance, location, and relationship other objects. This allows us flexibly adapt containing different types of information in end-to-end framework. our model, which call the Modular Attention Network (MAttNet), two attention are...

10.1109/cvpr.2018.00142 preprint EN 2018-06-01

Recent years have witnessed an increasing interest in image-based question-answering (QA) tasks. However, due to data limitations, there has been much less work on video-based QA. In this paper, we present TVQA, a large-scale video QA dataset based 6 popular TV shows. TVQA consists of 152,545 pairs from 21,793 clips, spanning over 460 hours video. Questions are designed be compositional nature, requiring systems jointly localize relevant moments within clip, comprehend subtitle-based...

10.18653/v1/d18-1167 article EN cc-by Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2018-01-01

We present HERO, a novel framework for large-scale video+language omni-representation learning. HERO encodes multimodal inputs in hierarchical structure, where local context of video frame is captured by Cross-modal Transformer via fusion, and global Temporal Transformer. In addition to standard Masked Language Modeling (MLM) Frame (MFM) objectives, we design two new pre-training tasks: (i) Video-Subtitle Matching (VSM), the model predicts both temporal alignment; (ii) Order (FOM), right...

10.18653/v1/2020.emnlp-main.161 article EN cc-by 2020-01-01

Referring expressions are natural language constructions used to identify particular objects within a scene. In this paper, we propose unified framework for the tasks of referring expression comprehension and generation. Our model is composed three modules: speaker, listener, reinforcer. The speaker generates expressions, listener comprehends reinforcer introduces reward function guide sampling more discriminative expressions. listener-speaker modules trained jointly in an end-to-end...

10.1109/cvpr.2017.375 preprint EN 2017-07-01

Hao Tan, Licheng Yu, Mohit Bansal. Proceedings of the 2019 Conference North American Chapter Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019.

10.18653/v1/n19-1268 preprint EN 2019-01-01

We present the task of Spatio-Temporal Video Question Answering, which requires intelligent systems to simultaneously retrieve relevant moments and detect referenced visual concepts (people objects) answer natural language questions about videos. first augment TVQA dataset with 310.8K bounding boxes, linking depicted objects in answers. name this augmented version as TVQA+. then propose Answerer Grounded Evidence (STAGE), a unified framework that grounds evidence both spatial temporal...

10.18653/v1/2020.acl-main.730 preprint EN 2020-01-01

In this paper, we introduce a new dataset consisting of 360,001 focused natural language descriptions for 10,738 images. This dataset, the Visual Madlibs is collected using automatically produced fill-in-the-blank templates designed to gather targeted about: people and objects, their appearances, activities, interactions, as well inferences about general scene or its broader context. We provide several analyses demonstrate applicability two description generation tasks: generation,...

10.1109/iccv.2015.283 article EN 2015-12-01

Traditional sparse image models treat color pixel as a scalar, which represents channels separately or concatenate monochrome image. In this paper, we propose vector representation model for images using quaternion matrix analysis. As new tool representation, its potential applications in several image-processing tasks are presented, including reconstruction, denoising, inpainting, and super-resolution. The proposed the matrix, where quaternion-based dictionary learning algorithm is...

10.1109/tip.2015.2397314 article EN IEEE Transactions on Image Processing 2015-01-27

Joint image-text embedding is the bedrock for most Vision-and-Language (V+L) tasks, where multimodality inputs are simultaneously processed joint visual and textual understanding. In this paper, we introduce UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four datasets (COCO, Visual Genome, Conceptual Captions, SBU Captions), which can power heterogeneous downstream V+L tasks with multimodal embeddings. We design tasks: Masked Language Modeling...

10.48550/arxiv.1909.11740 preprint EN other-oa arXiv (Cornell University) 2019-01-01

Abstract High‐entropy materials are composed of five or more metal elements with equimolar near‐equimolar concentrations within one crystal structure, which offer remarkable structural properties for many applications. Despite previously reported entropy‐driven stabilization mechanisms, high‐entropy still tend to decompose produce a variety derivatives under operating conditions. In this study, we use transition‐metal (Ni, Co, Ni, Zn, V)‐based metal–organic frameworks (HE‐MOFs) as the...

10.1002/cey2.263 article EN Carbon Energy 2022-10-17

In this paper, we introduce a new dataset consisting of 360,001 focused natural language descriptions for 10,738 images. This dataset, the Visual Madlibs is collected using automatically produced fill-in-the-blank templates designed to gather targeted about: people and objects, their appearances, activities, interactions, as well inferences about general scene or its broader context. We provide several analyses demonstrate applicability two description generation tasks: generation,...

10.48550/arxiv.1506.00278 preprint EN other-oa arXiv (Cornell University) 2015-01-01

Most recent garment capturing techniques rely on acquiring multiple views of clothing, which may not always be readily available, especially in the case pre-existing photographs from web. As an alternative, we propose a method that is able to compute 3D model human body and its outfit single photograph with little interaction. Our algorithm only capture global shape overall geometry it can also extract physical properties (i.e., material parameters needed for simulation) cloth. Unlike...

10.1145/3026479 article EN ACM Transactions on Graphics 2018-10-31

We address the problem of end-to-end visual storytelling. Given a photo album, our model first selects most representative (summary) photos, and then composes natural language story for album. For this task, we make use Visual Storytelling dataset composed three hierarchically-attentive Recurrent Neural Nets (RNNs) to: encode album select compose story. Automatic human evaluations show achieves better performance on selection, generation, retrieval than baselines.

10.18653/v1/d17-1101 article EN cc-by Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2017-01-01

Abstract Inconvenient dual-laser irradiation and tumor hypoxic environment as well limited judgment of treating region have impeded the development combined photothermal photodynamic therapies (PTT PDT). Herein, Bi 2 Se 3 @AIPH nanoparticles (NPs) are facilely developed to overcome these problems. Through a one-step method, free radical generator (AIPH) phase transition material (lauric acid, LA, 44–46 °C) encapsulated in hollow bismuth selenide (Bi NPs). Under single 808-nm laser at area,...

10.1007/s40820-019-0298-5 article EN cc-by Nano-Micro Letters 2019-08-19

Embodied Question Answering (EQA) is a relatively new task where an agent asked to answer questions about its environment from egocentric perception. EQA as introduced in [8] makes the fundamental assumption that every question, e.g., ``what color car?", has exactly one target (``car") being inquired about. This puts direct limitation on abilities of agent. We present generalization -- Multi-Target (MT-EQA). Specifically, we study have multiple targets them, such ``Is dresser bedroom bigger...

10.1109/cvpr.2019.00647 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019-06-01

We introduce a new task, Video-and-Language Inference, for joint multimodal understanding of video and text. Given clip with aligned subtitles as premise, paired natural language hypothesis based on the content, model needs to infer whether is entailed or contradicted by given clip. A large-scale dataset, named Violin (VIdeO-and-Language INference), introduced this which consists 95,322 video-hypothesis pairs from 15,887 clips, spanning over 582 hours video. These clips contain rich content...

10.1109/cvpr42600.2020.01091 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2020-06-01

Rigid 2D In-MOF nanosheets is investigated as electrocatalysts for nitrogen electroreduction in the entire pH range first time, which follows an enzymatic mechanism, with potential determining step being *H 2 NNH * → *NH + NH 3 .

10.1039/d1ta02684d article EN Journal of Materials Chemistry A 2021-01-01
Coming Soon ...