Peng Gao

ORCID: 0000-0003-4398-5471
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Domain Adaptation and Few-Shot Learning
  • Multimodal Machine Learning Applications
  • Advanced Neural Network Applications
  • Advanced Image and Video Retrieval Techniques
  • Mycobacterium research and diagnosis
  • Topic Modeling
  • Human Pose and Action Recognition
  • Natural Language Processing Techniques
  • CCD and CMOS Imaging Sensors
  • Analog and Mixed-Signal Circuit Design
  • 3D Shape Modeling and Analysis
  • Cancer-related molecular mechanisms research
  • COVID-19 diagnosis using AI
  • Radio Frequency Integrated Circuit Design
  • Soil Carbon and Nitrogen Dynamics
  • VLSI and FPGA Design Techniques
  • 3D Surveying and Cultural Heritage
  • Tuberculosis Research and Epidemiology
  • Manufacturing Process and Optimization
  • Speech Recognition and Synthesis
  • Advanced Sensor and Control Systems
  • Visual Attention and Saliency Detection
  • Advanced Power Amplifier Design
  • RNA modifications and cancer
  • Speech and Audio Processing

Beijing Normal University - Hong Kong Baptist University United International College
2025

Shandong Agricultural University
2023-2024

State Forestry and Grassland Administration
2023-2024

Shanghai Artificial Intelligence Laboratory
2023-2024

Beijing Academy of Artificial Intelligence
2023-2024

Shanghai Tenth People's Hospital
2021-2024

Tongji University
2021-2024

China United Network Communications Group (China)
2024

Gansu Coalfield Geology Bureau
2023-2024

BGI Group (China)
2024

We present LLaMA-Adapter, a lightweight adaption method to efficiently fine-tune LLaMA into an instruction-following model. Using 52K self-instruct demonstrations, LLaMA-Adapter only introduces 1.2M learnable parameters upon the frozen 7B model, and costs less than one hour for fine-tuning on 8 A100 GPUs. Specifically, we adopt set of prompts, prepend them word tokens at higher transformer layers. Then, zero-initialized attention mechanism with zero gating is proposed, which adaptively...

10.48550/arxiv.2303.16199 preprint EN other-oa arXiv (Cornell University) 2023-01-01

How to efficiently transform large language models (LLMs) into instruction followers is recently a popular research direction, while training LLM for multi-modal reasoning remains less explored. Although the recent LLaMA-Adapter demonstrates potential handle visual inputs with LLMs, it still cannot generalize well open-ended instructions and lags behind GPT-4. In this paper, we present V2, parameter-efficient model. Specifically, first augment by unlocking more learnable parameters (e.g.,...

10.48550/arxiv.2304.15010 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Visual recognition in low-data regimes requires deep neural networks to learn generalized representations from limited training samples. Recently, CLIP-based methods have shown promising few-shot performance benefited the contrastive language-image pre-training. We then question, if more diverse pre-training knowledge can be cascaded further assist representation learning. In this paper, we propose CaFo, a Cascade of Foundation models that incorporates prior various pretraining paradigms for...

10.1109/cvpr52729.2023.01460 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023-06-01

Masked Autoencoders (MAE) have shown great potentials in self-supervised pre-training for language and 2D image transformers. However, it still remains an open question on how to exploit masked autoencoding learning 3D representations of irregular point clouds. In this paper, we propose Point-M2AE, a strong Multi-scale MAE framework hierarchical Unlike the standard transformer MAE, modify encoder decoder into pyramid architectures progressively model spatial geometries capture both...

10.48550/arxiv.2205.14401 preprint EN other-oa arXiv (Cornell University) 2022-01-01

It is a challenging task to learn rich and multi-scale spatiotemporal semantics from high-dimensional videos, due large local redundancy complex global dependency between video frames. The recent advances in this research have been mainly driven by 3D convolutional neural networks vision transformers. Although convolution can efficiently aggregate context suppress small neighborhood, it lacks the capability capture because of limited receptive field. Alternatively, transformers effectively...

10.48550/arxiv.2201.04676 preprint EN other-oa arXiv (Cornell University) 2022-01-01

Large-scale pre-trained models have shown promising open-world performance for both vision and language tasks. However, their transferred capacity on 3D point clouds is still limited only constrained to the classification task. In this paper, we first collaborate CLIP GPT be a unified learner, named as Point-CLIP V2, which fully unleashes potential zero-shot classification, segmentation, detection. To better align data with knowledge, V2 contains two key designs. For visual end, prompt via...

10.1109/iccv51070.2023.00249 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2023-10-01

Pre-training by numerous image data has become defacto for robust 2D representations. In contrast, due to the expensive processing, a paucity of 3D datasets severely hinders learning high-quality features. this paper, we propose an alternative obtain superior representations from pre-trained models via Image-to-Point Masked Autoencoders, named as I2P-MAE. By self-supervised pre-training, leverage well learned knowledge guide masked autoencoding, which reconstructs point tokens with...

10.1109/cvpr52729.2023.02085 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023-06-01

Monocular 3D object detection has long been a challenging task in autonomous driving. Most existing methods follow conventional 2D detectors to first localize centers, and then predict attributes by neighboring features. However, only using local visual features is insufficient understand the scene-level spatial structures ignores long-range inter-object depth relations. In this paper, we introduce DETR framework for DEtection with depth-guided TRansformer, named MonoDETR. We modify vanilla...

10.1109/iccv51070.2023.00840 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2023-10-01

Driven by large-data pre-training, Segment Anything Model (SAM) has been demonstrated as a powerful and promptable framework, revolutionizing the segmentation models. Despite generality, customizing SAM for specific visual concepts without man-powered prompting is under explored, e.g., automatically segmenting your pet dog in different images. In this paper, we propose training-free Personalization approach SAM, termed PerSAM. Given only single image with reference mask, PerSAM first...

10.48550/arxiv.2305.03048 preprint EN other-oa arXiv (Cornell University) 2023-01-01

We present a Non-parametric Network for 3D point cloud analysis, Point-NN, which consists of purely non-learnable components: farthest sampling (FPS), k-nearest neighbors (k-NN), and pooling operations, with trigonometric functions. Surprisingly, it performs well on various tasks, requiring no parameters or training, even surpasses existing fully trained models. Starting from this basic non-parametric model, we propose two extensions. First, Point-NN can serve as base architectural framework...

10.1109/cvpr52729.2023.00517 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023-06-01

Abstract Objective. Decoding neural activity has been limited by the lack of tools available to record from large numbers neurons across multiple cortical regions simultaneously with high temporal fidelity. To this end, we developed Argo system at data rates. Approach. Here demonstrate a massively parallel recording based on platinum-iridium microwire electrode arrays bonded CMOS voltage amplifier array. The is highest channel count in vivo system, supporting simultaneous 65 536 channels,...

10.1088/1741-2552/abd0ce article EN Journal of Neural Engineering 2020-12-05

Vision Transformers (ViT) become widely-adopted architectures for various vision tasks. Masked auto-encoding feature pretraining and multi-scale hybrid convolution-transformer can further unleash the potentials of ViT, leading to state-of-the-art performances on image classification, detection semantic segmentation. In this paper, our ConvMAE framework demonstrates that learn more discriminative representations via mask scheme. However, directly using original masking strategy leads heavy...

10.48550/arxiv.2205.03892 preprint EN cc-by arXiv (Cornell University) 2022-01-01

We present a Non-parametric Network for 3D point cloud analysis, Point-NN, which consists of purely non-learnable components: farthest sampling (FPS), k-nearest neighbors (k-NN), and pooling operations, with trigonometric functions. Surprisingly, it performs well on various tasks, requiring no parameters or training, even surpasses existing fully trained models. Starting from this basic non-parametric model, we propose two extensions. First, Point-NN can serve as base architectural framework...

10.48550/arxiv.2303.08134 preprint EN other-oa arXiv (Cornell University) 2023-01-01

10.1109/cvpr52733.2024.02510 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024-06-16

Recently, video object segmentation (VOS) referred by multi-modal signals, e.g., language and audio, has evoked increasing attention in both industry academia. It is challenging for exploring the semantic alignment within modalities visual correspondence across frames. However, existing methods adopt separate network architectures different modalities, neglect inter-frame temporal interaction with references. In this paper, we propose MUTR, a Multi-modal Unified Temporal transformer...

10.1609/aaai.v38i6.28465 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2024-03-24

Challenging illumination conditions (low-light, under-exposure and over-exposure) in the real world not only cast an unpleasant visual appearance but also taint computer vision tasks. After camera captures raw-RGB data, it renders standard sRGB images with image signal processor (ISP). By decomposing ISP pipeline into local global components, we propose a lightweight fast Illumination Adaptive Transformer (IAT) to restore normal lit from either low-light or under/over-exposure conditions....

10.48550/arxiv.2205.14871 preprint EN cc-by arXiv (Cornell University) 2022-01-01

Foundation models have made significant strides in various applications, including text-to-image generation, panoptic segmentation, and natural language processing. This paper presents Instruct2Act, a framework that utilizes Large Language Models to map multi-modal instructions sequential actions for robotic manipulation tasks. Specifically, Instruct2Act employs the LLM model generate Python programs constitute comprehensive perception, planning, action loop In perception section,...

10.48550/arxiv.2305.11176 preprint EN cc-by arXiv (Cornell University) 2023-01-01

The popularity of Contrastive Language-Image Pretraining (CLIP) has propelled its application to diverse downstream vision tasks. To improve capacity on tasks, few-shot learning become a widelya-dopted technique. However, existing methods either exhibit limited performance or suffer from excessive learnable parameters. In this paper, we propose APE, an Adaptive Prior rEfinement method for CLIP's pre-trained knowledge, which achieves superior accuracy with high computational efficiency. Via...

10.1109/iccv51070.2023.00246 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2023-10-01

The large pre-trained vision transformers (ViTs) have demonstrated remarkable performance on various visual tasks, but suffer from expensive computational and memory cost problems when deployed resource-constrained devices. Among the powerful compression approaches, quantization extremely reduces computation consumption by low-bit parameters bit-wise operations. However, ViTs remain largely unexplored usually a significant drop compared with real-valued counterparts. In this work, through...

10.48550/arxiv.2210.06707 preprint EN cc-by arXiv (Cornell University) 2022-01-01
Coming Soon ...