Xinlong Wang

ORCID: 0000-0002-8137-1692
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Advanced Neural Network Applications
  • Inertial Sensor and Navigation
  • Domain Adaptation and Few-Shot Learning
  • Multimodal Machine Learning Applications
  • Advanced Image and Video Retrieval Techniques
  • Nonlinear Photonic Systems
  • Acoustic Wave Phenomena Research
  • Nonlinear Dynamics and Pattern Formation
  • Target Tracking and Data Fusion in Sensor Networks
  • GNSS positioning and interference
  • Advanced Fiber Laser Technologies
  • Video Surveillance and Tracking Methods
  • Underwater Acoustics Research
  • Robotics and Sensor-Based Localization
  • Topic Modeling
  • Advanced Vision and Imaging
  • Astronomical Observations and Instrumentation
  • Anomaly Detection Techniques and Applications
  • Machine Fault Diagnosis Techniques
  • Advanced Computational Techniques and Applications
  • Human Pose and Action Recognition
  • Ocean Waves and Remote Sensing
  • Blind Source Separation Techniques
  • Nonlinear Waves and Solitons
  • Aerodynamics and Acoustics in Jet Flows

Institute of Acoustics
2008-2025

Nanjing University
2010-2025

Beihang University
2014-2025

Shandong Normal University
2025

Beijing Academy of Artificial Intelligence
2023-2024

Dalian Polytechnic University
2024

Central China Normal University
2024

Northeast Normal University
2024

Dalian University of Technology
2022-2024

Anhui Jianzhu University
2023

Video instance segmentation (VIS) is the task that requires simultaneously classifying, segmenting and tracking object instances of interest in video. Recent methods typically develop sophisticated pipelines to tackle this task. Here, we propose a new video framework built upon Transformers, termed VisTR, which views VIS as direct end-to-end parallel sequence decoding/prediction problem. Given clip consisting multiple image frames input, VisTR outputs masks for each order directly. At core...

10.1109/cvpr46437.2021.00863 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021-06-01

Detecting individual pedestrians in a crowd remains challenging problem since the often gather together and occlude each other real-world scenarios. In this paper, we first explore how state-of-the-art pedestrian detector is harmed by occlusion via experimentation, providing insights into problem. Then, propose novel bounding box regression loss specifically designed for scenes, termed repulsion loss. This driven two motivations: attraction target, surrounding objects. The term prevents...

10.1109/cvpr.2018.00811 preprint EN 2018-06-01

To date, most existing self-supervised learning methods are designed and optimized for image classification. These pre-trained models can be sub-optimal dense prediction tasks due to the discrepancy between image-level pixel-level prediction. fill this gap, we aim design an effective, method that directly works at level of pixels (or local features) by taking into account correspondence features. We present contrastive (DenseCL), which implements optimizing a pairwise (dis)similarity loss...

10.1109/cvpr46437.2021.00304 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021-06-01

In this work, we aim at building a simple, direct, and fast instance segmentation framework with strong performance. We follow the principle of SOLO method Wang et al. "SOLO: segmenting objects by locations". Importantly, take one step further dynamically learning mask head object segmenter such that is conditioned on location. Specifically, branch decoupled into kernel feature branch, which are responsible for convolution convolved features respectively. Moreover, propose Matrix NMS (non...

10.48550/arxiv.2003.10152 preprint EN cc-by-nc-sa arXiv (Cornell University) 2020-01-01

We propose a conditional positional encoding (CPE) scheme for vision Transformers. Unlike previous fixed or learnable encodings, which are pre-defined and independent of input tokens, CPE is dynamically generated conditioned on the local neighborhood tokens. As result, can easily generalize to sequences that longer than what model has ever seen during training. Besides, keep desired translation-invariance in image classification task, resulting improved performance. implement with simple...

10.48550/arxiv.2102.10882 preprint EN other-oa arXiv (Cornell University) 2021-01-01

A 3D point cloud describes the real scene precisely and intuitively. To date how to segment diversified elements in such an informative is rarely discussed. In this paper, we first introduce a simple flexible framework instances semantics clouds simultaneously. Then, propose two approaches which make tasks take advantage of each other, leading win-win situation. Specifically, instance segmentation benefit from semantic through learning semantic-aware point-level embedding. Meanwhile,...

10.1109/cvpr.2019.00422 preprint EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019-06-01

We launch EVA, a vision-centric foundation model to Explore the limits of Visual representation at scAle using only publicly accessible data. EVA is vanilla ViT pre-trained reconstruct masked out image-text aligned vision features conditioned on visible image patches. Via this pretext task, we can efficiently scale up one billion parameters, and sets new records broad range representative downstream tasks, such as recognition, video action object detection, instance segmentation semantic...

10.1109/cvpr52729.2023.01855 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023-06-01

We present a high-performance method that can achieve mask-level instance segmentation with only bounding-box annotations for training. While this setting has been studied in the literature, here we show significantly stronger performance simple design (e.g., dramatically improving previous best reported mask AP of 21.1% [13] to 31.6% on COCO dataset). Our core idea is redesign loss learning masks segmentation, no modification network itself. The new functions supervise training without...

10.1109/cvpr46437.2021.00540 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021-06-01

Recently the state space models (SSMs) with efficient hardware-aware designs, i.e., Mamba deep learning model, have shown great potential for long sequence modeling. Meanwhile building and generic vision backbones purely upon SSMs is an appealing direction. However, representing visual data challenging due to position-sensitivity of requirement global context understanding. In this paper, we show that reliance on self-attention representation not necessary propose a new backbone...

10.48550/arxiv.2401.09417 preprint EN other-oa arXiv (Cornell University) 2024-01-01

We present SegGPT, a generalist model for segmenting everything in context. unify various segmentation tasks into in-context learning framework that accommodates different kinds of data by transforming them the same format images. The training SegGPT is formulated as an coloring problem with random color mapping each sample. objective to accomplish diverse according context, rather than relying on specific colors. After training, can perform arbitrary images or videos via inference, such...

10.1109/iccv51070.2023.00110 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2023-10-01

In-context learning, as a new paradigm in NLP, allows the model to rapidly adapt various tasks with only handful of prompts and examples. But computer vision, difficulties for in-context learning lie that vary significantly output representations, thus it is unclear how define general-purpose task vision can understand transfer out-of-domain tasks. In this work, we present Painter, generalist which addresses these obstacles an "image"-centric solution, is, redefine core images, specify also...

10.1109/cvpr52729.2023.00660 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023-06-01

Instance segmentation is a fundamental vision task that aims to recognize and segment each object in an image. However, it requires costly annotations such as bounding boxes masks for learning. In this work, we propose fully unsupervised learning method learns class-agnostic instance without any annotations. We present FreeSOLO, self-supervised framework built on top of the simple SOLO. Our also presents novel localization-aware pre-training framework, where objects can be discovered from...

10.1109/cvpr52688.2022.01378 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022-06-01

Contrastive language-image pre-training, CLIP for short, has gained increasing attention its potential in various scenarios. In this paper, we propose EVA-CLIP, a series of models that significantly improve the efficiency and effectiveness training. Our approach incorporates new techniques representation learning, optimization, augmentation, enabling EVA-CLIP to achieve superior performance compared previous with same number parameters but smaller training costs. Notably, our largest...

10.48550/arxiv.2303.15389 preprint EN other-oa arXiv (Cornell University) 2023-01-01

We present SegGPT, a generalist model for segmenting everything in context. unify various segmentation tasks into in-context learning framework that accommodates different kinds of data by transforming them the same format images. The training SegGPT is formulated as an coloring problem with random color mapping each sample. objective to accomplish diverse according context, rather than relying on specific colors. After training, can perform arbitrary images or videos via inference, such...

10.48550/arxiv.2304.03284 preprint EN cc-by arXiv (Cornell University) 2023-01-01

We launch EVA-02, a next-generation Transformer-based visual representation pre-trained to reconstruct strong and robust language-aligned vision features via masked image modeling. With an updated plain Transformer architecture as well extensive pre-training from open & accessible giant CLIP encoder, EVA-02 demonstrates superior performance compared prior state-of-the-art approaches across various representative tasks, while utilizing significantly fewer parameters compute budgets. Notably,...

10.2139/ssrn.4813567 preprint EN 2024-01-01

Compared to many other dense prediction tasks, e.g., semantic segmentation, it is the arbitrary number of instances that has made instance segmentation much more challenging. In order predict a mask for each instance, mainstream approaches either follow "detect-then-segment" strategy (e.g., Mask R-CNN), or embedding vectors first then cluster pixels into individual instances. this paper, we view task from completely new perspective by introducing notion "instance categories", which assigns...

10.1109/tpami.2021.3111116 article EN IEEE Transactions on Pattern Analysis and Machine Intelligence 2021-01-01

We present Emu, a Transformer-based multimodal foundation model, which can seamlessly generate images and texts in context. This omnivore model take any single-modality or data input indiscriminately (e.g., interleaved image, text video) through one-model-for-all autoregressive training process. First, visual signals are encoded into embeddings, together with tokens form an sequence. Emu is then end-to-end trained unified objective of classifying the next token regressing embedding versatile...

10.48550/arxiv.2307.05222 preprint EN cc-by arXiv (Cornell University) 2023-01-01

10.1109/cvpr52733.2024.01365 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024-06-16

Abstract High-quality broadband ultrasound transducers yield superior imaging performance in biomedical ultrasonography. However, proper design to perfectly bridge the energy between active piezoelectric material and target medium over operating spectrum is still lacking. Here, we demonstrate a new anisotropic cone-structured acoustic metamaterial matching layer that acts as an inhomogeneous with gradient impedance along propagation direction. When sandwiched unit medium, provides window...

10.1038/srep42863 article EN cc-by Scientific Reports 2017-02-17
Coming Soon ...