Jianwei Yang

ORCID: 0000-0002-2022-6002
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Advanced Image and Video Retrieval Techniques
  • Image Retrieval and Classification Techniques
  • Multimodal Machine Learning Applications
  • Domain Adaptation and Few-Shot Learning
  • Image and Signal Denoising Methods
  • Medical Image Segmentation Techniques
  • Image and Object Detection Techniques
  • Advanced Neural Network Applications
  • Topic Modeling
  • Advanced Data Compression Techniques
  • Advanced Vision and Imaging
  • Railway Engineering and Dynamics
  • Digital Media Forensic Detection
  • Natural Language Processing Techniques
  • Digital Filter Design and Implementation
  • Advanced Steganography and Watermarking Techniques
  • Advanced Image Fusion Techniques
  • Optical measurement and interference techniques
  • Image Processing and 3D Reconstruction
  • COVID-19 diagnosis using AI
  • Generative Adversarial Networks and Image Synthesis
  • Face and Expression Recognition
  • Radiomics and Machine Learning in Medical Imaging
  • Chaos-based Image/Signal Encryption
  • Remote Sensing and Land Use

Xiamen University of Technology
2025

Xi'an University of Technology
2024

Hebei Eye Hospital
2024

Microsoft (United States)
2024

Nanjing University of Information Science and Technology
2010-2023

Nanyang Normal University
2009-2023

Dhurakij Pundit University
2023

Southwest Jiaotong University
2010-2023

Microsoft Research (United Kingdom)
2023

Shenzhen University
2022

Large-scale text-to-image diffusion models have made amazing advances. However, the status quo is to use text input alone, which can impede controllability. In this work, we propose GLIGEN, Grounded-Language-to-Image Generation, a novel approach that builds upon and extends functionality of existing pre-trained by enabling them also be conditioned on grounding inputs. To preserve vast concept knowledge model, freeze all its weights inject information into new trainable layers via gated...

10.1109/cvpr52729.2023.02156 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023-06-01

In this paper, we present an open-set object detector, called Grounding DINO, by marrying Transformer-based detector DINO with grounded pre-training, which can detect arbitrary objects human inputs such as category names or referring expressions. The key solution of detection is introducing language to a closed-set for concept generalization. To effectively fuse and vision modalities, conceptually divide into three phases propose tight fusion solution, includes feature enhancer,...

10.48550/arxiv.2303.05499 preprint EN other-oa arXiv (Cornell University) 2023-01-01

In this work, we present SEEM, a promptable and interactive model for segmenting everything everywhere all at once in an image, as shown Fig.1. propose novel decoding mechanism that enables diverse prompting types of segmentation tasks, aiming universal interface behaves like large language models (LLMs). More specifically, SEEM is designed with four desiderata: i) Versatility. We introduce new visual prompt to unify different spatial queries including points, boxes, scribbles masks, which...

10.48550/arxiv.2304.06718 preprint EN other-oa arXiv (Cornell University) 2023-01-01

We propose focal modulation networks (FocalNets in short), where self-attention (SA) is completely replaced by a mechanism for modeling token interactions vision. Focal comprises three components: (i) hierarchical contextualization, implemented using stack of depth-wise convolutional layers, to encode visual contexts from short long ranges, (ii) gated aggregation selectively gather each query based on its content, and (iii) element-wise or affine transformation inject the aggregated context...

10.48550/arxiv.2203.11926 preprint EN other-oa arXiv (Cornell University) 2022-01-01

Abstract Digital pathology poses unique computational challenges, as a standard gigapixel slide may comprise tens of thousands image tiles 1–3 . Prior models have often resorted to subsampling small portion for each slide, thus missing the important slide-level context 4 Here we present Prov-GigaPath, whole-slide foundation model pretrained on 1.3 billion 256 × in 171,189 whole slides from Providence, large US health network comprising 28 cancer centres. The originated more than 30,000...

10.1038/s41586-024-07441-w article EN cc-by Nature 2024-05-22

We present OpenSeeD, a simple Open-vocabulary Segmentation and Detection framework that jointly learns from different segmentation detection datasets. To bridge the gap of vocabulary annotation granularity, we first introduce pre-trained text encoder to encode all visual concepts in two tasks learn common semantic space for them. This gives us reasonably good results compared with counterparts trained on task only. further reconcile them, identify discrepancies: i) discrepancy – requires...

10.1109/iccv51070.2023.00100 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2023-10-01

In this paper, we introduce Semantic-SAM, a universal image segmentation model to enable segment and recognize anything at any desired granularity. Our offers two key advantages: semantic-awareness granularity-abundance. To achieve semantic-awareness, consolidate multiple datasets across three granularities decoupled classification for objects parts. This allows our capture rich semantic information. For the multi-granularity capability, propose multi-choice learning scheme during training,...

10.48550/arxiv.2307.04767 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Learning visual representations from natural language supervision has recently shown great promise in a number of pioneering works. In general, these language-augmented models demonstrate strong transferability to variety datasets and tasks. However, it remains challenging evaluate the transferablity due lack easy-to-use evaluation toolkits public benchmarks. To tackle this, we build ELEVATER (Evaluation Language-augmented Visual Task-level Transfer), first benchmark toolkit for...

10.48550/arxiv.2204.08790 preprint EN cc-by arXiv (Cornell University) 2022-01-01

Abstract Background Verticillium wilt, caused by the fungus dahliae , leads to significant losses in cotton yield worldwide. Biocontrol management is a promising means of suppressing verticillium wilt. The purpose study was obtain and analyze endophytic bacteria with wilt-resistant activities from roots Gossypium barbadense ‘Xinhai15’ explore interactions between soil plants. Results An bacterium Bacillus sp. T6 obtained G. ‘Xinhai15’, which showed antagonistic abilities against bioassay...

10.1186/s12866-022-02749-x article EN cc-by BMC Microbiology 2023-01-10

The new generation of state-of-the-art computer vision systems are trained from natural language supervision, ranging simple object category names to descriptive captions. This form supervision ensures high generality and usability the learned visual models, due broad concept coverage achieved via large-scale data collection process. Alternatively, we argue that learning with external knowledge is a promising way which leverages much more structured source offers sample efficiency. We...

10.48550/arxiv.2204.09222 preprint EN cc-by arXiv (Cornell University) 2022-01-01

In computer vision, it has achieved great transfer learning performance via adapting large-scale pretrained vision models (e.g., transformers) to downstream tasks. Common approaches for model adaptation either update all parameters or leverage linear probes. this paper, we aim study parameter-efficient strategies transformers on the image classification task. We formulate efficient as a subspace training problem and perform comprehensive benchmarking over different methods. conduct an...

10.1609/aaai.v37i1.25160 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2023-06-26

We present Set-of-Mark (SoM), a new visual prompting method, to unleash the grounding abilities of large multimodal models (LMMs), such as GPT-4V. As illustrated in Fig. 1 (right), we employ off-the-shelf interactive segmentation models, SEEM/SAM, partition an image into regions at different levels granularity, and overlay these with set marks e.g., alphanumerics, masks, boxes. Using marked input, GPT-4V can answer questions that require grounding. perform comprehensive empirical study...

10.48550/arxiv.2310.11441 preprint EN other-oa arXiv (Cornell University) 2023-01-01

The development and maturation of maize kernel involves meticulous fine gene regulation at transcriptional post-transcriptional levels, miRNAs play important roles during this process. Although a number have been identified in seed, the ones involved early grains different lines not well studied. Here, we profiled four small RNA libraries, each constructed from groups immature Zea mays inbred line Chang 7–2 collected 4–6, 7–9, 12–14, 18–23 days after pollination (DAP). A total 40 known...

10.1371/journal.pone.0153168 article EN cc-by PLoS ONE 2016-04-15

Image-text contrastive learning models such as CLIP have demonstrated strong task transfer ability. The high generality and usability of these visual is achieved via a web-scale data collection process to ensure broad concept coverage, followed by expensive pre-training feed all the knowledge into model weights. Alternatively, we propose React,REtrieval-Augmented CusTomization, framework acquire relevant web build customized for target domains. We retrieve most image-text pairs <tex...

10.1109/cvpr52729.2023.01454 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023-06-01

The DEtection TRansformer (DETR) algorithm has received considerable attention in the research community and is gradually emerging as a mainstream approach for object detection other perception tasks. However, current field lacks unified comprehensive benchmark specifically tailored DETR-based models. To address this issue, we develop unified, highly modular, lightweight codebase called detrex, which supports majority of instance recognition algorithms, covering various fundamental tasks,...

10.48550/arxiv.2306.07265 preprint EN other-oa arXiv (Cornell University) 2023-01-01

10.1109/icassp49660.2025.10888778 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

Contrastive language-image pretraining (CLIP) links vision and language modalities into a unified embedding space, yielding the tremendous potential for vision-language (VL) tasks. While early concurrent works have begun to study this on subset of tasks, important questions remain: 1) What is benefit CLIP unstudied VL tasks? 2) Does provide in low-shot or domain-shifted scenarios? 3) Can improve existing approaches without impacting inference complexity? In work, we seek answer these through...

10.48550/arxiv.2201.05729 preprint EN cc-by-nc-sa arXiv (Cornell University) 2022-01-01

This paper presents a grounded language-image pre-training (GLIP) model for learning object-level, language-aware, and semantic-rich visual representations. GLIP unifies object detection phrase grounding pre-training. The unification brings two benefits: 1) it allows to learn from both data improve tasks bootstrap good model; 2) can leverage massive image-text pairs by generating boxes in self-training fashion, making the learned representation semantic-rich. In our experiments, we pre-train...

10.48550/arxiv.2112.03857 preprint EN other-oa arXiv (Cornell University) 2021-01-01

With the increasing penetration of electric vehicles (EV) and distributed generations (DG) in distribution networks, operation control networks (DN) is faced with many new challenges. Considering remarkable characteristics network layered by voltage level spatial temporal EV charging load, a hierarchical optimization method for DN proposed. Firstly, prediction model load established, which composed three parts: resident travel probability model, vehicle mobility traffic model. Secondly,...

10.1016/j.egyr.2023.04.086 article EN cc-by-nc-nd Energy Reports 2023-04-20

Fourier-Mellin transform (FMT) has been widely used for the extraction of rotation- and scale-invariant features. However, affine is a more reasonable approximation model real viewpoint change. Due to shearing, integral along angular direction in calculation FMT cannot be extract inherent features an image undergoing transform. To eliminate effect whitening should conducted on radial direction. can hardly modified by conventional whitening-based methods with low computational cost due...

10.1109/tip.2020.2967578 article EN IEEE Transactions on Image Processing 2020-01-01
Coming Soon ...