- Advanced Image and Video Retrieval Techniques
- Image Retrieval and Classification Techniques
- Multimodal Machine Learning Applications
- Domain Adaptation and Few-Shot Learning
- Image and Signal Denoising Methods
- Medical Image Segmentation Techniques
- Image and Object Detection Techniques
- Advanced Neural Network Applications
- Topic Modeling
- Advanced Data Compression Techniques
- Advanced Vision and Imaging
- Railway Engineering and Dynamics
- Digital Media Forensic Detection
- Natural Language Processing Techniques
- Digital Filter Design and Implementation
- Advanced Steganography and Watermarking Techniques
- Advanced Image Fusion Techniques
- Optical measurement and interference techniques
- Image Processing and 3D Reconstruction
- COVID-19 diagnosis using AI
- Generative Adversarial Networks and Image Synthesis
- Face and Expression Recognition
- Radiomics and Machine Learning in Medical Imaging
- Chaos-based Image/Signal Encryption
- Remote Sensing and Land Use
Xiamen University of Technology
2025
Xi'an University of Technology
2024
Hebei Eye Hospital
2024
Microsoft (United States)
2024
Nanjing University of Information Science and Technology
2010-2023
Nanyang Normal University
2009-2023
Dhurakij Pundit University
2023
Southwest Jiaotong University
2010-2023
Microsoft Research (United Kingdom)
2023
Shenzhen University
2022
Large-scale text-to-image diffusion models have made amazing advances. However, the status quo is to use text input alone, which can impede controllability. In this work, we propose GLIGEN, Grounded-Language-to-Image Generation, a novel approach that builds upon and extends functionality of existing pre-trained by enabling them also be conditioned on grounding inputs. To preserve vast concept knowledge model, freeze all its weights inject information into new trainable layers via gated...
In this paper, we present an open-set object detector, called Grounding DINO, by marrying Transformer-based detector DINO with grounded pre-training, which can detect arbitrary objects human inputs such as category names or referring expressions. The key solution of detection is introducing language to a closed-set for concept generalization. To effectively fuse and vision modalities, conceptually divide into three phases propose tight fusion solution, includes feature enhancer,...
In this work, we present SEEM, a promptable and interactive model for segmenting everything everywhere all at once in an image, as shown Fig.1. propose novel decoding mechanism that enables diverse prompting types of segmentation tasks, aiming universal interface behaves like large language models (LLMs). More specifically, SEEM is designed with four desiderata: i) Versatility. We introduce new visual prompt to unify different spatial queries including points, boxes, scribbles masks, which...
We propose focal modulation networks (FocalNets in short), where self-attention (SA) is completely replaced by a mechanism for modeling token interactions vision. Focal comprises three components: (i) hierarchical contextualization, implemented using stack of depth-wise convolutional layers, to encode visual contexts from short long ranges, (ii) gated aggregation selectively gather each query based on its content, and (iii) element-wise or affine transformation inject the aggregated context...
Abstract Digital pathology poses unique computational challenges, as a standard gigapixel slide may comprise tens of thousands image tiles 1–3 . Prior models have often resorted to subsampling small portion for each slide, thus missing the important slide-level context 4 Here we present Prov-GigaPath, whole-slide foundation model pretrained on 1.3 billion 256 × in 171,189 whole slides from Providence, large US health network comprising 28 cancer centres. The originated more than 30,000...
We present OpenSeeD, a simple Open-vocabulary Segmentation and Detection framework that jointly learns from different segmentation detection datasets. To bridge the gap of vocabulary annotation granularity, we first introduce pre-trained text encoder to encode all visual concepts in two tasks learn common semantic space for them. This gives us reasonably good results compared with counterparts trained on task only. further reconcile them, identify discrepancies: i) discrepancy – requires...
In this paper, we introduce Semantic-SAM, a universal image segmentation model to enable segment and recognize anything at any desired granularity. Our offers two key advantages: semantic-awareness granularity-abundance. To achieve semantic-awareness, consolidate multiple datasets across three granularities decoupled classification for objects parts. This allows our capture rich semantic information. For the multi-granularity capability, propose multi-choice learning scheme during training,...
Learning visual representations from natural language supervision has recently shown great promise in a number of pioneering works. In general, these language-augmented models demonstrate strong transferability to variety datasets and tasks. However, it remains challenging evaluate the transferablity due lack easy-to-use evaluation toolkits public benchmarks. To tackle this, we build ELEVATER (Evaluation Language-augmented Visual Task-level Transfer), first benchmark toolkit for...
Abstract Background Verticillium wilt, caused by the fungus dahliae , leads to significant losses in cotton yield worldwide. Biocontrol management is a promising means of suppressing verticillium wilt. The purpose study was obtain and analyze endophytic bacteria with wilt-resistant activities from roots Gossypium barbadense ‘Xinhai15’ explore interactions between soil plants. Results An bacterium Bacillus sp. T6 obtained G. ‘Xinhai15’, which showed antagonistic abilities against bioassay...
The new generation of state-of-the-art computer vision systems are trained from natural language supervision, ranging simple object category names to descriptive captions. This form supervision ensures high generality and usability the learned visual models, due broad concept coverage achieved via large-scale data collection process. Alternatively, we argue that learning with external knowledge is a promising way which leverages much more structured source offers sample efficiency. We...
In computer vision, it has achieved great transfer learning performance via adapting large-scale pretrained vision models (e.g., transformers) to downstream tasks. Common approaches for model adaptation either update all parameters or leverage linear probes. this paper, we aim study parameter-efficient strategies transformers on the image classification task. We formulate efficient as a subspace training problem and perform comprehensive benchmarking over different methods. conduct an...
We present Set-of-Mark (SoM), a new visual prompting method, to unleash the grounding abilities of large multimodal models (LMMs), such as GPT-4V. As illustrated in Fig. 1 (right), we employ off-the-shelf interactive segmentation models, SEEM/SAM, partition an image into regions at different levels granularity, and overlay these with set marks e.g., alphanumerics, masks, boxes. Using marked input, GPT-4V can answer questions that require grounding. perform comprehensive empirical study...
The development and maturation of maize kernel involves meticulous fine gene regulation at transcriptional post-transcriptional levels, miRNAs play important roles during this process. Although a number have been identified in seed, the ones involved early grains different lines not well studied. Here, we profiled four small RNA libraries, each constructed from groups immature Zea mays inbred line Chang 7–2 collected 4–6, 7–9, 12–14, 18–23 days after pollination (DAP). A total 40 known...
Image-text contrastive learning models such as CLIP have demonstrated strong task transfer ability. The high generality and usability of these visual is achieved via a web-scale data collection process to ensure broad concept coverage, followed by expensive pre-training feed all the knowledge into model weights. Alternatively, we propose React,REtrieval-Augmented CusTomization, framework acquire relevant web build customized for target domains. We retrieve most image-text pairs <tex...
The DEtection TRansformer (DETR) algorithm has received considerable attention in the research community and is gradually emerging as a mainstream approach for object detection other perception tasks. However, current field lacks unified comprehensive benchmark specifically tailored DETR-based models. To address this issue, we develop unified, highly modular, lightweight codebase called detrex, which supports majority of instance recognition algorithms, covering various fundamental tasks,...
Contrastive language-image pretraining (CLIP) links vision and language modalities into a unified embedding space, yielding the tremendous potential for vision-language (VL) tasks. While early concurrent works have begun to study this on subset of tasks, important questions remain: 1) What is benefit CLIP unstudied VL tasks? 2) Does provide in low-shot or domain-shifted scenarios? 3) Can improve existing approaches without impacting inference complexity? In work, we seek answer these through...
This paper presents a grounded language-image pre-training (GLIP) model for learning object-level, language-aware, and semantic-rich visual representations. GLIP unifies object detection phrase grounding pre-training. The unification brings two benefits: 1) it allows to learn from both data improve tasks bootstrap good model; 2) can leverage massive image-text pairs by generating boxes in self-training fashion, making the learned representation semantic-rich. In our experiments, we pre-train...
With the increasing penetration of electric vehicles (EV) and distributed generations (DG) in distribution networks, operation control networks (DN) is faced with many new challenges. Considering remarkable characteristics network layered by voltage level spatial temporal EV charging load, a hierarchical optimization method for DN proposed. Firstly, prediction model load established, which composed three parts: resident travel probability model, vehicle mobility traffic model. Secondly,...
Fourier-Mellin transform (FMT) has been widely used for the extraction of rotation- and scale-invariant features. However, affine is a more reasonable approximation model real viewpoint change. Due to shearing, integral along angular direction in calculation FMT cannot be extract inherent features an image undergoing transform. To eliminate effect whitening should conducted on radial direction. can hardly modified by conventional whitening-based methods with low computational cost due...