- Handwritten Text Recognition Techniques
- Advanced Image and Video Retrieval Techniques
- Multimodal Machine Learning Applications
- Natural Language Processing Techniques
- Image Retrieval and Classification Techniques
- Image Processing and 3D Reconstruction
- Human Pose and Action Recognition
- Advanced Chemical Sensor Technologies
- Web Data Mining and Analysis
- Domain Adaptation and Few-Shot Learning
- Vehicle License Plate Recognition
- Model-Driven Software Engineering Techniques
- Fault Detection and Control Systems
- Sensor Technology and Measurement Systems
- Anomaly Detection Techniques and Applications
- Infrared Target Detection Methodologies
- Time Series Analysis and Forecasting
- Video Analysis and Summarization
- Data Quality and Management
Baidu (China)
2021-2024
Vision Technology (United States)
2022
Structured text understanding on Visually Rich Documents (VRDs) is a crucial part of Document Intelligence. Due to the complexity content and layout in VRDs, structured has been challenging task. Most existing studies decoupled this problem into two sub-tasks: entity labeling linking, which require an entire context documents at both token segment levels. However, little work concerned with solutions that efficiently extract data from different This paper proposes unified framework named...
Visual appearance is considered to be the most important cue understand images for cross-modal retrieval, while sometimes scene text appearing in can provide valuable information visual semantics. Most of existing retrieval approaches ignore usage and directly adding this may lead performance degradation free scenarios. To address issue, we propose a full transformer architecture unify these scenarios single Vision Scene Text Aggregation framework (ViSTA). Specifically, ViSTA utilizes blocks...
In this paper, we study the problem of end-to-end multi-person pose estimation. State-of-the-art solutions adopt DETR-like framework, and mainly develop complex decoder, e.g., regarding estimation as keypoint box detection combining with human in ED-Pose [38], hierarchically predicting decoder joint (keypoint) PETR [27].We present a simple yet effective transformer approach, named Group Pose. We simply regard K-keypoint set N × K positions, each from query, well representing an instance...
Due to the flexible representation of arbitrary-shaped scene text and simple pipeline, bottom-up segmentation-based methods begin be mainstream in real-time detection. Despite great progress, these show deficiencies robustness still suffer from false positives instance adhesion. Different existing which integrate multiple-granularity features or multiple outputs, we resort perspective learning auxiliary tasks are utilized enable encoder jointly learn robust with main task per-pixel...
Typical text spotters follow the two-stage spotting strategy: detect precise boundary for a instance first and then perform recognition within located region. While such strategy has achieved substantial progress, there are two underlying limitations. 1) The performance of depends heavily on precision detection, resulting in potential error propagation from detection to recognition. 2) RoI cropping which bridges brings noise background leads information loss when pooling or interpolating...
All tables can be represented as grids. Based on this observation, we propose GridFormer, a novel approach for interpreting unconstrained table structures by predicting the vertex and edge of grid. First, flexible representation in form an M X N In representation, vertexes edges grid store localization adjacency information table. Then, introduce DETR-style structure recognizer to efficiently predict multi-objective single shot. Specifically, given set learned row column queries, directly...
In this paper, we present a light-weight detection transformer, LW-DETR, which outperforms YOLOs for real-time object detection. The architecture is simple stack of ViT encoder, projector, and shallow DETR decoder. Our approach leverages recent advanced techniques, such as training-effective e.g., improved loss pretraining, interleaved window global attentions reducing the encoder complexity. We improve by aggregating multi-level feature maps, intermediate final maps in forming richer...
Text-rich images have significant and extensive value, deeply integrated into various aspects of human life. Notably, both visual cues linguistic symbols in text-rich play crucial roles information transmission but are accompanied by diverse challenges. Therefore, the efficient effective understanding is a litmus test for capability Vision-Language Models. We crafted an vision-language model, StrucTexTv3, tailored to tackle intelligent tasks images. The design StrucTexTv3 presented following...
Open-vocabulary object detection focusing on detecting novel categories guided by natural language. In this report, we propose Open-Vocabulary Light-Weighted Detection Transformer (OVLW-DETR), a deployment friendly open-vocabulary detector with strong performance and low latency. Building upon OVLW-DETR, provide an end-to-end training recipe that transferring knowledge from vision-language model (VLM) to simple alignment. We align the text encoder VLM replacing fixed classification layer...
Diffusion models have exhibited remarkable prowess in visual generalization. Building on this success, we introduce an instruction-based object addition pipeline, named Add-SD, which automatically inserts objects into realistic scenes with rational sizes and positions. Different from layout-conditioned methods, Add-SD is solely conditioned simple text prompts rather than any other human-costly references like bounding boxes. Our work contributes three aspects: proposing a dataset containing...
Structured text understanding on Visually Rich Documents (VRDs) is a crucial part of Document Intelligence. Due to the complexity content and layout in VRDs, structured has been challenging task. Most existing studies decoupled this problem into two sub-tasks: entity labeling linking, which require an entire context documents at both token segment levels. However, little work concerned with solutions that efficiently extract data from different This paper proposes unified framework named...