- Advanced Image and Video Retrieval Techniques
- Visual Attention and Saliency Detection
- Multimodal Machine Learning Applications
- Image Enhancement Techniques
- Advanced Neural Network Applications
- Generative Adversarial Networks and Image Synthesis
- Video Analysis and Summarization
- Reinforcement Learning in Robotics
- Domain Adaptation and Few-Shot Learning
- Speech and Audio Processing
- Music and Audio Processing
- Multi-Agent Systems and Negotiation
- Topic Modeling
- Natural Language Processing Techniques
- Video Surveillance and Tracking Methods
- Handwritten Text Recognition Techniques
- Smart Grid Energy Management
- Remote Sensing and Land Use
- Cognitive Science and Education Research
- Human Pose and Action Recognition
- Language and cultural evolution
- Scientific Computing and Data Management
- Semantic Web and Ontologies
- Remote-Sensing Image Classification
- Simulation Techniques and Applications
King Abdullah University of Science and Technology
2023-2025
Alibaba Group (China)
2021-2023
Southern University of Science and Technology
2023
Inception Institute of Artificial Intelligence
2022
China University of Geosciences (Beijing)
2021-2022
Shandong University
2022
Alibaba Group (United States)
2021
China University of Geosciences
2020
Although current salient object detection (SOD) works have achieved significant progress, they are limited when it comes to the integrity of predicted regions. We define concept at both a micro and macro level. Specifically, level, model should highlight all parts that belong certain object. Meanwhile, needs discover objects in given image. To facilitate learning for SOD, we design novel Integrity Cognition Network (ICON), which explores three important components strong features. 1) Unlike...
We present a new vision-language (VL) pre-training model dubbed Kaleido-BERT , which introduces novel kaleido strategy for fashion cross-modality representations from transformers. In contrast to random masking of recent VL models, we design alignment guided jointly focus more on image-text semantic relations. To this end, carry out five tasks, i.e., rotation, jigsaw, camouflage, grey-to-color, and blank-to-color self-supervised at patches different scale. is conceptually simple easy extend...
Abstract We present a masked vision-language transformer (MVLT) for fashion-specific multi-modal representation. Technically, we simply utilize the vision architecture replacing bidirectional encoder representations from Transformers (BERT) in pre-training model, making MVLT first end-to-end framework fashion domain. Besides, designed image reconstruction (MIR) fine-grained understanding of fashion. is an extensible and convenient that admits raw inputs without extra pre-processing models...
Large Language Model (LLM)-based agents have demonstrated remarkable effectiveness. However, their performance can be compromised in data science scenarios that require real-time adjustment, expertise optimization due to complex dependencies among various tasks, and the ability identify logical errors for precise reasoning. In this study, we introduce Data Interpreter, a solution designed solve with code emphasizes three pivotal techniques augment problem-solving science: 1) dynamic planning...
Long-form writing agents require flexible integration and interaction across information retrieval, reasoning, composition. Current approaches rely on predetermined workflows rigid thinking patterns to generate outlines before writing, resulting in constrained adaptability during writing. In this paper we propose a general agent framework that achieves human-like adaptive through recursive task decomposition dynamic of three fundamental types, i.e. Our methodology features: 1) planning...
The advent of large language models (LLMs) has catalyzed a transformative shift in artificial intelligence, paving the way for advanced intelligent agents capable sophisticated reasoning, robust perception, and versatile action across diverse domains. As these increasingly drive AI research practical applications, their design, evaluation, continuous improvement present intricate, multifaceted challenges. This survey provides comprehensive overview, framing within modular, brain-inspired...
Recently, deep learning-based methods have made great progress in hyperspectral image (HSI) classification (HSIC). Different from ordinary images, the intrinsic complexity of HSIs data still limits performance many common convolutional neural network (CNN) models. Thus, architecture becomes more and complex to extract discriminative spectral-spatial features. For instance, 3-D CNN usually has a large number trainable parameters, thus increasing computational HSIC. In this letter, we designed...
Camouflaged Object Detection (COD) aims to detect objects with similar patterns (e.g., texture, intensity, colour, etc) their surroundings, and recently has attracted growing research interest. As camouflaged often present very ambiguous boundaries, how determine object locations as well weak boundaries is challenging also the key this task. Inspired by biological visual perception process when a human observer discovers objects, paper proposes novel edge-based reversible re-calibration...
Although current salient object detection (SOD) works have achieved significant progress, they are limited when it comes to the integrity of predicted regions. We define concept at both a micro and macro level. Specifically, level, model should highlight all parts that belong certain object. Meanwhile, needs discover objects in given image. To facilitate learning for SOD, we design novel Integrity Cognition Network (ICON), which explores three important components strong features. 1) Unlike...
Both Minsky's "society of mind" and Schmidhuber's "learning to think" inspire diverse societies large multimodal neural networks (NNs) that solve problems by interviewing each other in a "mindstorm." Recent implementations NN-based minds consist language models (LLMs) experts communicating through natural interface. In doing so, they overcome the limitations single LLMs, improving zero-shot reasoning. these language-based mind (NLSOMs), new agents -- all same universal symbolic are easily...
Camouflaged object detection (COD), which aims to identify the objects that conceal themselves into surroundings, has recently drawn increasing research efforts in field of computer vision. In practice, success deep learning based COD is mainly determined by two key factors, including (i) A significantly large receptive field, provides rich context information, and (ii) An effective fusion strategy, aggregates multi-level features for accurate COD. Motivated these observations, this paper,...
Large language models (LLMs) have demonstrated remarkable potential in solving complex tasks across diverse domains, typically by employing agentic workflows that follow detailed instructions and operational sequences. However, constructing these requires significant human effort, limiting scalability generalizability. Recent research has sought to automate the generation optimization of workflows, but existing methods still rely on initial manual setup fall short achieving fully automated...
Figure skating scoring is challenging because it requires judging players’ technical moves as well coordination with the background music. Most learning-based methods struggle for two reasons: 1) each move in figure changes quickly, hence simply applying traditional frame sampling will lose a lot of valuable information, especially 3 to 5 minutes lasting videos; 2) prior rarely considered critical audio-visual relationship their models. Due these reasons, we introduce novel architecture,...
Recent work on deep reinforcement learning (DRL) has pointed out that algorithmic information about good policies can be extracted from offline data which lack explicit executed actions [45], [46], [30]. For example, videos of humans or robots may convey a lot implicit rewarding action sequences, but DRL machine wants to profit watching such must first learn by itself identify and recognize relevant states/actions/rewards. Without relying ground-truth annotations, our new method called Deep...
We present a new vision-language (VL) pre-training model dubbed Kaleido-BERT, which introduces novel kaleido strategy for fashion cross-modality representations from transformers. In contrast to random masking of recent VL models, we design alignment guided jointly focus more on image-text semantic relations. To this end, carry out five tasks, i.e., rotation, jigsaw, camouflage, grey-to-color, and blank-to-color self-supervised at patches different scale. Kaleido-BERT is conceptually simple...
Temporal video segmentation is the get-to- go automatic analysis, which decomposes a long-form into smaller components for following-up understanding tasks. Recent works have studied several levels of granularity to segment video, such as shot, event, and scene. Those segmentations can help compare semantics in corresponding scales, but lack wider view larger temporal spans, especially when complex structured. Therefore, we present two abstractive study their hierarchy existing fine-grained...
Various human-designed prompt engineering techniques have been proposed to improve problem solvers based on Large Language Models (LLMs), yielding many disparate code bases. We unify these approaches by describing LLM-based agents as computational graphs. The nodes implement functions process multimodal data or query LLMs, and the edges describe information flow between operations. Graphs can be recursively combined into larger composite graphs representing hierarchies of inter-agent...
Most current LLM-based models for video understanding can process videos within minutes. However, they struggle with lengthy due to challenges such as "noise and redundancy", well "memory computation" constraints. In this paper, we present Goldfish, a methodology tailored comprehending of arbitrary lengths. We also introduce the TVQA-long benchmark, specifically designed evaluate models' capabilities in long questions both vision text content. Goldfish approaches these an efficient retrieval...
Contemporary evaluation techniques are inadequate for agentic systems. These approaches either focus exclusively on final outcomes -- ignoring the step-by-step nature of systems, or require excessive manual labour. To address this, we introduce Agent-as-a-Judge framework, wherein systems used to evaluate This is an organic extension LLM-as-a-Judge incorporating features that enable intermediate feedback entire task-solving process. We apply task code generation. overcome issues with existing...
Software is one of the most powerful tools that we humans have at our disposal; it allows a skilled programmer to interact with world in complex and profound ways. At same time, thanks improvements large language models (LLMs), there has also been rapid development AI agents affect change their surrounding environments. In this paper, introduce OpenHands (f.k.a. OpenDevin), platform for flexible similar ways those human developer: by writing code, interacting command line, browsing web. We...