- Radio Frequency Integrated Circuit Design
- Human Pose and Action Recognition
- Multimodal Machine Learning Applications
- Advanced Vision and Imaging
- Microwave Engineering and Waveguides
- Video Surveillance and Tracking Methods
- Millimeter-Wave Propagation and Modeling
- Advanced Image and Video Retrieval Techniques
- Domain Adaptation and Few-Shot Learning
- Advancements in PLL and VCO Technologies
- Analog and Mixed-Signal Circuit Design
- Advanced Power Amplifier Design
- Robotics and Sensor-Based Localization
- Advanced Neural Network Applications
- Antenna Design and Optimization
- Image Processing Techniques and Applications
- Generative Adversarial Networks and Image Synthesis
- Integrated Circuits and Semiconductor Failure Analysis
- Natural Language Processing Techniques
- Semiconductor materials and devices
- Antenna Design and Analysis
- Energy Harvesting in Wireless Networks
- 3D IC and TSV technologies
- Topic Modeling
- Anomaly Detection Techniques and Applications
Washington State University
2019-2025
Microsoft Research (United Kingdom)
2021-2024
Film Independent
2024
IBM Research - Thomas J. Watson Research Center
2013-2021
IBM (United States)
2014-2020
University of California, Los Angeles
2020
Bioengineering Center
2018
Southern Methodist University
2017-2018
University of Illinois Urbana-Champaign
2016
Yuan Ze University
2014
The goal of image stitching is to create natural-looking mosaics free artifacts that may occur due relative camera motion, illumination changes, and optical aberrations. In this paper, we propose a novel method, uses smooth field over the entire target image, while accounting for all local transformation variations. Computing warp fully automated combination homography global similarity transformations, both which are estimated with respect target. We mitigate perspective distortion in...
The canonical approach to video captioning dictates a caption generation model learn from offline-extracted dense features. These feature extractors usually operate on frames sampled at fixed frame rate and are often trained image/video understanding tasks, without adaption data. In this work, we present SwinBERT, an end-to-end transformer-based for captioning, which takes patches directly as inputs, outputs natural language description. Instead of leveraging multiple 2D/3D extractors, our...
Large multimodal models (LMMs) extend large language (LLMs) with multi-sensory skills, such as visual understanding, to achieve stronger generic intelligence. In this paper, we analyze the latest model, GPT-4V(ision), deepen understanding of LMMs. The analysis focuses on intriguing tasks that GPT-4V can perform, containing test samples probe quality and genericity GPT-4V's capabilities, its supported inputs working modes, effective ways prompt model. our approach exploring GPT-4V, curate...
Unified vision-language frameworks have greatly advanced in recent years, most of which adopt an encoder-decoder architecture to unify image-text tasks as sequence-to-sequence generation. However, existing video-language (VidL) models still require task-specific designs model and training objectives for each task. In this work, we explore a unified VidL framework LAVENDER, where Masked Language Modeling [13] (MLM) is used the common interface all pre-training downstream tasks. Such...
We propose a modified variational autoencoder (VAE) architecture built on top of Mask R-CNN for instance-level video segmentation and tracking. The method builds shared encoder three parallel decoders, yielding disjoint branches predictions future frames, object detection boxes, instance masks. To effectively solve multiple learning tasks, we introduce Gaussian Process model to enhance the statistical representation VAE by relaxing prior strong independent identically distributed (iid)...
The best beam steering directions are estimated through training, which is one of the most important and challenging tasks in millimeter-wave sub-terahertz communications. Novel array architectures signal processing techniques required to avoid prohibitive training overhead associated with large antenna arrays narrow beams. In this work, we leverage recent developments true-time-delay (TTD) delay-bandwidth products accelerate using frequency-dependent probing We propose study two TTD...
We present a cross-modal Transformer-based frame-work, which jointly encodes video data and text labels for zero-shot action recognition (ZSAR). Our model employs conceptually new pipeline by visual representations are learned in conjunction with visual-semantic associations an end-to-end manner. The design provides natural mechanism semantic to be shared knowledge space, whereby it encourages the embedding discriminative more semantically consistent. In inference, we devise simple transfer...
The canonical approach to video captioning dictates a caption generation model learn from offline-extracted dense features. These feature extractors usually operate on frames sampled at fixed frame rate and are often trained image/video understanding tasks, without adaption data. In this work, we present SwinBERT, an end-to-end transformer-based for captioning, which takes patches directly as inputs, outputs natural language description. Instead of leveraging multiple 2D/3D extractors, our...
This paper presents a prior-less method for tracking and clustering an unknown number of human faces maintaining their individual identities in unconstrained videos. The key challenge is to accurately track with partial occlusion drastic appearance changes multiple shots resulting from significant variations makeup, facial expression, head pose illumination. To address this challenge, we propose new multi-face re-identification algorithm, which provides high accuracy face association the...
Initial access in millimeter-wave (mmW) wireless is critical toward successful realization of the fifth-generation (5G) networks and beyond. Limited bandwidth existing standards use phase-shifters analog/hybrid phased-antenna arrays (PAAs) are not suited for these emerging demanding low-latency direction finding. This work proposes a reconfigurable true-time-delay (TTD)-based spatial signal processor (SSP) with frequency-division beam training methodology wideband beam-squint less data...
Spatial signal processors (SSP) for emerging millimeter-wave wireless networks are critically dependent on link discovery. To avoid loss in communication, mobile devices need to locate narrow directional beams with millisecond latency. In this work, we demonstrate a true-time-delay (TTD) array digitally reconfigurable delay elements enabling both fast beam-training at the receiver wideband data communications. mode, large delay-bandwidth products implemented accelerate beam training using...
This article presents a process- and temperature-invariant high-resolution highly linear low-power phase interpolator (PI) as an enabler for discrete-time spatial signal processors (SSPs) various mixed-mode RF transceiver architectures. Using current integration techniques, the proposed PI generates adaptable constant slope-and-swing ramp to achieve significantly lower power suited multiple antenna elements. Switched-capacitor-based bias generation enables tracking generator over process,...
We present MM-VID, an integrated system that harnesses the capabilities of GPT-4V, combined with specialized tools in vision, audio, and speech, to facilitate advanced video understanding. MM-VID is designed address challenges posed by long-form videos intricate tasks such as reasoning within hour-long content grasping storylines spanning multiple episodes. uses a video-to-script generation GPT-4V transcribe multimodal elements into long textual script. The generated script details character...
This study explores the concept of equivariance in vision-language foundation models (VLMs), focusing specifically on multimodal similarity function that is not only major training objective but also core delivery to support downstream tasks. Unlike existing image-text which categorizes matched pairs as similar and unmatched dissimilar, requires vary faithfully according semantic changes. allows VLMs generalize better nuanced unseen compositions. However, modeling challenging ground truth...
The decadal research in integrated true-time-delay arrays have seen organic growth enabling realization of wideband beamformers for large with wide aperture widths. This article introduces highly reconfigurable delay elements implementable at analog or digital baseband that enables multiple Spatial Signal Processing (SSP) functions including beamforming, interference cancellation, and fast beam training. Details the beam-training algorithm, system design considerations, architecture circuits...
This paper proposes a new ego-motion estimation and background/foreground classification method to effectively segment moving objects from videos captured by camera on platform. Existing methods for moving-camera detecting impose serious constraints. In our approach, ellipsoid scene shape is applied in the motion model complicated formula derived. Genetic algorithm introduced accurately solve parameters. After recovery, noisy result refined vector correlation foreground classified pixel...
Most methods for Bundle Adjustment (BA) in computer vision are either centralized or operate incrementally. This leads to poor scaling and affects the quality of solution as number images grows large scale structure from motion (SfM). Furthermore, they cannot be used scenarios where image acquisition processing must distributed. We address this problem with a new distributed BA algorithm. Our formulation uses alternating direction method multipliers (ADMM), and, since each processor sees...
In this paper, we propose a hierarchical computational system architecture to support the target domain of realtime mobile computing in context unmanned aerial vehicles (UAVs). The overall architectural vision includes for resilience presence uncertainties operational environment surveillance UAVs. We report measurement-based results that are obtained from UAV proxy demonstration apparatus. apparatus consists Raspberry Pi (RPi) board serves as an on-board computer, working with laptop...