- Domain Adaptation and Few-Shot Learning
- Generative Adversarial Networks and Image Synthesis
- Multimodal Machine Learning Applications
- Advanced Neural Network Applications
- Advanced Image and Video Retrieval Techniques
- Video Analysis and Summarization
- Computer Graphics and Visualization Techniques
- Multimedia Communication and Technology
- Digital Media Forensic Detection
- Subtitles and Audiovisual Media
- Machine Learning and ELM
- Machine Learning and Data Classification
- Image Processing and 3D Reconstruction
- Artificial Intelligence in Healthcare
- Imbalanced Data Classification Techniques
- ECG Monitoring and Analysis
- Private Equity and Venture Capital
- Digital Holography and Microscopy
- Model Reduction and Neural Networks
- Advanced Optical Imaging Technologies
- Advanced Vision and Imaging
- Photorefractive and Nonlinear Optics
- AI in cancer detection
- Face recognition and analysis
- Video Surveillance and Tracking Methods
Adobe Systems (United States)
2023-2024
National Institute of Technology Andhra Pradesh
2023
Rajiv Gandhi University of Knowledge Technologies
2023
University of California, Davis
2016-2020
University of California System
2016
Indian Institute of Technology Madras
1992
Large-scale text-to-image generative models have shown their remarkable ability to synthesize diverse, high-quality images. However, directly applying these for real image editing remains challenging two reasons. First, it is hard users craft a perfect text prompt depicting every visual detail in the input image. Second, while existing can introduce desirable changes certain regions, they often dramatically alter content and unexpected unwanted regions. In this work, we pix2pix-zero, an...
We propose FineGAN, a novel unsupervised GAN framework, which disentangles the background, object shape, and appearance to hierarchically generate images of fine-grained categories. To disentangle factors without supervision, our key idea is use information theory associate each factor latent code, condition relationships between codes in specific way induce desired hierarchy. Through extensive experiments, we show that FineGAN achieves disentanglement realistic diverse belonging classes...
Existing models often leverage co-occurrences between objects and their context to improve recognition accuracy. However, strongly relying on risks a model's generalizability, especially when typical co-occurrence patterns are absent. This work focuses addressing such contextual biases the robustness of learnt feature representations. Our goal is accurately recognize category in absence its context, without compromising performance it co-occurs with context. key idea decorrelate...
We present MixNMatch, a conditional generative model that learns to disentangle and encode background, object pose, shape, texture from real images with minimal supervision, for mix-and-match image generation. build upon FineGAN, an unconditional model, learn the desired disentanglement generator, leverage adversarial joint image-code distribution matching latent factor encoders. MixNMatch requires bounding boxes during training but no other supervision. Through extensive experiments, we...
We propose a novel way of using videos to obtain high precision object proposals for weakly-supervised detection. Existing detection approaches use off-the-shelf proposal methods like edge boxes or selective search candidate boxes. These provide recall but at the expense thousands noisy proposals. Thus, entire burden finding few relevant regions is left ensuing mining step. To mitigate this issue, we focus instead on improving initial Since cannot rely localization annotations, turn video...
We present multimodal conditioning modules (MCM) for enabling conditional image synthesis using pretrained diffusion models. Previous works rely on training networks from scratch or fine-tuning networks, both of which are computationally expensive large, state-of-the-art Our method uses but does not require any updates to the network's parameters. MCM is a small module trained modulate predictions during sampling 2D modalities (e.g., semantic segmentation maps, sketches) that were unseen...
In recent years, the use of CLIP (Contrastive Language-Image Pre-Training) has become increasingly popular in a wide range downstream applications, including zero-shot image classification and text-to-image synthesis. Despite being trained on vast dataset, model been found to exhibit biases against certain protected attributes, such as gender race. While previous research focused impact classification, there little investigation into their effects CLIP-based generative tasks. this paper, we...
We investigate how to generate multimodal image outputs, such as RGB, depth, and surface normals, with a single generative model. The challenge is produce outputs that are realistic, also consistent each other. Our solution builds on the StyleGAN3 architecture, shared backbone modality-specific branches in last layers of synthesis network, we propose per-modality fidelity discriminators cross-modality consistency discriminator. In experiments Stanford2D3D dataset, demonstrate realistic...
We propose a novel unsupervised generative model that learns to disentangle object identity from other low-level aspects in class-imbalanced data. first investigate the issues surrounding assumptions about uniformity made by InfoGAN, and demonstrate its ineffectiveness properly imbalanced Our key idea is make discovery of discrete latent factor variation invariant identity-preserving transformations real images, use as signal learn appropriate distribution representing identity. Experiments...
Generating video background that tailors to foreground subject motion is an important problem for the movie industry and visual effects community. This task involves synthesizing aligns with appearance of subject, while also complies artist's creative intention. We introduce ActAnywhere, a generative model automates this process which traditionally requires tedious manual efforts. Our leverages power large-scale diffusion models, specifically tailored task. ActAnywhere takes sequence...
Despite recent significant strides achieved by diffusion-based Text-to-Image (T2I) models, current systems are still less capable of ensuring decent compositional generation aligned with text prompts, particularly for the multi-object generation. In this work, we first show fundamental reasons such misalignment identifying issues related to low attention activation and mask overlaps. Then propose a finetuning framework two novel objectives, Separate loss Enhance loss, that reduce object...
Group portrait editing is highly desirable since users constantly want to add a person, delete or manipulate existing persons. It also challenging due the intricate dynamics of human interactions and diverse gestures. In this work, we present GroupDiff, pioneering effort tackle group photo with three dedicated contributions: 1) Data Engine: Since there no labeled data for editing, create engine generate paired training. The training covers needs editing. 2) Appearance Preservation: To keep...
In this paper, we introduce a model designed to improve the prediction of image-text alignment, targeting challenge compositional understanding in current visual-language models. Our approach focuses on generating high-quality training datasets for alignment task by producing mixed-type negative captions derived from positive ones. Critically, address distribution imbalance between and ensure that does not depend solely textual information but also considers associated images predicting...
We introduce a high-fidelity portrait shadow removal model that can effectively enhance the image of by predicting its appearance under disturbing shadows and highlights. Portrait is highly ill-posed problem where multiple plausible solutions be found based on single image. For example, disentangling complex environmental lighting from original skin color non-trivial problem. While existing works have solved this residuals propagate local distribution, such methods are often incomplete lead...