- Multimodal Machine Learning Applications
- Domain Adaptation and Few-Shot Learning
- Human Pose and Action Recognition
- Advanced Image and Video Retrieval Techniques
- Advanced Neural Network Applications
- Topic Modeling
- Generative Adversarial Networks and Image Synthesis
- Adversarial Robustness in Machine Learning
- Natural Language Processing Techniques
- Cancer-related molecular mechanisms research
- Reinforcement Learning in Robotics
- Video Analysis and Summarization
- Explainable Artificial Intelligence (XAI)
- Anomaly Detection Techniques and Applications
- Video Surveillance and Tracking Methods
- Advanced Vision and Imaging
- COVID-19 diagnosis using AI
- Image Retrieval and Classification Techniques
- Speech and Audio Processing
- Robotics and Sensor-Based Localization
- Robot Manipulation and Learning
- Remote-Sensing Image Classification
- Music and Audio Processing
- Speech Recognition and Synthesis
- Speech and dialogue systems
IBM (United States)
2019-2024
Boston University
2015-2023
Massachusetts Institute of Technology
2004-2020
Adobe Systems (United States)
2020
Max Planck Society
2011-2020
Stanford University
2020
University of Illinois Urbana-Champaign
2019
University of Massachusetts Lowell
2013-2017
The University of Texas at Austin
2016
Laboratoire d'Informatique de Paris-Nord
2016
Models based on deep convolutional networks have dominated recent image interpretation tasks; we investigate whether models which are also recurrent, or "temporally deep", effective for tasks involving sequences, visual and otherwise. We develop a novel recurrent architecture suitable large-scale learning is end-to-end trainable, demonstrate the value of these benchmark video recognition tasks, description retrieval problems, narration challenges. In contrast to current assume fixed...
Adversarial learning methods are a promising approach to training robust deep networks, and can generate complex samples across diverse domains. They also improve recognition despite the presence of domain shift or dataset bias: recent adversarial approaches unsupervised adaptation reduce difference between test distributions thus generalization performance. However, while generative networks (GANs) show compelling visualizations, they not optimal on discriminative tasks be limited smaller...
Recent reports suggest that a generic supervised deep CNN model trained on large-scale dataset reduces, but does not remove, bias standard benchmark. Fine-tuning models in new domain can require significant amount of data, which for many applications is simply available. We propose architecture introduces an adaptation layer and additional confusion loss, to learn representation both semantically meaningful invariant. additionally show metric be used selection determine the dimension best...
Domain adaptation is critical for success in new, unseen environments. Adversarial models applied feature spaces discover domain invariant representations, but are difficult to visualize and sometimes fail capture pixel-level low-level shifts. Recent work has shown that generative adversarial networks combined with cycle-consistency constraints surprisingly effective at mapping images between domains, even without the use of aligned image pairs. We propose a novel discriminatively-trained...
Unlike human learning, machine learning often fails to handle changes between training (source) and test (target) input distributions. Such domain shifts, common in practical scenarios, severely damage the performance of conventional methods. Supervised adaptation methods have been proposed for case when target data labels, including some that perform very well despite being ``frustratingly easy'' implement. However, practice, is unlabeled, requiring unsupervised adaptation. We propose a...
Models based on deep convolutional networks have dominated recent image interpretation tasks; we investigate whether models which are also recurrent effective for tasks involving sequences, visual and otherwise. We describe a class of architectures is end-to-end trainable suitable large-scale understanding tasks, demonstrate the value these activity recognition, captioning, video description. In contrast to previous assume fixed representation or perform simple temporal averaging sequential...
Real-world videos often have complex dynamics, methods for generating open-domain video descriptions should be sensitive to temporal structure and allow both input (sequence of frames) output words) variable length. To approach this problem we propose a novel end-to-end sequence-to-sequence model generate captions videos. For exploit recurrent neural networks, specifically LSTMs, which demonstrated state-of-the-art performance in image caption generation. Our LSTM is trained on...
Conventional unsupervised domain adaptation (UDA) assumes that training data are sampled from a single domain. This neglects the more practical scenario where collected multiple sources, requiring multi-source adaptation. We make three major contributions towards addressing this problem. First, we collect and annotate by far largest UDA dataset, called DomainNet, which contains six domains about 0.6 million images distributed among 345 categories, gap in availability for research. Second,...
Models based on deep convolutional networks have dominated recent image interpretation tasks; we investigate whether models which are also recurrent effective for tasks involving sequences, visual and otherwise.We describe a class of architectures is end-to-end trainable suitable large-scale understanding tasks, demonstrate the value these activity recognition, captioning, video description.In contrast to previous assume fixed representation or perform simple temporal averaging sequential...
Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, Kate Saenko. Proceedings of the 2015 Conference North American Chapter Association for Computational Linguistics: Human Language Technologies. 2015.
We address the problem of activity detection in continuous, untrimmed video streams. This is a difficult task that requires extracting meaningful spatio-temporal features to capture activities, accurately localizing start and end times each activity. introduce new model, Region Convolutional 3D Network (R-C3D), which encodes streams using three-dimensional fully convolutional network, then generates candidate temporal regions containing finally classifies selected into specific activities....
In real-world applications, "what you saw" during training is often not get" deployment: the distribution and even type dimensionality of features can change from one dataset to next. this paper, we address problem visual domain adaptation for transferring object models or another. We introduce ARC-t, a flexible model supervised learning non-linear transformations between domains. Our method based on novel theoretical result demonstrating that such be learned in kernel space. Unlike existing...
We propose an approach for unsupervised adaptation of object detectors from label-rich to label-poor domains which can significantly reduce annotation costs associated with detection. Recently, approaches that align distributions source and target images using adversarial loss have been proven effective adapting classifiers. However, detection, fully matching the entire each other at global image level may fail, as could distinct scene layouts different combinations objects. On hand, strong...
We present the 2017 Visual Domain Adaptation (VisDA) dataset and challenge, a large-scale testbed for unsupervised domain adaptation across visual domains. Unsupervised aims to solve real-world problem of shift, where machine learning models trained on one must be transferred adapted novel without additional supervision. The VisDA2017 challenge is focused simulation-to-reality shift has two associated tasks: image classification segmentation. goal in both tracks first train model simulated,...
In this paper, we address the task of natural language object retrieval, to localize a target within given image based on query object. Natural retrieval differs from text-based as it involves spatial information about objects scene and global context. To issue, propose novel Spatial Context Recurrent ConvNet (SCRC) model scoring function candidate boxes for integrating configurations scene-level contextual into network. Our processes text, local descriptors, context features through...
Contemporary domain adaptation methods are very effective at aligning feature distributions of source and target domains without any supervision. However, we show that these techniques perform poorly when even a few labeled examples available in the domain. To address this semi-supervised (SSDA) setting, propose novel Minimax Entropy (MME) approach adversarially optimizes an adaptive few-shot model. Our base model consists encoding network, followed by classification layer computes features'...
Despite a recent push towards large-scale object recognition, activity recognition remains limited to narrow domains and small vocabularies of actions. In this paper, we tackle the challenge recognizing describing activities ``in-the-wild''. We present solution that takes short video clip outputs brief sentence sums up main in video, such as actor, action its object. Unlike previous work, our approach works on out-of-domain actions: it does not require training videos exact activity. If...
Natural language questions are inherently compositional, and many most easily answered by reasoning about their decomposition into modular sub-problems. For example, to answer “is there an equal number of balls boxes?” we can look for balls, boxes, count them, compare the results. The recently proposed Neural Module Network (NMN) architecture [3, 2] implements this approach question answering parsing linguistic substructures assembling question-specific deep networks from smaller modules...
Recently, a number of grasp detection methods have been proposed that can be used to localize robotic configurations directly from sensor data without estimating object pose. The underlying idea is treat perception analogously in computer vision. These take as input noisy and partially occluded RGBD image or point cloud produce output pose estimates viable grasps, assuming known CAD model the object. Although these generalize knowledge new objects well, they not yet demonstrated reliable...
Deep neural networks are being used increasingly to automate data analysis and decision making, yet their decision-making process is largely unclear difficult explain the end users. In this paper, we address problem of Explainable AI for deep that take images as input output a class probability. We propose an approach called RISE generates importance map indicating how salient each pixel model's prediction. contrast white-box approaches estimate using gradients or other internal network...