Kate Saenko

ORCID: 0000-0002-7564-7218
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Multimodal Machine Learning Applications
  • Domain Adaptation and Few-Shot Learning
  • Human Pose and Action Recognition
  • Advanced Image and Video Retrieval Techniques
  • Advanced Neural Network Applications
  • Topic Modeling
  • Generative Adversarial Networks and Image Synthesis
  • Adversarial Robustness in Machine Learning
  • Natural Language Processing Techniques
  • Cancer-related molecular mechanisms research
  • Reinforcement Learning in Robotics
  • Video Analysis and Summarization
  • Explainable Artificial Intelligence (XAI)
  • Anomaly Detection Techniques and Applications
  • Video Surveillance and Tracking Methods
  • Advanced Vision and Imaging
  • COVID-19 diagnosis using AI
  • Image Retrieval and Classification Techniques
  • Speech and Audio Processing
  • Robotics and Sensor-Based Localization
  • Robot Manipulation and Learning
  • Remote-Sensing Image Classification
  • Music and Audio Processing
  • Speech Recognition and Synthesis
  • Speech and dialogue systems

IBM (United States)
2019-2024

Boston University
2015-2023

Massachusetts Institute of Technology
2004-2020

Adobe Systems (United States)
2020

Max Planck Society
2011-2020

Stanford University
2020

University of Illinois Urbana-Champaign
2019

University of Massachusetts Lowell
2013-2017

The University of Texas at Austin
2016

Laboratoire d'Informatique de Paris-Nord
2016

Models based on deep convolutional networks have dominated recent image interpretation tasks; we investigate whether models which are also recurrent, or "temporally deep", effective for tasks involving sequences, visual and otherwise. We develop a novel recurrent architecture suitable large-scale learning is end-to-end trainable, demonstrate the value of these benchmark video recognition tasks, description retrieval problems, narration challenges. In contrast to current assume fixed...

10.1109/cvpr.2015.7298878 article EN 2015-06-01

Adversarial learning methods are a promising approach to training robust deep networks, and can generate complex samples across diverse domains. They also improve recognition despite the presence of domain shift or dataset bias: recent adversarial approaches unsupervised adaptation reduce difference between test distributions thus generalization performance. However, while generative networks (GANs) show compelling visualizations, they not optimal on discriminative tasks be limited smaller...

10.1109/cvpr.2017.316 article EN 2017-07-01

Recent reports suggest that a generic supervised deep CNN model trained on large-scale dataset reduces, but does not remove, bias standard benchmark. Fine-tuning models in new domain can require significant amount of data, which for many applications is simply available. We propose architecture introduces an adaptation layer and additional confusion loss, to learn representation both semantically meaningful invariant. additionally show metric be used selection determine the dimension best...

10.48550/arxiv.1412.3474 preprint EN other-oa arXiv (Cornell University) 2014-01-01

Domain adaptation is critical for success in new, unseen environments. Adversarial models applied feature spaces discover domain invariant representations, but are difficult to visualize and sometimes fail capture pixel-level low-level shifts. Recent work has shown that generative adversarial networks combined with cycle-consistency constraints surprisingly effective at mapping images between domains, even without the use of aligned image pairs. We propose a novel discriminatively-trained...

10.48550/arxiv.1711.03213 preprint EN other-oa arXiv (Cornell University) 2017-01-01

Unlike human learning, machine learning often fails to handle changes between training (source) and test (target) input distributions. Such domain shifts, common in practical scenarios, severely damage the performance of conventional methods. Supervised adaptation methods have been proposed for case when target data labels, including some that perform very well despite being ``frustratingly easy'' implement. However, practice, is unlabeled, requiring unsupervised adaptation. We propose a...

10.1609/aaai.v30i1.10306 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2016-03-02

Models based on deep convolutional networks have dominated recent image interpretation tasks; we investigate whether models which are also recurrent effective for tasks involving sequences, visual and otherwise. We describe a class of architectures is end-to-end trainable suitable large-scale understanding tasks, demonstrate the value these activity recognition, captioning, video description. In contrast to previous assume fixed representation or perform simple temporal averaging sequential...

10.1109/tpami.2016.2599174 article EN publisher-specific-oa IEEE Transactions on Pattern Analysis and Machine Intelligence 2016-09-01

Real-world videos often have complex dynamics, methods for generating open-domain video descriptions should be sensitive to temporal structure and allow both input (sequence of frames) output words) variable length. To approach this problem we propose a novel end-to-end sequence-to-sequence model generate captions videos. For exploit recurrent neural networks, specifically LSTMs, which demonstrated state-of-the-art performance in image caption generation. Our LSTM is trained on...

10.1109/iccv.2015.515 article EN 2015-12-01

Conventional unsupervised domain adaptation (UDA) assumes that training data are sampled from a single domain. This neglects the more practical scenario where collected multiple sources, requiring multi-source adaptation. We make three major contributions towards addressing this problem. First, we collect and annotate by far largest UDA dataset, called DomainNet, which contains six domains about 0.6 million images distributed among 345 categories, gap in availability for research. Second,...

10.1109/iccv.2019.00149 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2019-10-01

Models based on deep convolutional networks have dominated recent image interpretation tasks; we investigate whether models which are also recurrent effective for tasks involving sequences, visual and otherwise.We describe a class of architectures is end-to-end trainable suitable large-scale understanding tasks, demonstrate the value these activity recognition, captioning, video description.In contrast to previous assume fixed representation or perform simple temporal averaging sequential...

10.21236/ada623249 preprint EN 2014-11-17

Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, Kate Saenko. Proceedings of the 2015 Conference North American Chapter Association for Computational Linguistics: Human Language Technologies. 2015.

10.3115/v1/n15-1173 article EN Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 2015-01-01

We address the problem of activity detection in continuous, untrimmed video streams. This is a difficult task that requires extracting meaningful spatio-temporal features to capture activities, accurately localizing start and end times each activity. introduce new model, Region Convolutional 3D Network (R-C3D), which encodes streams using three-dimensional fully convolutional network, then generates candidate temporal regions containing finally classifies selected into specific activities....

10.1109/iccv.2017.617 article EN 2017-10-01

In real-world applications, "what you saw" during training is often not get" deployment: the distribution and even type dimensionality of features can change from one dataset to next. this paper, we address problem visual domain adaptation for transferring object models or another. We introduce ARC-t, a flexible model supervised learning non-linear transformations between domains. Our method based on novel theoretical result demonstrating that such be learned in kernel space. Unlike existing...

10.1109/cvpr.2011.5995702 article EN 2011-06-01

We propose an approach for unsupervised adaptation of object detectors from label-rich to label-poor domains which can significantly reduce annotation costs associated with detection. Recently, approaches that align distributions source and target images using adversarial loss have been proven effective adapting classifiers. However, detection, fully matching the entire each other at global image level may fail, as could distinct scene layouts different combinations objects. On hand, strong...

10.1109/cvpr.2019.00712 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019-06-01

We present the 2017 Visual Domain Adaptation (VisDA) dataset and challenge, a large-scale testbed for unsupervised domain adaptation across visual domains. Unsupervised aims to solve real-world problem of shift, where machine learning models trained on one must be transferred adapted novel without additional supervision. The VisDA2017 challenge is focused simulation-to-reality shift has two associated tasks: image classification segmentation. goal in both tracks first train model simulated,...

10.48550/arxiv.1710.06924 preprint EN other-oa arXiv (Cornell University) 2017-01-01

In this paper, we address the task of natural language object retrieval, to localize a target within given image based on query object. Natural retrieval differs from text-based as it involves spatial information about objects scene and global context. To issue, propose novel Spatial Context Recurrent ConvNet (SCRC) model scoring function candidate boxes for integrating configurations scene-level contextual into network. Our processes text, local descriptors, context features through...

10.1109/cvpr.2016.493 preprint EN 2016-06-01

Contemporary domain adaptation methods are very effective at aligning feature distributions of source and target domains without any supervision. However, we show that these techniques perform poorly when even a few labeled examples available in the domain. To address this semi-supervised (SSDA) setting, propose novel Minimax Entropy (MME) approach adversarially optimizes an adaptive few-shot model. Our base model consists encoding network, followed by classification layer computes features'...

10.1109/iccv.2019.00814 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2019-10-01

Despite a recent push towards large-scale object recognition, activity recognition remains limited to narrow domains and small vocabularies of actions. In this paper, we tackle the challenge recognizing describing activities ``in-the-wild''. We present solution that takes short video clip outputs brief sentence sums up main in video, such as actor, action its object. Unlike previous work, our approach works on out-of-domain actions: it does not require training videos exact activity. If...

10.1109/iccv.2013.337 article EN 2013-12-01

Natural language questions are inherently compositional, and many most easily answered by reasoning about their decomposition into modular sub-problems. For example, to answer “is there an equal number of balls boxes?” we can look for balls, boxes, count them, compare the results. The recently proposed Neural Module Network (NMN) architecture [3, 2] implements this approach question answering parsing linguistic substructures assembling question-specific deep networks from smaller modules...

10.1109/iccv.2017.93 article EN 2017-10-01

Recently, a number of grasp detection methods have been proposed that can be used to localize robotic configurations directly from sensor data without estimating object pose. The underlying idea is treat perception analogously in computer vision. These take as input noisy and partially occluded RGBD image or point cloud produce output pose estimates viable grasps, assuming known CAD model the object. Although these generalize knowledge new objects well, they not yet demonstrated reliable...

10.1177/0278364917735594 article EN The International Journal of Robotics Research 2017-10-30

Deep neural networks are being used increasingly to automate data analysis and decision making, yet their decision-making process is largely unclear difficult explain the end users. In this paper, we address problem of Explainable AI for deep that take images as input output a class probability. We propose an approach called RISE generates importance map indicating how salient each pixel model's prediction. contrast white-box approaches estimate using gradients or other internal network...

10.48550/arxiv.1806.07421 preprint EN other-oa arXiv (Cornell University) 2018-01-01
Coming Soon ...