- Multimodal Machine Learning Applications
- Natural Language Processing Techniques
- Human Pose and Action Recognition
- Topic Modeling
- Video Analysis and Summarization
- Advanced Image and Video Retrieval Techniques
- AI-based Problem Solving and Planning
- Reinforcement Learning in Robotics
- Anomaly Detection Techniques and Applications
- Neural Networks and Applications
- Machine Learning and Algorithms
- Speech and dialogue systems
- Multi-Agent Systems and Negotiation
- Neural dynamics and brain function
- Robotic Path Planning Algorithms
- Action Observation and Synchronization
- Neurobiology of Language and Bilingualism
- Domain Adaptation and Few-Shot Learning
- Music and Audio Processing
- Human Motion and Animation
- Hand Gesture Recognition Systems
- Subtitles and Audiovisual Media
- Text Readability and Simplification
- Visual Attention and Saliency Detection
- Advanced Memory and Neural Computing
Companhia Brasileira de Metalurgia e Mineração (Brazil)
2024
Massachusetts Institute of Technology
2013-2023
IIT@MIT
2014-2022
Technion – Israel Institute of Technology
2022
Cornell University
2022
Vassar College
2019
Purdue University West Lafayette
2010-2018
Policijska akademija
2014
Alexandru Ioan Cuza University
2014
Police Academy
2014
Recognizing human activities in partially observed videos is a challenging problem and has many practical applications. When the unobserved subsequence at end of video, reduced to activity prediction from unfinished streaming, which been studied by researchers. However, general case, an may occur any time yielding temporal gap video. In this paper, we propose new method that can recognize case. Specifically, formulate into probabilistic framework: 1) dividing each multiple ordered segments,...
We generalize the notion of measuring social biases in word embeddings to visually grounded embeddings. Biases are present embeddings, and indeed seem be equally or more significant than for ungrounded This is despite fact that vision language can suffer from different biases, which one might hope could attenuate both. Multiple ways exist metrics bias this new setting. introduce space generalizations (Grounded-WEAT Grounded-SEAT) demonstrate three answer yet important questions about how...
We present an approach to simultaneously reasoning about a video clip and entire natural-language sentence. The compositional nature of language is exploited construct models which represent the meanings sentences composed out words in those mediated by grammar that encodes predicate-argument relations. demonstrate these faithfully are sensitive how roles played participants (nouns), their characteristics (adjectives), actions performed (verbs), manner such (adverbs), changing spatial...
A robot’s ability to understand or ground natural language instructions is fundamentally tied its knowledge about the surrounding world. We present an approach grounding utterances in context of factual information gathered through natural-language interactions and past visual observations. probabilistic model estimates, from a utterance, objects, relations, actions that utterance refers to, objectives for future robotic it implies, generates plan execute those while updating state...
We present a study on two key characteristics of human syntactic annotations: anchoring and agreement. Anchoring is well known cognitive bias in decision making, where judgments are drawn towards pre-existing values. the influence standard approach to creation resources annotations obtained via editing tagger parser output. Our experiments demonstrate clear effect reveal unwanted consequences, including overestimation parsing performance lower quality comparison with human-based annotations....
We present a system that demonstrates how the compositional structure of events, in concert with language, can interplay underlying focusing mechanisms video action recognition, providing medium for top-down and bottom-up integration as well multi-modal between vision language. show roles played by participants (nouns), their characteristics (adjectives), actions performed (verbs), manner such (adverbs), changing spatial relations (prepositions), form whole-sentence descriptions mediated...
Understanding language goes hand in with the ability to integrate complex contextual information obtained via perception.In this work, we present a novel task for grounded understanding: disambiguating sentence given visual scene which depicts one of possible interpretations that sentence.To end, introduce new multimodal corpus containing ambiguous sentences, representing wide range syntactic, semantic and discourse ambiguities, coupled videos visualize different each sentence.We address by...
We demonstrate how a sampling-based robotic planner can be augmented to learn understand sequence of natural language commands in continuous configuration space move and manipulate objects. Our approach combines deep network structured according the parse complex command that includes objects, verbs, spatial relations, attributes, with planner, RRT. A recurrent hierarchical controls explores environment, determines when planned path is likely achieve goal, estimates confidence each trade off...
The ability to perceive and reason about social interactions in the context of physical environments is core human intelligence human-machine cooperation. However, no prior dataset or benchmark has systematically evaluated physically grounded perception complex that go beyond short actions, such as high-fiving, simple group activities, gathering. In this work, we create a physically-grounded abstract events, PHASE, resemble wide range real-life by including concepts helping another agent....
We create a reusable Transformer, BrainBERT, for intracranial recordings bringing modern representation learning approaches to neuroscience. Much like in NLP and speech recognition, this Transformer enables classifying complex concepts, i.e., decoding neural data, with higher accuracy much less data by being pretrained an unsupervised manner on large corpus of unannotated recordings. Our approach generalizes new subjects electrodes positions unrelated tasks showing that the representations...
We present an integrated vision and robotic system that plays, learns to play, simple physically-instantiated board games are variants of TIC TAC TOE HEXA-PAWN. employ novel custom hardware designed specifically for this learning task. The game rules can be parametrically specified. Two independent computational agents alternate playing the two opponents with shared hardware, using pre-specified rule sets. A third agent, sharing same solely by observing physical without access set, inductive...
We demonstrate how a sequence model and sampling-based planner can influence each other to produce efficient plans such automatically learn take advantage of observations the environment. Sampling-based planners as RRT generally know nothing their environments even if they have traversed similar spaces many times. A model, an HMM or LSTM, guides search for good paths. The resulting called DeRRT*, observes state local environment bias next move state. neural-network-based models avoid manual...
We develop a semantic parser that is trained in grounded setting using pairs of videos captioned with sentences. This both data-efficient, requiring little annotation, and similar to the experience children where they observe their environment listen speakers. The recovers meaning English sentences despite not having access any annotated It does so ambiguity inherent vision sentence may refer combination objects, object properties, relations or actions taken by agent video. For this task, we...
Language allows humans to build mental models that interpret what is happening around them resulting in more accurate long-term predictions. We present a novel trajectory prediction model uses linguistic intermediate representations forecast trajectories, and trained using samples with partially-annotated captions. The learns the meaning of each words without direct per-word supervision. At inference time, it generates description trajectories which captures maneuvers interactions over an...
We present an approach to searching large video corpora for clips which depict a natural-language query in the form of sentence. Compositional semantics is used encode subtle meaning differences lost other approaches, such as difference between two sentences have identical words but entirely different meaning: The person rode horse versus person. Given sentential and parser, we produce score indicating how well clip depicts that sentence each corpus return ranked list clips. Two fundamental...
We demonstrate a reinforcement learning agent which uses compositional recurrent neural network that takes as input an LTL formula and determines satisfying actions. The formulas have never been seen before, yet the performs zero-shot generalization to satisfy them. This is novel form of multi-task for RL agents where learn from one diverse set tasks generalize new tasks. formulation enables this capacity generalize. ability in two domains. In symbolic domain, finds sequence letters...
Humans are remarkably flexible when understanding new sentences that include combinations of concepts they have never encountered before. Recent work has shown while deep networks can mimic some human language abilities presented with novel sentences, systematic variation uncovers the limitations in language-understanding networks. We demonstrate these be overcome by addressing generalization challenges gSCAN dataset, which explicitly measures how well an agent is able to interpret...
We present the Brain Treebank, a large-scale dataset of electrophysiological neural responses, recorded from intracranial probes while 10 subjects watched one or more Hollywood movies. Subjects on average 2.6 movies, for an viewing time 4.3 hours, and total 43 hours. The audio track each movie was transcribed with manual corrections. Word onsets were manually annotated spectrograms movie. Each transcript automatically parsed corrected into universal dependencies (UD) formalism, assigning...