David Harwath

ORCID: 0000-0003-0206-0253
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Multimodal Machine Learning Applications
  • Speech Recognition and Synthesis
  • Music and Audio Processing
  • Speech and Audio Processing
  • Natural Language Processing Techniques
  • Video Analysis and Summarization
  • Speech and dialogue systems
  • Advanced Image and Video Retrieval Techniques
  • Domain Adaptation and Few-Shot Learning
  • Human Pose and Action Recognition
  • Topic Modeling
  • Subtitles and Audiovisual Media
  • Phonetics and Phonology Research
  • Music Technology and Sound Studies
  • Digital Communication and Language
  • Image and Signal Denoising Methods
  • Hearing Loss and Rehabilitation
  • Language Development and Disorders
  • Time Series Analysis and Forecasting
  • Generative Adversarial Networks and Image Synthesis
  • Stuttering Research and Treatment
  • Robotics and Sensor-Based Localization
  • Advanced Data Compression Techniques
  • Functional Brain Connectivity Studies
  • Video Surveillance and Tracking Methods

The University of Texas at Austin
2021-2025

Massachusetts Institute of Technology
2012-2021

Moscow Institute of Thermal Technology
2020

University of Illinois Urbana-Champaign
2010

Urbana University
2010

Multi-modal learning from video data has seen increased attention recently as it allows training of semantically meaningful embeddings without human annotation, enabling tasks like zero-shot retrieval and action localization. In this work, we present a multi-modal, modality agnostic fusion transformer that learns to exchange information between multiple modalities, such video, audio, text, integrate them into fused representation in joined multi-modal embedding space. We propose train the...

10.1109/cvpr52688.2022.01939 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022-06-01

We summarize the accomplishments of a multi-disciplinary workshop exploring computational and scientific issues surrounding zero resource (unsupervised) speech technologies related models early language acquisition. Centered around tasks phonetic lexical discovery, we consider unified evaluation metrics, present two new approaches for improving speaker independence in absence supervision, evaluate application Bayesian word segmentation algorithms to automatic subword unit tokenizations....

10.1109/icassp.2013.6639245 article EN IEEE International Conference on Acoustics Speech and Signal Processing 2013-05-01

In this paper, we present a model which takes as input corpus of images with relevant spoken captions and finds correspondence between the two modalities. We employ pair convolutional neural networks to visual objects speech signals at word level, tie together an embedding alignment learns joint semantic space over both evaluate our using image search annotation tasks on Flickr8k dataset, augmented by collecting 40,000 Amazon Mechanical Turk.

10.1109/asru.2015.7404800 article EN 2015-12-01

Given a collection of images and spoken audio captions, we present method for discovering word-like acoustic units in the continuous speech signal grounding them to semantically relevant image regions. For example, our model is able detect instances word ‘lighthouse’ within an utterance associate with regions containing lighthouses. We do not use any form conventional automatic recognition, nor text transcriptions or linguistic annotations. Our effectively implements language acquisition,...

10.18653/v1/p17-1047 article EN cc-by Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2017-01-01

Current methods for learning visually grounded language from videos often rely on text annotation, such as human generated captions or machine automatic speech recognition (ASR) transcripts.In this work, we introduce the Audio-Video Language Network (AVLnet), a self-supervised network that learns shared audiovisual embedding space directly raw video inputs.To circumvent need learn audio-visual representations randomly segmented clips and their audio waveforms.We train AVLnet HowTo100M, large...

10.21437/interspeech.2021-1312 article EN Interspeech 2022 2021-08-27

In this paper, we present a method for learning discrete linguistic units by incorporating vector quantization layers into neural models of visually grounded speech. We show that our is capable capturing both word-level and sub-word units, depending on how it configured. What differentiates paper from prior work speech unit the choice training objective. Rather than using reconstruction-based loss, use discriminative, multimodal grounding objective which forces learned to be useful semantic...

10.48550/arxiv.1911.09602 preprint EN other-oa arXiv (Cornell University) 2019-01-01

In this paper, we propose a simple yet powerful improvement over the recent Self-Supervised Audio Spectrogram Transformer (SSAST) model for speech and audio classification.Specifically, leverage insight that SSAST uses very high masking ratio (75%) during pretraining, meaning vast majority of self-attention compute is performed on mask tokens.We address by integrating encoder-decoder architecture from Masked Autoencoders are Scalable Vision Learners (MAE) into SSAST, where deep encoder...

10.21437/interspeech.2022-10961 article EN Interspeech 2022 2022-09-16

Multimodal self-supervised learning is getting more and attention as it allows not only to train large networks without human supervision but also search retrieve data across various modalities. In this context, paper proposes a framework that, starting from pre-trained backbone, learns common multimodal embedding space in addition sharing representations different modalities, enforces grouping of semantically similar instances. To end, we extend the concept instance-level contrastive with...

10.1109/iccv48922.2021.00791 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021-10-01

When people observe events, they are able to abstract key information and build concise summaries of what is happening. These include contextual semantic describing the important high-level details (what, where, who how) observed event exclude background that deemed unimportant observer. With this in mind, descriptions generate for videos different dynamic events can greatly improve our understanding interest each video. be captured captions provide expanded attributes video labeling (e.g....

10.1109/cvpr46437.2021.01463 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021-06-01

Audio-visual representation learning aims to develop systems with human-like perception by utilizing correlation between auditory and visual information. However, current models often focus on a limited set of tasks, generalization abilities learned representations are unclear. To this end, we propose the AV-SUPERB benchmark that enables general-purpose evaluation unimodal audio/visual bimodal fusion 7 datasets covering 5 audio-visual tasks in speech audio processing. We evaluate recent...

10.1109/icassp48485.2024.10445941 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024-03-18

In this paper, we explore the learning of neural network embeddings for natural images and speech waveforms describing content those images. These are learned directly from without use linguistic transcriptions or conventional recognition technology. While prior work has investigated setting in monolingual case using English data, represents first effort to apply these techniques languages beyond English. Using spoken captions collected Hindi, show that same model architecture can be...

10.1109/icassp.2018.8462396 article EN 2018-04-01

Wei-Ning Hsu, David Harwath, Tyler Miller, Christopher Song, James Glass. Proceedings of the 59th Annual Meeting Association for Computational Linguistics and 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021.

10.18653/v1/2021.acl-long.411 article EN cc-by 2021-01-01

OursFigure 1: HuBERT: sum of attention weights each frame receives from other frames.Ours (VG-HuBERT3): the [CLS A] token.Attention different heads are coded with colors.

10.21437/interspeech.2022-10652 article EN Interspeech 2022 2022-09-16

Data-driven speech processing models usually perform well with a large amount of text supervision, but collecting transcribed data is costly. Therefore, we propose Speech-CLIP, novel framework bridging and through images to enhance without transcriptions. We leverage state-of-the-art pre-trained HuBERT CLIP, aligning them via paired spoken captions minimal fine-tuning. SpeechCLIP outperforms prior on image-speech retrieval performs zero-shot speech-text direct supervision from Moreover, can...

10.1109/slt54892.2023.10022954 article EN 2022 IEEE Spoken Language Technology Workshop (SLT) 2023-01-09

Reverberation not only degrades the quality of speech for human perception, but also severely impacts accuracy automatic recognition. Prior work attempts to remove reverberation based on audio modality only. Our idea is learn dereverberate from audio-visual observations. The visual environment surrounding a speaker reveals important cues about room geometry, materials, and location, all which influence precise effects. We introduce Visually-Informed Dereverberation Audio (VIDA), an...

10.1109/icassp49357.2023.10095818 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023-05-05

In this paper, we investigate the manner in which interpretable sub-word speech units emerge within a convolutional neural network model trained to associate raw waveforms with semantically related natural image scenes. We show how diphone boundaries can be superficially extracted from activation patterns of intermediate layers model, suggesting that may leveraging these events for purpose word recognition. present series experiments investigating information encoded by events.

10.1109/icassp.2019.8682666 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019-04-17

10.1109/wacv61041.2025.00490 article EN 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2025-02-26

We present VoiceCraft-Dub, a novel approach for automated video dubbing that synthesizes high-quality speech from text and facial cues. This task has broad applications in filmmaking, multimedia creation, assisting voice-impaired individuals. Building on the success of Neural Codec Language Models (NCLMs) synthesis, our method extends their capabilities by incorporating features, ensuring synthesized is time-synchronized expressively aligned with movements while preserving natural prosody....

10.48550/arxiv.2504.02386 preprint EN arXiv (Cornell University) 2025-04-03
Coming Soon ...