NFDI4DS | UHH-SEMS - Publication Details

Integrating Self-supervised Speech Model with Pseudo Word-level Targets from Visually-grounded Speech Model

DOI: 10.48550/arxiv.2402.05819 Publication Date: 2024-02-08

Abstract Supplemental Material References Cited by

AUTHORS (8)

Hung-Chieh Fang

Nai-Xuan Ye

Yi-Jen Shih

Puyuan Peng

Hsuan-Fu Wang

Layne Berry

Hung-yi Lee

David Harwath

ABSTRACT

Recent advances in self-supervised speech models have shown significant improvement many downstream tasks. However, these predominantly centered on frame-level training objectives, which can fall short spoken language understanding tasks that require semantic comprehension. Existing works often rely additional speech-text data as intermediate targets, is costly the real-world setting. To address this challenge, we propose Pseudo-Word HuBERT (PW-HuBERT), a framework integrates pseudo word-level targets into process, where are derived from visually-ground model, notably eliminating need for paired data. Our experimental results four (SLU) benchmarks suggest superiority of our model capturing information.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENALEX - Publications

PlumX Metrics

Integrating Self-supervised Speech Model with Pseudo Word-level Targets from Visually-grounded Speech Model

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....