Integrating Self-supervised Speech Model with Pseudo Word-level Targets from Visually-grounded Speech Model
DOI:
10.48550/arxiv.2402.05819
Publication Date:
2024-02-08
AUTHORS (8)
ABSTRACT
Recent advances in self-supervised speech models have shown significant improvement many downstream tasks. However, these predominantly centered on frame-level training objectives, which can fall short spoken language understanding tasks that require semantic comprehension. Existing works often rely additional speech-text data as intermediate targets, is costly the real-world setting. To address this challenge, we propose Pseudo-Word HuBERT (PW-HuBERT), a framework integrates pseudo word-level targets into process, where are derived from visually-ground model, notably eliminating need for paired data. Our experimental results four (SLU) benchmarks suggest superiority of our model capturing information.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....