EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning

Codec Closed captioning
DOI: 10.48550/arxiv.2401.17690 Publication Date: 2024-01-31
ABSTRACT
We propose EnCLAP, a novel framework for automated audio captioning. EnCLAP employs two acoustic representation models, EnCodec and CLAP, along with pretrained language model, BART. also introduce new training objective called masked codec modeling that improves awareness of the model. Experimental results on AudioCaps Clotho demonstrate our model surpasses performance baseline models. Source code will be available at https://github.com/jaeyeonkim99/EnCLAP . An online demo is https://huggingface.co/spaces/enclap-team/enclap
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....