{"id":1057,"date":"2023-05-05T11:06:19","date_gmt":"2023-05-05T09:06:19","guid":{"rendered":"https:\/\/reach.ircam.fr\/?p=1057"},"modified":"2024-03-09T17:07:06","modified_gmt":"2024-03-09T16:07:06","slug":"large-scale-contrastive-language-audio-pretraining-with-feature-fusion-and-keyword-to-caption-augmentation","status":"publish","type":"post","link":"https:\/\/reach.ircam.fr\/index.php\/2023\/05\/05\/large-scale-contrastive-language-audio-pretraining-with-feature-fusion-and-keyword-to-caption-augmentation\/","title":{"rendered":"Large-Scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation"},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-post\" data-elementor-id=\"1057\" class=\"elementor elementor-1057\">\n\t\t\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-4ca2f4c elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"4ca2f4c\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-46458a8\" data-id=\"46458a8\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-f00a373 elementor-widget elementor-widget-text-editor\" data-id=\"f00a373\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p><span data-sheets-root=\"1\" data-sheets-value=\"{&quot;1&quot;:2,&quot;2&quot;:&quot;Yusong Wu, K. Chen, T. Zhang, Y. Hui, T. Berg-Kirkpatrick and S. Dubnov, \\&quot;Large-Scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation,\\&quot; ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, Rhodes Island, Greece, 2023, pp. 1-5, doi: 10.1109\/ICASSP49357.2023.10095969.&quot;}\" data-sheets-userformat=\"{&quot;2&quot;:6915,&quot;3&quot;:{&quot;1&quot;:0},&quot;4&quot;:{&quot;1&quot;:2,&quot;2&quot;:16777215},&quot;11&quot;:4,&quot;12&quot;:0,&quot;14&quot;:{&quot;1&quot;:2,&quot;2&quot;:1136076},&quot;15&quot;:&quot;Arial, Helvetica, sans-serif&quot;}\">Yusong Wu, K. Chen, T. Zhang, Y. Hui, T. Berg-Kirkpatrick and S. Dubnov, \u00ab\u00a0Large-Scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation,\u00a0\u00bb ICASSP 2023 &#8211; 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, Rhodes Island, Greece, 2023, pp. 1-5, doi: 10.1109\/ICASSP49357.2023.10095969.<\/span><\/p><p><a href=\"https:\/\/ieeexplore.ieee.org\/document\/10095969\">Full publication<\/a><\/p><p><strong>Abstract<\/strong>: Contrastive learning has shown remarkable success in the field of multimodal representation learning. In this paper, we propose a pipeline of contrastive language-audio pretraining to develop an audio representation by combining audio data with natural language descriptions. To accomplish this target, we first release LAION-Audio-630K, a large collection of 633,526 audio-text pairs from different data sources. Second, we construct a contrastive language-audio pretraining model by considering different audio encoders and text encoders. We incorporate the feature fusion mechanism and keyword-to-caption augmentation into the model design to further enable the model to process audio inputs of variable lengths and enhance the performance. Third, we perform comprehensive experiments to evaluate our model across three tasks: text-to-audio retrieval, zero-shot audio classification, and supervised audio classification. The results demonstrate that our model achieves superior performance in text-to-audio retrieval task. In audio classification tasks, the model achieves state-of-the-art performance in the zero-shot setting and is able to obtain performance comparable to models\u2019 results in the non-zero-shot setting. LAION-Audio-630K\u00a0<sup>1<\/sup>\u00a0and the proposed model\u00a0<sup>2<\/sup>\u00a0are both available to the public.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t","protected":false},"excerpt":{"rendered":"<p>Yusong Wu, K. Chen, T. Zhang, Y. Hui, T. Berg-Kirkpatrick and S. Dubnov, \u00ab\u00a0Large-Scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation,\u00a0\u00bb ICASSP 2023 &#8211; 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, Rhodes Island, Greece, 2023, pp. 1-5, doi: 10.1109\/ICASSP49357.2023.10095969. Full publication Abstract: Contrastive learning has shown remarkable success in the [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":1059,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[52,46],"tags":[],"class_list":["post-1057","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-conferences","category-publications-research"],"aioseo_notices":[],"blog_post_layout_featured_media_urls":{"thumbnail":["https:\/\/reach.ircam.fr\/wp-content\/uploads\/2024\/03\/ICASSP_2023-150x150.jpg",150,150,true],"full":["https:\/\/reach.ircam.fr\/wp-content\/uploads\/2024\/03\/ICASSP_2023.jpg",400,300,false]},"categories_names":{"52":{"name":"Conferences","link":"https:\/\/reach.ircam.fr\/index.php\/category\/research\/conferences\/"},"46":{"name":"Publications","link":"https:\/\/reach.ircam.fr\/index.php\/category\/research\/publications-research\/"}},"tags_names":[],"comments_number":"0","wpmagazine_modules_lite_featured_media_urls":{"thumbnail":["https:\/\/reach.ircam.fr\/wp-content\/uploads\/2024\/03\/ICASSP_2023-150x150.jpg",150,150,true],"cvmm-medium":["https:\/\/reach.ircam.fr\/wp-content\/uploads\/2024\/03\/ICASSP_2023-300x300.jpg",300,300,true],"cvmm-medium-plus":["https:\/\/reach.ircam.fr\/wp-content\/uploads\/2024\/03\/ICASSP_2023-305x207.jpg",305,207,true],"cvmm-portrait":["https:\/\/reach.ircam.fr\/wp-content\/uploads\/2024\/03\/ICASSP_2023.jpg",400,300,false],"cvmm-medium-square":["https:\/\/reach.ircam.fr\/wp-content\/uploads\/2024\/03\/ICASSP_2023.jpg",400,300,false],"cvmm-large":["https:\/\/reach.ircam.fr\/wp-content\/uploads\/2024\/03\/ICASSP_2023.jpg",400,300,false],"cvmm-small":["https:\/\/reach.ircam.fr\/wp-content\/uploads\/2024\/03\/ICASSP_2023-130x95.jpg",130,95,true],"full":["https:\/\/reach.ircam.fr\/wp-content\/uploads\/2024\/03\/ICASSP_2023.jpg",400,300,false]},"_links":{"self":[{"href":"https:\/\/reach.ircam.fr\/index.php\/wp-json\/wp\/v2\/posts\/1057","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/reach.ircam.fr\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/reach.ircam.fr\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/reach.ircam.fr\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/reach.ircam.fr\/index.php\/wp-json\/wp\/v2\/comments?post=1057"}],"version-history":[{"count":4,"href":"https:\/\/reach.ircam.fr\/index.php\/wp-json\/wp\/v2\/posts\/1057\/revisions"}],"predecessor-version":[{"id":1062,"href":"https:\/\/reach.ircam.fr\/index.php\/wp-json\/wp\/v2\/posts\/1057\/revisions\/1062"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/reach.ircam.fr\/index.php\/wp-json\/wp\/v2\/media\/1059"}],"wp:attachment":[{"href":"https:\/\/reach.ircam.fr\/index.php\/wp-json\/wp\/v2\/media?parent=1057"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/reach.ircam.fr\/index.php\/wp-json\/wp\/v2\/categories?post=1057"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/reach.ircam.fr\/index.php\/wp-json\/wp\/v2\/tags?post=1057"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}