CLIPSep: Learning Text-queried Sound Separation with Noisy Unlabeled Videos

International Conference on Learning Representations (ICLR) 2023

View the Project on GitHub

Hao-Wen Dong†, Naoya Takahashi*, Yuki Mitsufuji, Julian McAuley, Taylor Berg-Kirkpatrick
In International Conference on Learning Representations (ICLR) 2023
(† Work done during an internship at Sony Group Corporation, * corresponding author)
| paper | code|

Content


Summary of the compared models

Model Unlabelled data Post-processing free Query type (training) Query type (test)
CLIPSep Image Text
CLIPSep-NIT Image Text
LabelSep Label Label
PIT - -

Important notes


Example results on “MUSIC + VGGSound”

Settings: We take an audio sample in the MUSIC dataset as the target source. We then mix the target source with an interference audio sample in the VGGSound dataset to create an artificial mixture.

Example 1 – “accordion” + “engine accelerating”

Mixture Ground truth Ground truth (Interference) Prediction (CLIPSep)
mix.png gtamp.png intamp.png predmap.png
Prediction (CLIPSep-NIT) Prediction (PIT) Noise head 1 (CLIPSep-NIT) * Noise head 2 (CLIPSep-NIT) *
predamp.png predamp.png pitmag1.png pitmag2.png

* The noise heads are expected to contain query-irrelevant noises.

Example 2 – “acoustic guitar” + “cheetah chirrup”

Mixture Ground truth Ground truth (Interference) Prediction (CLIPSep)
mix.png gtamp.png intamp.png predmap.png
Prediction (CLIPSep-NIT) Prediction (PIT) * Noise head 1 (CLIPSep-NIT) Noise head 2 (CLIPSep-NIT)
predamp.png predamp.png pitmag1.png pitmag2.png

* The PIT model requires a post-selection step to get the correct source. Without the post-selection step, the PIT model return the right source in only a 50% chance.

Example 3 – “violin” + “people sobbing”

Mixture Ground truth Ground truth (Interference) Prediction (CLIPSep)
mix.png gtamp.png intamp.png predmap.png
Prediction (CLIPSep-NIT) Prediction (PIT) Noise head 1 (CLIPSep-NIT) Noise head 2 (CLIPSep-NIT)
predamp.png predamp.png pitmag1.png pitmag2.png

Example results on “VGGSound-Clean + VGGSound”

Settings: We take an audio sample in the VGGSound-Clean dataset as the target source. We then mix the target source with an interference audio sample in the VGGSound dataset to create an artificial mixture. Note that the LabelSep model does not work on the MUSIC dataset due to the different label taxonomies of the MUSIC and VGGSound datasets.

Example 1 – “cat growling” + “railroad car”

Mixture Ground truth Ground truth (Interference)
mix.png gtamp.png intamp.png
Prediction (CLIPSep) Prediction (CLIPSep-NIT) Prediction (PIT)
predmap.png predamp.png predamp.png
Prediction (LabelSep) Noise head 1 (CLIPSep-NIT) * Noise head 2 (CLIPSep-NIT) *
predamp.png pitmag1.png pitmag2.png

* The noise heads are expected to contain query-irrelevant noises.

Example 2 – “electric grinder” + “car horn”

Mixture Ground truth Ground truth (Interference)
mix.png gtamp.png intamp.png
Prediction (CLIPSep) Prediction (CLIPSep-NIT) Prediction (PIT) *
predmap.png predamp.png predamp.png
Prediction (LabelSep) Noise head 1 (CLIPSep-NIT) Noise head 2 (CLIPSep-NIT)
predamp.png pitmag1.png pitmag2.png

* The PIT model requires a post-selection step to get the correct source. Without the post-selection step, the PIT model return the right source in only a 50% chance.

Example 3 – “playing harpsichord” + “people coughing”

Mixture Ground truth Ground truth (Interference)
mix.png gtamp.png intamp.png
Prediction (CLIPSep) Prediction (CLIPSep-NIT) Prediction (PIT) *
predmap.png predamp.png predamp.png
Prediction (LabelSep) Noise head 1 (CLIPSep-NIT) Noise head 2 (CLIPSep-NIT)
predamp.png pitmag1.png pitmag2.png

* The PIT model requires a post-selection step to get the correct source. Without the post-selection step, the PIT model return the right source in only a 50% chance.


Example results on “VGGSound + None”

Settings: We take a “noisy” audio sample in the VGGSound dataset and treat it as the input mixture. We aim to examine if the model can separate the target sounds from query-irrelevant noises. Note that there is no “ground truth” in this setting.

Example 1 – “playing bagpipes”

Source video
Mixture Prediction Noise head 1 Noise head 2
mix.png predamp.png pitmag1.png pitmag2.png

Example 2 – “subway, metro, underground”

Source video
Mixture Prediction Noise head 1 Noise head 2
mix.png predamp.png pitmag1.png pitmag2.png

Example 3 – “playing theremin”

Source video
Mixture Prediction Noise head 1 Noise head 2
mix.png predamp.png pitmag1.png pitmag2.png

Robustness to different queries

Settings: We take the same input mixture and query the model with different text queries to examine the model’s robustness to different queries. We use the CLIPSep-NIT model in this demo.

“acoustic guitar” + “cheetah chirrup”

Mixture Ground truth Ground truth (Interference)
mix.png gtamp.png intamp.png
Prediction
(Query: “acoustic guitar”)
Prediction
(Query: “guitar”)
Prediction
(Query: “a man is playing acoustic guitar”)
predmap.png predmap.png predmap.png
Prediction
(Query: “a man is playing acoustic guitar in a room”)
Prediction
(Query: “car engine”)
predmap.png predmap.png

  1. Dong Yu, Morten Kolbæk, Zheng-Hua Tan, and Jesper Jensen. Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In Proc. ICASSP, 2017.