CLIPSep: Learning Text-queried Sound Separation with Noisy Unlabeled Videos

Hao-Wen Dong†, Naoya Takahashi*, Yuki Mitsufuji, Julian McAuley, Taylor Berg-Kirkpatrick
In International Conference on Learning Representations (ICLR) 2023
(† Work done during an internship at Sony Group Corporation, * corresponding author)
| paper | code|

Content

Example results on “MUSIC + VGGSound”
Example results on “VGGSound-Clean + VGGSound”
Example results on “VGGSound + None”
Real-world movie example
1. Spiderman – No Way Home (2021)
Robustness to different queries
1. “acoustic guitar” + “cheetah chirrup”

Summary of the compared models

CLIPSep: Our proposed model without the noise invariant training.
CLIPSep-NIT: Our proposed model with the noise invariant training using γ = 0.25.
LabelSep: The proposed CLIPSep model with the query model replaced by a learnable embedding lookup table.
PIT: The permutation invariant training model proposed by Yu et al. (2017).¹

Model	Unlabelled data	Post-processing free	Query type (training)	Query type (test)
CLIPSep	✓	✓	Image	Text
CLIPSep-NIT	✓	✓	Image	Text
LabelSep	✕	✓	Label	Label
PIT	✓	✕	-	-

Important notes

All the examples presented below use text queries rather than image queries.
We prefix the text query into the form of “a photo of [user input query]”.
All the spectrograms are shown in the log frequency scale.

Example results on “MUSIC + VGGSound”

Settings: We take an audio sample in the MUSIC dataset as the target source. We then mix the target source with an interference audio sample in the VGGSound dataset to create an artificial mixture.

Example 1 – “accordion” + “engine accelerating”

Target source: accordion
Interference: engine accelerating revving vroom
Query: “accordion”

Mixture	Ground truth	Ground truth (Interference)	Prediction (CLIPSep)

Prediction (CLIPSep-NIT)	Prediction (PIT)	Noise head 1 (CLIPSep-NIT) *	Noise head 2 (CLIPSep-NIT) *

* The noise heads are expected to contain query-irrelevant noises.

Example 2 – “acoustic guitar” + “cheetah chirrup”

Target source: acoustic guitar
Interference: cheetah chirrup
Query: “acoustic guitar”

Mixture	Ground truth	Ground truth (Interference)	Prediction (CLIPSep)

Prediction (CLIPSep-NIT)	Prediction (PIT) *	Noise head 1 (CLIPSep-NIT)	Noise head 2 (CLIPSep-NIT)

* The PIT model requires a post-selection step to get the correct source. Without the post-selection step, the PIT model return the right source in only a 50% chance.

Example 3 – “violin” + “people sobbing”

Target source: violin
Interference: people sobbing
Query: “violin”

Mixture	Ground truth	Ground truth (Interference)	Prediction (CLIPSep)

Prediction (CLIPSep-NIT)	Prediction (PIT)	Noise head 1 (CLIPSep-NIT)	Noise head 2 (CLIPSep-NIT)

Example results on “VGGSound-Clean + VGGSound”

Settings: We take an audio sample in the VGGSound-Clean dataset as the target source. We then mix the target source with an interference audio sample in the VGGSound dataset to create an artificial mixture. Note that the LabelSep model does not work on the MUSIC dataset due to the different label taxonomies of the MUSIC and VGGSound datasets.

Example 1 – “cat growling” + “railroad car”

Target source: cat growling
Interference: railroad car train wagon
Query: “cat growling”

Mixture	Ground truth	Ground truth (Interference)

Prediction (CLIPSep)	Prediction (CLIPSep-NIT)	Prediction (PIT)

Prediction (LabelSep)	Noise head 1 (CLIPSep-NIT) *	Noise head 2 (CLIPSep-NIT) *

* The noise heads are expected to contain query-irrelevant noises.

Example 2 – “electric grinder” + “car horn”

Target source: electric grinder grinding
Interference: vehicle horn car horn honking
Query: “electric grinder grinding”

Mixture	Ground truth	Ground truth (Interference)

Prediction (CLIPSep)	Prediction (CLIPSep-NIT)	Prediction (PIT) *

Prediction (LabelSep)	Noise head 1 (CLIPSep-NIT)	Noise head 2 (CLIPSep-NIT)

* The PIT model requires a post-selection step to get the correct source. Without the post-selection step, the PIT model return the right source in only a 50% chance.

Example 3 – “playing harpsichord” + “people coughing”

Target source: playing harpsichord
Interference: people coughing
Query: “playing harpsichord”

Mixture	Ground truth	Ground truth (Interference)

Prediction (CLIPSep)	Prediction (CLIPSep-NIT)	Prediction (PIT) *

Prediction (LabelSep)	Noise head 1 (CLIPSep-NIT)	Noise head 2 (CLIPSep-NIT)

* The PIT model requires a post-selection step to get the correct source. Without the post-selection step, the PIT model return the right source in only a 50% chance.

Example results on “VGGSound + None”

Settings: We take a “noisy” audio sample in the VGGSound dataset and treat it as the input mixture. We aim to examine if the model can separate the target sounds from query-irrelevant noises. Note that there is no “ground truth” in this setting.

Example 1 – “playing bagpipes”

Target source: playing bagpipes (Bagpipe player in Vegas )
Interference: none
Query: “playing bagpipes”
Note: The model successfully separates the bagpipe sounds and the background noises.

Source video

Mixture	Prediction	Noise head 1	Noise head 2

Example 2 – “subway, metro, underground”

Target source: subway, metro, underground (Trains At Flitwick Railway Station (03/5/15))
Interference: none
Query: “subway, metro, underground”
Note: The model successfully separates the wind sounds from the train sounds.

Source video

Mixture	Prediction	Noise head 1	Noise head 2

Example 3 – “playing theremin”

Target source: playing theremin (Терменвокс)
Interference: none
Query: “playing theremin”
Note: The model successfully separates most theremin sounds from the piano accompaniments.

Source video

Mixture	Prediction	Noise head 1	Noise head 2

Robustness to different queries

Settings: We take the same input mixture and query the model with different text queries to examine the model’s robustness to different queries. We use the CLIPSep-NIT model in this demo.

“acoustic guitar” + “cheetah chirrup”

Target source: acoustic guitar
Interference: cheetah chirrup
Note: We can see that the model is robust to different text queries and can extract the desired sounds.

Mixture	Ground truth	Ground truth (Interference)

Prediction (Query: “acoustic guitar”)	Prediction (Query: “guitar”)	Prediction (Query: “a man is playing acoustic guitar”)

Prediction (Query: “a man is playing acoustic guitar in a room”)	Prediction (Query: “car engine”)

Dong Yu, Morten Kolbæk, Zheng-Hua Tan, and Jesper Jensen. Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In Proc. ICASSP, 2017. ↩