Latent Diffusion Bridges for Unsupervised Musical Audio Timbre Transfer Demo¶

Table of Contents¶

Timbre Transfer Results
1.1. Normal Instruments Created with Our Method
1.2. Pitch-Shifted Flute
1.3. Chunk-Based Minibatch
Impact of Different Sigma Max and Sigma N
Shared Space
Cycle Consistency
Appendix

Timbre Transfer Results¶

Normal Instruments Created with Our Method¶

Source	Target
flute	violin DPD: 0.07, JD: 0.0
flute	trumpet DPD: 0.05, JD: 0.0
violin	flute DPD: 0.1, JD: 0.2
violin	trumpet DPD: 0.13, JD: 0.1
trumpet	flute DPD: 0.02, JD: 0.0
trumpet	violin DPD: 0.02, JD: 0.0
bassoon	cello DPD: 0.12, JD: 0.0
cello	bassoon DPD: 0.07, JD: 0.0

Pitch-Shifted¶

Source	Target
flute shifted 0 semitones	bassoon DPD: 0.75, JD: 0.08
flute shifted -20 semitones	bassoon DPD: 0.6, JD: 0.25
flute shifted -25 semitones	bassoon DPD: 0.12, JD: 0.0

Chunk-Based Minibatch¶

Source	Target
flute model trained with time chunk size 4 and channel chunk size 0	violin DPD: 0.12, JD: 0.0
flute model trained with time chunk size 4 and channel chunk size 32	violin DPD: 0.2, JD: 0.0
violin model trained with time chunk size 4 and channel chunk size 0	flute DPD: 0.09, JD: 0.0
violin model trained with time chunk size 4 and channel chunk size 32	flute DPD: 0.13, JD: 0.1

Impact of Different Sigma Max and Sigma N¶

Source	Noise	Target
violin model with sigma_max=100 and sigma_N=100	Noisy violin	flute DPD: 2.39, JD: 0.64
violin model with sigma_max=100 and sigma_N=50	Noisy violin	flute DPD: 2.61, JD: 0.82
violin model with sigma_max=100 and sigma_N=20	Noisy violin	flute DPD: 0.33, JD: 0.1
violin model with sigma_max=100 and sigma_N=5	Noisy violin	flute DPD: 0.12, JD: 0.1

This graph illustrates the JD and DPD values for a specific violin-to-flute timbre transfer example while varying sigma_N. The classification of the generated audio as either violin or flute is also indicated.

Shared Space¶

The following audio samples were generated using flute and violin models, both with sigma_max=100 and sigma_N=100, by sampling directly from N(0, sigma_max). Below, we provide examples of audio pairs that were considered melodically similar and those that were not.

Source Latent	Flute	Violin
Source Latent Standard Gaussian Noise * 100 with seed=0	Flute Similar Melodies (DPD < 0.7)	Violin DPD: 0.52, JD: 0.18
Source Latent Standard Gaussian Noise * 100 with seed=1	Flute Different Melodies (DPD >= 0.7)	Violin DPD: 1.77, JD: 0.25

Cycle Consistency¶

The following results were obtained by calculating the normalized L2 norm between the input Encodec embeddings derived from flute audio and the generated Encodec embeddings after converting the flute to violin and back to flute.

No description has been provided for this image