Latent Diffusion Bridges for Unsupervised Musical Audio Timbre Transfer Demo¶

Source code

Supplementary Material

Table of Contents¶

  1. Timbre Transfer Results
    1.1. Normal Instruments Created with Our Method
    1.2. Pitch-Shifted Flute
    1.3. Chunk-Based Minibatch

  2. Impact of Different Sigma Max and Sigma N

  3. Shared Space

  4. Cycle Consistency

  5. Appendix

Timbre Transfer Results¶

Normal Instruments Created with Our Method¶

Source Target
flute
flute pitch contour
violin
DPD: 0.07, JD: 0.0
violin pitch contour
trumpet
DPD: 0.05, JD: 0.0
trumpet pitch contour
violin
violin pitch contour
flute
DPD: 0.1, JD: 0.2
flute pitch contour
trumpet
DPD: 0.13, JD: 0.1
trumpet pitch contour
trumpet
trumpet pitch contour
flute
DPD: 0.02, JD: 0.0
flute pitch contour
violin
DPD: 0.02, JD: 0.0
violin pitch contour
bassoon
bassoon pitch contour
cello
DPD: 0.12, JD: 0.0
cello pitch contour
cello
cello pitch contour
bassoon
DPD: 0.07, JD: 0.0
bassoon pitch contour

Pitch-Shifted¶

Source Target
flute shifted 0 semitones
flute_shifted_0 pitch contour
bassoon
DPD: 0.75, JD: 0.08
bassoon pitch contour
flute shifted -20 semitones
flute_shifted_20 pitch contour
bassoon
DPD: 0.6, JD: 0.25
bassoon pitch contour
flute shifted -25 semitones
flute_shifted_25 pitch contour
bassoon
DPD: 0.12, JD: 0.0
bassoon pitch contour

Chunk-Based Minibatch¶

Source Target
flute
model trained with time chunk size 4 and channel chunk size 0
flute_time_4_channel_0 pitch contour
violin
DPD: 0.12, JD: 0.0
violin pitch contour
flute
model trained with time chunk size 4 and channel chunk size 32
flute_time_4_channel_32 pitch contour
violin
DPD: 0.2, JD: 0.0
violin pitch contour
violin
model trained with time chunk size 4 and channel chunk size 0
violin_time_4_channel_0 pitch contour
flute
DPD: 0.09, JD: 0.0
flute pitch contour
violin
model trained with time chunk size 4 and channel chunk size 32
violin_time_4_channel_32 pitch contour
flute
DPD: 0.13, JD: 0.1
flute pitch contour

Impact of Different Sigma Max and Sigma N¶

Source Noise Target
violin
model with sigma_max=100 and sigma_N=100
violin_sigma_max=100_sigma_N=100 spectrogram
violin_sigma_max=100_sigma_N=100 pitch contour
Noisy violin
Noisy violin_sigma_max=100_sigma_N=100 spectrogram
flute
DPD: 2.39, JD: 0.64
flute spectrogram
flute pitch contour
violin
model with sigma_max=100 and sigma_N=50
violin_sigma_max=100_sigma_N=50 spectrogram
violin_sigma_max=100_sigma_N=50 pitch contour
Noisy violin
Noisy violin_sigma_max=100_sigma_N=50 spectrogram
flute
DPD: 2.61, JD: 0.82
flute spectrogram
flute pitch contour
violin
model with sigma_max=100 and sigma_N=20
violin_sigma_max=100_sigma_N=20 spectrogram
violin_sigma_max=100_sigma_N=20 pitch contour
Noisy violin
Noisy violin_sigma_max=100_sigma_N=20 spectrogram
flute
DPD: 0.33, JD: 0.1
flute spectrogram
flute pitch contour
violin
model with sigma_max=100 and sigma_N=5
violin_sigma_max=100_sigma_N=5 spectrogram
violin_sigma_max=100_sigma_N=5 pitch contour
Noisy violin
Noisy violin_sigma_max=100_sigma_N=5 spectrogram
flute
DPD: 0.12, JD: 0.1
flute spectrogram
flute pitch contour

This graph illustrates the JD and DPD values for a specific violin-to-flute timbre transfer example while varying sigma_N. The classification of the generated audio as either violin or flute is also indicated.

Jaccard vs Sigma_max
DTW vs Sigma_max

Shared Space¶

The following audio samples were generated using flute and violin models, both with sigma_max=100 and sigma_N=100, by sampling directly from N(0, sigma_max). Below, we provide examples of audio pairs that were considered melodically similar and those that were not.

Source Latent Flute Violin
Source Latent
Standard Gaussian Noise * 100 with seed=0
Flute
Similar Melodies (DPD < 0.7)
Flute pitch contour
Violin
DPD: 0.52, JD: 0.18
Violin pitch contour
Source Latent
Standard Gaussian Noise * 100 with seed=1
Flute
Different Melodies (DPD >= 0.7)
Flute pitch contour
Violin
DPD: 1.77, JD: 0.25
Violin pitch contour

Cycle Consistency¶

The following results were obtained by calculating the normalized L2 norm between the input Encodec embeddings derived from flute audio and the generated Encodec embeddings after converting the flute to violin and back to flute.

No description has been provided for this image No description has been provided for this image