Latent Diffusion Bridges for Unsupervised Musical Audio Timbre Transfer Demo¶
Timbre Transfer Results¶
Normal Instruments Created with Our Method¶
Source | Target |
---|---|
flute
|
violin
DPD: 0.07, JD: 0.0
|
trumpet
DPD: 0.05, JD: 0.0
|
|
violin
|
flute
DPD: 0.1, JD: 0.2
|
trumpet
DPD: 0.13, JD: 0.1
|
|
trumpet
|
flute
DPD: 0.02, JD: 0.0
|
violin
DPD: 0.02, JD: 0.0
|
|
bassoon
|
cello
DPD: 0.12, JD: 0.0
|
cello
|
bassoon
DPD: 0.07, JD: 0.0
|
Pitch-Shifted¶
Source | Target |
---|---|
flute shifted 0 semitones
|
bassoon
DPD: 0.75, JD: 0.08
|
flute shifted -20 semitones
|
bassoon
DPD: 0.6, JD: 0.25
|
flute shifted -25 semitones
|
bassoon
DPD: 0.12, JD: 0.0
|
Chunk-Based Minibatch¶
Source | Target |
---|---|
flute
model trained with time chunk size 4 and channel chunk size 0
|
violin
DPD: 0.12, JD: 0.0
|
flute
model trained with time chunk size 4 and channel chunk size 32
|
violin
DPD: 0.2, JD: 0.0
|
violin
model trained with time chunk size 4 and channel chunk size 0
|
flute
DPD: 0.09, JD: 0.0
|
violin
model trained with time chunk size 4 and channel chunk size 32
|
flute
DPD: 0.13, JD: 0.1
|
Impact of Different Sigma Max and Sigma N¶
Source | Noise | Target |
---|---|---|
violin
model with sigma_max=100 and sigma_N=100
|
Noisy violin
|
flute
DPD: 2.39, JD: 0.64
|
violin
model with sigma_max=100 and sigma_N=50
|
Noisy violin
|
flute
DPD: 2.61, JD: 0.82
|
violin
model with sigma_max=100 and sigma_N=20
|
Noisy violin
|
flute
DPD: 0.33, JD: 0.1
|
violin
model with sigma_max=100 and sigma_N=5
|
Noisy violin
|
flute
DPD: 0.12, JD: 0.1
|
This graph illustrates the JD and DPD values for a specific violin-to-flute timbre transfer example while varying sigma_N. The classification of the generated audio as either violin or flute is also indicated.
Shared Space¶
The following audio samples were generated using flute and violin models, both with sigma_max=100 and sigma_N=100, by sampling directly from N(0, sigma_max). Below, we provide examples of audio pairs that were considered melodically similar and those that were not.
Source Latent | Flute | Violin |
---|---|---|
Source Latent
Standard Gaussian Noise * 100 with seed=0
|
Flute
Similar Melodies (DPD < 0.7)
|
Violin
DPD: 0.52, JD: 0.18
|
Source Latent
Standard Gaussian Noise * 100 with seed=1
|
Flute
Different Melodies (DPD >= 0.7)
|
Violin
DPD: 1.77, JD: 0.25
|
Cycle Consistency¶
The following results were obtained by calculating the normalized L2 norm between the input Encodec embeddings derived from flute audio and the generated Encodec embeddings after converting the flute to violin and back to flute.