DiffRoll Demo¶

This demo page is for the paper DiffRoll: Diffusion-based Generative Music Transcription with Unsupervise Pretraining Capability.

Source code: https://github.com/sony/DiffRoll

Table of contents¶

  1. Transcription (Conditional Generation)
  2. Inpainting 1
  3. Inpainting 2
  4. Unconditional Generation
  5. About L2 loss between $\hat{x}_0$ and $x_\text{roll}$

Transcription (Conditional Generation)¶

In this section, we use the spectrograms $x_\text{spec}$ as the condition of our model. By doing so, our model transcribes the spectrogram, but it can be also considered as conditional generation.

For the transcription, convert the output posteriorgrams into piano rolls and then we export the piano rolls into midi files.

FL Studio is used to synthesize audio from midi files.

Sampling
Trajectory
Trajectory for generation
File Name Sample 1 Sample 2 Sample 3 Sample 4
Input Audio (Condition) sample0 sample0 sample0 sample0
Input Audio (Condition) alternative text
alternative text
alternative text
alternative text
Transcribed
Posteriorgram
sample0 sample1 sample2 sample3
Transcribed MIDI
(Synthesized)
alternative text
alternative text
alternative text
alternative text
Output MIDI Download
Download
Download
Download
Ground Truth Piano Roll sample0 sample1 sample2 sample3
Ground Truth MIDI alternative text
alternative text
alternative text
alternative text

Back to Top

Inpainting 1: Gap Filling¶

In this section, we mask part of the spectrograms $x_\text{spec}$ with $-1$. By doing so, our model generates music for the missing part masked with $-1$ and transcribes the rest of the spectrogram. There are rooms for improvements but we will leave it as our future work.

File Name Sample 1 Sample 2 Sample 3 Sample 4
Input Audio (Condition) sample0
sample1
sample2
sample3
Inpainted
Posteriorgram
sample0 sample1 sample2 sample3
Inpainted MIDI
(Synthesized)
alternative text
alternative text
alternative text
alternative text
Output MIDI Download
Download
Download
Download
Original Piano Roll sample0 sample1 sample2 sample3
Original MIDI
(Synthesized)
alternative text
alternative text
alternative text
alternative text

Back to Top

Inpainting 2: Music Continuation¶

File Name Sample 1 Sample 2 Sample 3 Sample 4
Input Audio (Condition) sample0
sample1
sample2
sample3
Inpainted
Posteriorgram
sample0 sample1 sample2 sample3
Inpainted MIDI
(Synthesized)
alternative text
alternative text
alternative text
alternative text
Output MIDI Download
Download
Download
Download
Original Piano Roll sample0 sample1 sample2 sample3
Original MIDI
(Synthesized)
alternative text
alternative text
alternative text
alternative text

Back to Top

Unconditional Generation¶

In this section, we use a tensor of $-1$ with the same shape of $x_\text{spec}$ as the condition. Our model generates new piano rolls which can be considered as a form of music generation. There are rooms for improvements but we will leave it as our future work.

Sampling
Trajectory
Trajectory for generation
Posteriorgrams sample0 sample1 sample2 sample3
Audio alternative text
alternative text
alternative text
alternative text
Midi Download
Download
Download
Download

Back to Top

L2 Loss¶

We want to show that minimizing the L2 loss between the model predicted posteriorgram $\hat{x}_0$ and the ground truth piano roll $x_\text{roll}$ is equivalent to the original DDPM paper where the loss between the model predicted noise $\hat{\epsilon}$ and the noise $\epsilon$ sampled from a Gaussian distribution.

We start with the loss between $\epsilon$ and $\epsilon_\theta$ that is commonly used in diffusion models.

$$L =\|\epsilon-\hat{\epsilon}\|^2\label{A}\tag{A}$$

From Eq.(1) of our paper, we have the following two relationships. \begin{align} \epsilon&=\frac{1}{\sqrt{1-\bar{\alpha}_t}}\left(x_t - \sqrt{\bar{\alpha}_t}x_\text{roll}\right)\label{B1}\tag{B1} \\ \hat{\epsilon}&=\frac{1}{\sqrt{1-\bar{\alpha}_t}}\left(x_t - \sqrt{\bar{\alpha}_t}\hat{x}\right)\label{B2}\tag{B2} \end{align}

Plugging (\ref{B1}) and (\ref{B2}) into (\ref{A}) yield the followings:

\begin{align} L &=\left\|\frac{1}{\sqrt{1-\bar{\alpha}_t}}\left(x_t - \sqrt{\bar{\alpha}_t}x_\text{roll}\right)-\frac{1}{\sqrt{1-\bar{\alpha}_t}}\left(x_t - \sqrt{\bar{\alpha}_t}\hat{x}\right)\right\|^2 \\ &=\sqrt{\frac{\bar{\alpha}_t}{1-\bar{\alpha}_t}}\left\| -x_\text{roll} + \hat{x} \right\|^2 \\ &=\sqrt{\frac{\bar{\alpha}_t}{1-\bar{\alpha}_t}}\left\| x_\text{roll} - \hat{x} \right\|^2 \end{align}

Therefore, minimizing the L2 loss between $\hat{x}_0$ and $x_\text{roll}$ is equivalent to the L2 loss between $\hat{\epsilon}$ and $\epsilon$ by a constant $\sqrt{\frac{\bar{\alpha}_t}{1-\bar{\alpha}_t}}$.