This demo page is for the paper **DiffRoll: Diffusion-based Generative Music Transcription with Unsupervise Pretraining Capability**.

**Source code:** https://github.com/sony/DiffRoll

In this section, we use the spectrograms $x_\text{spec}$ as the condition of our model. By doing so, our model transcribes the spectrogram, but it can be also considered as conditional generation.

For the transcription, convert the output posteriorgrams into piano rolls and then we export the piano rolls into midi files.

FL Studio is used to synthesize audio from midi files.

SamplingTrajectory |
||||

File Name |
Sample 1 | Sample 2 | Sample 3 | Sample 4 |

Input Audio (Condition) |
||||

Input Audio (Condition) |
||||

TranscribedPosteriorgram |
||||

Transcribed MIDI(Synthesized) |
||||

Output MIDI |
Download |
Download |
Download |
Download |

Ground Truth Piano Roll |
||||

Ground Truth MIDI |

In this section, we mask part of the spectrograms $x_\text{spec}$ with $-1$. By doing so, our model generates music for the missing part masked with $-1$ and transcribes the rest of the spectrogram. There are rooms for improvements but we will leave it as our future work.

File Name |
Sample 1 | Sample 2 | Sample 3 | Sample 4 |

Input Audio (Condition) |
||||

InpaintedPosteriorgram |
||||

Inpainted MIDI(Synthesized) |
||||

Output MIDI |
Download |
Download |
Download |
Download |

Original Piano Roll |
||||

Original MIDI(Synthesized) |

In this section, we use a tensor of $-1$ with the same shape of $x_\text{spec}$ as the condition. Our model generates new piano rolls which can be considered as a form of music generation. There are rooms for improvements but we will leave it as our future work.

SamplingTrajectory |
||||

Posteriorgrams |
||||

Audio |
||||

Midi |
Download |
Download |
Download |
Download |

We want to show that minimizing the L2 loss between the model predicted posteriorgram $\hat{x}_0$ and the ground truth piano roll $x_\text{roll}$ is equivalent to the original DDPM paper where the loss between the model predicted noise $\hat{\epsilon}$ and the noise $\epsilon$ sampled from a Gaussian distribution.

We start with the loss between $\epsilon$ and $\epsilon_\theta$ that is commonly used in diffusion models.

$$L =\|\epsilon-\hat{\epsilon}\|^2\label{A}\tag{A}$$From Eq.(1) of our paper, we have the following two relationships. \begin{align} \epsilon&=\frac{1}{\sqrt{1-\bar{\alpha}_t}}\left(x_t - \sqrt{\bar{\alpha}_t}x_\text{roll}\right)\label{B1}\tag{B1} \\ \hat{\epsilon}&=\frac{1}{\sqrt{1-\bar{\alpha}_t}}\left(x_t - \sqrt{\bar{\alpha}_t}\hat{x}\right)\label{B2}\tag{B2} \end{align}

Plugging (\ref{B1}) and (\ref{B2}) into (\ref{A}) yield the followings:

\begin{align} L &=\left\|\frac{1}{\sqrt{1-\bar{\alpha}_t}}\left(x_t - \sqrt{\bar{\alpha}_t}x_\text{roll}\right)-\frac{1}{\sqrt{1-\bar{\alpha}_t}}\left(x_t - \sqrt{\bar{\alpha}_t}\hat{x}\right)\right\|^2 \\ &=\sqrt{\frac{\bar{\alpha}_t}{1-\bar{\alpha}_t}}\left\| -x_\text{roll} + \hat{x} \right\|^2 \\ &=\sqrt{\frac{\bar{\alpha}_t}{1-\bar{\alpha}_t}}\left\| x_\text{roll} - \hat{x} \right\|^2 \end{align}Therefore, minimizing the L2 loss between $\hat{x}_0$ and $x_\text{roll}$ is equivalent to the L2 loss between $\hat{\epsilon}$ and $\epsilon$ by a constant $\sqrt{\frac{\bar{\alpha}_t}{1-\bar{\alpha}_t}}$.