DiffRoll Demo¶

This demo page is for the paper DiffRoll: Diffusion-based Generative Music Transcription with Unsupervise Pretraining Capability.

Source code: https://github.com/sony/DiffRoll

Table of contents¶

Transcription (Conditional Generation)
Inpainting 1
Inpainting 2
Unconditional Generation
About L2 loss between $\hat{x}_0$ and $x_\text{roll}$

Transcription (Conditional Generation)¶

In this section, we use the spectrograms $x_\text{spec}$ as the condition of our model. By doing so, our model transcribes the spectrogram, but it can be also considered as conditional generation.

For the transcription, convert the output posteriorgrams into piano rolls and then we export the piano rolls into midi files.

FL Studio is used to synthesize audio from midi files.

Sampling Trajectory
File Name	Sample 1	Sample 2	Sample 3	Sample 4
Input Audio (Condition)
Input Audio (Condition)
Transcribed Posteriorgram
Transcribed MIDI (Synthesized)
Output MIDI	Download	Download	Download	Download
Ground Truth Piano Roll
Ground Truth MIDI

Back to Top

Inpainting 1: Gap Filling¶

In this section, we mask part of the spectrograms $x_\text{spec}$ with $-1$. By doing so, our model generates music for the missing part masked with $-1$ and transcribes the rest of the spectrogram. There are rooms for improvements but we will leave it as our future work.

File Name	Sample 1	Sample 2	Sample 3	Sample 4
Input Audio (Condition)
Inpainted Posteriorgram
Inpainted MIDI (Synthesized)
Output MIDI	Download	Download	Download	Download
Original Piano Roll
Original MIDI (Synthesized)

Back to Top

Inpainting 2: Music Continuation¶

File Name	Sample 1	Sample 2	Sample 3	Sample 4
Input Audio (Condition)
Inpainted Posteriorgram
Inpainted MIDI (Synthesized)
Output MIDI	Download	Download	Download	Download
Original Piano Roll
Original MIDI (Synthesized)

Back to Top

Unconditional Generation¶

In this section, we use a tensor of $-1$ with the same shape of $x_\text{spec}$ as the condition. Our model generates new piano rolls which can be considered as a form of music generation. There are rooms for improvements but we will leave it as our future work.

Sampling Trajectory
Posteriorgrams
Audio
Midi	Download	Download	Download	Download

Back to Top

L2 Loss¶

We want to show that minimizing the L2 loss between the model predicted posteriorgram $\hat{x}_0$ and the ground truth piano roll $x_\text{roll}$ is equivalent to the original DDPM paper where the loss between the model predicted noise $\hat{\epsilon}$ and the noise $\epsilon$ sampled from a Gaussian distribution.

We start with the loss between $\epsilon$ and $\epsilon_\theta$ that is commonly used in diffusion models.

$$L =\|\epsilon-\hat{\epsilon}\|^2\label{A}\tag{A}$$

From Eq.(1) of our paper, we have the following two relationships. \begin{align} \epsilon&=\frac{1}{\sqrt{1-\bar{\alpha}_t}}\left(x_t - \sqrt{\bar{\alpha}_t}x_\text{roll}\right)\label{B1}\tag{B1} \\ \hat{\epsilon}&=\frac{1}{\sqrt{1-\bar{\alpha}_t}}\left(x_t - \sqrt{\bar{\alpha}_t}\hat{x}\right)\label{B2}\tag{B2} \end{align}

Plugging (\ref{B1}) and (\ref{B2}) into (\ref{A}) yield the followings:

\begin{align} L &=\left\|\frac{1}{\sqrt{1-\bar{\alpha}_t}}\left(x_t - \sqrt{\bar{\alpha}_t}x_\text{roll}\right)-\frac{1}{\sqrt{1-\bar{\alpha}_t}}\left(x_t - \sqrt{\bar{\alpha}_t}\hat{x}\right)\right\|^2 \\ &=\sqrt{\frac{\bar{\alpha}_t}{1-\bar{\alpha}_t}}\left\| -x_\text{roll} + \hat{x} \right\|^2 \\ &=\sqrt{\frac{\bar{\alpha}_t}{1-\bar{\alpha}_t}}\left\| x_\text{roll} - \hat{x} \right\|^2 \end{align}

Therefore, minimizing the L2 loss between $\hat{x}_0$ and $x_\text{roll}$ is equivalent to the L2 loss between $\hat{\epsilon}$ and $\epsilon$ by a constant $\sqrt{\frac{\bar{\alpha}_t}{1-\bar{\alpha}_t}}$.