Hosted
Challenges

Deep Generative Modeling

TLoRA

[arXiv]

Propose tensor-decomposition-based PEFT method, showing its effectiveness on T-to-I generation tasks

ICCV25

Di4C

[arXiv] [code]

Theoretical analysis of limitation of current discrete diffusion and a method for effectively capturing element-wise dependency

ICML25

VCT

[arXiv] [code]

Improving Consistency Training with a learned data-noise coupling

ICML25

Memorization

[arXiv] [code]

Classifier-Free Guidance inside the Attraction Basin May Cause Memorization

CVPR25

Jump Your Steps

[arXiv]

A general method to find an optimal sampling schedule for inference in discrete diffusion

ICLR25

HERO-DM

[arXiv] [demo]

A method efficiently leverages online human feedback to fine-tune Stable Diffusion for various range of tasks

ICLR25

WPSE

[arXiv]

An enhanced multimodal representation using weighted point clouds and its theoretical benefits

ICLR25

PaGoDA

[arXiv]

A 64x64 pre-trained diffusion model is all you need for 1-step high-resolution SOTA generation

NeurIPS24

CTM

[arXiv] [demo]

Unified framework enables diverse samplers and 1-step generation SOTAs

ICLR24

Applications:
[SoundGen]

SAN

[arXiv] [code] [demo]

Enhancing GAN with metrizable discriminators

ICLR24

Applications:
[Vocoder]

MPGD

[arXiv] [demo]

Fast, Efficient, Training-Free, and Controllable diffusion-based generation method

ICLR24

HQ-VAE

[OpenReview] [arXiv]

Generalizing hierarchical VQ-VAEs with a Bayesian framework

TMLR

FP-Diffusion

[PMLR] [code]

Improving density estimation of diffusion

ICML23

GibbsDDRM

[PMLR] [code]

Achieving blind inversion using DDPM

ICML23

Applications:
[DeReverb] [SpeechEnhance]

Consistency-type Models

[arXiv]

Theoretically unified framework for "consistency" on diffusion model

ICML23 SPIGM Workshop

SQ-VAE

[PMLR] [arXiv] [code]

Improving codebook utilization and training stability

ICML22

AR-ELBO

[Elsevier] [arXiv]

Mitigating oversmoothness in VAE

Neurocomputing

Multimodal NLP

VinaBench

[CVPR] [arXiv] [data]

VinaBench: Benchmark for Faithful and Consistent Visual Narratives

CVPR25

DiffuCOMET

[ACL] [arXiv] [code]

DiffuCOMET: Contextual Commonsense Knowledge Diffusion

ACL24

CyCLIPs/CyCLAPs

[ACL] [arXiv]

On the Language Encoder of Contrastive Cross-modal Models

ACL24

DIIR

[ACL] [arXiv] [code]

Few-shot Dialogue Strategy Learning for Motivational Interviewing via Inductive Reasoning

ACL24

PeaCok

[ACL] [arXiv] [code]

PeaCoK: Persona Commonsense Knowledge for Consistent and Engaging Narratives
(Outstanding Paper Award)

ACL23

ComFact

[EMNLP] [arXiv] [code]

ComFact: A Benchmark for Linking Contextual Commonsense Knowledge

EMNLP22 Findings

Music Technologies

GRAFx (ext.)

[JAES] [code] [demo]

Reverse Engineering of Music Mixing Graphs with Differentiable Processors and Iterative Pruning

JAES

CLEWS

[arXiv]

Supervised contrastive learning from weakly-labeled audio segments for musical version matching

ICML25

MFM as Generic Booster

[OpenReview] [arXiv]

Music Foundation Model as Generic Booster for Music Downstream Tasks

TMLR

DiffVox

[arXiv]

DiffVox: A Differentiable Model for Capturing and Analysing Professional Effects Distributions

DAFx25

Variable Bitrate RVQ

[arXiv]

VRVQ: Variable Bitrate Residual Vector Quantization for Audio Compression

ICASSP25

Instr. Timbre Transfer

[arXiv] [code] [demo]

Latent Diffusion Bridges for Unsupervised Musical Audio Timbre Transfer

ICASSP25

Mixing Graph Estimation

[arXiv] [code] [demo]

Searching For Music Mixing Graphs: A Pruning Approach

DAFx24

Guitar Amp. Modeling

[arXiv]

Improving Unsupervised Clean-to-Rendered Guitar Tone Transformation Using GANs and Integrated Unaligned Clean Data

DAFx24

Text-to-Music Editing

[arXiv] [code] [demo]

MusicMagus: Zero-Shot Text-to-Music Editing via Diffusion Models

IJCAI24

Instr.-Agnostic Trans.

[IEEE] [arXiv]

Timbre-Trap: A Low-Resource Framework for Instrument-Agnostic Music Transcription

ICASSP24

Vocal Restoration

[IEEE] [arXiv] [demo]

VRDMG: Vocal Restoration via Diffusion Posterior Sampling with Multiple Guidance

ICASSP24

hFT-Transformer

[arXiv] [code]

Automatic Piano Transcription with Hierarchical Frequency-Time Transformer

ISMIR23

Automatic Music Tagging

[arXiv]

An Attention-based Approach To Hierarchical Multi-label Music Instrument Classification

ICASSP23

Vocal Dereverberation

[arXiv] [demo]

Unsupervised Vocal Dereverberation with Diffusion-based Generative Models

ICASSP23

Mixing Style Transfer

[arXiv] [code] [demo]

Music Mixing Style Transfer: A Contrastive Learning Approach to Disentangle Audio Effects

ICASSP23

Music Transcription

[arXiv] [code] [demo]

DiffRoll: Diffusion-based Generative Music Transcription with Unsupervised Pretraining Capability

ICASSP23

Singing Voice Vocoder

[arXiv] [demo]

Hierarchical Diffusion Models for Singing Voice Neural Vocoder

ICASSP23

Distortion Effect Removal

[poster] [arXiv] [demo]

Distortion Audio Effects: Learning How to Recover the Clean Signal

ISMIR22

Automatic Music Mixing

[poster] [arXiv] [code] [demo]

Automatic Music Mixing with Deep Learning and Out-of-Domain Data

ISMIR22

Sound Separation

[IEEE]

Music Source Separation with Deep Equilibrium Models

ICASSP22

Automatic DJ Transition

[arXiv] [code] [demo]

Automatic DJ Transitions with Differentiable Audio Effects and Generative Adversarial Networks

ICASSP22

Singing Voice Conversion

[arXiv] [demo]

Robust One-Shot Singing Voice Conversion

Sound Separation

[video] [site]

Glenn Gould and Kanji Ishimaru 2021: A collaboration with AI Sound Separation after 60 years

Cinematic Technologies

MMAudio

[arXiv] [code] [demo]

MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis

CVPR25

MMDisCo

[OpenReview] [arXiv] [code]

MMDisCo: Multi-Modal Discriminator-Guided Cooperative Diffusion for Joint Audio and Video Generation

ICLR25

SoundCTM

[OpenReview] [arXiv] [code] [demo]

SoundCTM: Unifying Score-based and Consistency Models for Full-band Text-to-Sound Generation

ICLR25

Mining Your Own Secrets

[OpenReview] [arXiv]

Mining Your Own Secrets: Diffusion Classifier Scores for Continual Personalization of Text-to-Image Diffusion Models

ICLR25

GenWarp

[arXiv] [demo]

GenWarp: Single Image to Novel Views with Semantic-Preserving Generative Warping

NeurIPS24

SpecMaskGIT

[arXiv] [demo]

SpecMaskGIT: Masked Generative Modeling of Audio Spectrograms for Efficient Audio Synthesis and Beyond

ISMIR24

Acoustic Inv. Rendering

[CVF] [arXiv] [dataset] [code] [demo]

Hearing Anything Anywhere

CVPR24

BigVSAN Vocoder

[arXiv] [code] [demo]

BigVSAN: Enhancing GAN-based Neural Vocoders with Slicing Adversarial Network

ICASSP24

Zero-/Few-shot SELD

[IEEE] [arXiv]

Zero- and Few-shot Sound Event Localization and Detection

ICASSP24

STARSS23

[arXiv] [dataset]

STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events

NeurIPS23

Audio Restoration: ViT-AE

[IEEE] [arXiv] [demo]

Extending Audio Masked Autoencoders Toward Audio Restoration

WASPAA23

Diffiner

[ISCA] [arXiv] [code]

Diffiner: A Versatile Diffusion-based Generative Refiner for Speech Enhancement

INTERSPEECH23

CLIPSep

[OpenReview] [arXiv] [code] [demo]

CLIPSep: Learning Text-queried Sound Separation with Noisy Unlabeled Videos

ICLR23

Sound Event Localization and Detection

[IEEE] [arXiv]

Multi-ACCDOA: Localizing and Detecting Overlapping Sounds from the Same Class with Auxiliary Duplicating Permutation Invariant Training

ICASSP22

Hosted Challenges

CPD Challenge 2025

[CPD Challenge 2025]

Commonsense Persona-grounded Dialogue Challenge 2025

SVG Challenge 2024

[SVG Challenge 2024]

Sounding Video Generation Challenge 2024

DCASE Challenge Task 3

[DCASE Challenge2023]

Sound Event Localization and Detection Evaluated in Real Spatial Sound Scenes

CPD Challenge 2023

[CPD Challenge 2023]

Commonsense Persona-grounded Dialogue Challenge

SDX Challenge 2023

[site] [paper (music)] [paper (cinematic)]

Sound Demixing Challenge 2023

MDX Challenge 2021

[site] [frontiers]

Music Demixing Challenge 2021

Deep Generative Modeling

TLoRA

Di4C

VCT

Memorization

Jump Your Steps

HERO-DM

WPSE

PaGoDA

CTM

SAN

MPGD

HQ-VAE

FP-Diffusion

GibbsDDRM

Consistency-type Models

SQ-VAE

AR-ELBO

Multimodal NLP

VinaBench

[CVPR] [arXiv] [data]

DiffuCOMET

CyCLIPs/CyCLAPs

DIIR

PeaCok

ComFact

Music Technologies

GRAFx (ext.)

CLEWS

MFM as Generic Booster

DiffVox

Variable Bitrate RVQ

Instr. Timbre Transfer

Mixing Graph Estimation

Guitar Amp. Modeling

Text-to-Music Editing

Instr.-Agnostic Trans.

Vocal Restoration

hFT-Transformer

Automatic Music Tagging

Vocal Dereverberation

Mixing Style Transfer

Music Transcription

Singing Voice Vocoder

Distortion Effect Removal

Automatic Music Mixing

Sound Separation

Automatic DJ Transition

Singing Voice Conversion

Sound Separation

Cinematic Technologies

MMAudio

MMDisCo

SoundCTM

Mining Your Own Secrets

GenWarp

SpecMaskGIT

Acoustic Inv. Rendering

BigVSAN Vocoder

Zero-/Few-shot SELD

STARSS23

Audio Restoration: ViT-AE

Diffiner

CLIPSep

Sound Event Localization and Detection

Hosted Challenges

CPD Challenge 2025

SVG Challenge 2024

DCASE Challenge Task 3

CPD Challenge 2023

SDX Challenge 2023

MDX Challenge 2021

Contact

Yuki Mitsufuji (yuhki.mitsufuji@sony.com)