Skip to the content.

Deep Generative Modeling


PaGoDA

[arXiv]

A 64x64 pre-trained diffusion model is all you need for 1-step high-resolution SOTA generation

NeurIPS24

CTM

[arXiv] [demo]

Unified framework enables diverse samplers and 1-step generation SOTAs

ICLR24

SAN

[arXiv] [code] [demo]

Enhancing GAN with metrizable discriminators

ICLR24

Applications:
[Vocoder]

MPGD

[arXiv] [demo]

Fast, Efficient, Training-Free, and Controllable diffusion-based generation method

ICLR24

HQ-VAE

[OpenReview] [arXiv]

Generalizing hierarchical VQ-VAEs with a Bayesian framework

TMLR

FP-Diffusion

[PMLR] [code]

Improving density estimation of diffusion

ICML23

GibbsDDRM

[PMLR] [code]

Achieving blind inversion using DDPM

ICML23

Applications:
[DeReverb] [SpeechEnhance]

Consistency-type Models

[arXiv]

Theoretically unified framework for "consistency" on diffusion model

ICML23 SPIGM Workshop

SQ-VAE

[PMLR] [arXiv] [code]

Improving codebook utilization and training stability

ICML22

AR-ELBO

[Elsevier] [arXiv]

Mitigating oversmoothness in VAE

Neurocomputing

Multimodal NLP


DiffuCOMET

[ACL] [arXiv] [code]

DiffuCOMET: Contextual Commonsense Knowledge Diffusion

ACL24

CyCLIPs/CyCLAPs

[ACL] [arXiv]

On the Language Encoder of Contrastive Cross-modal Models

ACL24

DIIR

[ACL] [arXiv] [code]

Few-shot Dialogue Strategy Learning for Motivational Interviewing via Inductive Reasoning

ACL24

PeaCok

[ACL] [arXiv] [code]

PeaCoK: Persona Commonsense Knowledge for Consistent and Engaging Narratives
(Outstanding Paper Award)

ACL23

ComFact

[EMNLP] [arXiv] [code]

ComFact: A Benchmark for Linking Contextual Commonsense Knowledge

EMNLP22 Findings

Music Technologies


Mixing Graph Estimation

[arXiv] [code] [demo]

Searching For Music Mixing Graphs: A Pruning Approach

DAFx24

Guitar Amp. Modeling

[arXiv]

Improving Unsupervised Clean-to-Rendered Guitar Tone Transformation Using GANs and Integrated Unaligned Clean Data

DAFx24

Text-to-Music Editing

[arXiv] [code] [demo]

MusicMagus: Zero-Shot Text-to-Music Editing via Diffusion Models

IJCAI24

Instr.-Agnostic Trans.

[IEEE] [arXiv]

Timbre-Trap: A Low-Resource Framework for Instrument-Agnostic Music Transcription

ICASSP24

Vocal Restoration

[IEEE] [arXiv] [demo]

VRDMG: Vocal Restoration via Diffusion Posterior Sampling with Multiple Guidance

ICASSP24

CLIPSep

[OpenReview] [arXiv] [code] [demo]

CLIPSep: Learning Text-queried Sound Separation with Noisy Unlabeled Videos

ICLR23

hFT-Transformer

[arXiv] [code]

Automatic Piano Transcription with Hierarchical Frequency-Time Transformer

ISMIR23

Automatic Music Tagging

[arXiv]

An Attention-based Approach To Hierarchical Multi-label Music Instrument Classification

ICASSP23

Vocal Dereverberation

[arXiv] [demo]

Unsupervised Vocal Dereverberation with Diffusion-based Generative Models

ICASSP23

Mixing Style Transfer

[arXiv] [code] [demo]

Music Mixing Style Transfer: A Contrastive Learning Approach to Disentangle Audio Effects

ICASSP23

Music Transcription

[arXiv] [code] [demo]

DiffRoll: Diffusion-based Generative Music Transcription with Unsupervised Pretraining Capability

ICASSP23

Singing Voice Vocoder

[arXiv] [demo]

Hierarchical Diffusion Models for Singing Voice Neural Vocoder

ICASSP23

Distortion Effect Removal

[poster] [arXiv] [demo]

Distortion Audio Effects: Learning How to Recover the Clean Signal

ISMIR22

Automatic Music Mixing

[poster] [arXiv] [code] [demo]

Automatic Music Mixing with Deep Learning and Out-of-Domain Data

ISMIR22

Sound Separation

[IEEE]

Music Source Separation with Deep Equilibrium Models

ICASSP22

Automatic DJ Transition

[arXiv] [code] [demo]

Automatic DJ Transitions with Differentiable Audio Effects and Generative Adversarial Networks

ICASSP22

Singing Voice Conversion

[arXiv] [demo]

Robust One-Shot Singing Voice Conversion

Sound Separation

[video] [site]

Glenn Gould and Kanji Ishimaru 2021: A collaboration with AI Sound Separation after 60 years

Cinematic Technologies


GenWarp

[arXiv] [demo]

GenWarp: Single Image to Novel Views with Semantic-Preserving Generative Warping

NeurIPS24

Acoustic Inv. Rendering

[CVF] [arXiv] [dataset] [code] [demo]

Hearing Anything Anywhere

CVPR24

STARSS23

[arXiv] [dataset]

STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events

NeurIPS23

BigVSAN Vocoder

[arXiv] [code] [demo]

BigVSAN: Enhancing GAN-based Neural Vocoders with Slicing Adversarial Network

ICASSP24

Zero-/Few-shot SELD

[IEEE] [arXiv]

Zero- and Few-shot Sound Event Localization and Detection

ICASSP24

Audio Restoration: ViT-AE

[IEEE] [arXiv] [demo]

Extending Audio Masked Autoencoders Toward Audio Restoration

WASPAA23

Diffiner

[ISCA] [arXiv] [code]

Diffiner: A Versatile Diffusion-based Generative Refiner for Speech Enhancement

INTERSPEECH2023

Sound Event Localization and Detection

[IEEE] [arXiv]

Multi-ACCDOA: Localizing and Detecting Overlapping Sounds from the Same Class with Auxiliary Duplicating Permutation Invariant Training

ICASSP22

Hosted Challenges


SVG Challenge 2024

[SVG Challenge 2024]

Sounding Video Generation Challenge 2024

DCASE Challenge Task 3

[DCASE Challenge2023]

Sound Event Localization and Detection Evaluated in Real Spatial Sound Scenes

CPD Challenge 2023

[CPD Challenge 2023]

Commonsense Persona-grounded Dialogue Challenge

SDX Challenge 2023

[site] [paper (music)] [paper (cinematic)]

Sound Demixing Challenge 2023

MDX Challenge 2021

[site] [frontiers]

Music Demixing Challenge 2021

Contact

Yuki Mitsufuji (yuhki.mitsufuji@sony.com)