DGMO: Training-Free Audio Source Separation
through Diffusion-Guided Mask Optimization

Interspeech 2025

Geonyoung Lee^1,* Geonhee Han^1,* Paul Hongsuck Seo¹

¹Korea University
* Equal contribution

Abstract

Language-queried Audio Source Separation (LASS) enables open-vocabulary sound separation via natural language queries. While existing methods rely on task-specific training, we explore whether pretrained diffusion models, originally designed for audio generation, can inherently perform separation without further training. In this study, we introduce a training-free framework leveraging generative priors for zero-shot LASS. Analyzing naive adaptations, we identify key limitations arising from modality-specific challenges. To address these issues, we propose Diffusion-Guided Mask Optimization (DGMO), a test-time optimization framework that refines spectrogram masks for precise, input-aligned separation. Our approach effectively repurposes pretrained diffusion models for source separation, achieving competitive performance without task-specific supervision. This work expands the application of diffusion models beyond generation, establishing a new paradigm for zero-shot audio separation.

Method

Separation Results

Text Query	Mixture	Separated	Ground Truth
Babbling

A man is speaking

A frog is croaking near a pond

Storm approaching with heavy wind

BibTeX

@inproceedings{lee25g_interspeech,
  title     = {{DGMO: Training-Free Audio Source Separation through Diffusion-Guided Mask Optimization}},
  author    = {{Geonyoung Lee and Geonhee Han and Paul Hongsuck Seo}},
  year      = {{2025}},
  booktitle = {{Interspeech 2025}},
  pages     = {{4983--4987}},
  doi       = {{10.21437/Interspeech.2025-840}},
  issn      = {{2958-1796}},
}