DGMO: Training-Free Audio Source Separation
through Diffusion-Guided Mask Optimization

Interspeech logo Interspeech 2025
1Korea University
* Equal contribution
Korea University MIIL

Abstract

Language-queried Audio Source Separation (LASS) enables open-vocabulary sound separation via natural language queries. While existing methods rely on task-specific training, we explore whether pretrained diffusion models, originally designed for audio generation, can inherently perform separation without further training. In this study, we introduce a training-free framework leveraging generative priors for zero-shot LASS. Analyzing naive adaptations, we identify key limitations arising from modality-specific challenges. To address these issues, we propose Diffusion-Guided Mask Optimization (DGMO), a test-time optimization framework that refines spectrogram masks for precise, input-aligned separation. Our approach effectively repurposes pretrained diffusion models for source separation, achieving competitive performance without task-specific supervision. This work expands the application of diffusion models beyond generation, establishing a new paradigm for zero-shot audio separation.

Method

DGMO framework

We present Diffusion-Guided Mask Optimization (DGMO), a novel training-free framework for language-queried audio source separation (LASS). As illustrated above, DGMO operates in two stages: reference generation and mask optimization. First, given a natural language query and an input mixture, we use a pretrained text-to-audio diffusion model with DDIM inversion to synthesize reference signals that reflect the semantic content of the query while preserving the structure of the mixture. These references are then used to supervise the optimization of a spectrogram mask, applied directly to the magnitude domain to ensure alignment with the input signal. The loss is defined in the mel-spectrogram space to match the diffusion model's conditioning domain, allowing gradient-based optimization. This hybrid approach unifies generative priors and explicit masking to enable zero-shot, input-consistent LASS without any task-specific training.

Separation Results

Text Query Mixture Separated Ground Truth
The dog barking
acoustic guitar
ice craking
playing tennis

BibTeX