Language-queried Audio Source Separation (LASS) enables open-vocabulary sound separation via natural language queries. While existing methods rely on task-specific training, we explore whether pretrained diffusion models, originally designed for audio generation, can inherently perform separation without further training. In this study, we introduce a training-free framework leveraging generative priors for zero-shot LASS. Analyzing naive adaptations, we identify key limitations arising from modality-specific challenges. To address these issues, we propose Diffusion-Guided Mask Optimization (DGMO), a test-time optimization framework that refines spectrogram masks for precise, input-aligned separation. Our approach effectively repurposes pretrained diffusion models for source separation, achieving competitive performance without task-specific supervision. This work expands the application of diffusion models beyond generation, establishing a new paradigm for zero-shot audio separation.
We present Diffusion-Guided Mask Optimization (DGMO), a novel training-free framework for language-queried audio source separation (LASS). As illustrated above, DGMO operates in two stages: reference generation and mask optimization. First, given a natural language query and an input mixture, we use a pretrained text-to-audio diffusion model with DDIM inversion to synthesize reference signals that reflect the semantic content of the query while preserving the structure of the mixture. These references are then used to supervise the optimization of a spectrogram mask, applied directly to the magnitude domain to ensure alignment with the input signal. The loss is defined in the mel-spectrogram space to match the diffusion model's conditioning domain, allowing gradient-based optimization. This hybrid approach unifies generative priors and explicit masking to enable zero-shot, input-consistent LASS without any task-specific training.
Text Query | Mixture | Separated | Ground Truth |
---|---|---|---|
The dog barking | ![]() |
![]() |
![]() |
acoustic guitar | ![]() |
![]() |
![]() |
ice craking | ![]() |
![]() |
![]() |
playing tennis | ![]() |
![]() |
![]() |