DAGM German Conference on Pattern Recognition (GCPR) 2025

Ilpo Viertola

Tampere University

Vladimir Iashin

University of Oxford

Esa Rahtu

Tampere University

Samples

ARXIV

https://github.com/ilpoviertola/SAGANet

Abstract

Existing multimodal audio generation models often lack precise user control, which limits their applicability in professional Foley workflows. In particular, these models focus on the entire video and do not provide precise methods for prioritizing a specific object within a scene, generating unnecessary background sounds, or focusing on the wrong objects. To address this gap, we introduce the novel task of Video Object Segmentation-aware Audio Generation, which explicitly conditions sound synthesis on object-level segmentation maps. We present SAGANet, a new multimodal generative model that enables controllable audio generation by leveraging visual segmentation masks along with video and textual cues. Our model provides users with fine-grained and visually localized control over audio generation. To support this task and further research on segmentation-aware Foley, we propose Segmented Music Solos, a benchmark dataset of musical instrument performance videos with segmentation information. Our method demonstrates substantial improvements and sets a new standard for controllable, high-fidelity Foley synthesis.

Segmented Music Solos

Segmented Music Solos (SMS)

Segmentation Aware Generative Audio Network (SAGANet)

saganet-pp.png

Figure 1: Overview of SAGANet.

Given a video and its corresponding segmentation masks, SAGANet combines global and local information streams (a.k.a. Focal Prompt). Gated Cross-Attention [1, 2] layers are used to fuse global and local features extracted by Synchformer [3], with shared weights across both branches. Only the layers highlighted in orange are updated during training. The final audio is generated following the same procedure as in the base MMAudio model. For additional details on MMAudio, refer to [4].

Results

gcpr-sample_smaller.png

Figure 2: SAGANet generates accurate audio for the target object.

In Fig. 2 the ground truth audio is separately recorded audio for the segmented (red mask) object. SAGANet is able to concentrate on segmented object, given the visual stream, segmentation mask, and the label of the segmented instrument. MMAudio [4] is not able to focus on the target object, conditioned on the textual label and the visual stream. $\dagger$ refers to LoRA finetuned ViT blocks associated with segmentation aware visual features.

FD$_{PaSST}\darr$ FD$_{PANNs}\darr$ FD$_{VGG}\darr$ KL$_{PANNs}\darr$ KL$_{PaSST}\darr$
MMAudio [4] 530.60 23.83 13.26 1.17 1.00
SAGANet 364.07 21.31 17.19 0.79 0.59
SAGANet$^\dagger$ 330.01 17.88 19.36 0.69 0.56
IS$\uarr$ IB-score$\uarr$ DeSync$\darr$
MMAudio [4] 2.24 35.94 0.96
SAGANet 3.34 39.62 0.44
SAGANet$^\dagger$ 3.27 41.50 0.30

Table 1: Main results.

Main results are presented in Tab. 1. We use MMAudio-S-44.1kHz model [4] in our experiments. Adding segmentation aware visual features guides the audio generation to focus on correct target instrument, which is not achievable using solely a target label. Finetuning the ViT layers associated with the segmentation-aware features improves the performance further by helping the generative model to better adapt to these new features.

Footnotes