Abstract

Existing multimodal audio generation models often lack precise user control, which limits their applicability in professional Foley workflows. In particular, these models focus on the entire video and do not provide precise methods for prioritizing a specific object within a scene, generating unnecessary background sounds, or focusing on the wrong objects. To address this gap, we introduce the novel task of Video Object Segmentation-aware Audio Generation, which explicitly conditions sound synthesis on object-level segmentation maps. We present SAGANet, a new multimodal generative model that enables controllable audio generation by leveraging visual segmentation masks along with video and textual cues. Our model provides users with fine-grained and visually localized control over audio generation. To support this task and further research on segmentation-aware Foley, we propose Segmented Music Solos, a benchmark dataset of musical instrument performance videos with segmentation information. Our method demonstrates substantial improvements and sets a new standard for controllable, high-fidelity Foley synthesis.

Segmented Music Solos

Segmented Music Solos (SMS)

Segmentation Aware Generative Audio Network (SAGANet)

Figure 1: Overview of SAGANet.

Given a video and its corresponding segmentation masks, SAGANet combines global and local information streams (a.k.a. Focal Prompt). Gated Cross-Attention [1, 2] layers are used to fuse global and local features extracted by Synchformer [3], with shared weights across both branches. Only the layers highlighted in orange are updated during training. The final audio is generated following the same procedure as in the base MMAudio model. For additional details on MMAudio, refer to [4].

Results

Figure 2: SAGANet generates accurate audio for the target object.

In Fig. 2 the ground truth audio is separately recorded audio for the segmented (red mask) object. SAGANet is able to concentrate on segmented object, given the visual stream, segmentation mask, and the label of the segmented instrument. MMAudio [4] is not able to focus on the target object, conditioned on the textual label and the visual stream. $\dagger$ refers to LoRA finetuned ViT blocks associated with segmentation aware visual features.

	FD$_{PaSST}\darr$	FD$_{PANNs}\darr$	FD$_{VGG}\darr$	KL$_{PANNs}\darr$	KL$_{PaSST}\darr$	IS$\uarr$	IB-score$\uarr$	DeSync$\darr$
MMAudio [4]	530.60	23.83	13.26	1.17	1.00	2.24	35.94	0.96
SAGANet	364.07	21.31	17.19	0.79	0.59	3.34	39.62	0.44
SAGANet$^\dagger$	330.01	17.88	19.36	0.69	0.56	3.27	41.50	0.30

Table 1: Main results.

Main results are presented in Tab. 1. We use MMAudio-S-44.1kHz model [4] in our experiments. Adding segmentation aware visual features guides the audio generation to focus on correct target instrument, which is not achievable using solely a target label. Finetuning the ViT layers associated with the segmentation-aware features improves the performance further by helping the generative model to better adapt to these new features.

Ablation Study

In Table 2, we analyze the effect of different visual prompts on the generated audio quality. Using only the global visual stream (V) yields the lowest performance across semanti calignment (IB) and temporal synchronization (DeSync), despite achieving a relatively good inception score (IS). This suggests decent audio quality but poor relevance to the target object. Incorporating segmentation information (V+M) substantially improves performance, particularly in temporal synchronization, highlighting the importance of precise spatial focus in VOS-aware audio generation.