We use University of Rochester Multi-Modal Music Performance (URMP) dataset as our evaluation dataset, since it contains separately recorded audio for every instrument used in a single musical piece. Meaning that when we generate audio for a segmented instrument, we also have a ground truth audio for it. For MMAudio, we use MMAudio-S-44.1kHz version.
SAGANet$^\dagger$ refers to LoRA fine tuned version of our model.
GT mixed audio
Vid_35_Rondeau_vn_vn_va_db_51278_56278.mp4
GT separated audio
Vid_35_Rondeau_vn_vn_va_db_51278_56278_01.mp4
Vid_35_Rondeau_vn_vn_va_db_51278_56278_02.mp4
Vid_35_Rondeau_vn_vn_va_db_51278_56278_03.mp4
Vid_35_Rondeau_vn_vn_va_db_51278_56278_04.mp4
MMAudio
Vid_35_Rondeau_vn_vn_va_db_51278_56278_01.mp4
Vid_35_Rondeau_vn_vn_va_db_51278_56278_02.mp4
Vid_35_Rondeau_vn_vn_va_db_51278_56278_03.mp4
Vid_35_Rondeau_vn_vn_va_db_51278_56278_04.mp4
SAGANet$^\dagger$
Vid_35_Rondeau_vn_vn_va_db_51278_56278_01.mp4
Vid_35_Rondeau_vn_vn_va_db_51278_56278_02.mp4
Vid_35_Rondeau_vn_vn_va_db_51278_56278_03.mp4
Vid_35_Rondeau_vn_vn_va_db_51278_56278_04.mp4
SAGANet
Vid_35_Rondeau_vn_vn_va_db_51278_56278_01.mp4
Vid_35_Rondeau_vn_vn_va_db_51278_56278_02.mp4
Vid_35_Rondeau_vn_vn_va_db_51278_56278_03.mp4
Vid_35_Rondeau_vn_vn_va_db_51278_56278_04.mp4
GT mixed audio
Vid_39_Jerusalem_vn_vn_va_sax_db_71471_76471.mp4
GT separated audio
Vid_39_Jerusalem_vn_vn_va_sax_db_71471_76471_01.mp4
Vid_39_Jerusalem_vn_vn_va_sax_db_71471_76471_02.mp4
Vid_39_Jerusalem_vn_vn_va_sax_db_71471_76471_03.mp4
Vid_39_Jerusalem_vn_vn_va_sax_db_71471_76471_04.mp4
Vid_39_Jerusalem_vn_vn_va_sax_db_71471_76471_05.mp4
MMAudio
Vid_39_Jerusalem_vn_vn_va_sax_db_71471_76471_01.mp4
Vid_39_Jerusalem_vn_vn_va_sax_db_71471_76471_02.mp4
Vid_39_Jerusalem_vn_vn_va_sax_db_71471_76471_03.mp4
Vid_39_Jerusalem_vn_vn_va_sax_db_71471_76471_04.mp4
Vid_39_Jerusalem_vn_vn_va_sax_db_71471_76471_05.mp4
SAGANet$^\dagger$
Vid_39_Jerusalem_vn_vn_va_sax_db_71471_76471_01.mp4
Vid_39_Jerusalem_vn_vn_va_sax_db_71471_76471_02.mp4
Vid_39_Jerusalem_vn_vn_va_sax_db_71471_76471_03.mp4
Vid_39_Jerusalem_vn_vn_va_sax_db_71471_76471_04.mp4
Vid_39_Jerusalem_vn_vn_va_sax_db_71471_76471_05.mp4
SAGANet
Vid_39_Jerusalem_vn_vn_va_sax_db_71471_76471_01.mp4
Vid_39_Jerusalem_vn_vn_va_sax_db_71471_76471_02.mp4
Vid_39_Jerusalem_vn_vn_va_sax_db_71471_76471_03.mp4
Vid_39_Jerusalem_vn_vn_va_sax_db_71471_76471_04.mp4
Vid_39_Jerusalem_vn_vn_va_sax_db_71471_76471_05.mp4
GT mixed
Vid_42_Arioso_tpt_tpt_hn_tbn_tba_102022_107022.mp4
GT separated audio
Vid_42_Arioso_tpt_tpt_hn_tbn_tba_102022_107022_01.mp4
Vid_42_Arioso_tpt_tpt_hn_tbn_tba_102022_107022_02.mp4
Vid_42_Arioso_tpt_tpt_hn_tbn_tba_102022_107022_03.mp4
Vid_42_Arioso_tpt_tpt_hn_tbn_tba_102022_107022_04.mp4
Vid_42_Arioso_tpt_tpt_hn_tbn_tba_102022_107022_05.mp4
MMAudio
Vid_42_Arioso_tpt_tpt_hn_tbn_tba_102022_107022_01.mp4
Vid_42_Arioso_tpt_tpt_hn_tbn_tba_102022_107022_02.mp4
Vid_42_Arioso_tpt_tpt_hn_tbn_tba_102022_107022_03.mp4
Vid_42_Arioso_tpt_tpt_hn_tbn_tba_102022_107022_04.mp4
Vid_42_Arioso_tpt_tpt_hn_tbn_tba_102022_107022_05.mp4
SAGANet$^\dagger$
Vid_42_Arioso_tpt_tpt_hn_tbn_tba_102022_107022_01.mp4
Vid_42_Arioso_tpt_tpt_hn_tbn_tba_102022_107022_02.mp4
Vid_42_Arioso_tpt_tpt_hn_tbn_tba_102022_107022_03.mp4
Vid_42_Arioso_tpt_tpt_hn_tbn_tba_102022_107022_04.mp4
Vid_42_Arioso_tpt_tpt_hn_tbn_tba_102022_107022_05.mp4
SAGANet
Vid_42_Arioso_tpt_tpt_hn_tbn_tba_102022_107022_01.mp4
Vid_42_Arioso_tpt_tpt_hn_tbn_tba_102022_107022_02.mp4
Vid_42_Arioso_tpt_tpt_hn_tbn_tba_102022_107022_03.mp4
Vid_42_Arioso_tpt_tpt_hn_tbn_tba_102022_107022_04.mp4
Vid_42_Arioso_tpt_tpt_hn_tbn_tba_102022_107022_05.mp4