We use University of Rochester Multi-Modal Music Performance (URMP) dataset as our evaluation dataset, since it contains separately recorded audio for every instrument used in a single musical piece. Meaning that when we generate audio for a segmented instrument, we also have a ground truth audio for it. For MMAudio, we use MMAudio-S-44.1kHz version.

SAGANet$^\dagger$ refers to LoRA fine tuned version of our model.

35_Rondeau_vn_vn_va_db

GT mixed audio

Vid_35_Rondeau_vn_vn_va_db_51278_56278.mp4

GT separated audio

Vid_35_Rondeau_vn_vn_va_db_51278_56278_01.mp4

Vid_35_Rondeau_vn_vn_va_db_51278_56278_02.mp4

Vid_35_Rondeau_vn_vn_va_db_51278_56278_03.mp4

Vid_35_Rondeau_vn_vn_va_db_51278_56278_04.mp4

MMAudio

Vid_35_Rondeau_vn_vn_va_db_51278_56278_01.mp4

Vid_35_Rondeau_vn_vn_va_db_51278_56278_02.mp4

Vid_35_Rondeau_vn_vn_va_db_51278_56278_03.mp4

Vid_35_Rondeau_vn_vn_va_db_51278_56278_04.mp4

SAGANet$^\dagger$

Vid_35_Rondeau_vn_vn_va_db_51278_56278_01.mp4

Vid_35_Rondeau_vn_vn_va_db_51278_56278_02.mp4

Vid_35_Rondeau_vn_vn_va_db_51278_56278_03.mp4

Vid_35_Rondeau_vn_vn_va_db_51278_56278_04.mp4

SAGANet

Vid_35_Rondeau_vn_vn_va_db_51278_56278_01.mp4

Vid_35_Rondeau_vn_vn_va_db_51278_56278_02.mp4

Vid_35_Rondeau_vn_vn_va_db_51278_56278_03.mp4

Vid_35_Rondeau_vn_vn_va_db_51278_56278_04.mp4

39_Jerusalem_vn_vn_va_sax_db

GT mixed audio

Vid_39_Jerusalem_vn_vn_va_sax_db_71471_76471.mp4

GT separated audio

Vid_39_Jerusalem_vn_vn_va_sax_db_71471_76471_01.mp4

Vid_39_Jerusalem_vn_vn_va_sax_db_71471_76471_02.mp4

Vid_39_Jerusalem_vn_vn_va_sax_db_71471_76471_03.mp4

Vid_39_Jerusalem_vn_vn_va_sax_db_71471_76471_04.mp4

Vid_39_Jerusalem_vn_vn_va_sax_db_71471_76471_05.mp4

MMAudio

Vid_39_Jerusalem_vn_vn_va_sax_db_71471_76471_01.mp4

Vid_39_Jerusalem_vn_vn_va_sax_db_71471_76471_02.mp4

Vid_39_Jerusalem_vn_vn_va_sax_db_71471_76471_03.mp4

Vid_39_Jerusalem_vn_vn_va_sax_db_71471_76471_04.mp4

Vid_39_Jerusalem_vn_vn_va_sax_db_71471_76471_05.mp4

SAGANet$^\dagger$

Vid_39_Jerusalem_vn_vn_va_sax_db_71471_76471_01.mp4

Vid_39_Jerusalem_vn_vn_va_sax_db_71471_76471_02.mp4

Vid_39_Jerusalem_vn_vn_va_sax_db_71471_76471_03.mp4

Vid_39_Jerusalem_vn_vn_va_sax_db_71471_76471_04.mp4

Vid_39_Jerusalem_vn_vn_va_sax_db_71471_76471_05.mp4

SAGANet

Vid_39_Jerusalem_vn_vn_va_sax_db_71471_76471_01.mp4

Vid_39_Jerusalem_vn_vn_va_sax_db_71471_76471_02.mp4

Vid_39_Jerusalem_vn_vn_va_sax_db_71471_76471_03.mp4

Vid_39_Jerusalem_vn_vn_va_sax_db_71471_76471_04.mp4

Vid_39_Jerusalem_vn_vn_va_sax_db_71471_76471_05.mp4

42_Arioso_tpt_tpt_hn_tbn_tba

GT mixed

Vid_42_Arioso_tpt_tpt_hn_tbn_tba_102022_107022.mp4

GT separated audio

Vid_42_Arioso_tpt_tpt_hn_tbn_tba_102022_107022_01.mp4

Vid_42_Arioso_tpt_tpt_hn_tbn_tba_102022_107022_02.mp4

Vid_42_Arioso_tpt_tpt_hn_tbn_tba_102022_107022_03.mp4

Vid_42_Arioso_tpt_tpt_hn_tbn_tba_102022_107022_04.mp4

Vid_42_Arioso_tpt_tpt_hn_tbn_tba_102022_107022_05.mp4

MMAudio

Vid_42_Arioso_tpt_tpt_hn_tbn_tba_102022_107022_01.mp4

Vid_42_Arioso_tpt_tpt_hn_tbn_tba_102022_107022_02.mp4

Vid_42_Arioso_tpt_tpt_hn_tbn_tba_102022_107022_03.mp4

Vid_42_Arioso_tpt_tpt_hn_tbn_tba_102022_107022_04.mp4

Vid_42_Arioso_tpt_tpt_hn_tbn_tba_102022_107022_05.mp4

SAGANet$^\dagger$

Vid_42_Arioso_tpt_tpt_hn_tbn_tba_102022_107022_01.mp4

Vid_42_Arioso_tpt_tpt_hn_tbn_tba_102022_107022_02.mp4

Vid_42_Arioso_tpt_tpt_hn_tbn_tba_102022_107022_03.mp4

Vid_42_Arioso_tpt_tpt_hn_tbn_tba_102022_107022_04.mp4

Vid_42_Arioso_tpt_tpt_hn_tbn_tba_102022_107022_05.mp4

SAGANet

Vid_42_Arioso_tpt_tpt_hn_tbn_tba_102022_107022_01.mp4

Vid_42_Arioso_tpt_tpt_hn_tbn_tba_102022_107022_02.mp4

Vid_42_Arioso_tpt_tpt_hn_tbn_tba_102022_107022_03.mp4

Vid_42_Arioso_tpt_tpt_hn_tbn_tba_102022_107022_04.mp4

Vid_42_Arioso_tpt_tpt_hn_tbn_tba_102022_107022_05.mp4

43_Chorale_tpt_tpt_hn_tbn_tba