Abstract: In this work, we define a diffusion-based generative model capable of both music synthesis and source separation by learning the score of the joint probability density of sources sharing a context. Alongside the classic total inference tasks (i.e. generating a mixture, separating the sources), we also introduce and experiment on the partial inference task of source imputation, where we generate a subset of the sources given the others (e.g., play a piano track that goes well with the drums). Additionally, we introduce a novel inference method for the separation task. We train our model on Slakh2100, a standard dataset for musical source separation, provide qualitative results in the generation settings, and showcase competitive quantitative results in the separation setting. Our method is the first example of a single model that can handle both generation and separation tasks, thus representing a step toward general audio models.
Below are some examples of the tasks we can perform with our approach.
Generation
Here we ask the neural model to randomly generate some new music with just piano and drums:
Sample #1
Sample #2
Sample #3
Sample #4
Sample #5
Sample #6
Source Imputation (a.k.a. partial generation)
Given a drum track as input, the neural model generates the accompanying piano from scratch:
Input Drums Track 1
Sampled Piano #1
Sampled Piano #2
Input Drums Track 2
Sampled Piano #1
Sampled Piano #2
Similarly, given a piano track as input, the neural model is able to generate the accompanying drums:
Input Piano Track 1
Sampled Drums #1
Sampled Drums #2
Input Piano Track 2
Sampled Drums #1
Sampled Drums #2
Source Separation
Finally, it is possible to use our model to extract single sources from an input mixture: