|Thesis advisor:||Prof. Dr.-Ing. Thomas Sikora,||Technical University of Berlin|
|Thesis reader:||Prof. Dr. Gaël Richard,||TELECOM ParisTech|
|Chairman:||Prof. Dr.-Ing. Reinhold Orglmeister,||Technical University of Berlin|
The goal of source separation is to detect and extract the individual signals present in a mixture. Its application to sound signals and, in particular, to music signals, is of interest for content analysis and retrieval applications arising in the context of online music services. Other applications include unmixing and remixing for post-production, restoration of old recordings, object-based audio compression and upmixing to multichannel setups.
This work addresses the task of source separation from monaural and stereophonic linear musical mixtures. In both cases, the problem is underdetermined, meaning that there are more sources to separate than channels in the observed mixture. This requires taking strong statistical assumptions and/or learning a priori information about the sources in order for a solution to be feasible. On the other hand, constraining the analysis to instrumental music signals allows exploiting specific cues such as spectral and temporal smoothness, note-based segmentation and timbre similarity for the detection and extraction of sound events.
The statistical assumptions and, if present, the a priori information, are both captured by a given source model that can greatly vary in complexity and extent of application. The approach used here is to consider source models of increasing levels of complexity, and to study their implications on the separation algorithm.
The starting point is sparsity-based separation, which makes the general assumption that the sources can be represented in a transformed domain with few high-energy coefficients. It will be shown that sparsity, and consequently separation, can both be improved by using nonuniform-resolution time-frequency representations. To that end, several types of frequency-warped filter banks will be used as signal front-ends in conjunction with an unsupervised stereo separation approach.
As a next step, more sophisticated models based on sinusoidal modeling and statistical training will be considered in order to improve separation and to allow the consideration of the maximally underdetermined problem: separation from single-channel signals. An emphasis is given in this work to a detailed but compact approach to train models of the timbre of musical instruments. An important characteristic of the approach is that it aims at a close description of the temporal evolution of the spectral envelope. The proposed method uses a formant-preserving, dimension-reduced representation of the spectral envelope based on spectral interpolation and Principal Component Analysis. It then describes the timbre of a given instrument as a Gaussian Process that can be interpreted either as a prototype curve in a timbral space or as a time-frequency template in the spectral domain.
A monaural separation method based on sinusoidal modeling and on the mentioned timbre modeling approach will be presented. It exploits common-fate and good-continuation cues to extract groups of sinusoidal tracks corresponding to the individual notes. Each group is compared to each one of the timbre templates on the database using a specially-designed measure of timbre similarity, followed by a Maximum Likelihood decision. Subsequently, overlapping and missing parts of the sinusoidal tracks are retrieved by interpolating the selected timbre template. The method is later extended to stereo mixtures by using a preliminary spatial-based blind separation stage, followed by a set of refinements performed by the above sinusoidal modeling and timbre matching methods and aiming at reducing interferences with the undesired sources.
A notable characteristic of the proposed separation methods is that they do not assume harmonicity, and are thus not based on a previous multipitch estimation stage, nor on the input of detailed pitch-related information. Instead, grouping and separation relies solely on the dynamic behavior of the amplitudes of the partials. This also allows separating highly inharmonic sounds and extracting chords played by a single instrument as individual entities.
The fact that the presented approaches are supervised and based on classification and similarity allows using them (or parts thereof) for other content analysis applications. In particular the use of the timbre models, and the timbre matching stages of the separation systems will be evaluated in the tasks of musical instrument classification and detection of instruments in polyphonic mixtures.