Voice Morphing

The method of transforming the source speaker’s speech to that of the target speaker is usually referred as Voice Morphing or voice transformation or voice conversion. Using the linear transformations estimated from time-aligned parallel training data, it transforms the spectral envelope of the potential speaker in tone with the target speaker. As the image morphing is analogous in nature, i.e. the source face smoothly changing its shape and texture to the target face, speech morphing also should smoothly change the source voice into another, keeping the shared characteristics of the starting and ending signals. The pitch and the envelope information are two factors that coincide in a speech signal, which needs to be separated. The method of cepstral analysis is usually employed to extract the same.

Voice Morphing which is also referred to as voice transformation and voice conversion is a technique to modify a source speaker's speech utterance to sound as if it was spoken by a target speaker. There are many applications which may benefit from this sort of technology. For example, a TTS system with voice morphing technology integrated can produce many different voices. In cases where the speaker identity plays a key role, such as dubbing movies and TV-shows, the availability of high quality voice morphing technology will be very valuable allowing the appropriate voice to be generated (maybe in different languages) without the original actors being present.

There are basically three inter-dependent issues that must be solved before building a voice morphing system. Firstly, it is important to develop a mathematical model to represent the speech signal so that the synthetic speech can be regenerated and prosody can be manipulated without artifacts. Secondly, the various acoustic cues which enable humans to identify speakers must be identified and extracted. Thirdly, the type of conversion function and the method of training and applying the conversion function must be decided.

The aim of this research is to develop flexible high quality algorithms which can morph speech from one speaker. A system has been developed based on a pitch synchronous sinusoidal model which uses LSF feature encoding and linear transforms. To ensure high quality, a number of novel techniques have been developed to minimise the artifacts which typically result from loss of glottal source information, formant bandwidth broadening, phase incoherance and spectral colouring of unvoiced sounds. Full details are given in references [1] and [2] and some demonstration files are given below.

Current work is focussed on extending the techniques to allow the conversion of an unknown speaker's voice to sound like that of a known target speaker.