“Motion Capture” for Text-to-Speech?

I had a random thought over the weekend, and while I suspect it’s not original, I couldn’t find anyone working on it.

One big reason why text-to-speech (TTS) synthesis sucks so badly is that the result sounds flat. Yes, the synthesizer can try to infer cadence and tone from things like commas, paragraph breaks, exclamation points, and question marks, but the result still falls far short of what a human reader sounds like. In the end, the problem seems to be Turing-hard, since you need to understand the meaning of a piece of text in order to read it properly.

So would it be possible to record a human reading a piece of text, and extract just the intonation, cadence, and pacing of the text? Hollywood already uses motion capture, in which cameras record the movements of a human being, and makes a CGI creature move the same way (e.g., Gollum in The Lord of the Rings or Shrek). In fact, you can combine multiple people’s movements into one synthesized creature, say by using one person’s stride, another’s hand movements, and a third person’s facial expressions.

So why not apply the same principle to synthesized speech? For instance, you could have someone read a paragraph of text. We already have voice-recognition software, so it should be possible to analyze that recording and match it to individual words and phonemes in the text. That gives you timing, for things like the length of a comma or reading speed. The recording can then be analyzed for things like whether a given word was spoken more loudly, or at a higher pitch, than other surrounding words, and by how much. This can be converted to speech markup.

This means that you could synthesize Stephen Fry reading a book in Patrick Stewart’s voice.

Perhaps more to the point, if you poke around Project Gutenberg, you’ll see that there are two types of audio books: ones generated via TTS, and ones read by people. The recordings of humans are, of course, better, but they require that an actual person sit down and read the whole book from start to finish, which is time-consuming.

If it were possible to apply a human’s reading style to the synthesis of a known piece of text, then it would be possible for multiple people to share the job of recording an audio book. Allow volunteers to read one or two pages at a time, and synthesize a recording of those pages using the volunteer’s intonation and cadence, but using a standard voice.

I imagine that there would still be lots of problems with this — for instance, it might feel somewhat jarring when the book switches from one person’s reading style to another’s — but it should still be an improvement over what we have now. And there are probably lots of other problems that I can’t imagine.

But hey, it would still be an improvements. Is anyone out there working on this?