A few weeks ago, I detailed how I combined voice synthesis and generative AI to build the basis of an automated phone answering machine. It served as a practical guide on how to implement this system. However, there is some theory and history behind it that is interesting as well!
What makes a voice human is what makes it hard to synthesize
Text-To-Speech technology has known impressive development during the last decades. It is now a relatively mature technology that is present in our daily lives - you can find it under the hood for services like Siri, Waze, Google Maps or the Google Assistant. And it is a real technological prowess that is often underestimated by the public in an era ruled by LLMs and RAGs. By the way, those are amazing tools - make sure to check out how to create your own semantic version of Google Images!
Think about it for a moment - what characterizes a human voice? At the most basic level, a human voice is a superposition of sound waves with frequencies that vary over time. But so many different "features" contribute to the uniqueness and complexity of a human voice:
- pitch, which is determined by the frequency of the sound waves
- loudness, influenced by the amplitude of the sound waves
- various articulatory features, such as the shape of the vocal tract and the movement of the tongue, lips, and other parts of the speech apparatus
- intonation, which can be used to add more (or less!) dynamism, or signify a question
- rhythm
- accent
And all those elements can to some extent vary, even within a single sentence! To make things even more challenging, the voice synthesis process must be done in real-time. The overall latency of the whole system must be under 1 second.
1937 - The VODER, or how to play a voice just like playing piano
One of the first experiments in this field was achieved in the 1930s under the sweet, delicate name of VODER.
The VODER (Voice Operating Demonstrator) was one of the first significant breakthroughs in the field of voice synthesis. Developed in 1937 by Homer Dudley, the VODER represented a pioneering attempt to electronically produce human speech. Dudley's creation was based on his earlier work with the vocoder (voice encoder).
The VODER was an intricate device that required a highly skilled operator to "play" it much like a musical instrument. The operator used a keyboard and a foot pedal to manipulate the sound, producing speech sounds in real-time. To build his machine, Dudley applied the concept of carrier wave, which produced a buzzer-like sound for voiced sounds and a hiss-like sound for unvoiced sounds.
The VODER had a bank of ten predefined band-pass filters. Those filters covered most of the spectrum of human speech. They received input from either a noise source or a relaxation oscillator, the buzz source, selected by the operator using a wrist bar. A foot pedal allowed the operator to control the pitch of the input. The keyboard acted as a controller on the filters, reducing or increasing the contribution of any one of them, and a quiet key allowed for pauses and silences in the generated speech.
To produce intelligible speech, the operator had to master the coordination of both hands and feet, much like a pianist. By pressing different combinations of keys, the operator could generate continuous speech with a natural flow and varying tones. The VODER was a groundbreaking achievement, despite its complexity and the high level of skill required to operate it. It demonstrated that it was possible to synthesize human speech through electronic means. It really paved the way for future advancements in text-to-speech technology.
Homer Dudley showcased the VODER at the 1939 New York World's Fair. It captivated audiences and provided a glimpse into the potential of electronic voice synthesis.
1940-2010 - A rule-based vs a data-based approach for voice synthesis
While the VODER laid the foundation for voice synthesis, subsequent technological advancements sought to refine and simplify the process of generating human-like speech. One of the most well-known applications of early voice synthesis technology was Stephen Hawking's computerized voice. Diagnosed with amyotrophic lateral sclerosis (ALS), Hawking lost his ability to speak and relied on speech-generating devices to communicate. His voice synthesizer, developed in the 1980s, utilized formant synthesis, a significant leap from the manual operation of the VODER.
Formant synthesis, developed in the mid-20th century, generates speech by creating signals based on language-specific rules combined with the general spectral properties of human speech. This method uses additive synthesis guided by an acoustic model that describes the fundamental frequency, intonation, and prosody—the elements of speech that define individual articulation, including tone of voice and accent. This approach allowed Hawking to communicate effectively, though the speech produced was often robotic and lacked emotional nuance.
In contrast, concatenative synthesis is a data-driven approach that relies on high-fidelity audio recordings. Segments of recorded speech are selected and combined to form new speech utterances. Typically, a voice actor records several hours of speech, which are then processed into a large database. This database contains linguistic units such as phonemes, phrases, and sentences. When speech synthesis is initiated, the system searches this database for units that match the input text. Then, it concatenates them to produce an audible output.
Concatenative synthesis can produce high-quality audio if a large and varied dataset is available. It captures the natural variability and richness of human speech better than formant synthesis. However, it has its limitations. The approach makes it difficult to modify the voice without recording a new database of phrases. Changing the speaker or adjusting the emphasis or emotion of the speech requires extensive additional recordings, which can be impractical.
Despite their advancements, both methods had limitations in terms of flexibility and emotional expressiveness. These techniques remained dominant until the 2010s when the advent of deep neural networks revolutionized the field, providing more natural, flexible, and emotionally expressive voice synthesis capabilities.
2010s - Reality-like voice synthesis by leveraging Deep Neural Networks
The 2010s marked a turning point in the evolution of voice synthesis technology with the introduction of deep neural networks (DNNs). These networks offered a revolutionary approach, moving beyond the rigid structures of formant and concatenative synthesis. Instead of relying on rule-based algorithms or databases of pre-recorded speech segments, DNNs could learn and generate speech patterns directly from large amounts of data. This capability allowed for more natural, flexible, and emotionally expressive voice synthesis.
DNNs work by training on vast datasets of human speech. The network learns the complex relationships between text input and corresponding audio output. This process involves multiple layers of artificial neurons that enables the system to understand and replicate the intricate features of human speech. The result is a more dynamic and adaptable voice synthesis model that can produce highly realistic and nuanced speech.
There are typically two neural networks involved in modern voice synthesis systems. The first network converts text input into spectrograms, which represent the frequencies of sound waves over time. Then, another network converts these spectrograms into audio output.
Later on, in 2016, Google introduced a new state-of-the-art model called WaveNet, which was able to generate human-like speech with remarkable accuracy and naturalness. WaveNet bypassed this need to use two different networks for text-to-spectrogram and spectrogram-to-audio conversion. Instead, it used a single neural network architecture, which allowed it to generate audio samples directly from text input. This approach significantly improved the quality and efficiency of voice synthesis, leading to a more realistic and expressive speech output. WaveNet was integrated into Google Assistant in 2017, enhancing the user experience with more natural and human-like interactions.
What about voice synthesis with MY own voice ?
We now have an overview of how different voice synthesis technologies work. But how does voice cloning relate to those previous explanations regarding voice synthesis?
Actually, voice cloning is a specific application of voice synthesis that aims to generate a voice that is as close as possible to a specific target voice. The simplest way to do this is to follow the process described in the previous section, but with a dataset of audio recordings of the target voice. The idea is to record a lot of audio of the target voice (a few hours at least), and then train a deep neural network on this dataset. However, training a model on a few hours of audio is painful - it requires time both to record the audio and to train the model, as well as computational power.
For this reason, a few-shot learning approach can be used. It consists of fine-tuning a pre-trained model on a smaller dataset of the target voice, typically a few minutes. This approach is less computationally intensive and can be done with just a few minutes of audio. However, the quality of the output is not as good as with a full dataset.
Voice synthesis has known an amazing development over the last decades! As of today, no need to get too technical, there are several solutions that allow you to clone your voice in just a few minutes. I implemented a small solution with Eleven Labs in the first part of this article, check it out!
Feel free to reach out to me on LinkedIn - I’m a software engineer at Theodo. I would be happy to discuss this project or any other topic related to AI, voice synthesis, or technology in general.