July 9, 2024 • 5 min read

How I built an AI-powered phone answering machine with the power of Voice Synthesis and GenAI - Part I

Rédigé par Hugo Ruy

A few weeks ago, I was discussing with a friend who is an entrepreneur. She told me how hard it was for her to find some focus time during the day. She was constantly disturbed by phone calls. The day after, I was baffled by the demo of GPT-4o with human-like vocal interactions. I wondered: what if we could combine GenAI and voice synthesis to create a virtual clone of ourselves that would answer the phone on our behalf?

My first intuition was that such a solution could not be realistic enough to fool the interlocutor. Still, I was deeply curious to see how realistic it could get. Following this idea, I started experimenting with voice synthesis and AI. The goal was to build an AI that would answer my phone, pretend to be me, and have a natural conversation with my interlocutor. Then, it would propose to call them back as soon as I was available.

You can directly jump to the repository here or to the results of my experiments with voice synthesis.

Exploring the potential and limits of voice synthesis and Text-To-Speech (TTS) solutions combined with OpenAI services

The architecture schema of the voice synthesis project

I built this project with a React/Typescript frontend and a Fast API & Pydantic. I tried several APIs to generate a human-like conversation. Since GPT4o voice's features were not released, the first step was to transcribe the vocal input of my interlocutor. I used OpenAI's Whisper API with the Speech-To-Text endpoint. Then, I prompted OpenAI's GPT4o API to generate a written answer - MY written answer. Finally, for the voice synthesis, I tried several TTS solutions to iterate on the realism of the vocal output.

I did not implement the phone routing solution, but it is possible to do it with Twilio. If you wish to do so, you can find more information here.

I built a simple API with three routes, corresponding to the three steps of the process:

3 endpoints to transcribe, get the answer and do the voice synthesis

How to transcribe my interlocutor’s questions with OpenAI’s Whisper API

The first part of the process is to transcribe the vocal input of my interlocutor. To do so, I created an AudioRecorder component that enables my "interlocutor" to record himself as if he were talking with me on the phone. To make the code simpler, the user has to click on a Start/Stop button when he talks.

Once the audio is recorded, it is sent to the backend with the /transcribe endpoint. The transcription is then sent back to the frontend and added to a messages array stored in the frontend.

The transcribe endpoint is pretty straightforward. It uses the OpenAI Client to call the Whisper API and transcribe the audio.

The Whisper speech-to-text model is pretty astonishing. The latency is very good, which is key for a real-time voice synthesis solution. The rare mistakes it makes do not affect the clarity of the message.

Great! We now have a (multi-lingual) functional transcription system.

Prompting OpenAI GPT4o API to generate a written answer

The next step is to generate a written answer to the transcription with a LLM. I used the OpenAI's GPT4o API to do so. It required a bit of iteration on the prompt to get a realistic answer. Since my use case was in french, I wrote the prompt in french as well. Its final version is:

The endpoint /getAnswer concatenates this system prompt with the whole discussion transcribed to get a natural language written answer.

The answer is then displayed in the frontend, and the user can answer back. This process is repeated until the user decides to end the conversation.

So far so good! We’ve got a written answer - MY answer - to a vocal input. Now let’s get to the fun part: how to actually trick your friend into believing they are talking to another human-being!

Using ElevenLabs’ solution to clone my voice and synthesize the answer

To clone my voice, I tried out a few solutions: the native Web Speech API, OpenAI’s TTS model Whisper, ResembleAI and ElevenLabs. I finally decided to go with ElevenLabs’ solution - stay tuned for the second part of this article to learn more on this, alongside a fascinating overview of the history of voice synthesis!

First things first: to use my voice to synthesize an answer, I must clone it. Hopefully for us, this can be done very easily on the web interface of ElevenLabs. It only requires a few minutes of audio and a few seconds (!) to be able to use it.

I must admit, I was really amazed by the quality of the voice generated by ElevenLabs. The voice was very close to mine, with a very low latency.

Base voice - EN

Cloned Voice with Eleven Labs - EN

The intonation or pitch was sometimes a bit off, but overall, the results were promising. I decided to try it out to synthesize my answer generated by GPT4o.

I defined an endpoint /synthesize that uses my voice to synthesize a text.

A few parameters are available with this voice synthesis API:

A streaming latency optimization parameter, that decreases the quality in favor of a faster response,
A stability parameter, which allows us to add more variation in the answer. Below a certain threshold around 0.65, the audio output starts to be incoherent, with some words being cut or repeated. Above 0.7, the voice is stable, and the higher it is, the more monotone the voice becomes. Hence, 0.7 is a good compromise between stability and expressiveness,
A similarity boost parameter, which controls how close the voice is to the original voice. The higher it is, the more the voice will sound like the original voice,
A speaker boost boolean parameter, which allows us to add more emotion to the voice for a small compromise on latency and stability

To have a functional prototype, we only have to call the backend to synthesize the voice of my AI assistant and directly play the fetched audio in the frontend.

The latency remains low, even without optimizing the synthesis for streaming. I asked some friends to talk with my fellow AI-self, and they were all surprised by the quality of the voice. I recorded an example of a (multi-lingual!) discussion one of them had with my AI-self.

However, to be perfectly honest, I do not expect this project to be a success in the real world. The voice generated by ElevenLabs is close to mine, but is too monotone - even the most hungover version of myself would not talk like this on the phone. Playing with the parameters as stability or similarity allows a more "genuine" expressive answer, but brings instability in the model that can introduce bugs - some words are cut or repeated in the audio output. I'm really curious about the level of quality that can be achieved with a professional voice cloning with Eleven Labs though!

Towards a more realistic solution

This project is pretty basic in the implementation of the conversation. For a real-life system, I should implement a proper automatic end-of-speech detection. For instance, this could be done by checking if the audio level is below a certain threshold for a certain amount of time. I could do this type of check directly in the browser, with the WebAudio API.

Currently, the answer we get is written on a fun, informal tone. To make it more realistic, we could add some more complex logic to the system, such as a categorization system to adapt the tone of the answer depending on the category of the contact - you definitely do not want to talk the same way to your boss and to your best friend!

Concerning the generation of the written answer, we could go further and ask GPT4o to fetch my availabilities from my Google Calendar. This could be done by creating an assistant on the OpenAI Platform and creating a function that would call the Google Calendar API to get my availabilities. A good practice in this case would be to create an intermediate layer in charge of keeping the Google Calendar API Key safe, so that the assistant cannot do anything malicious with it - learn more about LLM security.

At last, there is currently no action when the conversation is over. In a real-life system, the app should then store the conversation in a database, and send a notification to the user to let him know that he has a new missed call.

Wrapping up my experiments with GenAI and voice synthesis

This project was a lot of fun to work on, and I learned a lot about voice synthesis and AI. I am amazed by the performance of the voice cloning services of ElevenLabs.

This technology, coupled with AI, opens up a world of possibilities but also blurs the frontier between what is real and what is not. It raises important ethical questions about the use of synthetic voices and the potential for misuse, such as deepfakes or impersonation. Actually, some cases of fraud have already been reported, where criminals call employees of a company, pretending to be the CEO and talking with the CEO's actual voice, and ask them to transfer money to a fraudulent account. Hence, it is crucial to be aware of these risks, even more considering the fast pace at which this technology is evolving.

I hope you enjoyed reading this article as much as I enjoyed writing it! If you have any questions or comments regarding the code or this topic in general, feel free to reach out to me on LinkedIn. I would be happy to discuss this project or any other topic related to AI, voice synthesis, or technology in general.

Cet article a été écrit par

Hugo Ruy