Building an AI Voice Assistant with Groq and Deepgram (Part 1)
A local Python voice assistant app
June 12, 2024
Have you ever thought about creating your own AI voice assistant, like a personal Siri or Alexa? With the latest progress in AI and some powerful new tools, it's now a lot easier and more affordable than ever before.
In this 3-part series we'll go through the process, step by step, of building a fast and interactive voice assistant using Python, React, Deepgram, Llama 3 and Groq:
- Part 1: A local Python voice assistant app. Here, in the first part, we'll start with a basic Python local app that shows off the core features.
- Part 2: A Python backend with FastAPI & WebSockets. In the second part, we'll create the Python backend for the complete full-stack web app.
- Part 3: An interactive React user interface. In the final part, we'll put together the React frontend with a friendly UI.
Along the way, we will also learn about asynchronous programming in Python, WebSockets, FastAPI and Web APIs for recording and playing audio in the browser.
Want to jump ahead and see what we'll be making? Check out the finished voice assistant app by clicking below:
Let's get started!
#1. System Overview
Let's take a moment to understand the core elements of our AI voice assistant and how they work together. Essentially, the assistant is made up of three main components:
- Speech-to-Text (STT): This component listens to your voice and transcribes it into text. We'll be using Deepgrams Nova 2, currently the fastest and most accurate speech-to-text API available.
- Large Language Model (LLM): Once your speech is transcribed, the text is sent to a language model along with any previous messages in the conversation. The LLM is the brain of the assistant, capable of understanding the context and generating human-like responses. We'll be using Meta's recently released open-source Llama 3, running on Groq, the world's fastest AI inference technology. And they keep getting faster!
- Text-to-Speech (TTS): After the LLM generates a response, the text is sent to a TTS model that converts it into speech. For this, we'll use Deepgram Aura.
These three components run in a continuous loop, and with the speed of both Groq and Deepgram, we can create a voice assistant that can understand and respond to you in real-time. If you have never tried Groq or Deepgram before, I highly recommend them!
Moreover, using an open-source LLM like Llama 3 offers many advantages: it's more affordable, it allows you to have more control, customization and data privacy, and less dependency on LLM providers. With some additional work, it would be feasible to turn this voice assistant into a fully open-source solution that can be self-hosted, giving you complete control over your data.
Now that we have a conceptual understanding of how the voice assistant works, let's dive into the code and see how it all comes together!
#2. Technical Deep Dive
As mentioned before, in this first part of the series we'll build a straightforward Python app that showcases the fundamental building blocks of the assistant. And we will save Parts 2 and 3 to build the complete full-stack web application.
You can find all the code in this GitHub repository, with an MIT license that lets you use it and build on it however you like. The repository includes the complete application, but this first part of the series focuses only on the local Python assistant found in this file.
#2.1. Main Loop
The heart of our application is the run()
function, which orchestrates the interaction between the user and the assistant. This main loop continues to run until the user chooses to end the conversation.
Here's a breakdown of what's happening:
- We start with a few definitions:
system_message
sets the initial context for the conversation. This message helps guide the LLM's responses to align with our desired assistant's personality and behavior.memory_size
limits the number of previous messages the LLM considers when generating a response.messages
keeps track of the conversation history, storing both user and assistant messages.
- We then enter the infinite loop, which will keep the conversation going until we break out from it. Inside the loop, we first call the
transcribe_audio
function to listen to the user's speech and convert it into text. This function returns the transcribed text asuser_message
, and we append it to themessages
list. - The function
should_end_conversation
checks if the user's message ends with the words "bye" or "goodbye” to end the conversation. - If the conversation continues, we call the
assistant_chat
function to generate the assistant's text response. This function passes to the LLM thesystem_message
, the lastmemory_size
number of messages of the conversation history, and the current user message. The LLM will generate a response that we store asassistant_message
, append it to themessages
list and print it to the screen. [1] - Finally, we call the
text_to_speech
function to convert theassistant_message
into speech and play the audio, allowing the user to hear the assistant's response.
If you are curious about the should_end_conversation
function, it simply removes punctuation, converts the text to lowercase and uses a regular expression to check for the words “bye” or “goodbye” at the end:
#2.2. Asynchronous Programming in Python
You might have noticed that the Python code in this project uses the asyncio library and features async
/await
syntax.
Asynchronous programming allows us to write code that can perform multiple tasks concurrently without blocking the execution of other parts of the program. This is particularly useful for I/O-bound operations like making API requests or waiting for user input, as it allows the program to handle other tasks while waiting for these operations to complete.
This is not critical for our basic local assistant. But it will be much more important for the final web application, especially if you intend to serve multiple users.
If you're new to asynchronous programming in Python, don't worry! The basic idea is that functions defined with async def
are coroutines, which can be paused and resumed during execution. The await
keyword is used to wait for the result of a coroutine without blocking the rest of the program.
For a very intuitive explanation of async programming in Python, check out the FastAPI docs by @tiangolo.
#2.3. Speech-to-Text using Deepgram's Nova 2
To convert the user's speech into text, we'll rely on Deepgram's Nova 2 model and its Live Streaming API, using Deepgram's Python SDK. The code for this section is based on the Live Streaming Getting Started guide and examples from their Python SDK repository. To create a new Deepgram account, follow this guide.
In order to run the transcribe_audio()
function, we first need to create the Deepgram client and configure the options for live transcription, loading the API key as an environment variable: [2]
You can check Deepgram's docs to learn more about the configuration. Some interesting concepts:
- The keepalive option is used to maintain an uninterrupted connection.
- Interim Results provide preliminary transcripts (which we can display to the user as early as possible) followed by final transcripts once maximum accuracy has been reached.
- The Endpointing feature returns transcripts when pauses in speech are detected and allows us to detect when a speaker finishes speaking.
- The Utterance End feature addresses the limitations of endpointing and helps us identify the end of speech when a sufficiently long gap in words has occurred.
This is the code for the transcribe_audio()
function with comments explaining the different parts:
One interesting detail is that we are using an asyncio Event to signal the end of the transcription. This means that we can asynchronously await the end of the transcription by doing await transcription_complete.wait()
. When we finally detect the end of speech, we will set this event to True with transcription_complete.set()
, and the code will proceed to close the connection and return the full_transcript
.
Now let's examine the on_message
event handler that listens for any transcripts to be received:
There are three types of transcripts that we can receive:
- Interim results (
is_final is False
), which are provisional transcripts that we can display early to the user. - Final transcripts (
is_final is True
), which are received when maximum accuracy has been reached and we collect in thetranscription_parts
list. - End of speech transcript (
speech_final is True
), which identifies that the speaker has finished talking. We can now join thetranscript_parts
into thefull_transcript
and set thetranscription_complete
event to signal the end of the transcription.
As mentioned before, the Utterance End feature allows us to detect the end of speech when the Endpointing feature fails. The on_utterance_end
event handler takes care of that:
#2.4. Large Language Model: Llama 3 on Groq
With the user's speech transcribed into text, the next step is to generate the assistant's response. For this, we'll be using Meta's Llama 3 language model running on Groq's inference platform. To get started with Groq, follow this guide.
Groq's Chat Completions API allows us to have a conversation with the language model by sending a series of messages. Each message is a dictionary that includes a role
and content
, and there are three possible roles:
'system'
: This message sets the overall behavior and personality of the assistant. It's typically the first message sent and acts as an instruction for how the model should behave throughout the conversation.'user'
: These messages represent the user's input, in our case, the transcribed speech.'assistant'
: These messages contain the model's responses to the user's messages.
We want to make sure that the answers are brief to ensure a natural back-and-forth flow suitable for voice interaction. We can instruct the model to do this using the system message:
“You are a helpful and enthusiastic assistant. Speak in a human, conversational tone.
Keep your answers as short and concise as possible, like in a conversation, ideally no more than 120 characters.”
The system message we're using here is quite simple, but you can easily customize it any way you like. Consider adding more personality details (e.g. humorous, ironic) or giving it a specific background like a life coach.
To generate a response we use the following function:
We are using the smallest Llama 3 model (Llama 3 8B) as it will be the fastest and therefore ideal for a real-time voice assistant. But you can also try the more powerful Llama 3 70B, as well as the other models provided by Groq.
#2.5. Text-to-Text using Deepgram's Aura
The final step in our voice assistant pipeline is converting the generated text response back into speech. For this, we'll use Deepgram's Aura Text-to-Speech API.
To have the lowest latency, we'll use audio output streaming and start streaming the audio as soon as the first byte arrives.
This is the URL for the TTS API, including as parameters the Aura model, the audio encoding and the sample rate:
And this is the simple function that converts the text to speech: [3]
A quick breakdown of what's happening:
- Using Python's Requests library, we send a POST request to the TTS API URL, including the text in the request body and setting
stream=True
to receive the response as a stream of audio data. - The wave module allows us to read the streamed audio data as a WAV file.
- We use the PyAudio library to play the audio stream from the WAV file reading the data in chunks of 1024 frames.
#2.6. Running the Code
To run the code for the local Python assistant, follow these steps in your terminal or command-line interface:
-
Make sure you have Python (3.11 or higher) installed on your system.
-
Install Poetry, the package manager used in this project. You can find the installation instructions here.
-
Clone the GitHub repository containing the code, navigate to the backend folder and install the project dependencies using Poetry.
-
Create a
.env
file in the backend folder copying the.env.example
file provided and set the required environment variables:GROQ_API_KEY
: Your Groq API key.DEEPGRAM_API_KEY
: Your Deepgram API key.
-
Run the local Python assistant script using the provided Poetry script:
And there you have it! You now have a working local Python assistant that showcases the core functionality.
In Part 2, we'll take things to the next level by building the Python backend with FastAPI for the complete full-stack web app.
#Notes
[1] We are printing colored text to the screen thanks to the Console API of the Rich library, which allows us to easily format and style our output.
[2] We are using Pydantic Settings to manage the application settings and loading the environment variables. We will see this in more detail in Part 2.
[3] Note that I implemented the text_to_speech
function synchronously to simplify the code and the integration with PyAudio. But in the final web app all functions will be asynchronous for optimal performance.