Building an AI Voice Assistant with Groq and Deepgram (Part 1)

Have you ever thought about creating your own AI voice assistant, like a personal Siri or Alexa? With the latest progress in AI and some powerful new tools, it's now a lot easier and more affordable than ever before.

In this 3-part series we'll go through the process, step by step, of building a fast and interactive voice assistant using Python, React, Deepgram, Llama 3 and Groq:

Part 1: A local Python voice assistant app. Here, in the first part, we'll start with a basic Python local app that shows off the core features.
Part 2: A Python backend with FastAPI & WebSockets. In the second part, we'll create the Python backend for the complete full-stack web app.
Part 3: An interactive React user interface. In the final part, we'll put together the React frontend with a friendly UI.

Along the way, we will also learn about asynchronous programming in Python, WebSockets, FastAPI and Web APIs for recording and playing audio in the browser.

Want to jump ahead and see what we'll be making? Check out the finished voice assistant app by clicking below:

Let's get started!

#1. System Overview

Let's take a moment to understand the core elements of our AI voice assistant and how they work together. Essentially, the assistant is made up of three main components:

Speech-to-Text (STT): This component listens to your voice and transcribes it into text. We'll be using Deepgrams Nova 2, currently the fastest and most accurate speech-to-text API available.
Large Language Model (LLM): Once your speech is transcribed, the text is sent to a language model along with any previous messages in the conversation. The LLM is the brain of the assistant, capable of understanding the context and generating human-like responses. We'll be using Meta's recently released open-source Llama 3, running on Groq, the world's fastest AI inference technology. And they keep getting faster!
Text-to-Speech (TTS): After the LLM generates a response, the text is sent to a TTS model that converts it into speech. For this, we'll use Deepgram Aura.

These three components run in a continuous loop, and with the speed of both Groq and Deepgram, we can create a voice assistant that can understand and respond to you in real-time. If you have never tried Groq or Deepgram before, I highly recommend them!

Moreover, using an open-source LLM like Llama 3 offers many advantages: it's more affordable, it allows you to have more control, customization and data privacy, and less dependency on LLM providers. With some additional work, it would be feasible to turn this voice assistant into a fully open-source solution that can be self-hosted, giving you complete control over your data.

Now that we have a conceptual understanding of how the voice assistant works, let's dive into the code and see how it all comes together!

#2. Technical Deep Dive

As mentioned before, in this first part of the series we'll build a straightforward Python app that showcases the fundamental building blocks of the assistant. And we will save Parts 2 and 3 to build the complete full-stack web application.

You can find all the code in this GitHub repository, with an MIT license that lets you use it and build on it however you like. The repository includes the complete application, but this first part of the series focuses only on the local Python assistant found in this file.

#2.1. Main Loop

The heart of our application is the run() function, which orchestrates the interaction between the user and the assistant. This main loop continues to run until the user chooses to end the conversation.

async def run():
    system_message = {'role': 'system', 'content': SYSTEM_PROMPT}
    memory_size = 10
    messages = []
    while True:
        user_message = await transcribe_audio()
        messages.append({'role': 'user', 'content': user_message})

        if should_end_conversation(user_message):
            break

        assistant_message = await assistant_chat([system_message] + messages[-memory_size+1:])
        messages.append({'role': 'assistant', 'content': assistant_message})
        console.print(assistant_message, style='dark_orange')
        text_to_speech(assistant_message)

Here's a breakdown of what's happening:

We start with a few definitions:
- system_message sets the initial context for the conversation. This message helps guide the LLM's responses to align with our desired assistant's personality and behavior.
- memory_size limits the number of previous messages the LLM considers when generating a response.
- messages keeps track of the conversation history, storing both user and assistant messages.
We then enter the infinite loop, which will keep the conversation going until we break out from it. Inside the loop, we first call the transcribe_audio function to listen to the user's speech and convert it into text. This function returns the transcribed text as user_message, and we append it to the messages list.
The function should_end_conversation checks if the user's message ends with the words "bye" or "goodbye” to end the conversation.
If the conversation continues, we call the assistant_chat function to generate the assistant's text response. This function passes to the LLM the system_message , the last memory_size number of messages of the conversation history, and the current user message. The LLM will generate a response that we store as assistant_message, append it to the messages list and print it to the screen. ^[1]
Finally, we call the text_to_speech function to convert the assistant_message into speech and play the audio, allowing the user to hear the assistant's response.

If you are curious about the should_end_conversation function, it simply removes punctuation, converts the text to lowercase and uses a regular expression to check for the words “bye” or “goodbye” at the end:

def should_end_conversation(text):
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = text.strip().lower()
    return re.search(r'\b(goodbye|bye)\b$', text) is not None

#2.2. Asynchronous Programming in Python

You might have noticed that the Python code in this project uses the asyncio library and features async/await syntax.

Asynchronous programming allows us to write code that can perform multiple tasks concurrently without blocking the execution of other parts of the program. This is particularly useful for I/O-bound operations like making API requests or waiting for user input, as it allows the program to handle other tasks while waiting for these operations to complete.

This is not critical for our basic local assistant. But it will be much more important for the final web application, especially if you intend to serve multiple users.

If you're new to asynchronous programming in Python, don't worry! The basic idea is that functions defined with async def are coroutines, which can be paused and resumed during execution. The await keyword is used to wait for the result of a coroutine without blocking the rest of the program.

For a very intuitive explanation of async programming in Python, check out the FastAPI docs by @tiangolo.

#2.3. Speech-to-Text using Deepgram's Nova 2

To convert the user's speech into text, we'll rely on Deepgram's Nova 2 model and its Live Streaming API, using Deepgram's Python SDK. The code for this section is based on the Live Streaming Getting Started guide and examples from their Python SDK repository. To create a new Deepgram account, follow this guide.

In order to run the transcribe_audio() function, we first need to create the Deepgram client and configure the options for live transcription, loading the API key as an environment variable: ^[2]

deepgram_config = DeepgramClientOptions(options={'keepalive': 'true'})
deepgram = DeepgramClient(settings.DEEPGRAM_API_KEY, config=deepgram_config)
dg_connection_options = LiveOptions(
    model='nova-2',
    language='en',
    # Apply smart formatting to the output
    smart_format=True,
    # Raw audio format details
    encoding='linear16',
    channels=1,
    sample_rate=16000,
    # To get UtteranceEnd, the following must be set:
    interim_results=True,
    utterance_end_ms='1500',
    vad_events=True,
    # Time in milliseconds of silence to wait for before finalizing speech
    endpointing=500,
)

You can check Deepgram's docs to learn more about the configuration. Some interesting concepts:

The keepalive option is used to maintain an uninterrupted connection.
Interim Results provide preliminary transcripts (which we can display to the user as early as possible) followed by final transcripts once maximum accuracy has been reached.
The Endpointing feature returns transcripts when pauses in speech are detected and allows us to detect when a speaker finishes speaking.
The Utterance End feature addresses the limitations of endpointing and helps us identify the end of speech when a sufficiently long gap in words has occurred.

This is the code for the transcribe_audio() function with comments explaining the different parts:

async def transcribe_audio():
    transcript_parts = []
    full_transcript = ''
    # Event to signal transcription is complete
    transcription_complete = asyncio.Event()
    
    try:
        # Create a websocket connection to Deepgram
        dg_connection = deepgram.listen.asynclive.v('1')
      
        # Register the event handlers
        dg_connection.on(LiveTranscriptionEvents.Transcript, on_message)
        dg_connection.on(LiveTranscriptionEvents.UtteranceEnd, on_utterance_end)
        dg_connection.on(LiveTranscriptionEvents.Error, on_error)

        # Start the connection
        if await dg_connection.start(dg_connection_options) is False:
            console.print('Failed to connect to Deepgram')
            return
        
        # Open a microphone stream on the default input device
        microphone = Microphone(dg_connection.send)
        microphone.start()
        console.print('\nListening...\n')

        # Wait for the transcription to complete
        await transcription_complete.wait()
        
        # Close the microphone and the Deepgram connection
        microphone.finish()
        await dg_connection.finish()
        return full_transcript
    
    except Exception as e:
        console.print(f'Could not open socket: {e}')
        return

One interesting detail is that we are using an asyncio Event to signal the end of the transcription. This means that we can asynchronously await the end of the transcription by doing await transcription_complete.wait() . When we finally detect the end of speech, we will set this event to True with transcription_complete.set(), and the code will proceed to close the connection and return the full_transcript .

Now let's examine the on_message event handler that listens for any transcripts to be received:

async def on_message(self, result, **kwargs):
    nonlocal transcript_parts, full_transcript
    sentence = result.channel.alternatives[0].transcript
    if len(sentence) == 0:
        return
    if result.is_final:
        # Collect these and concatenate them together when speech_final is True
        transcript_parts.append(sentence)
        console.print(sentence, style='cyan')
        
        # Sufficient silence detected to consider this end of speech
        if result.speech_final:
            full_transcript = ' '.join(transcript_parts)
            transcription_complete.set()
    else:
        # Interim results
        console.print(sentence, style='cyan', end='\r')

There are three types of transcripts that we can receive:

Interim results (is_final is False), which are provisional transcripts that we can display early to the user.
Final transcripts (is_final is True), which are received when maximum accuracy has been reached and we collect in the transcription_parts list.
End of speech transcript (speech_final is True), which identifies that the speaker has finished talking. We can now join the transcript_parts into the full_transcript and set the transcription_complete event to signal the end of the transcription.

As mentioned before, the Utterance End feature allows us to detect the end of speech when the Endpointing feature fails. The on_utterance_end event handler takes care of that:

async def on_utterance_end(self, error, **kwargs):
    nonlocal transcript_parts, full_transcript
    if len(transcript_parts) > 0:
        full_transcript = ' '.join(transcript_parts)
        transcription_complete.set()

#2.4. Large Language Model: Llama 3 on Groq

With the user's speech transcribed into text, the next step is to generate the assistant's response. For this, we'll be using Meta's Llama 3 language model running on Groq's inference platform. To get started with Groq, follow this guide.

Groq's Chat Completions API allows us to have a conversation with the language model by sending a series of messages. Each message is a dictionary that includes a role and content , and there are three possible roles:

'system': This message sets the overall behavior and personality of the assistant. It's typically the first message sent and acts as an instruction for how the model should behave throughout the conversation.
'user': These messages represent the user's input, in our case, the transcribed speech.
'assistant': These messages contain the model's responses to the user's messages.

We want to make sure that the answers are brief to ensure a natural back-and-forth flow suitable for voice interaction. We can instruct the model to do this using the system message:

“You are a helpful and enthusiastic assistant. Speak in a human, conversational tone.
Keep your answers as short and concise as possible, like in a conversation, ideally no more than 120 characters.”

The system message we're using here is quite simple, but you can easily customize it any way you like. Consider adding more personality details (e.g. humorous, ironic) or giving it a specific background like a life coach.

To generate a response we use the following function:

async def assistant_chat(messages, model='llama3-8b-8192'):
    res = await groq.chat.completions.create(messages=messages, model=model)
    return res.choices[0].message.content

We are using the smallest Llama 3 model (Llama 3 8B) as it will be the fastest and therefore ideal for a real-time voice assistant. But you can also try the more powerful Llama 3 70B, as well as the other models provided by Groq.

#2.5. Text-to-Text using Deepgram's Aura

The final step in our voice assistant pipeline is converting the generated text response back into speech. For this, we'll use Deepgram's Aura Text-to-Speech API.

To have the lowest latency, we'll use audio output streaming and start streaming the audio as soon as the first byte arrives.

This is the URL for the TTS API, including as parameters the Aura model, the audio encoding and the sample rate:

DEEPGRAM_TTS_URL = 'https://api.deepgram.com/v1/speak?model=aura-luna-en&encoding=linear16&sample_rate=24000'

And this is the simple function that converts the text to speech: ^[3]

def text_to_speech(text):
    headers = {
        'Authorization': f'Token {settings.DEEPGRAM_API_KEY}',
        'Content-Type': 'application/json'
    }
    res = requests.post(DEEPGRAM_TTS_URL, headers=headers, json={'text': text}, stream=True)
    with wave.open(res.raw, 'rb') as wf:
        p = pyaudio.PyAudio()
        stream = p.open(
            format=p.get_format_from_width(wf.getsampwidth()),
            channels=wf.getnchannels(),
            rate=wf.getframerate(),
            frames_per_buffer=1024,
            output=True
        )
        while len(data := wf.readframes(1024)): 
            stream.write(data)
        
        stream.close()
        p.terminate()

A quick breakdown of what's happening:

Using Python's Requests library, we send a POST request to the TTS API URL, including the text in the request body and setting stream=True to receive the response as a stream of audio data.
The wave module allows us to read the streamed audio data as a WAV file.
We use the PyAudio library to play the audio stream from the WAV file reading the data in chunks of 1024 frames.

#2.6. Running the Code

To run the code for the local Python assistant, follow these steps in your terminal or command-line interface:

Make sure you have Python (3.11 or higher) installed on your system.
Install Poetry, the package manager used in this project. You can find the installation instructions here.
Clone the GitHub repository containing the code, navigate to the backend folder and install the project dependencies using Poetry.
```
git clone https://github.com/ruizguille/voice-assistant.git
cd voice-assistant/backend
poetry install
```
Create a .env file in the backend folder copying the .env.example file provided and set the required environment variables:
- GROQ_API_KEY: Your Groq API key.
- DEEPGRAM_API_KEY: Your Deepgram API key.
Run the local Python assistant script using the provided Poetry script:
```
poetry run local-assistant
```

And there you have it! You now have a working local Python assistant that showcases the core functionality.

In Part 2, we'll take things to the next level by building the Python backend with FastAPI for the complete full-stack web app.

#Notes

^[1] We are printing colored text to the screen thanks to the Console API of the Rich library, which allows us to easily format and style our output.

^[2] We are using Pydantic Settings to manage the application settings and loading the environment variables. We will see this in more detail in Part 2.

^[3] Note that I implemented the text_to_speech function synchronously to simplify the code and the integration with PyAudio. But in the final web app all functions will be asynchronous for optimal performance.