Creating a Voice Assistant for your Knowledge Base

Introduction

We are going to create a voice assistant for your knowledge base! This lesson will outline how you can develop your very own voice assistant employing state-of-the-art artificial intelligence tools. The voice assistant utilizes OpenAI's Whisper, a sophisticated automatic speech recognition (ASR) system. Whisper effectively transcribes our voice inputs into text. Once our voice inputs have been transcribed into text, we turn our attention towards generating voice outputs. To accomplish this, we employ Eleven Labs, which enables the voice assistant to respond to the users in an engaging and natural manner.

The core of the project revolves around a robust question-answering mechanism. This process initiates with loading the vector database, a repository housing several documents relevant to our potential queries. On posing a question, the system retrieves the documents from this database and, along with the question, feeds them to the LLM. The LLM then generates the response based on retrieved documents.

We aim to create a voice assistant that can efficiently navigate a knowledge base, providing precise and timely responses to a user's queries. For this experiment we’re using the ‘JarvisBase’ repository on GitHub.

Disclaimer

There may be problems with Stramlit if you try to run the code on Google Colab, it is recommended to use a local computer.

Setup: Library Installation

We’d start by installing the requirements. These are the necessary libraries that we’ll be using. While we strongly recommend installing the latest versions of these packages, please note that the codes have been tested with the versions specified in parentheses.

langchain==0.0.208
deeplake==3.6.5
openai==0.27.8
tiktoken==0.4.0
elevenlabs==0.2.18
streamlit==1.23.1
beautifulsoup4==4.11.2
audio-recorder-streamlit==0.0.8
streamlit-chat==0.0.2.2

Tokens and APIs

For this experiment, you’d need to obtain several API keys and tokens. They need to be set in the environment variable as described below.

import os

os.environ['OPENAI_API_KEY']='<your-openai-api-key>'
os.environ['ELEVEN_API_KEY']='<your-eleven-api-key>'
os.environ['ACTIVELOOP_TOKEN']='<your-activeloop-token>'

To access OpenAI's services, you must first obtain credentials by signing up on their website, completing the registration process, and creating an API key from your dashboard. This enables you to leverage OpenAI's powerful capabilities in your projects.

  1. If you don't have an account yet, create one by going to https://platform.openai.com/. If you already have an account, skip to step 5.
  2. Fill out the registration form with your name, email address, and desired password.
  3. OpenAI will send you a confirmation email with a link. Click on the link to confirm your account.
  4. Please note that you'll need to verify your email account and provide a phone number for verification.
  5. Log in to https://platform.openai.com/.
  6. Navigate to the API key section at https://platform.openai.com/account/api-keys.
  7. Click "Create new secret key" and give the key a recognizable name or ID.

To get the ELEVEN_API_KEY, follow these steps:

1. Go to https://elevenlabs.io/ and click on "Sign Up" to create an account. 2. Once you have created an account, log in and navigate to the "API" section. 3. Click the "Create API key" button and follow the prompts to generate a new API key. 4. Copy the API key and paste it into your code where it says "your-eleven-api-key" in the ELEVEN_API_KEY variable.

For ACTIVELOOP TOKEN, follow these easy steps:

  1. Go to https://www.activeloop.ai/ and click on “Sign Up” to create an account.

2. Once you have an Activeloop account, you can create tokens in the Deep Lake App (Organization Details -> API Tokens)

3. Click the "Create API key" button and generate a new API Token.

  1. Copy the API key and paste it as your environment variable: ACTIVELOOP_TOKEN='your-Activeloop-token'

1. Sourcing Content from Hugging Face Hub

Now that everything is set up, let’s begin by aggregating all Python library articles from the Hugging Face Hub, an open platform to share, collaborate and advance in machine learning. These articles will serve as the knowledge base for our voice assistant. We'll do some web scraping in order to collect some knowledge documents.

Let’s observe and run the script.py file (i.e. run python scrape.py ). This script contains all the code we use in this lesson under the “Sourcing Content from Hugging Face Hub” and “Embedding and storing in Deep Lake” sections. You can fork or download the mentioned repository and run the files.

We start with importing necessary modules, loading environment variables, and setting up the path for Deep Lake, a vector database. It also sets up an OpenAIEmbeddings instance, which will be used later to embed the scraped articles:

import os
import requests
from bs4 import BeautifulSoup
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import DeepLake
from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import TextLoader
import re

# TODO: use your organization id here. (by default, org id is your username)
my_activeloop_org_id = "<YOUR-ACTIVELOOP-ORG-ID>"
my_activeloop_dataset_name = "langchain_course_jarvis_assistant"
dataset_path= f'hub://{my_activeloop_org_id}/{my_activeloop_dataset_name}'

embeddings =  OpenAIEmbeddings(model="text-embedding-ada-002")

We first create a list of relative URLs leading to knowledge documents hosted on the Hugging Face Hub. To do this, we define a function called get_documentation_urls(). Using another function, construct_full_url(), we then append these relative URLs to the base URL of the Hugging Face Hub, effectively creating full URLs that we can access directly.

def get_documentation_urls():
    # List of relative URLs for Hugging Face documentation pages, commented a lot of these because it would take too long to scrape all of them
    return [
		    '/docs/huggingface_hub/guides/overview',
		    '/docs/huggingface_hub/guides/download',
		    '/docs/huggingface_hub/guides/upload',
		    '/docs/huggingface_hub/guides/hf_file_system',
		    '/docs/huggingface_hub/guides/repository',
		    '/docs/huggingface_hub/guides/search',
		    # You may add additional URLs here or replace all of them
    ]

def construct_full_url(base_url, relative_url):
    # Construct the full URL by appending the relative URL to the base URL
    return base_url + relative_url

The script then aggregates all the scraped content from the URLs. This is achieved with the scrape_all_content() function, which iteratively calls scrape_page_content() for each URL and extracts its text. This collected text is then saved to a file.


def scrape_page_content(url):
    # Send a GET request to the URL and parse the HTML response using BeautifulSoup
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    # Extract the desired content from the page (in this case, the body text)
    text=soup.body.text.strip()
    # Remove non-ASCII characters
    text = re.sub(r'[\x00-\x08\x0b-\x0c\x0e-\x1f\x7f-\xff]', '', text)
    # Remove extra whitespace and newlines
    text = re.sub(r'\s+', ' ', text)
    return text.strip()

def scrape_all_content(base_url, relative_urls, filename):
    # Loop through the list of URLs, scrape content and add it to the content list
    content = []
    for relative_url in relative_urls:
        full_url = construct_full_url(base_url, relative_url)
        scraped_content = scrape_page_content(full_url)
        content.append(scraped_content.rstrip('\n'))

    # Write the scraped content to a file
    with open(filename, 'w', encoding='utf-8') as file:
        for item in content:
            file.write("%s\n" % item)
    
    return content

Loading and splitting texts

To prepare the collected text for embedding into our vector database, we load the content from the file and split it into separate documents using the load_docs() function. To further refine the content, we split it into individual chunks through the split_docs(). Here we’d see a Text loader and text_splitter in action.

The instructiontext_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0) creates an instance of a text splitter that splits the text into chunks based on characters. Each document in docs is split into chunks of approximately 1000 characters, with no overlap between consecutive chunks.

# Define a function to load documents from a file
def load_docs(root_dir,filename):
    # Create an empty list to hold the documents
    docs = []
    try:
        # Load the file using the TextLoader class and UTF-8 encoding
        loader = TextLoader(os.path.join(
            root_dir, filename), encoding='utf-8')
        # Split the loaded file into separate documents and add them to the list of documents
        docs.extend(loader.load_and_split())
    except Exception as e:
        # If an error occurs during loading, ignore it and return an empty list of documents
        pass
    # Return the list of documents
    return docs
  
def split_docs(docs):
    text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
    return text_splitter.split_documents(docs)

2. Embedding and storing in Deep Lake

Once we've collected the necessary articles, the next step is to embed them using Deep Lake. Deep Lake is a powerful tool for creating searchable vector databases. In this context, it will allow us to efficiently index and retrieve the information contained in our Python library articles.

Finally, we're ready to populate our vector database.

The Deep Lake integration initializes a database instance with the given dataset path and the predefined OpenAIEmbeddings function. The OpenAIEmbeddings is converting the text chunks into their embedding vectors, a format suitable for the vector database. The .add_documents method will process and store the texts on the database.


# Define the main function
def main():
    base_url = 'https://huggingface.co'
    # Set the name of the file to which the scraped content will be saved
    filename='content.txt'
    # Set the root directory where the content file will be saved
    root_dir ='./'
    relative_urls = get_documentation_urls()
    # Scrape all the content from the relative URLs and save it to the content file
    content = scrape_all_content(base_url, relative_urls,filename)
    # Load the content from the file
    docs = load_docs(root_dir,filename)
    # Split the content into individual documents
    texts = split_docs(docs)
    # Create a DeepLake database with the given dataset path and embedding function
    db = DeepLake(dataset_path=dataset_path, embedding_function=embeddings)
    # Add the individual documents to the database
    db.add_documents(texts)
    # Clean up by deleting the content file
    os.remove(filename)

# Call the main function if this script is being run as the main program
if __name__ == '__main__':
    main()

All these steps are neatly wrapped into our main function. This sets the necessary parameters, invokes the functions we've defined, and oversees the overall process from scraping the content from the web to loading it into the Deep Lake database. As a final step, it deletes the content file to clean up.

3. Voice Assistant

Having successfully stored all the necessary data in the vector database, in this instance using Deep Lake by Activeloop, we're ready to utilize this data in our chatbot.

Without further ado, let's transition to the coding part of our chatbot. The following code can be found in the chat.py file of the directory. To give it a try, run streamlit run chat.py.

These libraries will help us in building web applications with Streamlit, handling audio input, generating text responses, and effectively retrieving information stored in the Deep Lake:

import os
import openai
import streamlit as st
from audio_recorder_streamlit import audio_recorder
from elevenlabs import generate
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import DeepLake
from streamlit_chat import message

# Constants
TEMP_AUDIO_PATH = "temp_audio.wav"
AUDIO_FORMAT = "audio/wav"

# Load environment variables from .env file and return the keys
openai.api_key = os.environ.get('OPENAI_API_KEY')
eleven_api_key = os.environ.get('ELEVEN_API_KEY')

We then create an instance that points to our Deep Lake vector database.

def load_embeddings_and_database(active_loop_data_set_path):
    embeddings = OpenAIEmbeddings()
    db = DeepLake(
        dataset_path=active_loop_data_set_path,
        read_only=True,
        embedding_function=embeddings
    )
    return db

Next, we prepare the code for transcribing audio.

# Transcribe audio using OpenAI Whisper API
def transcribe_audio(audio_file_path, openai_key):
    openai.api_key = openai_key
    try:
        with open(audio_file_path, "rb") as audio_file:
            response = openai.Audio.transcribe("whisper-1", audio_file)
        return response["text"]
    except Exception as e:
        print(f"Error calling Whisper API: {str(e)}")
        return None

This transcribes an audio file into text using the OpenAI Whisper API, requiring the path of the audio file and the OpenAI key as input parameters.

# Record audio using audio_recorder and transcribe using transcribe_audio
def record_and_transcribe_audio():
    audio_bytes = audio_recorder()
    transcription = None
    if audio_bytes:
        st.audio(audio_bytes, format=AUDIO_FORMAT)

        with open(TEMP_AUDIO_PATH, "wb") as f:
            f.write(audio_bytes)

        if st.button("Transcribe"):
            transcription = transcribe_audio(TEMP_AUDIO_PATH, openai.api_key)
            os.remove(TEMP_AUDIO_PATH)
            display_transcription(transcription)

    return transcription

# Display the transcription of the audio on the app
def display_transcription(transcription):
    if transcription:
        st.write(f"Transcription: {transcription}")
        with open("audio_transcription.txt", "w+") as f:
            f.write(transcription)
    else:
        st.write("Error transcribing audio.")

# Get user input from Streamlit text input field
def get_user_input(transcription):
    return st.text_input("", value=transcription if transcription else "", key="input")

This part of the code allows users to record audio directly within the application. The recorded audio is then transcribed into text using the Whisper API, and the transcribed text is displayed on the application. If any issues occur during the transcription process, an error message will be shown to the user.

# Search the database for a response based on the user's query
def search_db(user_input, db):
    print(user_input)
    retriever = db.as_retriever()
    retriever.search_kwargs['distance_metric'] = 'cos'
    retriever.search_kwargs['fetch_k'] = 100
    retriever.search_kwargs['k'] = 4
    model = ChatOpenAI(model_name='gpt-4o-mini')
    qa = RetrievalQA.from_llm(model, retriever=retriever, return_source_documents=True)
    return qa({'query': user_input})

This segment of the code is for searching the vector database for the most relevant responses to the user's query. It first converts the database into a retriever, which is a tool that searches for the nearest embeddings in the vector space. It then sets various parameters for the search, such as the metric to use when measuring distance in the embedding space, the number of documents to fetch initially, whether or not to use maximal marginal relevance to balance the diversity and relevance of the results, and how many results to return. The retrieved results are then passed through the language model, which is GPT-3.5 Turbo in this case, to generate the most appropriate response to the user's query.

Streamlit

Streamlit is a Python framework used for building data visualization web applications. It provides an intuitive way to create interactive web apps for machine learning and data science projects.

Now we have the part with the conversation history between the user and the chatbot using Streamlit's messaging functionality. It goes through the previous messages in the conversation and displays each user message followed by the corresponding chatbot response. It employs the Eleven Labs API to convert the chatbot's text response into speech and give the chatbot a voice. This voice output, in MP3 format, is then played on the Streamlit interface, adding an auditory dimension to the conversation:

# Display conversation history using Streamlit messages
def display_conversation(history):
    for i in range(len(history["generated"])):
        message(history["past"][i], is_user=True, key=str(i) + "_user")
        message(history["generated"][i],key=str(i))
        #Voice using Eleven API
        voice= "Bella"
        text= history["generated"][i]
        audio = generate(text=text, voice=voice,api_key=eleven_api_key)
        st.audio(audio, format='audio/mp3')

User Interaction

After the knowledge base is set up, the next stage is user interaction. The voice assistant is designed to accept queries either in the form of voice recordings or typed text.

# Main function to run the app
def main():
    # Initialize Streamlit app with a title
    st.write("# JarvisBase 🧙")
   
    # Load embeddings and the DeepLake database
    db = load_embeddings_and_database(dataset_path)

    # Record and transcribe audio
    transcription = record_and_transcribe_audio()

    # Get user input from text input or audio transcription
    user_input = get_user_input(transcription)

    # Initialize session state for generated responses and past messages
    if "generated" not in st.session_state:
        st.session_state["generated"] = ["I am ready to help you"]
    if "past" not in st.session_state:
        st.session_state["past"] = ["Hey there!"]
        
    # Search the database for a response based on user input and update the session state
    if user_input:
        output = search_db(user_input, db)
        print(output['source_documents'])
        st.session_state.past.append(user_input)
        response = str(output["result"])
        st.session_state.generated.append(response)

    #Display conversation history using Streamlit messages
    if st.session_state["generated"]:
        display_conversation(st.session_state)

# Run the main function when the script is executed
if __name__ == "__main__":
    main()

This is the main driver of the entire application. First, it sets up the Streamlit application and loads the Deep Lake vector database along with its embeddings. It then offers two methods for user input: through text or through an audio recording which is then transcribed.

The application keeps a record of past user inputs and generated responses in a session state. When new user input is received, the application searches the database for the most suitable response. This response is then added to the session state.

Finally, the application displays the entire conversation history, including both user inputs and chatbot responses. If the input was made via voice, the chatbot's responses are also generated in an audio format using the Eleven Labs API.

You should now run the following command in your terminal:

streamlit run chat.py

When you run your application using the Streamlit command, the application will start a local web server and provide you with a URL where your application is running and can be accessed via a web browser. In your case, you have two URLs: a Network URL and an External URL.

Your application will be running as long as the command is running in your terminal, and it will stop once you stop the command (ctrl+Z) or close the terminal.

Trying Out the UI

We have now explained the main code parts and are ready to test the Streamlit app!

This is how it presents itself.

image

By clicking on the microphone icon, your microphone will be active for some seconds and you’ll be able to ask a question. Let’s try “How do I search for models in the Hugging Face Hub?”.

After a few seconds, the app will show an audio player that can be used to listen to your registered audio. You may then click on the “Transcribe” button.

image

This button will invoke a call to the Whisper API and transcribe your audio. The produced text will be soon pasted to the chat text entry underneath.

image

Here we see that the Whisper API didn’t do a perfect job at transcribing “Hugging Face” correctly and instead wrote “Huggy Face”. This is unwanted, but let’s see if ChatGPT is still able to understand the query and give it an appropriate answer by leveraging the knowledge documents stored in Deep Lake.

After a few more seconds, the underlying chat will be populated with your audio transcription, along with the chatbot's textual response and its audio version, generated by calling the ElevenLabs API. As we can see, ChatGPT was smart enough to understand that “Huggy Face” was a misspelling of “Hugging Face” and was still able to give an appropriate answer.

image

Conclusion

In this lesson we integrated several popular generative AI tools and models, namely OpenAI Whisper and ElevenLabs text-to-speech.

In the next lesson we’ll see how LLMs can be used to aid in understanding new codebases, such as the Twitter Algorithm public repository.

Github Repo: