Introduction
This chapter demonstrates the implementation of a hybrid retrieval system on a practical restaurant reviews project. The main focus is on how to combine traditional BM25 and dense vector similarity search methods for enhanced search capabilities. The initial sections lay the groundwork by explaining and implementing these foundational techniques. The primary objective, however, is to show how these methods integrate into a hybrid approach, combining lexical and semantic retrieval to improve relevance of retrieved documents.
Resources:
- Deep Lake docs: RAG -
- Jupyter: Google Colab
Load the Data from Deep Lake
The following code opens the dataset in read-only mode from Deep Lake at the specified path al://activeloop/restaurant_reviews_complete
. The scraped_data
object now contains the complete restaurant dataset, featuring 160 restaurants and over 24,000 images, ready for data extraction and processing.
#!pip install deeplake
import deeplake
scraped_data = deeplake.open_read_only(f"al://activeloop/restaurant_reviews_complete")
print(f"Scraped {len(scraped_data)} reviews")
Output:
Scraped 18625 reviews
Create the Dataset and Use an Inverted Index for Filtering
In the first stage of this course, we’ll cover Lexical Search, a traditional and foundational approach to information retrieval.
An inverted index is a data structure commonly used in search engines and databases to facilitate fast full-text searches. Unlike a row-wise search, which scans each row of a document or dataset for a search term, an inverted index maps each unique word or term to the locations (such as document IDs or row numbers) where it appears. This setup allows for very efficient retrieval of information, especially in large datasets.
For small datasets with up to 1,000 documents, row-wise search can provide efficient performance without needing an inverted index. For medium-sized datasets (10,000+ documents), inverted indexes become useful, particularly if search queries are frequent. For large datasets of 100,000+ documents, using an inverted index is essential to ensure efficient query processing and meet performance expectations.
import deeplake
from deeplake import types
# Create a dataset
inverted_index_dataset = "local_inverted_index"
ds = deeplake.create(f"file://{inverted_index_dataset}")
We now create two columns in the dataset: restaurant_name
and restaurant_review
. Both columns are text-based and use an inverted index to improve search efficiency.
ds.add_column("restaurant_name", types.Text(index_type=types.Inverted))
ds.add_column("restaurant_review", types.Text(index_type=types.Inverted))
ds.add_column("owner_answer", types.Text(index_type=types.Inverted))
Extract the data
This code extracts restaurant details from scraped_data
into separate lists:
- Initialize Lists :
restaurant_name
,restaurant_review
andowner_answer
are initialized to store respective data for each restaurant. - Populate Lists : For each entry (
el
) inscraped_data
, the code appends:
el['restaurant_name']
torestaurant_name
el['restaurant_review']
torestaurant_review
el['owner_answer']
toowner_answer
After running, each list holds a specific field from all restaurants, ready for further processing.
restaurant_name = []
restaurant_review = []
owner_answer = []
images = []
for el in scraped_data:
restaurant_name.append(el['restaurant_name'])
restaurant_review.append(el['restaurant_review'])
owner_answer.append(el['owner_answer'])
Add the data to the dataset
We add the collected restaurant names and reviews to the dataset ds
. Using ds.append()
, we insert two columns: "restaurant_name"
and "restaurant_review"
, populated with the values from our lists restaurant_name
and restaurant_review
. After appending the data, ds.commit()
saves the changes permanently to the dataset, ensuring all new entries are stored and ready for further processing.
ds.append({
"restaurant_name": restaurant_name,
"restaurant_review": restaurant_review,
"owner_answer": owner_answer
})
ds.commit()
ds
Output:
Dataset(columns=(restaurant_name,restaurant_review,owner_answer), length=18625)
Search for the restaurant using a specific word
We define a search query to find any entries in the dataset ds
where the word "tapas"
appears in the restaurant_review
column. The command ds.query()
runs a TQL query with SELECT *
, which retrieves all entries that match the condition CONTAINS(restaurant_review, '{word}')
. This search filters the dataset to show only records containing the specified word (tapas
) in their reviews. The results are saved in the variable view
.
Deep Lake offers a high-performance SQL-based query engine for data analysis called TQL
(Tensor Query Language). You can find the official documentation here.
word = 'burritos'
view = ds.query(f"""
SELECT *
WHERE CONTAINS(restaurant_review, '{word}')
LIMIT 4
""")
view
Output:
Dataset(columns=(restaurant_name,restaurant_review,owner_answer), length=4)
Show the results
for row in view:
print(f"Restaurant name: {row['restaurant_name']} \nReview: {row['restaurant_review']}")
Output:
Restaurant name: Los Amigos
Review: Best Burritos i have ever tried!!!!! Wolderful!!!
Restaurant name: Los Amigos
Review: Really good breakfast burrito, and just burritos in general
Restaurant name: Los Amigos
Review: Ordered two of their veggie burritos, nothing crazy just added extra cheese and sour cream. They even repeated the order back to me and everything was fine, then when I picked the burritos up and got home they put zucchini and squash in it.. like what??
Restaurant name: Los Amigos
Review: Don't make my mistake and over order. The portions are monstrous. The wet burritos are as big as a football.
AI data retrieval systems today face 3 challenges: limited modalities
, lack of accuracy
, and high costs at scale
. Deep Lake 4.0 fixes this by enabling true multi-modality, enhancing accuracy, and reducing query costs by 2x with index-on-the-lake technology.
Consider a scenario where we store all our data locally on a computer. Initially, this may be adequate, but as the volume of data grows, managing it becomes increasingly challenging. The computer’s storage becomes limited, data access slows, and sharing information with others is less efficient.
To address these challenges, we can transition our data storage to the cloud using Deep Lake. Designed specifically for handling large-scale datasets and AI workloads, Deep Lake enables up to 10 times faster data access. With cloud storage, hardware limitations are no longer a concern: Deep Lake offers ample storage capacity, secure access from any location, and streamlined data sharing.
This approach provides a robust and scalable infrastructure that can grow alongside our projects, minimizing the need for frequent hardware upgrades and ensuring efficient data management.
Use BM25 to Retrieve the Data
Our advanced "Index-On-The-Lake"
technology enables sub-second query performance directly from object storage, such as S3
, using minimal compute power and memory resources. Achieve up to 10x greater cost efficiency
compared to in-memory databases and 2x faster performance
than other object storage solutions, all without requiring additional disk-based caching.
With Deep Lake, you benefit from rapid streaming columnar access to train deep learning models directly, while also executing sub-second indexed queries for retrieval-augmented generation.
In this stage, the system uses BM25 for a straightforward lexical search. This approach is efficient for retrieving documents based on exact or partial keyword matches.
We start by importing deeplake and setting up an organization ID org_id
and dataset name dataset_name_bm25
. Next, we create a new dataset with the specified name and location in Deep Lake storage.
We then add two columns to the dataset: restaurant_name
and restaurant_review
. Both columns use a BM25 index, which optimizes them for relevance-based searches, enhancing the ability to rank results based on how well they match search terms.
Finally, we use ds_bm25.commit()
to save these changes to the dataset and ds_bm25.summary()
to display an overview of the dataset's structure and contents.
If you don't have a token yet, you can sign up and then log in on the official Activeloop website, then click the Create API token
button to obtain a new API token. Here, under Select organization
, you can also find your organization ID(s).
import os, getpass
os.environ["ACTIVELOOP_TOKEN"] = getpass.getpass("Activeloop API token: ")
org_id = ""
dataset_name_bm25 = "bm25_test"
ds_bm25 = deeplake.create(f"al://{org_id}/{dataset_name_bm25}")
# Add columns to the dataset
ds_bm25.add_column("restaurant_name", types.Text(index_type=types.BM25))
ds_bm25.add_column("restaurant_review", types.Text(index_type=types.BM25))
ds_bm25.add_column("owner_answer", types.Text(index_type=types.BM25))
ds_bm25.commit()
ds_bm25.summary()
Output:
Dataset(columns=(restaurant_name,restaurant_review,owner_answer), length=0)
+-----------------+-----------------+
| column | type |
+-----------------+-----------------+
| restaurant_name |text (bm25 Index)|
+-----------------+-----------------+
|restaurant_review|text (bm25 Index)|
+-----------------+-----------------+
| owner_answer |text (bm25 Index)|
+-----------------+-----------------+
Add data to the dataset
We add data to the ds_bm25
dataset by appending the two columns, filled with values from the lists we previously created.
After appending, ds_bm25.commit()
saves the changes, ensuring the new data is permanently stored in the dataset. Finally, ds_bm25.summary()
provides a summary of the dataset's updated structure and contents, allowing us to verify that the data was added successfully.
ds_bm25.append({
"restaurant_name": restaurant_name,
"restaurant_review": restaurant_review,
"owner_answer": owner_answer
})
ds_bm25.commit()
ds_bm25.summary()
Output:
Dataset(columns=(restaurant_name,restaurant_review,owner_answer), length=18625)
+-----------------+-----------------+
| column | type |
+-----------------+-----------------+
| restaurant_name |text (bm25 Index)|
+-----------------+-----------------+
|restaurant_review|text (bm25 Index)|
+-----------------+-----------------+
| owner_answer |text (bm25 Index)|
+-----------------+-----------------+
Search for the restaurant using a specific sentence
We define a query, "I want burritos"
, to find relevant restaurant reviews in the dataset. Using ds_bm25.query()
, we search and rank entries in restaurant_review
based on BM25 similarity to the query. The code orders results by how well they match the query (BM25_SIMILARITY
), from highest to lowest relevance, and limits the output to the top 10 results. The final list of results is stored in view_bm25
.
query = "I want burritos"
view_bm25 = ds_bm25.query(f"""
SELECT *
ORDER BY BM25_SIMILARITY(restaurant_review, '{query}') DESC
LIMIT 6
""")
view_bm25
Output:
Dataset(columns=(restaurant_name,restaurant_review,owner_answer), length=6)
Show the results
for row in view_bm25:
print(f"Restaurant name: {row['restaurant_name']} \nReview: {row['restaurant_review']}")
Output:
Restaurant name: Los Amigos
Review: Best Burritos i have ever tried!!!!! Wolderful!!!
Restaurant name: Los Amigos
Review: Fantastic burritos!
Restaurant name: Cheztakos!!!
Review: Great burritos
Restaurant name: La Costeña
Review: Awesome burritos!
Restaurant name: La Costeña
Review: Awesome burritos
Restaurant name: La Costeña
Review: Bomb burritos
Vector similarity search
If you want to generate text embeddings for similarity search, you can choose a proprietary model like text-embedding-3-large
from OpenAI
, or you can opt for an open-source
model. The MTEB leaderboard on Hugging Face provides a selection of open-source models that have been tested for their effectiveness at converting text into embeddings, which are numerical representations that capture the meaning and nuances of words and sentences. Using these embeddings, you can perform similarity search, grouping similar pieces of text (like sentences or documents) based on their meaning.
Selecting a model from the MTEB leaderboard offers several benefits: these models are ranked based on performance across a variety of tasks and languages, ensuring that you’re choosing a model that’s both accurate and versatile. If you prefer not to use a proprietary model, a high-performing model from this list is an excellent alternative.
We start by installing and importing the openai
library to access OpenAI's API for generating embeddings.Next, we define the function embedding_function
, which takes texts
as input (either a single string or a list of strings) and a model name, defaulting to "text-embedding-3-large"
. Then, for each text, we replace newline characters with spaces to maintain clean, uniform text. Finally, we use openai.embeddings.create()
to generate embeddings for each text and return a list of these embeddings, which can be used for cosine similarity comparisons.
#!pip install openai
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key: ")
import openai
def embedding_function(texts, model="text-embedding-3-large"):
if isinstance(texts, str):
texts = [texts]
texts = [t.replace("\n", " ") for t in texts]
return [data.embedding for data in openai.embeddings.create(input = texts, model=model).data]
Create the dataset and add the columns
Next, we add three columns to vector_search
:
embedding
: Stores vector embeddings with a dimension size of 3072, which will enable vector-based similarity searches.restaurant_name
: A text column with a BM25 index , optimizing it for relevance-based text search.restaurant_review
: Another text column with a BM25 index , also optimized for efficient and ranked search results.owner_answer
: A text column with an inverted index , allowing fast and efficient filtering based on specific onwner answer.
Finally, we use vector_search.commit()
to save these new columns, ensuring the dataset structure is ready for further data additions and queries.
dataset_name_vs = "vector_indexes"
vector_search = deeplake.create(f"al://{org_id}/{dataset_name_vs}")
# Add columns to the dataset
vector_search.add_column(name="embedding", dtype=types.Embedding(3072))
vector_search.add_column(name="restaurant_name", dtype=types.Text(index_type=types.BM25))
vector_search.add_column(name="restaurant_review", dtype=types.Text(index_type=types.BM25))
vector_search.add_column(name="owner_answer", dtype=types.Text(index_type=types.Inverted))
vector_search.commit()
Create embeddings
This function processes each review in restaurant_review
and converts it into a numerical embedding. These embeddings, stored in embeddings_restaurant_review
, represent each review as a vector, enabling us to perform cosine similarity searches and comparisons within the dataset.
Deep Lake will handle the search computations, providing us with the final results.
# Create embeddings
batch_size = 500
embeddings_restaurant_review = []
for i in range(0, len(restaurant_review), batch_size):
embeddings_restaurant_review += embedding_function(restaurant_review[i : i + batch_size])
# Add data to the dataset
vector_search.append({"restaurant_name": restaurant_name, "restaurant_review": restaurant_review, "embedding": embeddings_restaurant_review, "owner_answer": owner_answer})
vector_search.commit()
vector_search.summary()
Output:
Dataset(columns=(embedding,restaurant_name,restaurant_review,owner_answer), length=18625)
+-----------------+---------------------+
| column | type |
+-----------------+---------------------+
| embedding | embedding(3072) |
+-----------------+---------------------+
| restaurant_name | text (bm25 Index) |
+-----------------+---------------------+
|restaurant_review| text (bm25 Index) |
+-----------------+---------------------+
| owner_answer |text (Inverted Index)|
+-----------------+---------------------+
Search for the restaurant using a specific sentence
We start by defining a search query, "A restaurant that serves good burritos."
.
Generate Embedding for Query :
- We call
embedding_function(query)
to generate an embedding for this query. Sinceembedding_function
returns a list, we access the first (and only) item with[0]
, storing the result inembed_query
.
Convert Embedding to String :
- We convert
embed_query
(a list of numbers) into a single comma-separated string using",".join(str(c) for c in embed_query)
. This step stores the embedding as a formatted string instr_query
, preparing it for further processing or use in queries.
query = "A restaurant that serves good burritos."
embed_query = embedding_function(query)[0]
str_query = ",".join(str(c) for c in embed_query)
- Define Query with Cosine Similarity :
- We construct a TQL query (
query_vs
) to search within thevector_search
dataset. - The query calculates the cosine similarity between the
embedding
column andstr_query
, which is the embedding of our query,"A restaurant that serves good burritos."
. This similarity scorescore
measures how closely each entry matches our query. - Order by Score and Limit Results :
- The query orders results by
score
in descending order, showing the most relevant matches first. We limit the results to the top 3 matches to focus on the best results. - Execute Query :
vector_search.query(query_vs)
runs the query on the dataset, storing the output inview_vs
, which contains the top 3 most similar entries based on cosine similarity. This approach helps us retrieve the most relevant records matching our query invector_search
.
query_vs = f"""
SELECT *
FROM (
SELECT *, cosine_similarity(embedding, ARRAY[{str_query}]) AS score
FROM (
SELECT *, ROW_NUMBER() AS row_id
)
)
ORDER BY score DESC
LIMIT 3
"""
view_vs = vector_search.query(query_vs)
view_vs
for row in view_vs:
print(f"Restaurant name: {row['restaurant_name']} \nReview: {row['restaurant_review']}")
Output:
Restaurant name: Cheztakos!!!
Review: Great burritos
Restaurant name: Los Amigos
Review: Nice place real good burritos.
Restaurant name: La Costeña
Review: Awesome burritos
If we want to filter for a specific owner answer, such as Thank you , we set word = "Thank you"
to define the desired owner answer. Here, we’re using an inverted index on the owner_answer
column to efficiently filter results based on this owner answer.
word = "Thank you"
query_vs = f"""
SELECT *
FROM (
SELECT *, cosine_similarity(embedding, ARRAY[{str_query}]) AS score
FROM (
SELECT *, ROW_NUMBER() AS row_id
)
)
WHERE CONTAINS(owner_answer, '{word}')
ORDER BY score DESC
LIMIT 3
"""
view_vs = vector_search.query(query_vs)
view_vs
for row in view_vs:
print(f"Restaurant name: {row['restaurant_name']} \nReview: {row['restaurant_review']} \nOwner Answer: {row['owner_answer']}")
Output:
Restaurant name: Taqueria La Espuela
Review: My favorite place for super burrito and horchata
Owner Answer: Thank you for your continued support!
Restaurant name: Chaat Bhavan Mountain View
Review: Great place with good food
Owner Answer: Thank you for your positive feedback! We're thrilled to hear that you had a great experience at our restaurant and enjoyed our delicious food. Your satisfaction is our priority, and we can't wait to welcome you back for another wonderful dining experience.
Thanks,
Team Chaat Bhavan
Restaurant name: Chaat Bhavan Mountain View
Review: Good food.
Owner Answer: Thank you for your 4-star rating! We're glad to hear that you had a positive experience at our restaurant. Your feedback is valuable to us, and we appreciate your support. If there's anything specific we can improve upon to earn that extra star next time, please let us know. We look forward to serving you again soon.
Thanks,
Team Chaat Bhavan
Hybrid search
In this stage, the system enhances its search capabilities by combining BM25 with Approximate Nearest Neighbors (ANN) for a hybrid search. This approach blends lexical search with semantic search, improving relevance by considering both keywords and semantic meaning. The introduction of a Large Language Model (LLM) allows the system to generate text-based answers, delivering direct responses instead of simply listing relevant documents.
We open the vector_search
dataset to perform a hybrid search. First, we define a query "Let's grab a drink"
and generate its embedding using embedding_function(query)[0]
. We then convert this embedding into a comma-separated string embedding_string
, preparing it for use in combined text and vector-based searches.
vector_search = deeplake.open(f"al://{org_id}/{dataset_name_vs}")
Search for the correct restaurant using a specific sentence
query = "I feel like a drink"
embed_query = embedding_function(query)[0]
embedding_string = ",".join(str(c) for c in embed_query)
We create two queries:
- Vector Search (
tql_vs
): Calculates cosine similarity withembedding_string
and returns the top 5 matches by score. - BM25 Search (
tql_bm25
): Ranksrestaurant_review
by BM25 similarity toquery
, also limited to the top 5.
We then execute both queries, storing vector results in vs_results
and BM25 results in bm25_results
. This allows us to compare results from both search methods.
tql_vs = f"""
SELECT *
FROM (
SELECT *, cosine_similarity(embedding, ARRAY[{embedding_string}]) AS score
FROM (
SELECT *, ROW_NUMBER() AS row_id
)
)
ORDER BY score DESC
LIMIT 5
"""
tql_bm25 = f"""
SELECT *, BM25_SIMILARITY(restaurant_review, '{query}') AS score
FROM (
SELECT *, ROW_NUMBER() AS row_id
)
ORDER BY BM25_SIMILARITY(restaurant_review, '{query}') DESC
LIMIT 5
"""
vs_results = vector_search.query(tql_vs)
bm25_results = vector_search.query(tql_bm25)
Show the scores:
for el_vs in vs_results:
print(f"vector search score: {el_vs['score']}")
for el_bm25 in bm25_results:
print(f"bm25 score: {el_bm25['score']}")
Output:
vector search score: 0.5322654247283936
vector search score: 0.46281781792640686
vector search score: 0.4580579102039337
vector search score: 0.45585304498672485
vector search score: 0.4528498649597168
bm25 score: 13.076177597045898
bm25 score: 11.206666946411133
bm25 score: 11.023599624633789
bm25 score: 10.277934074401855
bm25 score: 10.238584518432617
First, we import the required libraries and define a Document class, where each document has an id, a data dictionary, and an optional score for ranking.
- Setup and Classes :
- We import necessary libraries and define a
Document
class usingpydantic.BaseModel
. EachDocument
has anid
, adata
dictionary, and an optionalscore
for ranking. - Softmax Function :
- The
softmax
function normalizes a list of scores (retrieved_score
) using the softmax formula. Scores are exponentiated, limited bymax_weight
, and then normalized to sum up to 1. This returnsnew_weights
, a list of normalized scores.
#!pip install numpy pydantic
import math
import numpy as np
from typing import Any, Dict, List, Optional
from pydantic import BaseModel
class Document(BaseModel):
id: str
data: Dict[str, Any]
score: Optional[float] = None
def softmax(retrieved_score: list[float], max_weight: int = 700) -> Dict[str, Document]:
# Compute the exponentials
exp_scores = [math.exp(min(score, max_weight)) for score in retrieved_score]
# Compute the sum of the exponentials
sum_exp_scores = sum(exp_scores)
# Update the scores of the documents using softmax
new_weights = []
for score in exp_scores:
new_weights.append(score / sum_exp_scores)
return new_weights
Normalize the score
- Apply Softmax to Scores :
- We extract
score
values fromvs_results
andbm25_results
and applysoftmax
to them, storing the results invss
andbm25s
. This step scales both sets of scores for easy comparison. - Create Document Dictionaries :
- We create dictionaries
docs_vs
anddocs_bm25
to store documents fromvs_results
andbm25_results
, respectively. For each result, we add therestaurant_name
andrestaurant_review
along with the normalized score. Each document is identified byrow_id
.
This code standardizes scores and organizes results, allowing comparison across both vector and BM25 search methods.
vs_score = vs_results["score"]
bm_score = bm25_results["score"]
vss = softmax(vs_score)
bm25s = softmax(bm_score)
print(vss)
print(bm25s)
Output:
[0.21224761685297047, 0.19800771415362647, 0.1970674552539808, 0.19663342673946818, 0.19604378699995426]
[0.7132230191866898, 0.10997834807700335, 0.09158030054295993, 0.04344738382536802, 0.04177094836797888]
docs_vs = {}
docs_bm25 = {}
for el, score in zip(vs_results, vss):
docs_vs[str(el["row_id"])] = Document(id=str(el["row_id"]), data={"restaurant_name": el["restaurant_name"], "restaurant_review": el["restaurant_review"]}, score=score)
for el, score in zip(bm25_results, bm25s):
docs_bm25[str(el["row_id"])] = Document(id=str(el["row_id"]), data={"restaurant_name": el["restaurant_name"], "restaurant_review": el["restaurant_review"]}, score=score)
We define weights for our hybrid search: VECTOR_WEIGHT
and LEXICAL_WEIGHT
are both set to 0.5
, giving equal importance to vector-based and BM25 scores.
- Initialize Results Dictionary :
- We create an empty dictionary,
results
, to store documents with their combined scores from both search methods. - Combine Scores :
- We iterate over the unique document IDs from
docs_vs
anddocs_bm25
. - For each document:
- We add it to
results
, defaulting to the version available (vector or BM25). - We calculate a weighted score:
vs_score
from vector results (if present indocs_vs
) andbm_score
from BM25 results (if present indocs_bm25
). - The final
results[k].score
is set by addingvs_score
andbm_score
.
This produces a fused score for each document in results
, ready to rank in the hybrid search.
def fusion(docs_vs: Dict[str, Document], docs_bm25: Dict[str, Document]) -> Dict[str, Document]:
VECTOR_WEIGHT = 0.5
LEXICAL_WEIGHT = 0.5
results: Dict[str, Dict[str, Document]] = {}
for k in set(docs_vs) | set(docs_bm25):
results[k] = docs_vs.get(k, None) or docs_bm25.get(k, None)
vs_score = VECTOR_WEIGHT * docs_vs[k].score if k in docs_vs else 0
bm_score = LEXICAL_WEIGHT * docs_bm25[k].score if k in docs_bm25 else 0
results[k].score = vs_score + bm_score
return results
results = fusion(docs_vs, docs_bm25)
results
Ouput:
{'2637': Document(id='2637', data={'restaurant_name': 'Mifen101 花溪米粉王', 'restaurant_review': 'Feel like I’m back in China.'}, score=0.013747293509625419),
'5136': Document(id='5136', data={'restaurant_name': 'Scratch', 'restaurant_review': 'Just had drinks. They were good!'}, score=0.024505473374994282),
'17426': Document(id='17426', data={'restaurant_name': "St. Stephen's Green", 'restaurant_review': 'Good drinks an easy going bartenders'}, score=0.024579178342433523),
'17444': Document(id='17444', data={'restaurant_name': "St. Stephen's Green", 'restaurant_review': 'Good drinks, good food'}, score=0.02475096426920331),
'2496': Document(id='2496', data={'restaurant_name': 'Seasons Noodles & Dumplings Garden', 'restaurant_review': 'Comfort food, excellent service! Feel like back to home.'}, score=0.005430922978171003),
'4022': Document(id='4022', data={'restaurant_name': 'Eureka! Mountain View', 'restaurant_review': 'Good drinks and burgers'}, score=0.0246334319067476),
'3518': Document(id='3518', data={'restaurant_name': 'Olympus Caffe & Bakery', 'restaurant_review': 'I like the garden to sit down with friends and have a drink.'}, score=0.08915287739833623),
'17502': Document(id='17502', data={'restaurant_name': "St. Stephen's Green", 'restaurant_review': 'Nice place for a drink'}, score=0.02653095210662131),
'11383': Document(id='11383', data={'restaurant_name': 'Ludwigs Biergarten Mountain View', 'restaurant_review': 'Beer is fresh tables are big feel like a proper beer garden'}, score=0.011447537567869991),
'10788': Document(id='10788', data={'restaurant_name': 'Casa Lupe', 'restaurant_review': 'Run by a family that makes you feel like part of the family. Awesome food. I love their wet Chili Verde burritos'}, score=0.00522136854599736)}
We sort the results dictionary by each document's combined score in descending order, ensuring that the highest-ranking documents appear first.
sorted_documents = dict(sorted(results.items(), key=lambda item: item[1].score, reverse=True))
sorted_documents
Output:
{'3518': Document(id='3518', data={'restaurant_name': 'Olympus Caffe & Bakery', 'restaurant_review': 'I like the garden to sit down with friends and have a drink.'}, score=0.3566115095933449),
'17502': Document(id='17502', data={'restaurant_name': "St. Stephen's Green", 'restaurant_review': 'Nice place for a drink'}, score=0.10612380842648524),
'17444': Document(id='17444', data={'restaurant_name': "St. Stephen's Green", 'restaurant_review': 'Good drinks, good food'}, score=0.09900385707681324),
'4022': Document(id='4022', data={'restaurant_name': 'Eureka! Mountain View', 'restaurant_review': 'Good drinks and burgers'}, score=0.0985337276269904),
'17426': Document(id='17426', data={'restaurant_name': "St. Stephen's Green", 'restaurant_review': 'Good drinks an easy going bartenders'}, score=0.09831671336973409),
'5136': Document(id='5136', data={'restaurant_name': 'Scratch', 'restaurant_review': 'Just had drinks. They were good!'}, score=0.09802189349997713),
'2637': Document(id='2637', data={'restaurant_name': 'Mifen101 花溪米粉王', 'restaurant_review': 'Feel like I’m back in China.'}, score=0.054989174038501676),
'11383': Document(id='11383', data={'restaurant_name': 'Ludwigs Biergarten Mountain View', 'restaurant_review': 'Beer is fresh tables are big feel like a proper beer garden'}, score=0.045790150271479965),
'2496': Document(id='2496', data={'restaurant_name': 'Seasons Noodles & Dumplings Garden', 'restaurant_review': 'Comfort food, excellent service! Feel like back to home.'}, score=0.02172369191268401),
'10788': Document(id='10788', data={'restaurant_name': 'Casa Lupe', 'restaurant_review': 'Run by a family that makes you feel like part of the family. Awesome food. I love their wet Chili Verde burritos'}, score=0.02088547418398944)}
Show the results
We will output a list of restaurants in order of relevance, showing each name and review based on the hybrid search results.
for v in sorted_documents.values():
print(f"Restaurant name: {v.data['restaurant_name']} \nReview: {v.data['restaurant_review']}")
Output:
Restaurant name: Olympus Caffe & Bakery
Review: I like the garden to sit down with friends and have a drink.
Restaurant name: St. Stephen's Green
Review: Nice place for a drink
Restaurant name: St. Stephen's Green
Review: Good drinks, good food
Restaurant name: Eureka! Mountain View
Review: Good drinks and burgers
Restaurant name: St. Stephen's Green
Review: Good drinks an easy going bartenders
Restaurant name: Scratch
Review: Just had drinks. They were good!
Restaurant name: Mifen101 花溪米粉王
Review: Feel like I’m back in China.
Restaurant name: Ludwigs Biergarten Mountain View
Review: Beer is fresh tables are big feel like a proper beer garden
Restaurant name: Seasons Noodles & Dumplings Garden
Review: Comfort food, excellent service! Feel like back to home.
Restaurant name: Casa Lupe
Review: Run by a family that makes you feel like part of the family. Awesome food. I love their wet Chili Verde burritos
Generating LLM answer
This code completes the RAG (Retrieval-Augmented Generation) approach by generating an LLM-based answer to a user’s question, using results retrieved in the previous step. Here’s how it works:
- Setup and Initialization :
- We import
json
for handling JSON responses and initialize theOpenAI
client to interact with the language model. - Define
generate_question
Function : - This function accepts:
question
: The user’s question.information
: A list relevant chunks retrieved previously, providing context.- System and User Prompts :
- The
system_prompt
instructs the model to act as a restaurant assistant, using the provided chunks to answer clearly and without repetition. - The model is directed to format its response in JSON.
- The
user_prompt
combines the user’s question and the information chunks. - Generate and Parse the Response :
- Using
client.chat.completions.create()
, the system and user prompts are sent to the LLM (specified asgpt-4o-mini
). - The response is parsed as JSON, extracting the
answer
field. If parsing fails,False
is returned.
import json
from openai import OpenAI
client = OpenAI()
def generate_question(question:str, information:list):
system_prompt = f"""You are a helpful assistant specialized in providing answers to questions about restaurants. Below is a question from a user, along with the top four relevant information chunks about restaurants from a Deep Lake database. Using these chunks, construct a clear and informative answer that addresses the question, incorporating key details without repeating information.
The output must be in JSON format with the following structure:
{{
"answer": "The answer to the question."
}}
"""
user_prompt = f"Here is a question from a user: {question}\n\nHere are the top relevant information about restaurants {information}"
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt},
],
response_format={"type": "json_object"},
)
try:
response = response.choices[0].message.content
response = json.loads(response)
questions = response["answer"]
return questions
except:
return False
This function takes a restaurant-related question and retrieves the best response based on the given context. It completes the RAG process by combining relevant information and LLM-generated content into a concise answer.
information = [f'Review: {el["restaurant_review"]}, Restaurant name: {el["restaurant_name"]}' for el in view_vs]
result = generate_question(query, information)
result
Output:
"If you're feeling like a drink, consider visiting Taqueria La Espuela
which is known for its refreshing horchata. Alternatively, you might enjoy
Chaat Bhavan Mountain View, a great place with good food and a lively atmosphere."
Search on a multiple datasets
In this approach, we perform the hybrid search across two separate datasets: vector_search
for vector-based search results and ds_bm25
for BM25-based text search results. This allows us to independently query and retrieve scores from each dataset, then combine them using the same fusion method as before.
ds_bm25 = deeplake.open(f"al://{org_id}/{dataset_name_bm25}")
vs_results = vector_search.query(tql_vs)
bm25_results = ds_bm25.query(tql_bm25)
vs_score = vs_results["score"]
bm_score = bm25_results["score"]
vss = softmax(vs_score)
bm25s = softmax(bm_score)
docs_vs = {}
docs_bm25 = {}
for el, score in zip(vs_results, vss):
docs_vs[str(el["row_id"])] = Document(id=str(el["row_id"]), data={"restaurant_name": el["restaurant_name"], "restaurant_review": el["restaurant_review"]}, score=score)
for el, score in zip(bm25_results, bm25s):
docs_bm25[str(el["row_id"])] = Document(id=str(el["row_id"]), data={"restaurant_name": el["restaurant_name"], "restaurant_review": el["restaurant_review"]}, score=score)
results = fusion(docs_vs, docs_bm25)
for v in sorted_documents.values():
print(f"Restaurant name: {v.data['restaurant_name']} \nReview: {v.data['restaurant_review']}")
Output:
Restaurant name: Olympus Caffe & Bakery
Review: I like the garden to sit down with friends and have a drink.
Restaurant name: St. Stephen's Green
Review: Nice place for a drink
Restaurant name: St. Stephen's Green
Review: Good drinks, good food
Restaurant name: Eureka! Mountain View
Review: Good drinks and burgers
Restaurant name: St. Stephen's Green
Review: Good drinks an easy going bartenders
Restaurant name: Scratch
Review: Just had drinks. They were good!
Restaurant name: Mifen101 花溪米粉王
Review: Feel like I’m back in China.
Restaurant name: Ludwigs Biergarten Mountain View
Review: Beer is fresh tables are big feel like a proper beer garden
Restaurant name: Seasons Noodles & Dumplings Garden
Review: Comfort food, excellent service! Feel like back to home.
Restaurant name: Casa Lupe
Review: Run by a family that makes you feel like part of the family. Awesome food. I love their wet Chili Verde burritos
Comparison of Sync vs Async Query Performance
This code performs an asynchronous query on a Deep Lake dataset. It begins by opening the dataset asynchronously using await deeplake.open_async()
, specifying org_id
and dataset_name_vs
.
ds_async = await deeplake.open_async(f"al://{org_id}/{dataset_name_vs}")
ds_async_results = ds_async.query_async(tql_vs).result()
This following code compares the execution times of synchronous and asynchronous queries on a Deep Lake dataset:
- First, it records the start time
start_sync
for the synchronous query, executes the query withvector_search.query(tql_vs)
, and then records the end timeend_sync
. It calculates and prints the total time taken for the synchronous query by subtractingstart_sync
fromend_sync
. - Next, it measures the asynchronous query execution by recording
start_async
, runningvector_search.query_async(tql_vs).result()
to execute and retrieve the query result asynchronously, and then recordingend_async
. The asynchronous query time is calculated as the difference betweenend_async
andstart_async
, and is printed.
The code executes two queries both synchronously and asynchronously, measuring the execution time for each method. In the synchronous part, the queries are executed one after the other, and the execution time is recorded. In the asynchronous part, the queries are run concurrently using asyncio.gather()
to parallelize the asynchronous calls, and the execution time is also measured. The "speed factor" is then calculated by comparing the execution times, showing how much faster the asynchronous execution is compared to the synchronous one. Using asyncio.gather()
allows the asynchronous queries to run in parallel, reducing the overall execution time.
Finally, the code calculates the speed factor by dividing the synchronous query time by the asynchronous query time, indicating how much faster the asynchronous query is. The speed factor is printed to compare the efficiency of asynchronous vs. synchronous execution.
import time
import asyncio
import nest_asyncio
nest_asyncio.apply()
async def run_async_queries():
# Use asyncio.gather to run queries concurrently
ds_async_results, ds_bm25_async_results = await asyncio.gather(
vector_search.query_async(tql_vs),
ds_bm25.query_async(tql_bm25)
)
return ds_async_results, ds_bm25_async_results
# Measure synchronous execution time
start_sync = time.time()
ds_sync_results = vector_search.query(tql_vs)
ds_bm25_sync_results = ds_bm25.query(tql_bm25)
end_sync = time.time()
print(f"Sync query time: {end_sync - start_sync}")
# Measure asynchronous execution time
start_async = time.time()
# Run the async queries concurrently using asyncio.gather
ds_async_results, ds_bm25_async_results = asyncio.run(run_async_queries())
end_async = time.time()
print(f"Async query time: {end_async - start_async}")
sync_time = end_sync - start_sync
async_time = end_async - start_async
# Calculate speed factor
speed_factor = sync_time / async_time
# Print the result
print(f"The async query is {speed_factor:.2f} times faster than the sync query.")
Output:
Sync query time: 0.09148645401000977
Async query time: 0.0657045841217041
The async query is 1.39 times faster than the sync query.
We can execute asynchronous queries even after loading the dataset synchronously. In the following example, we perform a BM25 query asynchronously on a dataset ds_bm25
that was loaded synchronously.
result_async_with_bm25 = ds_bm25.query_async(tql_bm25).result()
result_async_with_bm25
Conclusion
This chapter provides a step-by-step guide to building a hybrid search system, starting with BM25 and dense vector retrieval methods and culminating in their integration into a hybrid approach. By combining lexical and semantic retrieval, the hybrid system demonstrates how these methods complement each other to deliver more accurate and flexible results. This progression illustrates the practical value of hybrid search for achieving advanced functionality in modern information retrieval systems.
In the next chapter, we will explore advanced chunking methods. The naive chunking approaches based on tokens and overlap disrupt logical flow of text and context and there are many techniques that try to fix this!