Introduction
In this practical project, we dive into the details of Multimodal RAG in Deep Lake, leveraging its key advantages like true multi-modality and cost-efficient scalability. This chapter builds on the Hybrid RAG project, using the same context of restaurants and dataset. However, if you missed the previous project, no worries—this works perfectly as a standalone exploration. Let’s jump into comparing burgers using image embeddings and discover how Deep Lake's advanced capabilities make it an ideal tool for seamless integration of visual and textual data!
Jupyter: Google Colab
To set up for image embedding generation, we start by importing necessary libraries.
- Set Device :
- We define
device
to use GPU if available, otherwise defaulting to CPU, ensuring compatibility across hardware. - Load CLIP Model :
- We load the CLIP model (
ViT-B/32
) with its associated preprocessing steps usingclip.load()
. This model is optimized for multi-modal tasks and is set to run on the specifieddevice
.
This setup allows us to efficiently process images for embedding, supporting multi-modal applications like image-text similarity.
The following image illustrates the CLIP
(Contrastive Language-Image Pretraining) model's structure, which aligns text and images in a shared embedding space, enabling cross-modal understanding.
import torch
import clip
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)
Create the embedding function for images
To prepare images for embedding generation, we define a transformation pipeline and a function to process images in batches.
- Define Transformations (
tform
) : - The transformation pipeline includes:
- Resize : Scales images to 224x224 pixels.
- ToTensor : Converts images to tensor format.
- Lambda : Ensures grayscale images are replicated across three channels to match the RGB format.
- Normalize : Standardizes pixel values based on common RGB means and standard deviations.
- Define
embedding_function_images
: - This function generates embeddings for a list of image.
- If
images
is a single filename, it’s converted to a list. - Batch Processing : Images are processed in batches (default size 4), with transformations applied to each image. The batch is then loaded to the device.
- Embedding Creation : The model encodes each batch into embeddings, stored in the
embeddings
list, which is returned as a single list.
This function supports efficient, batched embedding generation, useful for multi-modal tasks like image-based search.
from torchvision import transforms
tform = transforms.Compose([
transforms.Resize((224,224)),
transforms.ToTensor(),
transforms.Lambda(lambda x: torch.cat([x, x, x], dim=0) if x.shape[0] == 1 else x),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
])
def embedding_function_images(images, model = model, transform = tform, batch_size = 4):
"""Creates a list of embeddings based on a list of image. Images are processed in batches."""
if isinstance(images, str):
images = [images]
# Proceess the embeddings in batches, but return everything as a single list
embeddings = []
for i in range(0, len(images), batch_size):
batch = torch.stack([transform(item) for item in images[i:i+batch_size]])
batch = batch.to(device)
with torch.no_grad():
embeddings+= model.encode_image(batch).cpu().numpy().tolist()
return embeddings
Create a new dataset to save the images
We set up a dataset for restaurant images and embeddings. The dataset includes an embedding
column for 512-dimensional image embeddings, a restaurant_name
column for names, and an image
column for storing images in UInt8 format. After defining the structure, vector_search_images.commit()
saves it, making the dataset ready for storing data for multi-modal search tasks with images and metadata.
import deeplake
scraped_data = deeplake.open_read_only("al://activeloop/restaurant_dataset_complete")
This code extracts restaurant details from scraped_data
into separate lists:
- Initialize Lists :
restaurant_name
andimages
are initialized to store respective data for each restaurant. - Populate Lists : For each entry (
el
) inscraped_data
, the code appends: el['restaurant_name']
torestaurant_name
el['images']['urls']
toimages
.
After running, each list holds a specific field from all restaurants, ready for further processing.
restaurant_name = []
images = []
for el in scraped_data:
restaurant_name.append(el['restaurant_name'])
images.append(el['images']['urls'])
image_dataset_name = "restaurant_dataset_with_images"
vector_search_images = deeplake.create(f"al://{org_id}/{image_dataset_name}")
vector_search_images.add_column(name="embedding", dtype=types.Embedding(512))
vector_search_images.add_column(name="restaurant_name", dtype=types.Text())
vector_search_images.add_column(name="image", dtype=types.Image(dtype=types.UInt8()))
vector_search_images.commit()
Convert the URLs into images
We retrieve images for each restaurant from URLs in scraped_data and store them in restaurants_images. For each restaurant, we extract image URLs, request each URL, and filter for successful responses (status code 200). These responses are then converted to PIL images and added to restaurants_images as lists of images, with each sublist containing the images for one restaurant.
#!pip install requests
import requests
from PIL import Image
from io import BytesIO
restaurants_images = []
for urls in images:
pil_images = []
for url in urls:
response = requests.get(url)
if response.status_code == 200:
image = Image.open(BytesIO(response.content))
if image.mode == "RGB":
pil_images.append(image)
if len(pil_images) == 0:
pil_images.append(Image.new("RGB", (224, 224), (255, 255, 255)))
restaurants_images.append(pil_images)
We populate vector_search_images
with restaurant image data and embeddings. For each restaurant in scraped_data
, we retrieve its name and images, create embeddings for the images, and convert them to UInt8
arrays. Then, we append the restaurant names, images, and embeddings to the dataset and save with vector_search_images.commit()
.
import numpy as np
for sd, rest_images in zip(scraped_data, restaurants_images):
restaurant_name = [sd["restaurant_name"]] * len(rest_images)
embeddings = embedding_function_images(rest_images, model=model, transform=tform, batch_size=4)
vector_search_images.append({"restaurant_name": restaurant_name, "image": [np.array(fn).astype(np.uint8) for fn in rest_images], "embedding": embeddings})
vector_search_images.commit()
Search similar images
If you want direct access to the images and the embeddings, you can copy the Activeloop dataset.
deeplake.copy("al://activeloop/restaurant_dataset_images_v4", f"al://{org_id}/{image_dataset_name}")
vector_search_images = deeplake.open(f"al://{org_id}/{image_dataset_name}")
Alternatively, you can load the dataset you just created.
vector_search_images = deeplake.open(f"al://{org_id}/{image_dataset_name}")
vector_search_images
query = "https://www.moltofood.it/wp-content/uploads/2024/09/Hamburger.jpg"
image_query = requests.get(query)
image_query_pil = Image.open(BytesIO(image_query.content))
Performing a similar image search based on a specific image
image_query_pil
Output:
We generate an embedding for the query image, image_query_pil
, by calling embedding_function_images([image_query_pil])[0]
. This embedding is then converted into a comma-separated string, query_embedding_string
, for compatibility in the query.The query, tql
, retrieves entries from the dataset by calculating cosine similarity between embedding
and query_embedding_string
. It ranks results by similarity score in descending order, limiting the output to the top 6 most similar images.
query_embedding = embedding_function_images([image_query_pil])[0]
query_embedding_string = ",".join([str(item) for item in query_embedding])
tql = f"""
SELECT *
FROM (
SELECT *, cosine_similarity(embedding, ARRAY[{query_embedding_string}]) AS score
FROM (
SELECT *, ROW_NUMBER() AS row_id
)
)
ORDER BY score DESC
LIMIT 6
"""
similar_images_result = vector_search_images.query(tql)
similar_images_result
Output:
Dataset(columns=(embedding,restaurant_name,image,row_id,score), length=6)
Show similar images and the their respective restaurants
The show_images
function displays a grid of similar images, along with restaurant names and similarity scores. It defines a grid with 3 columns and calculates the required number of rows based on the number of images. A figure with subplots is created, where each image is displayed in a cell with its restaurant name and similarity score shown as the title, and axes turned off for a cleaner look. Any extra cells, if present, are hidden to avoid empty spaces. Finally, plt.tight_layout()
arranges the grid, and plt.show()
displays the images in a well-organized layout, highlighting the most similar images along with their metadata.
import matplotlib.pyplot as plt
from PIL import Image
import numpy as np
def show_images(similar_images: list[dict]):
# Define the number of rows and columns for the grid
num_columns = 3
num_rows = (len(similar_images) + num_columns - 1) // num_columns # Calculate the required number of rows
# Create the grid
fig, axes = plt.subplots(num_rows, num_columns, figsize=(15, 5 * num_rows))
axes = axes.flatten() # Flatten for easier access to cells
for idx, el in enumerate(similar_images):
img = Image.fromarray(el["image"])
axes[idx].imshow(img)
axes[idx].set_title(f"Restaurant: {el['restaurant_name']}, Similarity: {el['score']:.4f}")
axes[idx].axis('off') # Turn off axes for a cleaner look
# Remove empty axes if the number of images doesn't fill the grid
for ax in axes[len(similar_images):]:
ax.axis('off')
plt.tight_layout()
plt.show()
show_images(similar_images_result)
Conclusion
Through the delicious lens of burger images, this project showcased the power of multimodal RAG systems. By leveraging Deep Lake's capabilities, we explored embedding generation, dataset creation, and similarity-based search, demonstrating how to seamlessly integrate visual and textual data for versatile and practical applications.
In the next chapter, we are going to explore a new and exciting technique for working with image data - ColPali!