April 18, 2024 • 6 min read

Create Your Own Multimodal Search Engine Using Google’s Vertex AI

Rédigé par Hugo Borsoni

Hugo Borsoni

We all know at least one search engine. Almost everybody uses Google Search daily, and one thing is for sure, it works really well! But, with the fast-paced development of AI these last months, many are starting to turn to AI-enhanced alternatives to perform searches, like asking chatGPT. In this context, I decided to ask myself: could I come up with a different search engine using AI? Could I recreate a version of Google Images that understands language on my own image dataset? I heard something about this Vertex AI service in GCP...

You can also jump directly to how I did it, or go straight to my GitHub repo.

How does a search engine work?

General search engines

To create a search engine, one must first understand how these things work. It can be divided into 2 steps: Indexing and Querying the search engine’s data.

  • Indexing means storing the data in a practical way, format, and data structure.
  • Querying means retrieving the data, aiming at getting as fast as possible the most pertinent results.

The indexing must be optimized so that it is not too slow to add data to the search engine, but mostly so that the results of the querying are pertinent and fast to obtain.

My search engine


For my search engine, I used the embeddings technology for indexing. Simply put, it helps transform complex data into simpler numerical vectors. It is somewhat new in the AI ecosystem and the underlying technology for chatGPT. If you want to know more about embeddings, I recommend this amazing article that helped me better understand what makes our beloved assistant work. The famous math YouTube channel 3 blue one brown also made a super cool video series about this and GPTs in general. With embeddings, you can get mathematical relationships between words that carry meaning, like in this schema:

An image of 3 graphs showing semantic relationships between vector embeddings

To generate embeddings from content, you need a trained AI model. Some of the most used models nowadays are BERT, text-ada … However these models only work for text, and I said I wanted to recreate Google Images!

Multimodal embeddings

The model I will be using is Google’s Vertex AI multimodalembeddings. It can generate embeddings from text AND from images! They even added video since I released my repo. It is the only embeddings model I know that is capable of doing that.

What’s even better, the embeddings made from images and the ones made from text carry the same meaning. This means that embeddings generated from a picture of a dog, and embeddings generated by a perfect text description of the picture could theoretically be exactly the same. Generally, the embeddings will be very close, and we will need a way to measure that. I’ll tell you all about it in the next sections of this article.

Now that we know how a search engine works and how to make ours, let’s dive into the implementation!

Indexing your content with Vertex AI

Vertex AI data indexing Architecture

 A schema of the architecture of the indexing in my app

The indexing in my app is done in 3 steps:

  • Storing the image in the server
  • Computing an embedding from an image using GCP Vertex AI multimodalembeddings
  • Storing the embedding and image path in a specialized DB called a vector store. I decided to use qdrant because it has some of the best performance for vector store, and is open source. See this article about qdrant’s performance

To finish talking about architecture, here is the complete stack you will find in my repo:

  • I used a simple React in Typescript app for the frontend
  • I used fastAPI (a Python framework) for the backend

Vertex AI indexing code implementation

Main script

This is the script I use to index images for my search engine. You need to provide a folder containing the images you want your search engine to show.

The script:

  • creates an instance of VectorDB to access the database.
  • creates a folder in which to store the images.
  • gets the embeddings for the images.
  • insert the embeddings and image paths in the vector db.

Creating a Vector DB instance

This script uses the very convenient qdrant python library to create a collection if does not exist. Otherwise, it just connects to the existing collection.

Embeddings distance

If you are reading the code carefully, you will notice a distance=Distance.COSINE part when we create the collection.

It is the algorithm used to compute the distance between embeddings. With qdrant, we can use 3 different algorithms. When indexing your vector database, make sure to use the same distance the embedding model was trained with.

We are using the cosine distance, because multimodalembeddings001 was very probably trained using this algorithm, although I was not able to confirm it. The only hint I have comes from a jupyter notebook from the GCP team.

When you do not know which distance your model was trained with, cosine is a safe bet as it is by far the most commonly used.

Here are 4 distances commonly used in vector search. Keep in mind that these vectors we are talking about often have hundreds of dimensions! Notice how the cosine distance and the dot product are the same for normalized vectors.


Vertex AI multimodalembeddings client

When I first tried to use Vertex’s multimodalembeddings model, I wanted to use the famous langchain library. Although some experiment was done in the javascript version of langchain to query the model, it is not available in the main version of langchain, in python. I saw an issue asking for the feature, but it was closed because it was too old, so I opened a discussion. Do not hesitate to react to it if you are interested!

Since the langchain implementation of this does not exist, I had to create my own EmbeddingsClient class I used to send embeddings requests to Vertex. I bundle the image in a protobuf Struct to follow the requested request format from Vertex and recover the embeddings in the response. You need to authenticate to gcloud in the server where you send the requests.

A potential improvement could be to pass the API keys from the env directly to the EmbeddingsClient, instead of having to log in separately.

Sending batch requests to Vertex

I was using Vertex AI free credits (you can get 300$ of free credits if you register for the first time in Google Cloud!!). When you work with Vertex AI, and especially when working with free credits, you have rate-limiting quotas that you can check in the Gcloud console. The base quota for the multimodal embeddings is 120 requests per minute. You can submit a request to push this quota if you need it.

However, it means I had to code some rate-limiting when creating embedding:

In the script load_image_embeddings, I start by creating threads, to be able to send requests in parallel from a worker while another is waiting for its response. If you are not familiar with threading in Python, I recommend this nice article.

  • I use tqdm to show the progress of the embeddings generation because the indexing of the search engine is meant as a script that you have to run in the machine. Feel free to propose better implementations!
  • For every image, I load the bytes of the image, use my brand new embeddings client to send an embedding request for the image bytes, and wait for the response in my thread
  • Once the processing of all embeddings is complete, I put all the embeddings together in a big array, and return it.

Searching in a vector DB

Now that our search engine potential results are indexed, we need to have a way to search using our search engine! Let’s dive straight into the architecture.

Searching Vertex AI embeddings in a qdrant DB - Architecture

A graph representing the architecture of the searching part of my app

The searching in the app is done in 5 steps:

  • The user sends a text or an image query.
  • The server formats a request using my Vertex client and Vertex computes an embedding for the query.
  • We perform a search with qdrant to find the 10 indexed embeddings that are the closest to the embedding we just got from qdrant.
  • We retrieve the payload for these embeddings: the image path on the server.
  • The server returns the links for the images to the frontend.

Searching Vertex AI embeddings in a qdrant DB - Code implementation

Follow me for another code implementation before seeing my demo!

Sending an embeddings request to Vertex

This is done exactly like it was for the indexing part. It is worth noting that I also created a function to get embeddings for a sentence, in this code:

The only difference is the keys in the struct we send to Vertex AI.

Querying the DB

Querying the database is again very simple thanks to the amazing qdrant client.

It is so simple that I will use the opportunity to explain a bit the underlying algorithm used for the search. You can jump straight to the last section if you just want to see the demo

How does the search work?

The algorithm used for search is called Hierarchical Navigable Small Worlds, or HNSW. There is a good article here if you want to dive a bit deeper. The idea is to expand the number of vectors accessible to your algorithm layer by layer.

You start at the top layer, and you try to get as close as possible to the query vector, using the distance you chose during the creation of the collection. Once you cannot get closer, you change layer and try to get as close as possible using the newly available vectors.

The closest vector on the last layer has a very good chance of being the true closest vector from your query, but there are edge cases when this is not true. However, the time complexity gains are so big that this is the most widely used solution in the industry for vector search.

An image representing how the HNSW algorithm works

This is very efficient because the time complexity of the search now depends only on 2 things:

  • the number of layers

  • the number of neighbors every node can have. This is controlled by the value hnsw_ef. From the qdrant doc 

    hnsw_ef - controls the number of neighbors to visit during search. The higher the value, the more accurate and slower the search will be. Recommended range is 32-512.

A short demo!

You made it this far, congrats!

Here is a demo of the search engine indexed on a food database for you to enjoy. In this case, I gave it an image of a vegetarian tacos, and this is the results it gave me. Thank you for your read!

My food search engine!

Do not hesitate to reach out to me if you want to try this on a project and need some help. Here is the GitHub repo with my implementation.

Also, since I made this repo, GCP added video to multimodalembeddings! I think it is a must-try and would be curious to see the amazing use cases one can come up with.

The future of AI is multimodal!!

Cet article a été écrit par

Hugo Borsoni

Hugo Borsoni