Build Multimodal Search With Vertex AI + Weaviate
Multimodal search — the ability to query across text, images, and video simultaneously — is rapidly becoming a must-have feature for modern applications. By combining Google Vertex AI's multimodal embedding models with the Weaviate vector database, developers can now build production-ready search systems that understand content across modalities with remarkable accuracy.
This guide walks through the architecture, setup, and implementation of a multimodal search application that lets users find results using text queries to surface images, or image queries to surface related text — all powered by a shared embedding space.
Key Takeaways
- Google Vertex AI's multimodal embeddings map text, images, and video into a single 1408-dimensional vector space, enabling cross-modal similarity search
- Weaviate natively supports multimodal vectorization through its
multi2vec-palmmodule, simplifying integration with Vertex AI - Combining these tools reduces development time from months to days compared to building custom embedding pipelines
- The architecture supports real-time search across millions of objects with sub-100ms latency at scale
- Developers can deploy locally for prototyping using Docker and scale to production on Google Cloud or Weaviate Cloud Services
- This approach outperforms traditional keyword-based search by understanding semantic meaning rather than exact matches
Why Multimodal Search Changes Everything
Traditional search systems treat text and images as entirely separate domains. A user searching for 'sunset over mountains' would only match documents containing those exact words — never an unlabeled photograph of that exact scene. Multimodal search eliminates this limitation entirely.
Google's multimodal embedding model (available through Vertex AI) processes text, images, and even short video clips into vectors that share a common mathematical space. When a sunset photo and the phrase 'sunset over mountains' produce vectors that are close together, search becomes truly cross-modal.
This capability unlocks use cases that were previously impossible or required expensive manual tagging. E-commerce platforms can let users search product catalogs with photos. Media companies can surface relevant archival footage using natural language queries. Healthcare organizations can match medical images with clinical descriptions.
Understanding the Architecture Stack
The architecture for a multimodal search application with Vertex AI and Weaviate consists of 3 primary layers:
- Embedding Layer (Google Vertex AI): The
multimodalembeddingmodel from Google generates 1408-dimensional vectors for text inputs (up to 32 tokens), images (PNG, JPEG, BMP, GIF), and video clips (up to 2 minutes) - Storage and Indexing Layer (Weaviate): Weaviate stores both the raw data and its vector representations, using HNSW (Hierarchical Navigable Small World) indexing for fast approximate nearest neighbor search
- Query Layer (Application): Your application sends queries — either text or image — to Weaviate, which vectorizes the query via Vertex AI and returns semantically similar results across all modalities
- Orchestration Layer: Docker Compose or Kubernetes manages the services, with Weaviate running as a containerized service alongside its Vertex AI integration module
Unlike solutions that require separate vector stores for each modality, Weaviate's native multimodal support means a single collection can hold text, image, and video objects in the same vector space. This dramatically simplifies both the codebase and operational overhead.
Setting Up the Development Environment
Prerequisites and Authentication
Before writing any code, developers need to configure Google Cloud credentials and spin up the Weaviate instance. The process starts with enabling the Vertex AI API in your Google Cloud project, which currently costs $0 for the API itself — you pay only for embedding generation at roughly $0.0001 per call.
You will need:
- A Google Cloud project with Vertex AI API enabled
- A service account key (JSON) with
aiplatform.userrole - Docker and Docker Compose installed locally
- Python 3.9+ with the
weaviate-clientpackage (v4.x recommended) - At least 4GB of RAM allocated to Docker for smooth operation
Launching Weaviate With the Vertex AI Module
Weaviate's Docker configuration makes module activation straightforward. The key is enabling the multi2vec-palm module, which handles communication with Google's embedding API. Your docker-compose.yml should specify the module and pass your Google Cloud credentials as environment variables.
The critical configuration parameters include setting ENABLE_MODULES to multi2vec-palm, providing your GOOGLE_APIKEY or mounting your service account credentials, and specifying the Google Cloud location (typically us-central1 for lowest latency from North America). Once the container is running, Weaviate exposes its REST and GraphQL APIs on port 8080 by default.
Defining the Multimodal Schema
Weaviate requires a collection schema that tells it which properties contain text, which contain images, and how to vectorize them. This is where the multimodal magic happens — by declaring multiple data types under a single collection, Weaviate knows to embed each property using the appropriate modality encoder within Vertex AI's model.
A typical schema for a product search application might include a name property (text), a description property (text), and an image property (blob). The multi2vec-palm vectorizer configuration specifies weights for each field — for instance, giving the image 60% weight and the text fields 40% combined. This weighting determines how much each modality contributes to the final vector representation.
Developers should experiment with these weights based on their use case. Image-heavy applications like fashion search benefit from higher image weights, while document-centric applications should emphasize text. The weights are normalized automatically, so they just need to reflect relative importance.
Ingesting Multimodal Data
Data ingestion follows Weaviate's standard batch import pattern, but with multimodal considerations. Images must be base64-encoded before insertion, and the Weaviate client handles sending them to Vertex AI for vectorization automatically.
A practical ingestion pipeline typically involves:
- Loading images from a local directory or cloud storage bucket (Google Cloud Storage integrates seamlessly)
- Converting each image to base64 string format
- Pairing images with their text metadata (titles, descriptions, tags)
- Using Weaviate's batch API to insert objects in groups of 100-200 for optimal throughput
- Monitoring the vectorization queue — Vertex AI has a default rate limit of 600 requests per minute for embedding generation
- Implementing retry logic for rate-limited requests using exponential backoff
For large datasets exceeding 100,000 objects, developers should consider using Weaviate's async batch import functionality and potentially requesting a rate limit increase from Google Cloud. A dataset of 1 million product images typically takes 4-6 hours to fully ingest and vectorize.
Executing Cross-Modal Queries
Once data is indexed, the real power emerges. Weaviate supports 3 query patterns for multimodal search:
Text-to-image search allows users to type natural language queries like 'red leather handbag with gold buckle' and receive ranked image results. The query text is vectorized by Vertex AI and compared against all stored vectors using cosine similarity.
Image-to-image search takes an uploaded image as input and finds visually similar items. This is particularly valuable for 'shop the look' features in retail or duplicate detection in media libraries.
Image-to-text search uses an image query to find relevant text documents — useful for automatically generating descriptions or finding related articles for a given photograph.
Weaviate's GraphQL API handles all 3 patterns through its nearText and nearImage operators. The nearText operator accepts a string query, while nearImage accepts a base64-encoded image. Both return results ranked by vector distance, with optional filters on metadata properties like category, price range, or date.
Hybrid Search for Better Results
For production applications, combining vector search with BM25 keyword search through Weaviate's hybrid search feature often yields superior results. Setting an alpha parameter between 0 and 1 controls the balance — 0.7 typically works well, favoring semantic understanding while still boosting exact keyword matches. This approach consistently outperforms pure vector search by 10-15% on relevance benchmarks.
Performance Optimization and Scaling
Latency optimization is critical for user-facing search applications. Several strategies help achieve sub-100ms response times:
- Pre-compute and cache vectors for frequently searched queries
- Use Weaviate's
HNSWindex withefConstructionset to 128 andefset to 64 for a strong balance between recall and speed - Enable Weaviate's product quantization (PQ) to compress vectors from 1408 x 4 bytes to roughly 1408 bytes, reducing memory usage by 75%
- Deploy Weaviate on machines with at least 32GB RAM for datasets exceeding 500,000 objects
- Consider Weaviate Cloud Services (WCS) for managed infrastructure starting at approximately $25/month for development tiers
Compared to building a similar system with OpenAI's CLIP model and a standalone vector database like Pinecone or Milvus, the Vertex AI + Weaviate combination offers tighter integration, lower operational complexity, and competitive pricing. However, OpenAI's embedding models currently support more languages out of the box, which may matter for global applications.
Industry Context and Growing Demand
The multimodal search market is experiencing explosive growth. Grand View Research estimates the global visual search market alone will reach $32 billion by 2028. Major players are investing heavily — Google, Amazon, and Microsoft have all launched or expanded multimodal AI services in 2024.
Weaviate, which raised $50 million in Series B funding in 2023, has positioned itself as the leading open-source vector database for multimodal use cases. Its native module system — supporting not just Vertex AI but also OpenAI, Cohere, Hugging Face, and AWS Bedrock — gives developers flexibility to switch providers without rewriting application logic.
Google's Vertex AI platform continues to expand its multimodal capabilities. The latest multimodalembedding@001 model supports 128, 256, 512, and 1408 dimensional outputs, letting developers trade accuracy for speed depending on their requirements.
What This Means for Developers and Businesses
For developers, this stack eliminates the most painful parts of building search: managing embedding pipelines, synchronizing multiple databases, and handling modality-specific preprocessing. A working prototype can be built in a single afternoon.
For businesses, multimodal search directly impacts revenue. E-commerce companies report 20-30% increases in search-to-purchase conversion rates after implementing visual search. Media companies reduce content discovery time by up to 60%. Customer support teams resolve image-based tickets 40% faster.
The barrier to entry has never been lower. Between Weaviate's open-source deployment option and Vertex AI's pay-per-use pricing, a startup can launch a production multimodal search feature for under $100/month at moderate scale.
Looking Ahead: The Multimodal Future
Google's Gemini models are expected to further enhance Vertex AI's multimodal capabilities throughout 2025, potentially adding audio embeddings and longer video support. Weaviate's roadmap includes native support for real-time streaming ingestion and improved multi-tenancy for SaaS applications.
The convergence of powerful multimodal models and purpose-built vector databases signals a fundamental shift in how applications handle search and retrieval. Developers who invest in understanding this architecture today will be well-positioned as multimodal AI becomes the default expectation for users — not a premium feature.
For teams ready to start building, Google's official Vertex AI documentation and Weaviate's recipe notebooks on GitHub provide working code samples that can be adapted for most use cases in under a day.
📌 Source: GogoAI News (www.gogoai.xin)
🔗 Original: https://www.gogoai.xin/article/build-multimodal-search-with-vertex-ai-weaviate
⚠️ Please credit GogoAI when republishing.