AI Platforms Tech Giants AI News Tech Cloud Infrastructure AI Models & Research

Google makes Gemini Embedding 2 generally available as its first natively multimodal embedding model

The model maps text, images, video, audio and documents into a single vector space, replacing fragmented multi-model pipelines

by Defused News Writer

Updated April 22, 2026

Google makes Gemini Embedding 2 generally available as its first natively multimodal embedding model

Google has made Gemini Embedding 2 generally available via the Gemini API and Vertex AI, moving its first natively multimodal embedding model out of the public preview it entered in March.

Embedding models convert raw content into numerical vectors that capture semantic meaning, enabling search, classification, clustering and retrieval-augmented generation (RAG) systems.

Where previous Google embedding models handled text exclusively, Gemini Embedding 2 maps text, images, video, audio and PDF documents into a single shared vector space, allowing a text query to retrieve a relevant video clip, image or spoken audio segment through one model and one API call.

The model is built on the Gemini architecture and generates 3,072-dimensional vectors by default, with support for flexible output dimensions down to 768 using a technique called Matryoshka Representation Learning, which preserves retrieval quality at smaller vector sizes.

Google's benchmarks show the 768-dimension output scores 67.99 on the MTEB text benchmark compared to 68.16 for the full 3,072, a negligible drop that translates to a 75% reduction in storage and compute costs at scale.

On the MTEB English leaderboard, the model scored 68.32, a margin of nearly six points over competitors, while also leading the multilingual and code retrieval benchmarks.

During preview, users built prototypes including advanced e-commerce discovery engines and video analysis tools, and early adopters reported latency reductions of up to 70% by eliminating the need for separate embedding models for each data type.

Legal technology company Everlaw has been using the model for litigation discovery across millions of records, while personal wellness app MindLid reported a 20% improvement in top-1 recall when embedding conversational memories alongside audio and visual data.

The model supports over 100 languages, processes up to 8,192 input tokens for text, handles video clips of up to 128 seconds, audio of up to 80 seconds in MP3 and WAV formats, and PDF documents of up to six pages per request.

It integrates with popular frameworks including LangChain, LlamaIndex, Haystack, Weaviate, Qdrant, ChromaDB and Google's Vector Search.

"As a core technology powering many Google products, we are excited to share these research breakthroughs with the developer community," the company said.

The general availability release aims to provide the stability and optimisations needed to move multimodal systems into production, replacing the fragmented pipelines that previously required separate models and preprocessing steps for each data type.The recap