Google DeepMind has released Gemini Embedding 2, its first AI model capable of understanding and comparing text, images, video, audio and documents within a single unified system.
To understand why that matters, it helps to know what an embedding model does: it converts different types of content into numerical representations that a computer can compare, allowing a search system to find images that match a written description, or retrieve a document based on a spoken question.
Previous embedding models typically handled only one type of content at a time, meaning developers needed separate systems for text, images and audio, and had no straightforward way to search across them together.
Gemini Embedding 2 places all of these formats into the same numerical space, so a query in one format can surface results in another, across more than 100 languages.
In practical terms, a developer could build a search tool that accepts a spoken question and returns relevant video clips, images and documents in a single query, without first converting the audio into text.
The model accepts text passages of up to roughly 8,000 words, up to six images per request, up to two minutes of video, audio files processed directly without transcription, and PDF documents of up to six pages.
Related reading
- OpenAI releases training dataset to help AI models follow instructions in the right order
- Hospitals are moving AI out of the lab and into everyday clinical work, Microsoft says
- Nvidia's Jetson chip is bringing AI out of the cloud and into the machines around us
Google said the model uses a technique called Matryoshka Representation Learning, named after the Russian nesting dolls, which allows the numerical outputs to be scaled down in size to save storage without significantly degrading quality.
The company said Gemini Embedding 2 outperforms leading competing models on text, image and video tasks and is available now in public preview through its Gemini API and its Vertex AI developer platform.
The recap
- Gemini Embedding 2 enters public preview for developers.
- Maps text, image, video, audio and documents into one space.
- Available via Gemini API and Vertex AI, with integrations supported.