Subscribe to Our Newsletter

Success! Now Check Your Email

To complete Subscribe, click the confirmation link in your inbox. If it doesn’t arrive within 3 minutes, check your spam folder.

Ok, Thanks

Google DeepMind releases AI model that understands text, images, video and audio together

Gemini Embedding 2 is the company's first model to treat different types of media as part of a single, searchable system

Defused News Writer profile image
by Defused News Writer
Google DeepMind releases AI model that understands text, images, video and audio together
Photo by Sascha Bosshard / Unsplash

Google DeepMind has released Gemini Embedding 2, its first AI model capable of understanding and comparing text, images, video, audio and documents within a single unified system.

To understand why that matters, it helps to know what an embedding model does: it converts different types of content into numerical representations that a computer can compare, allowing a search system to find images that match a written description, or retrieve a document based on a spoken question.

Previous embedding models typically handled only one type of content at a time, meaning developers needed separate systems for text, images and audio, and had no straightforward way to search across them together.

Gemini Embedding 2 places all of these formats into the same numerical space, so a query in one format can surface results in another, across more than 100 languages.

In practical terms, a developer could build a search tool that accepts a spoken question and returns relevant video clips, images and documents in a single query, without first converting the audio into text.

The model accepts text passages of up to roughly 8,000 words, up to six images per request, up to two minutes of video, audio files processed directly without transcription, and PDF documents of up to six pages.

Google said the model uses a technique called Matryoshka Representation Learning, named after the Russian nesting dolls, which allows the numerical outputs to be scaled down in size to save storage without significantly degrading quality.

The company said Gemini Embedding 2 outperforms leading competing models on text, image and video tasks and is available now in public preview through its Gemini API and its Vertex AI developer platform.

The recap

  • Gemini Embedding 2 enters public preview for developers.
  • Maps text, image, video, audio and documents into one space.
  • Available via Gemini API and Vertex AI, with integrations supported.
Defused News Writer profile image
by Defused News Writer

Explore stories