Google Gemini Omni: A new model that converts images, audio, and text into video

04:56 / 20.05.2026·224·Technology

Three years ago, when Google launched the Gemini project, the main goal was to create a single multimodal neural network trained on text, image, audio, and video data. Today, at the Google I/O conference, CEO Sundar Pichai presented an important step toward this goal: the Gemini Omni model. According to him, the new model can generate any content from any input data. This is reported by Techcrunch.com reports .

Gemini Omni allows users to combine images, audio, video, and text. Unlike simply aggregating data, Omni analyzes all of it to provide high-quality videos based on the laws of physics, culture, history, and scientific concepts. Users will also be able to edit images via simple text prompts without complex software.

Google DeepMind representative Nicole Brichtova called this innovation the next stage in combining Gemini's intelligence with the visualization capabilities of media models. For example, when given a command to prepare a video tutorial on protein folding, the model not only creates the animation but also adds a voiceover explaining the process.

With the new model, users can also create their own digital avatars. To prevent the risk of deepfakes, Google has introduced a special security system: the user is required to record themselves on video and say specific numbers. Only then is the avatar saved and authorized for future use.

All videos created via Gemini Omni are protected by Google's SynthID digital watermark. This allows users to verify that the video was created by artificial intelligence. As Sundar Pichai noted, artificial intelligence is moving from simply predicting text to the stage of simulating reality.