Google Unveils New Gemma 4 12B Multimodal AI Model

00:53 / 05.06.2026·193·Technology

Google has unveiled the new Gemma 4 12B model, designed for local operation on laptops and devices with limited computing resources. This multimodal artificial intelligence system serves as an intermediate link between the compact E4B and the large MoE architecture with 26 billion parameters. The model's key feature is that it is the first mid-sized system in its class to natively support audio signals. According to Ixbt.com reports .

Developers state that Gemma 4 12B can process images and audio without traditional separate encoders. Instead, multimodal signals are integrated directly into the core language model. For image processing, a lightweight module based on matrix transformations is used instead of a separate vision encoder, significantly reducing computational costs.

Audio signals are projected into the text token space in their raw form, without any encoders. Despite the simplified architecture, Gemma 4 12B demonstrates performance close to large models with 26 billion parameters in standard benchmarks. At the same time, it is less demanding on memory and runs smoothly on devices with 16 GB of VRAM.

The model supports the Multi-Token Prediction (MTP) mechanism, which reduces latency in text generation, and is designed for complex agent scenarios. According to Google, the Gemma family has been downloaded more than 150 million times to date. The new model is distributed under the Apache 2.0 license, allowing users to leverage advanced AI capabilities directly on their devices without cloud services.