
Google just changed the rules for what a small AI model can do. Here’s everything developers need to know about Gemma 4.
There’s a quiet revolution happening in AI right now, and it doesn’t live in a data centre. It runs on your laptop, your phone, even a Raspberry Pi. Google DeepMind’s Gemma 4 is the company’s most intelligent open model family to date blog — and with an Apache 2.0 license, it’s yours to run, fine-tune, and ship into production without legal headaches or cloud dependency. If you’ve been waiting for open-source AI to genuinely compete with proprietary frontier models, that moment may have arrived.
This post breaks down what Gemma 4 actually is, how it works technically, how real developers are already using it, and — crucially — whether it deserves a place in your AI stack.
What Is Gemma 4? The Quick Overview
Google DeepMind released four new vision-capable, Apache 2.0 licensed reasoning LLMs sized at 2B, 4B, 31B, plus a 26B-A4B Mixture-of-Experts. simonwillison
The family is called Gemma 4, and it represents the fourth generation of Google’s open model series. Since the launch of the first generation, developers have downloaded Gemma over 400 million times, building a vibrant ecosystem of more than 100,000 model variants. blog That community momentum is what makes this launch significant — it’s not a research preview, it’s a production-grade release backed by a massive, active developer base.
The headline claim from Google is striking: an “unprecedented level of intelligence-per-parameter.” simonwillison This isn’t marketing fluff. The 31B model currently ranks as the #3 open model in the world on the industry-standard Arena AI text leaderboard, while the 26B model holds the #6 spot — outcompeting models 20x its size. blog
The Four Gemma 4 Models: Sizes, Names, and What They’re For
One of the first things to understand about Gemma 4 is that the naming is a bit unusual. The two smaller models are labelled E2B and E4B — where “E” stands for “Effective.”
The smaller models incorporate Per-Layer Embeddings (PLE) to maximize parameter efficiency in on-device deployments. Rather than adding more layers or parameters to the model, PLE gives each decoder layer its own small embedding for every token. These embedding tables are large but are only used for quick lookups, which is why the effective parameter count is much smaller than the total. simonwillison
In plain English: the models are heavier on disk than their parameter count suggests, but they’re actually faster and cheaper to run because they activate far fewer parameters during inference. It’s a smart architectural trade-off designed specifically for edge hardware.
Here’s a breakdown of all four models:
| Model | Type | Best For | Context Window |
|---|---|---|---|
| E2B | Dense, edge-optimized | Mobile, IoT, on-device | 128K tokens |
| E4B | Dense, edge-optimized | Laptops, mobile | 128K tokens |
| 26B-A4B | Mixture-of-Experts (MoE) | Developer workstations | 256K tokens |
| 31B | Dense | Fine-tuning, research | 256K tokens |
The 26B Mixture of Experts model activates only 3.8 billion of its total parameters during inference to deliver exceptionally fast tokens-per-second, while the 31B Dense model maximizes raw quality and provides a powerful foundation for fine-tuning. blog
Multimodal by Default: Vision, Audio, and More
This is where Gemma 4 makes a real leap. Previous open models typically handled text and images. Gemma 4 goes further.
All models natively process video and images, supporting variable resolutions, and excelling at visual tasks like OCR and chart understanding. Additionally, the E2B and E4B models feature native audio input for speech recognition and understanding. blog
That means the smallest models in the family — the ones designed to run on your phone — can listen, see, and reason. That’s a significant capability jump for edge AI.
Simon Willison, one of the most respected independent AI researchers, noted in his hands-on testing that the audio input capability isn’t yet supported by mainstream local inference tools. He couldn’t find a way to run audio input locally — that feature isn’t in LM Studio or Ollama yet. simonwillison So while the capability exists in the model weights, tooling support is still catching up. Keep an eye on Ollama and LM Studio update logs if audio inference locally is important to your workflow.
What Multimodal Means for Developers
For Indian developers building products in sectors like edtech, healthtech, or vernacular content, this matters enormously. A model that can understand images, audio, and text natively — running entirely on-device — opens up use cases that were previously only viable with expensive cloud API calls:
- Document digitization from scanned images (OCR) without cloud latency
- Voice assistants that work offline
- Chart and data extraction from visual reports
- Multilingual audio transcription on-device
Real-World Performance: What Developers Are Seeing
The best test of any model isn’t benchmarks — it’s what it actually produces. Willison ran his go-to visual benchmark across the model sizes using GGUF files in LM Studio:
- The 2B model (4.41 GB on disk) produced rough, barely recognizable outputs — expected for its size.
- The 4B model (6.33 GB) showed clear improvement, though still limited visually.
- The 26B-A4B produced probably the best visual output Willison had seen from a model that runs on a laptop simonwillison — a genuinely impressive result for a locally-run model.
- The 31B model (19.89 GB) was broken in LM Studio at release, producing only repetitive output loops for every prompt.
This last point is worth flagging: early releases sometimes ship with rough edges, and the 31B was clearly one of them at launch on local inference. However, Google provides API access to the 31B via AI Studio, and it performed well there.
Running Gemma 4 via API
For the 31B model specifically, Willison accessed it through Google’s AI Studio and the llm-gemini command-line tool. The prompt was as simple as:
llm -m gemini/gemma-4-31b-it 'Generate an SVG of a pelican riding a bicycle'
The output was solid. This is important: even if your local hardware can’t run the 31B, you can still access it via API during development and only deploy locally with the smaller models.
Why the “Intelligence-Per-Parameter” Story Matters for India
The succession of capability from 2B to 4B to 26B-A4B provides more evidence that creating small, useful models is one of the hottest areas of research right now. simonwillison
For the Indian developer and startup ecosystem, this trend is particularly significant. Most AI adoption in India faces a dual constraint: cloud API costs in dollars, and infrastructure reliability. A model family that runs well on consumer hardware — with a permissive license — changes the economics of AI product development dramatically.
Consider what’s now possible with Gemma 4:
- A vernacular language chatbot running on a mid-range Android device, completely offline
- A document processing tool for legal or medical use cases that never sends sensitive data to the cloud
- An AI coding assistant running locally in VS Code, with no subscription cost
- Fine-tuned domain models for agriculture, finance, or healthcare — built on top of Gemma 4 weights and deployed on-premise
INSAIT used the Gemma ecosystem to create a pioneering Bulgarian-first language model (BgGPT), and Google worked with Yale University on cancer therapy discovery blog — the precedent for high-impact fine-tuned applications is already there.
The Apache 2.0 License: Why It’s a Big Deal
A lot of “open” AI models aren’t actually open for commercial use. Meta’s Llama, for example, has usage restrictions tied to user count thresholds. Gemma 4 ships under Apache 2.0 — one of the most permissive open-source licenses in existence.
What Apache 2.0 means in practice:
- Commercial use is unrestricted
- Modification and redistribution are allowed
- No royalties or fees to Google
- No forced disclosure of your own source code
- Patent protection for users of the licensed code
This license provides a foundation for complete developer flexibility and digital sovereignty, granting you complete control over your data, infrastructure, and models. blog
For Indian startups building on AI — especially those handling sensitive user data under DPDP (Digital Personal Data Protection) Act constraints — this matters. You can run Gemma 4 on your own servers, in your own data centre, with no data leaving your infrastructure. That’s a compliance-friendly baseline that proprietary API-based models simply cannot match.
How to Get Started With Gemma 4 Right Now
The model family has excellent day-one tooling support. Here’s the fastest path to running Gemma 4 depending on your context:
For Local Inference
LM Studio, Ollama, llama.cpp, and MLX all have day-one support for Gemma 4. blog The simplest route for most developers:
- Install Ollama
- Run
ollama pull gemma4(or specify the size variant) - Start querying via
ollama run gemma4
For LM Studio users, the GGUFs are available directly in the model browser — just search “gemma-4.”
For API Access (No Local Hardware Required)
The 31B and 26B MoE models are accessible instantly via Google AI Studio. blog This is the best option if you want to evaluate the top-tier models without committing to local hardware.
For Fine-Tuning
You can train and adapt Gemma 4 using Google Colab, Vertex AI, or even a gaming GPU. blog The Unsloth library has Gemma 4 support, which significantly reduces VRAM requirements for fine-tuning.
For Android / Mobile Deployment
Android developers can prototype agentic flows using the AICore Developer Preview today, with forward-compatibility with Gemini Nano 4 planned. blog
Gemma 4 vs. The Competition: Where It Sits
| Model | License | Local? | Multimodal | Context |
|---|---|---|---|---|
| Gemma 4 E4B | Apache 2.0 | ✅ Yes | ✅ Vision + Audio | 128K |
| Gemma 4 31B | Apache 2.0 | ✅ Yes | ✅ Vision | 256K |
| Llama 3.3 70B | Llama License | ✅ Yes | ❌ Text only | 128K |
| Mistral Small | Apache 2.0 | ✅ Yes | ❌ Limited | 32K |
| GPT-4o Mini | Proprietary | ❌ No | ✅ Vision | 128K |
Gemma 4’s real differentiator is the combination of Apache 2.0 licensing, native multimodal support (including audio at the edge), and genuinely competitive benchmark performance — all in a package that runs locally. No other open model family currently offers all three.
What Gemma 4 Signals About the State of AI in 2026
The broader narrative here is worth stepping back to appreciate. We’re in a moment where the capability gap between closed frontier models and open models is closing fast — and the economics of who can build AI products is shifting as a result.
Gemma 4 isn’t just a new model release. It’s evidence of a structural trend: intelligence-per-parameter improvements are compounding. Each generation of small models does what only large models could do before. The 26B MoE model in Gemma 4 outperforms models 20x its size. In 2–3 years, a 4B model might do what the 26B does today.
For developers, this means a few things worth internalising:
- Invest in fine-tuning skills now. The ability to take a strong base model and adapt it to your domain is becoming the key differentiator. Gemma 4 is an excellent base for this.
- Think on-device first. Cloud APIs will remain useful, but on-device inference is becoming viable for a wider class of problems. Design your AI features with local fallback in mind.
- Treat Apache 2.0 as a feature. When evaluating AI models for product use, license type should be on your evaluation checklist alongside benchmark scores. Gemma 4’s open license is a genuine product advantage.
- Multimodal is the new baseline. Any model that can’t process images is already behind. Audio comprehension at the edge is the next capability to watch.
Limitations and Things to Watch
No model launch is perfect. A few honest caveats:
- The 31B model had issues in LM Studio at launch — producing looping outputs. If you’re testing locally, stick to the 26B MoE or smaller until this is patched.
- Audio input support in local tools is not yet available. Ollama and LM Studio don’t yet expose the audio input capability of the E2B/E4B models. This is a tooling gap, not a model gap — but it limits what you can actually do locally today.
- Fine-tuning at scale still requires decent hardware. While Unsloth makes it much easier, training even the 4B model meaningfully needs at least 16GB VRAM.
Final Verdict: Should You Use Gemma 4?
Yes — with appropriate caveats based on your use case.
If you’re building production AI features and care about cost control, data privacy, or offline capability, Gemma 4 belongs in your evaluation stack immediately. The 26B MoE model is the standout: it runs on a powerful consumer GPU, delivers results comparable to much larger models, and is available under a license that gives you full commercial freedom.
If you’re just experimenting or building demos, start with the 4B via Ollama — it’s fast, free, and surprisingly capable. If you need the best possible quality and can tolerate API latency, use the 31B via Google AI Studio while local support matures.
The race to build useful small models is officially the hottest front in AI research right now, and Gemma 4 is currently leading it.(Google open source AI models, run AI locally on-device, small language models 2026, Apache 2.0 AI model license)
Frequently Asked Questions (FAQ)
What exactly is Gemma 4, and how does it differ from Gemini?
Gemma 4 is a family of “open-weights” models. While developed by Google DeepMind using the same technology and infrastructure as the proprietary Gemini models, there is a fundamental difference in access. Gemini is a closed system available primarily via API or Google’s own interfaces. Gemma 4, however, allows you to download the model weights and run them on your own infrastructure—be it a local laptop, a private server, or a mobile device. This gives you total control over your data and eliminates per-token API costs.
Which Gemma 4 model size should I choose for my project?
Choosing the right model depends on your hardware and performance requirements:
- E2B & E4B: These are optimized for “the edge.” If you are building a mobile app or an IoT integration where the model must live on the device, these are your best bets. They utilize Per-Layer Embeddings (PLE) to remain highly efficient despite their small footprint.
- 26B-A4B (MoE): This is the “sweet spot” for developers with high-end consumer GPUs (like an RTX 4090 or Mac M2/M3 Max). Because it is a Mixture-of-Experts model, it provides the intelligence of a much larger model while only “activating” 3.8 billion parameters during inference, ensuring high speed.
- 31B: This is the flagship dense model. It is designed for those who need the highest possible reasoning capabilities and plan to perform intensive fine-tuning for specialized domains like legal, medical, or scientific research.
Is Gemma 4 truly free for commercial use?
Yes. Gemma 4 is released under the Apache 2.0 license. Unlike other “open” models that have restrictive clauses (such as Meta’s Llama license which has user-count thresholds), Apache 2.0 is one of the most permissive licenses in software. You can use it in commercial products, modify the code, and redistribute it without paying royalties or disclosing your proprietary source code to Google.
Can Gemma 4 really process audio natively?
Yes, but with a caveat. The E2B and E4B models feature native audio comprehension. This means they don’t just “read” a transcript of what was said; they understand the nuances of the audio signal itself. However, as of early 2026, many local inference tools like Ollama or LM Studio are still updating their backends to support this specific multimodal input. For now, you may need to use Google AI Studio to test these audio features until the local tooling ecosystem catches up.
How do I run Gemma 4 locally on my laptop?
The easiest way to get started is using Ollama:
- Download and install Ollama for your operating system.
- Open your terminal and type
ollama pull gemma4:4b(or your preferred size). - Once downloaded, run
ollama run gemma4:4bto start chatting immediately.For a GUI-based experience, LM Studio allows you to search for Gemma 4 GGUF files and load them with a single click, provided your machine has enough RAM/VRAM.
What is “Intelligence-per-Parameter” and why should I care?
“Intelligence-per-parameter” refers to how much “smartness” a model packs into its size. In the past, to get better reasoning, you simply needed a bigger model. Gemma 4 breaks this trend. For example, the 26B MoE model rivals the performance of models with 100B+ parameters. This is crucial because smaller, smarter models are faster to run, cheaper to host, and can fit on devices that previously couldn’t handle advanced AI.
How does Gemma 4 handle privacy and data security?
Because Gemma 4 can run entirely offline, it is inherently more private than cloud-based AI. When you run Gemma 4 on your local machine, your prompts, documents, and data never leave your hardware. This makes it an ideal solution for industries like healthcare, finance, or law, where uploading sensitive data to a third-party cloud is often a compliance violation under acts like the DPDP in India or GDPR in Europe.
Can I fine-tune Gemma 4 on my own data?
Absolutely. Gemma 4 is highly “tunable.” Using libraries like Unsloth, you can perform Parameter-Efficient Fine-Tuning (PEFT) or QLoRA on a single consumer-grade GPU. This allows you to teach the model specific jargon, brand voice, or specialized knowledge unique to your business.
What are the known limitations of the 31B model?
At launch, some developers noted that the 31B Dense model exhibited “looping” behavior (repeating the same text) when run through certain local inference engines like llama.cpp. This is typically a formatting or “stop token” issue in the early software implementations. If you encounter this, ensure your tools are updated to the latest version or try accessing the model via the Google AI Studio API to verify its baseline performance.