Gemma 4 audio with MLX

Simon Willison shares a practical recipe for running Gemma 4 E2B (10.28 GB) on macOS to transcribe audio files using MLX and mlx-vlm, with a ready-to-use uv command demonstrating local inference.
Modelwire context
ExplainerThe more significant detail here is that Gemma 4 E2B is a multimodal model handling audio transcription locally, not just text, which means the 10 GB weight file is doing work that would typically require a cloud API call. The uv-based one-liner also signals how much the local inference tooling has matured: setup friction that once took an afternoon now fits in a terminal command.
Google has been pushing multimodal capability across its model families in multiple directions simultaneously. Around the same time this recipe appeared, Google DeepMind shipped Gemini Robotics-ER 1.6 (covered here April 13) for spatial reasoning, and separately released Gemini 3.1 Flash TTS for expressive speech synthesis. Those are cloud-served, tightly controlled releases. Willison's write-up sits at the opposite end of the deployment spectrum: the same underlying capability direction, but running entirely on a consumer laptop via MLX. That contrast is worth holding onto as a frame for how Google's multimodal investments actually reach developers.
Watch whether mlx-vlm adds support for larger Gemma 4 variants in the next few weeks. If the 27B or higher checkpoints become runnable on M-series hardware with acceptable speed, local audio and vision inference stops being a hobbyist curiosity and starts competing with API pricing for batch workloads.
Coverage we drew on
- Gemini 3.1 Flash TTS: the next generation of expressive AI speech · Google DeepMind
This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.
MentionsGoogle · Gemma 4 E2B · MLX · mlx-vlm · Simon Willison · Rahim Nathwani
Modelwire summarizes — we don’t republish. The full article lives on simonwillison.net. If you’re a publisher and want a different summarization policy for your work, see our takedown page.