Tools & Code Models & Releases·Simon Willison·Apr 12

Gemma 4 audio with MLX

Simon Willison shares a practical recipe for running Gemma 4 E2B (10.28 GB) on macOS to transcribe audio files using MLX and mlx-vlm, with a ready-to-use uv command demonstrating local inference.

Modelwire context

Explainer

The more significant detail here is that Gemma 4 E2B is a multimodal model handling audio transcription locally, not just text, which means the 10 GB weight file is doing work that would typically require a cloud API call. The uv-based one-liner also signals how much the local inference tooling has matured: setup friction that once took an afternoon now fits in a terminal command.

Google has been pushing multimodal capability across its model families in multiple directions simultaneously. Around the same time this recipe appeared, Google DeepMind shipped Gemini Robotics-ER 1.6 (covered here April 13) for spatial reasoning, and separately released Gemini 3.1 Flash TTS for expressive speech synthesis. Those are cloud-served, tightly controlled releases. Willison's write-up sits at the opposite end of the deployment spectrum: the same underlying capability direction, but running entirely on a consumer laptop via MLX. That contrast is worth holding onto as a frame for how Google's multimodal investments actually reach developers.

Watch whether mlx-vlm adds support for larger Gemma 4 variants in the next few weeks. If the 27B or higher checkpoints become runnable on M-series hardware with acceptable speed, local audio and vision inference stops being a hobbyist curiosity and starts competing with API pricing for batch workloads.

Coverage we drew on

Gemini 3.1 Flash TTS: the next generation of expressive AI speech · Google DeepMind

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsGoogle · Gemma 4 E2B · MLX · mlx-vlm · Simon Willison · Rahim Nathwani

Read full story at Simon Willison →(simonwillison.net)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on simonwillison.net. If you’re a publisher and want a different summarization policy for your work, see our takedown page.