Tools & Code Opinion & Analysis·Simon Willison·May 20

How fast is 10 tokens per second really?

Mike Veerman's interactive token-speed simulator addresses a persistent friction point in LLM evaluation: the gap between advertised throughput metrics and user experience. By rendering real-time token generation across a 5-800 tokens/second range, the tool lets practitioners calibrate expectations against actual latency perception, surfacing why a model's raw speed claim often diverges from perceived responsiveness. This matters as inference speed becomes a primary competitive lever in the model market, and buyers increasingly need intuition for what throughput numbers mean in practice.

Modelwire context

Explainer

The deeper issue the summary sidesteps is that tokens-per-second is a throughput measure, not a latency measure, and conflating the two is where most practitioner confusion originates. A model streaming at 50 tokens/second feels very different depending on whether you are reading prose, waiting for a code block to complete, or piping output into another system.

The related Modelwire coverage from this week on AI coding capabilities driving robotics deployment (the OpenClaw agent piece) is largely disconnected from this story in terms of subject matter, but it does underscore a shared underlying dynamic: as AI moves into production infrastructure, the gap between benchmark numbers and real operational behavior becomes a practical problem, not just an academic one. Developers building physical systems with LLM-generated code care about inference latency in a very concrete way. Veerman's tool is a small but useful contribution to closing that gap at the evaluation stage, before deployment decisions are made.

Watch whether inference providers such as Groq or Cerebras begin citing perceptual benchmarks alongside raw throughput figures in their marketing within the next two quarters. If they do, it signals that perceived responsiveness is becoming a commercial differentiator, not just a developer curiosity.

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsMike Veerman · Simon Willison · Hacker News

Read full story at Simon Willison →(simonwillison.net)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on simonwillison.net. If you’re a publisher and want a different summarization policy for your work, see our takedown page.