Beyond
Benchmarks
Direct from terabytes of coding agent logs:
how the models actually perform.
About the data
This data is aggregated from anonymized real AI coding sessions. These session transcripts are uploaded and analyzed by the Cadence app. Only models with sufficient recent activity are shown. The site updates daily.
Output tokens per second
How fast each model generates output during active session time.
How is this measured?
This metric divides top-level output tokens by active session time for each session, then averages those session-level values by day. It focuses on model output during work time and excludes sessions with no active-time measurement. For v1, data is aggregated at the model level so the chart reads as model-level speed rather than tool-level speed. Differences in task mix, coding language, and session length can still influence the result, so this should be read as a field signal rather than a lab benchmark.
Frustration rate
Share of user messages where the developer expressed frustration.
How is this measured?
This metric counts user messages flagged as frustrated and divides that by all user messages in the same 28-day model window. It captures explicit signs of irritation in developer messages and does not attempt to infer hidden sentiment. The signal can be affected by team communication style and task difficulty, so it is most useful as a comparative friction indicator across sufficiently active models.
Interruptions per session
How often developers interrupt, reject, or roll back model work during a session.
How is this measured?
This metric counts recorded user interventions, including interrupted turns, rejected tool use, cancellations, and rollbacks, then divides by the number of sessions in the same 28-day model window. It is intended to capture how often developers had to stop or redirect the agent explicitly. Some tools expose interruption events more clearly than others, so compare sustained patterns rather than isolated differences.
Context-hunting vs implementation ratio
Time spent finding context relative to time spent implementing.
How is this measured?
This metric uses LLM-classified phase percentages and session duration to estimate total context-hunting time and total implementation time in the same 28-day model window, then divides context time by implementation time. A value of 1.0 means equal estimated time spent finding context and implementing. Sessions without duration or implementation time are excluded from this calculation, so read small day-to-day differences cautiously and focus on persistent separation between series.
Meaningful outcome rate
Share of sessions that produced a meaningful outcome.
How is this measured?
This metric counts sessions classified as producing a meaningful outcome and divides by all classified sessions for the same model and day. It is intended to separate sessions that actually moved work forward from sessions that stalled, wandered, or ended without a useful result. It does not rank the size or business value of the outcome, and task difficulty can vary across teams and tools.
Failed tool call rate
Share of tool calls that failed.
How is this measured?
This metric counts failed tool calls and divides by all tool calls in the same 28-day model window. It includes tool failures visible in the recorded session stream and excludes sessions with no tool calls. A lower rate generally indicates smoother execution, but task mix and tool usage can vary by model, so compare sustained patterns rather than isolated spikes.
About Cadence
Cadence helps engineering organizations measure the friction in their AI-assisted development. We aggregate session telemetry from our customer base to surface where AI tools succeed and where they break down in real engineering work.