About the data

This data is aggregated from anonymized real AI coding sessions. These session transcripts are uploaded and analyzed by the Cadence app. Only models with sufficient recent activity are shown. The site updates daily.

Output tokens per second

How fast each model generates output during active session time.

Grouped by model. Higher = better.
ModelLatest
516.0 tokens/s
236.1 tokens/s
49.2 tokens/s
32.6 tokens/s
28.5 tokens/s
24.7 tokens/s
21.3 tokens/s
21.1 tokens/s
18.8 tokens/s
15.5 tokens/s
15.0 tokens/s
5.7 tokens/s
0.4 tokens/s
How is this measured?

This metric divides top-level output tokens by active session time for each session, then averages those session-level values by day. It focuses on model output during work time and excludes sessions with no active-time measurement. For v1, data is aggregated at the model level so the chart reads as model-level speed rather than tool-level speed. Differences in task mix, coding language, and session length can still influence the result, so this should be read as a field signal rather than a lab benchmark.

Frustration rate

Share of user messages where the developer expressed frustration.

Grouped by model. Lower = better.
ModelLatest
0.0%
0.0%
0.0%
0.0%
0.0%
0.0%
0.0%
0.0%
0.0%
0.1%
0.1%
0.1%
0.6%
How is this measured?

This metric counts user messages flagged as frustrated and divides that by all user messages in the same 28-day model window. It captures explicit signs of irritation in developer messages and does not attempt to infer hidden sentiment. The signal can be affected by team communication style and task difficulty, so it is most useful as a comparative friction indicator across sufficiently active models.

Interruptions per session

How often developers interrupt, reject, or roll back model work during a session.

Grouped by model. Lower = better.
ModelLatest
0.00 / session
0.00 / session
0.00 / session
0.03 / session
0.40 / session
0.45 / session
0.45 / session
0.53 / session
0.60 / session
1.40 / session
1.45 / session
2.36 / session
3.18 / session
How is this measured?

This metric counts recorded user interventions, including interrupted turns, rejected tool use, cancellations, and rollbacks, then divides by the number of sessions in the same 28-day model window. It is intended to capture how often developers had to stop or redirect the agent explicitly. Some tools expose interruption events more clearly than others, so compare sustained patterns rather than isolated differences.

Context-hunting vs implementation ratio

Time spent finding context relative to time spent implementing.

Grouped by model. Lower = better.
ModelLatest
0.31x
0.72x
0.92x
0.94x
0.97x
0.99x
1.02x
1.13x
1.17x
1.18x
1.19x
1.24x
1.40x
How is this measured?

This metric uses LLM-classified phase percentages and session duration to estimate total context-hunting time and total implementation time in the same 28-day model window, then divides context time by implementation time. A value of 1.0 means equal estimated time spent finding context and implementing. Sessions without duration or implementation time are excluded from this calculation, so read small day-to-day differences cautiously and focus on persistent separation between series.

Meaningful outcome rate

Share of sessions that produced a meaningful outcome.

Grouped by model. Higher = better.
ModelLatest
100%
100%
96.3%
93.2%
90.9%
86.5%
82.5%
77.4%
76.9%
75.8%
67.1%
38.1%
29.7%
How is this measured?

This metric counts sessions classified as producing a meaningful outcome and divides by all classified sessions for the same model and day. It is intended to separate sessions that actually moved work forward from sessions that stalled, wandered, or ended without a useful result. It does not rank the size or business value of the outcome, and task difficulty can vary across teams and tools.

Failed tool call rate

Share of tool calls that failed.

Grouped by model. Lower = better.
ModelLatest
3%
3.1%
3.3%
4.2%
4.2%
4.8%
4.9%
5.7%
6.4%
9.5%
10%
12%
How is this measured?

This metric counts failed tool calls and divides by all tool calls in the same 28-day model window. It includes tool failures visible in the recorded session stream and excludes sessions with no tool calls. A lower rate generally indicates smoother execution, but task mix and tool usage can vary by model, so compare sustained patterns rather than isolated spikes.

Cadence

About Cadence

Cadence helps engineering organizations measure the friction in their AI-assisted development. We aggregate session telemetry from our customer base to surface where AI tools succeed and where they break down in real engineering work.

Learn more