Agentic knowledge work benchmark comparing AI model speed, cost, and performance in AA-Briefcase evaluations — AA-Briefcase reveals how today’s leading AI agents balance task completion time, cost, and real-world knowledge work performance.

Frontier AI agents now take over 20 minutes to complete a single knowledge work task — and the gap between the fastest and most capable models is wider than you’d expect. If you’re evaluating AI for enterprise deployment, understanding what an agentic knowledge work benchmark actually measures about task time is now as important as understanding raw accuracy scores.

The release of AA-Briefcase by Artificial Analysis in June 2026 gives us the most rigorous public look yet at how long frontier AI models take to complete real, professional-grade work — and the results reshape how we should think about AI agent selection.

Why Time Per Task Is the New AI Performance Metric

For years, AI benchmarks focused almost exclusively on accuracy: did the model get the right answer? But as organizations deploy AI agents on multi-step workflows — drafting financial models, building board presentations, synthesizing thousands of documents — a new question has become equally critical: how long does it actually take?

An AI agent that scores brilliantly on accuracy but requires 30 minutes per task may be impractical for real-world business workflows. Conversely, a fast model that cuts corners on quality produces outputs that require costly human review. The right question isn’t “how smart is this model?” — it’s “where does it sit on the speed-quality tradeoff curve?”

This is exactly what the AA-Briefcase agentic knowledge work benchmark was built to answer.

What Is AA-Briefcase?

AA-Briefcase is a proprietary agentic knowledge work benchmark developed by Artificial Analysis that evaluates frontier AI models on long-horizon, professional-grade tasks built by industry experts. Unlike conventional benchmarks that test isolated questions or short completions, AA-Briefcase requires models to act as autonomous agents working through realistic multi-week projects — producing deliverables like financial models, board presentations, and design mock-ups across complex scenarios.

This makes AA-Briefcase fundamentally different from most AI evaluations. It is not a multiple-choice test. It is a simulation of real work.

How the Benchmark Is Structured

AA-Briefcase is organized around four private knowledge work scenarios — Data Science, Product Management, Banking Operations, and Heavy Industry Strategy — plus one public scenario (Due Diligence) released for research purposes. Each scenario is built as a multi-week professional project, with tasks that build on each other over time.

In total, the benchmark comprises 91 tasks across the four scored scenarios, supported by nearly 2,000 source files — including more than 3,500 emails and 25,000 Slack messages — that models must navigate to complete each task. The input data is intentionally messy and fragmented, mirroring the contradictions and ambiguity of real organizational environments.

Each task is graded across three dimensions:

Rubric pass rate — Binary checks verifying whether the model followed instructions, retrieved required evidence, and reached correct conclusions
Analytical Quality Elo — Pairwise comparison judging the rigor and depth of each model’s analysis
Presentation Elo — Pairwise comparison judging the professionalism and clarity of the output

The combined AA-Briefcase Elo score aggregates all three dimensions into a single performance ranking.

What “Time Per Task” Actually Measures

As a true agentic knowledge work benchmark, AA-Briefcase also captures wall-clock time per task, calculated by combining three factors:

The number of output tokens the model generates, divided by its canonical inference speed
The number of reasoning tokens (for models that use extended thinking)
The actual tool execution time recorded during evaluation — the time spent on file reads, code runs, image inspections, and similar agent actions

Importantly, models are allowed up to 500 turns per task and can submit their work or abandon a task at any point. This makes the time measurement highly realistic: some models grind through many iterations; others converge and submit faster.

AA-Briefcase Results: Model-by-Model Time and Performance Breakdown

The table below summarizes key results from the AA-Briefcase agentic knowledge work benchmark, combining time per task with overall Elo performance and approximate cost per task where data is available.

Model	AA-Briefcase Elo	Time Per Task (mins)	Approx. Cost Per Task	Notes
Claude Fable 5	Highest overall	~28.5 (est.)	~$31	Top rubric + analytical Elo; not publicly available
Claude Opus 4.8 (max)	2nd overall	~23	High	Co-leads on Presentation Elo
GLM-5.2 (max)	3rd overall (~1261 Elo)	~16.3	Moderate	Top open-weight model
GPT-5.5 (xhigh)	Top 5	~11	Moderate	Best speed-performance tradeoff at top tier
GPT-5.5 (high/medium)	Top tier variants	~10–12	Lower	Pareto-efficient frontier
MiniMax-M3	~1113 Elo	~26	Low-moderate	Slower than Opus despite lower Elo
DeepSeek V4 Pro (max)	Competitive	Lower	Very low	Strong price-performance ratio
DeepSeek V4 Flash (max)	Lower tier	Fast	~$0.04	Lowest cost; well below frontier capability
Gemini 3.5 Flash	Well below Elo leaders	Competitive	Low	Highest turn count (~88/task); underperforms vs. general intelligence ranking
Gemini 3.1 Pro Preview	Moderate	Fast	Low	Rarely inspects outputs visually before submitting

Data from Artificial Analysis AA-Briefcase evaluation, June 2026. Cost estimates based on token usage and model pricing.

A few things stand out immediately. First, time does not reliably predict quality. MiniMax-M3, for example, averages more time per task than Claude Opus 4.8 yet lands 240 Elo points behind it. Second, cost varies by more than 800x across tested models, from roughly $0.04 to over $31 per task. Third, no low-cost model currently reaches frontier AA-Briefcase performance.

The Pareto Frontier: Where Speed Meets Intelligence

The most useful way to read this agentic knowledge work benchmark is not by ranking models purely on Elo or purely on speed, but by identifying which models sit on the Pareto frontier — the set of models where you cannot improve one dimension without sacrificing the other.

GPT-5.5 Variants Hold the Efficiency Edge

Multiple GPT-5.5 reasoning variants — medium, high, and xhigh — cluster along the Pareto frontier of the AA-Briefcase Elo versus time-per-task chart. GPT-5.5 (xhigh) is particularly notable: it ranks in the top five on overall AA-Briefcase Elo while completing tasks in roughly 11 minutes on average, approximately half the time of Claude Opus 4.8. For organizations that prioritize throughput alongside quality, GPT-5.5 (xhigh) represents the strongest efficiency case among currently available top-tier models.

GLM-5.2 Leads Open-Weight Models

GLM-5.2 (max) achieves an AA-Briefcase Elo of approximately 1261 — placing it third overall in the benchmark and well ahead of any other open-weight model — while averaging about 16.3 minutes per task. This positions it as the dominant open-weight option in the agentic knowledge work benchmark landscape and an attractive option for organizations that want frontier-adjacent capability without vendor lock-in or the premium pricing of top closed-source models. MiniMax-M3 is the next best open-weight model at roughly 1113 Elo.

GLM-5.2 scores approximately 90 Elo points below Claude Opus 4.8 while costing less than 25% as much per task — a tradeoff many enterprise teams may find compelling.

What Drives Task Time? Six Factors That Determine AI Agent Speed

Understanding what makes some models faster or slower on this agentic knowledge work benchmark helps practitioners make smarter deployment decisions. The data reveals a clear hierarchy of causes:

Output verbosity is the dominant factor. The number of answer tokens a model generates per task is the single biggest driver of wall-clock time. Models that write lengthy, detailed outputs simply take longer, regardless of inference speed. Claude Fable 5 averages roughly 112,000 output tokens per task; Gemini 3.5 Flash uses approximately 141,000 — 25% more — yet scores far lower on Elo, suggesting that verbosity alone does not produce quality.
Inference speed multiplies or dampens verbosity’s effect. A highly verbose model running on fast infrastructure may still complete a task faster than a less verbose model on slower hardware. For estimated task time, Artificial Analysis uses each model’s canonical output speed, so both factors interact.
Turn count varies widely but correlates weakly with performance. Models are allowed up to 500 turns per task. Gemini 3.5 Flash averages roughly 88 turns — among the highest in the benchmark — yet lands well below the Elo leaders, suggesting that more iterations do not compensate for weaker underlying capability.
Reasoning token usage adds significant time for extended-thinking models. Models that generate internal reasoning chains before producing answers consume additional tokens not visible in final outputs, adding meaningful latency.
Tool execution time is a surprisingly small contributor. Time spent on file reads, code execution, and other tool calls accounts for only approximately 12% of total task time. The remaining ~88% is explained by output generation and inference speed. This is a counterintuitive finding: agent speed is primarily a language model problem, not a tooling or infrastructure problem.
Visual inspection behavior correlates strongly with presentation quality. The top-performing models on Presentation Elo — Claude Fable 5 and Claude Opus 4.8 — make an average of 21 and 12 visual inspections per task respectively, reviewing their rendered outputs before submitting. Lower-ranked models inspect far less; Gemini 3.1 Pro Preview averages fewer than 0.1 visual checks per task. This behavior costs time but pays off in quality.

What AA-Briefcase Results Mean for Real-World AI Deployment

Performance Ceilings Are Lower Than Expected

Perhaps the most important finding from this agentic knowledge work benchmark is that raw performance remains limited even at the frontier. Claude Fable 5, the highest-scoring available model, passes all rubric criteria on only 3% of tasks. On nearly a third of all 91 tasks, no tested model achieves better than a 50% rubric pass rate. This is not a criticism of any single model — it reflects the genuine difficulty of coordinating across thousands of fragmented source files, fulfilling hidden requirements, and producing professional-grade deliverables simultaneously.

For practitioners, this means AI agents should be treated as powerful assistants that dramatically accelerate work — not autonomous replacements for domain experts on complex knowledge tasks.

Failure Modes Differ by Capability Tier

One of the more nuanced insights from the agentic knowledge work benchmark is that how models fail changes as capability increases. Weaker models most often fail on basic execution: missing relevant files, producing unusable deliverables, or generating no deliverable at all. More capable models, measured by rubric pass rate, most often fail to fulfill all task requirements — including requirements buried in source documents or implicit in the project context. Incorrect analysis and formatting errors appear across all capability tiers.

The Cost-Performance Tradeoff Is Now Quantifiable

The 800x variation in cost per task across this agentic knowledge work benchmark gives enterprise buyers an unprecedented tool for cost modeling. A team running 1,000 knowledge work tasks per month faces costs ranging from roughly $40 (DeepSeek V4 Flash) to over $31,000 (Claude Fable 5 pricing) — a difference that directly shapes which models are viable for which use cases. The strongest open-weight options, particularly GLM-5.2, appear to close a meaningful portion of the quality gap at a fraction of the price.

AI Agent Speed Is Primarily a Model Problem, Not an Infrastructure Problem

Since tool execution accounts for only ~12% of total task time, organizations investing heavily in faster infrastructure or tool orchestration will see diminishing returns on wall-clock task time. The more productive investment is model selection — choosing a model that balances verbosity, inference speed, and accuracy for the specific task type.

Frequently Asked Questions

What is AA-Briefcase? AA-Briefcase is a proprietary agentic knowledge work benchmark released by Artificial Analysis in June 2026. It evaluates frontier AI models on 91 realistic professional tasks spanning data science, product management, banking operations, and industrial strategy, using thousands of source files and composite rubric-plus-pairwise grading.

How is time per task calculated in AA-Briefcase? Time per task is estimated from three components: answer token generation time (tokens ÷ canonical model output speed), reasoning token generation time (for extended-thinking models), and mean tool execution time per task measured during evaluation.

Which AI model is fastest on the AA-Briefcase benchmark? Among top-performing models, GPT-5.5 (xhigh) is one of the fastest, averaging roughly 11 minutes per task while placing in the top five on overall Elo. Among the very highest-scoring models, Claude Opus 4.8 averages approximately 23 minutes per task.

What makes AA-Briefcase different from other AI benchmarks? Most benchmarks test isolated, single-turn questions. AA-Briefcase simulates multi-week professional projects with thousands of fragmented input files, requiring models to act as autonomous agents, build complex deliverables, and meet professional standards of analytical rigor and presentation — mirroring real knowledge work more closely than any prior public evaluation.

Does spending more time on a task improve quality? Not reliably. The benchmark shows that turn count and output verbosity do not consistently correlate with higher Elo scores. Models like Gemini 3.5 Flash use among the most turns and tokens yet rank well below the Elo leaders. Quality appears to depend more on the model’s underlying reasoning capability and its use of visual review than on raw time invested.

Is the AA-Briefcase data publicly available? The four core scored scenarios remain private to reduce contamination risk. A fifth “Due Diligence” scenario (AA-Briefcase Lite) is available on Hugging Face as a public demonstration of benchmark structure, submission format, and grading — though it does not count toward official scores.

What does an agentic knowledge work benchmark tell us that standard benchmarks can’t? Standard benchmarks measure what a model knows. An agentic knowledge work benchmark measures what a model can do — across multiple steps, with imperfect information, producing structured outputs under realistic professional conditions. That gap between knowing and doing is precisely where most AI deployments succeed or fail.

Conclusion: The Time-Quality Frontier Is Now Measurable

The AA-Briefcase agentic knowledge work benchmark represents a significant step forward in how the AI industry evaluates frontier model capability. By measuring time per task alongside quality, cost, and failure modes, it gives practitioners a far richer framework for model selection than accuracy scores alone.

The headline finding — that frontier AI agents now take 10 to 25 minutes per task on realistic knowledge work — reframes how organizations should plan AI workflows. Speed and quality sit in genuine tension. The Pareto frontier, dominated by GPT-5.5 variants at the top and GLM-5.2 among open-weight options, offers the most efficient paths. But even the best models currently pass all rubric criteria on a tiny fraction of tasks, underscoring that agentic AI in knowledge work is still in its early innings.

What’s clear is that time per task is no longer a secondary metric. As agentic deployments scale, it will define throughput, cost, and user experience as much as any accuracy number. The agentic knowledge work benchmark has given the field a rigorous way to measure it — and that matters enormously for what comes next.

kalinga.ai

How Long Do AI Agents Really Take? Inside the AA-Briefcase Agentic Knowledge Work Benchmark