Skip to content

Strands OfficeBench Agent

An agent that evaluates LLMs on the OfficeBench benchmark — 300 office-automation tasks spanning calendar, email, Excel, Word, PDF, and OCR operations. It deploys to Bedrock AgentCore Runtime for parallel evaluation at scale.

Built on 20 Strands @tool functions (in tools.py) that wrap OfficeBench’s original app scripts via subprocess. All 20 tools are available simultaneously — no explicit app switching, no single-action-per-turn JSON dispatch.

Averaged over 5 runs (mean ± std).

ModelSingle App (93)Two Apps (95)Three Apps (112)Overall (300)
Claude Sonnet 4.5 — non-thinking51.61 ± 1.0864.63 ± 1.4150.18 ± 0.7455.20 ± 0.38
Claude Sonnet 4.5 — thinking (budget=10000)48.82 ± 2.1065.90 ± 1.9147.86 ± 1.8553.87 ± 1.07

Results are not directly comparable with the original OfficeBench leaderboard — this agent uses Strands native tool-use instead of JSON action dispatch, has all tools available simultaneously, and has no iteration cap. See the full README for the complete list of divergences.

Terminal window
cd examples/strands_officebench_agent
uv venv --python 3.13 && source .venv/bin/activate
uv pip install -e .
# Local testing (no ACR)
python test_local.py

Deploy + run the full benchmark:

Terminal window
python deploy.py # build, push, deploy to ACR
python benchmark.py --limit 300 # full 300-task run
python benchmark.py --limit 1 # smoke test
  • rl_app.py — the rollout entrypoint; stages files in /testbed/, runs the agent with all 20 tools, evaluates file-state at the end.
  • reward.pyOfficeBenchReward: file-state comparison on the testbed directory after the agent finishes.
  • tools.py — 20 Strands @tool functions wrapping OfficeBench app scripts (calendar, email, Excel, Word, PDF, OCR).
  • benchmark.py — parallel evaluation over the 300-task benchmark via RolloutClient.
  • run_local_eval.py / test_local.py — local (no-ACR) evaluation and smoke tests.
  • Dockerfile + deploy.py + config.toml — container build + programmatic ACR deploy.
  • preprocess.py, models.py, utils.py — shared helpers (S3 task fetch, testbed setup, file readers).

For the full list of divergences from the original OfficeBench, Docker + ECR setup, benchmark config, and local-testing workflow, see the full README on GitHub.