Skip to content

Overview

RL on agents is different from RL on text completion. Rollouts are long-horizon, multi-turn, and tool-using — episodes run for minutes / hours hitting shells, APIs, or other tools. The hard parts are operational: running hundreds in isolation in parallel, capturing the signals a trainer needs (token IDs, logprobs, rewards), and doing it without rewiring the agent for RL.

The AgentCore RL Toolkit’s goal: the agent you train is the agent you deploy. Keep your Bedrock AgentCore agent, swap one decorator, and any training backend (slime, rllm, or verl) can drive it.

AgentCore Runtime is AWS’s serverless hosting layer for agents (see the developer guide for broader AgentCore features). For RL, four properties matter, all available in AgentCore Runtime without bespoke infrastructure:

  • Session isolation — each session runs in a dedicated microVM with isolated CPU, memory, and filesystem, giving complete separation between user sessions and preventing cross-session data contamination. Filesystem writes in one rollout are invisible to any other. After session completion the microVM is terminated and memory sanitized.
  • On-demand scaling — the runtime is serverless; new sessions spin up on demand, enabling massive parallel rollouts without contending for local CPU or reserving worker capacity.
  • Extended execution time — real-time interactions and long-running workloads up to 8 hours — wide enough for multi-turn tool-using rollouts.
  • Framework agnostic — any agent framework (LangGraph, Strands, or custom) and any foundation model. The agent’s runtime environment is independent of the training library’s, so the two evolve separately.

AgentCore Runtime handles hosting; the toolkit wires the training-data pipeline that makes an agent trainable.

The diagram has three parts, left to right:

  • AgentCore Runtime (left) — where your agent runs. Each rollout is a session with an AgentEnvironment tool-use loop that ends in an Evaluate step. Your code lives here, unchanged from deployment.
  • Shared data plane (center) — model-gateway is what lets the agent on AgentCore Runtime keep using the standard OpenAI chat interface with no awareness of training. The agent makes plain v1/chat calls; the gateway transparently proxies them to the training backend’s inference servers and, on the side, captures the training-only signals (token IDs, logprobs, expert masks) that the agent never sees or handles.
  • Training backend (right) — Inference Servers serve the current policy; the Training Engine consumes rollout data + rewards and syncs updated weights back via Weight Sync.

Start with Prepare agent for RL, then pick a backend: slime · rllm · verl.