Overview

Task apps expose the environment and evaluation endpoints (e.g., /rollout, /health). Provide the task app base URL when creating an RL job.

Health and readiness

  • Expose /health and /readyz for rollout token minting and diagnostics.

Security

  • Rollouts use organization-scoped tokens minted by the backend. Keep task app private; only the base URL is provided by the client.

Rollout endpoint

  • Path: POST /rollout
  • Auth: header X-API-Key: <rollout token>
    • Tokens are issued by the backend per job at POST /api/rl/jobs/{id}/tokens/rollout and scoped to rollout.

Request body (JSON)

{
  "env": {
    "env_name": "Crafter",
    "config": {},
    "seed": 0
  },
  "policy": {
    "policy_name": "crafter-react",
    "config": {
      "inference_url": "http://127.0.0.1:8001",
      "model": "Qwen/Qwen3-0.6B",
      "temperature": 0.3,
      "top_p": 0.95,
      "max_tokens": 1024,
      "thinking_mode": "think",
      "thinking_budget": 512
    }
  },
  "ops": ["agent", "env", "agent", "env"],
  "max_steps": 7,
  "record": {"trajectories": true, "logprobs": false, "value": false},
  "on_done": "reset",
  "safety": {"max_ops": 100000, "max_time_s": 300.0}
}
Notes:
  • ops specifies the interleave of agent and environment steps up to max_steps per episode.
  • The task app should use policy.config.inference_url to call the LLM policy, and step the environment accordingly.
  • Determinism: respect the provided seed when applicable.

Response body (200 OK)

{
  "trajectories": [
    {
      "env_id": "episode_0",
      "steps": [
        {
          "obs": {"text": "...", "prompt": "..."},
          "tool_calls": [{"name": "act", "args": {"action": "..."}}],
          "reward": 0.0
        }
      ],
      "final": {"observation": {"achievements": {"goal_met": true}}}
    }
  ]
}
Requirements:
  • Include steps with per-step obs, optional tool_calls, and reward when defined by the task.
  • Include a final.observation block; environment-specific handlers compute episode_return from this.
  • For long-running rollouts, you may return 303 See Other with a Location to poll; the trainer will follow until completion.
  • If the horizon is insufficient, 422 Unprocessable Entity is acceptable; the trainer will retry with a larger max_steps once.