Skip to main content
Synth’s RL system is built around three cooperative services: your hosted task app, Synth’s rollout + trainer infrastructure, and the dashboard that surfaces results. This page explains how an RL job actually runs so you know which levers to pull when designing environments, rewards, and configs.

1. Episode flow

  1. Job submission – the CLI sends your TOML config to Synth. The backend snapshots it (data.config) so the run is reproducible.
  2. Rollout workers spin up – Synth provisions GPU workers based on [backend.compute_pool] / [backend.topology_id]. Each worker contacts your task app with an organization-scoped ENVIRONMENT_API_KEY.
  3. Alternating ops – workers follow the ops sequence generated from [training.rollouts] (typically agent, env, agent, …). The task app:
    • Calls the policy (vLLM, provider proxy, or custom logic).
    • Steps the environment.
    • Emits rewards and final state metadata.
  4. Trainer updates – trainers compute gradients, update policy checkpoints, and stream metrics (reward curves, losses) back to the backend.
  5. Dashboards & traces – every trajectory is logged; checkpoints and status updates appear immediately in the dashboard.

2. Role of the task app

Your task app owns:
  • Determinism & seeding – respect the seed supplied in each rollout request so episodes can be replayed or audited later.
  • Reward surfaces – emit both incremental (EventReward) and final (OutcomeReward) signals; Synth aggregates these into metrics and JSONL exports.
  • Metadata/task_info describes datasets, rubrics, and capabilities so the CLI can help users choose the right seeds/topologies.
  • Tracing toggles – honour tracing flags so every run produces artifacts for finetuning.
If rewards look wrong or metrics are missing, check your rollout executor first—the backend simply forwards what the task app emits.

3. How configs influence training

  • [model] determines the starting checkpoint. Use base for cold starts or source (e.g., ft:…) to resume from SFT.
  • [training.rollouts] shapes sample efficiency: batch size, concurrency, seeds, and whether you collect reference rollouts.
  • [backend] maps to GPU pools/topologies. Under-provisioning here slows convergence more than anything else.
  • [tags] and [data] are pure metadata but invaluable for filtering runs in the dashboard.
Keep configs in version control—Synth embeds them in job payloads so you can trace every run back to the exact settings used.

4. Metrics you’ll watch

  • Average reward / return – aggregated from OutcomeReward values; track this for overall progress.
  • Event frequencies – how often key achievements fire (EventReward). Useful when tuning curricula or rejection filters.
  • Policy health – token usage, invalid action rate, latency (all logged automatically when the task app emits structured data).
  • Checkpoint quality – every job writes checkpoints you can evaluate or export. Use evaluation configs to compare them against baselines.

5. Best practices

  • Start with short runs on smaller pools (A10G) to validate reward signals, then graduate to H100 for longer training.
  • Enable tracing from day one so you can bootstrap rejection finetuning without rerunning episodes.
  • Pair RL configs with evaluation configs; run evals after major checkpoints to catch regressions quickly.
  • Use idempotency keys when automating job submissions to avoid launching duplicates.
  • Keep your task app slim—mount large assets from object storage rather than bundling them into the container.
Once you internalize the flow above, the rest of the docs (task apps, datasets, CLI commands) slot into place: they give you the knobs to shape each piece of the pipeline while Synth handles scheduling, scaling, and observability.
I