1. Episode flow
- Job submission – the CLI sends your TOML config to Synth. The backend snapshots it (
data.config
) so the run is reproducible. - Rollout workers spin up – Synth provisions GPU workers based on
[backend.compute_pool]
/[backend.topology_id]
. Each worker contacts your task app with an organization-scopedENVIRONMENT_API_KEY
. - Alternating ops – workers follow the
ops
sequence generated from[training.rollouts]
(typicallyagent
,env
,agent
, …). The task app:- Calls the policy (vLLM, provider proxy, or custom logic).
- Steps the environment.
- Emits rewards and final state metadata.
- Trainer updates – trainers compute gradients, update policy checkpoints, and stream metrics (reward curves, losses) back to the backend.
- Dashboards & traces – every trajectory is logged; checkpoints and status updates appear immediately in the dashboard.
2. Role of the task app
Your task app owns:- Determinism & seeding – respect the seed supplied in each rollout request so episodes can be replayed or audited later.
- Reward surfaces – emit both incremental (
EventReward
) and final (OutcomeReward
) signals; Synth aggregates these into metrics and JSONL exports. - Metadata –
/task_info
describes datasets, rubrics, and capabilities so the CLI can help users choose the right seeds/topologies. - Tracing toggles – honour tracing flags so every run produces artifacts for finetuning.
3. How configs influence training
[model]
determines the starting checkpoint. Usebase
for cold starts orsource
(e.g.,ft:…
) to resume from SFT.[training.rollouts]
shapes sample efficiency: batch size, concurrency, seeds, and whether you collect reference rollouts.[backend]
maps to GPU pools/topologies. Under-provisioning here slows convergence more than anything else.[tags]
and[data]
are pure metadata but invaluable for filtering runs in the dashboard.
4. Metrics you’ll watch
- Average reward / return – aggregated from
OutcomeReward
values; track this for overall progress. - Event frequencies – how often key achievements fire (
EventReward
). Useful when tuning curricula or rejection filters. - Policy health – token usage, invalid action rate, latency (all logged automatically when the task app emits structured data).
- Checkpoint quality – every job writes checkpoints you can evaluate or export. Use evaluation configs to compare them against baselines.
5. Best practices
- Start with short runs on smaller pools (A10G) to validate reward signals, then graduate to H100 for longer training.
- Enable tracing from day one so you can bootstrap rejection finetuning without rerunning episodes.
- Pair RL configs with evaluation configs; run evals after major checkpoints to catch regressions quickly.
- Use idempotency keys when automating job submissions to avoid launching duplicates.
- Keep your task app slim—mount large assets from object storage rather than bundling them into the container.