Lifecycle at a glance
- Ship a rollout-ready task app – define actions, rewards, and judges so the trainer can call
/rollout.
→ See Task app architecture. - Deploy and verify the task app – run locally for smoke tests, then deploy to Modal for scalable runs.
→ See Deploying task apps. - Author an RL config – specify the algorithm (
gspo), model, trainer topology, references, and evaluation cadence.
→ See Training jobs. - Launch and monitor the job –
uvx synth-ai train --type rluploads configs, creates the job, and polls status. - Evaluate & iterate – run rollouts (
uvx synth-ai eval) to benchmark checkpoints or create more SFT data.
→ See Rollouts & evaluation.
Key commands
| Action | Command |
|---|---|
| Serve task app locally | uvx synth-ai serve your-task-id --port 8001 |
| Deploy to Modal | uvx synth-ai deploy your-task-id --name my-task-app |
| Launch RL training | uvx synth-ai train --type rl --config configs/rl.toml |
| Run rollouts / eval | uvx synth-ai eval --app-id your-task-id --model ft:... |
| Retrieve job status | curl -H "Authorization: Bearer $SYNTH_API_KEY" https://agent-learning.onrender.com/api/learning/jobs/<job_id> |
.env written by uvx synth-ai setup. Use --env-file for overrides (multiple flags allowed when you need to layer secrets).
Model selection: instruct vs reasoning SKUs
- Reasoning/Thinking models (e.g.
Qwen/Qwen3-4B-Thinking-2507) tend to converge more reliably because their prompts include explicit deliberation windows. They tolerate higher rollout concurrency (training.async_semaphore_max) without starving the trainer. - Instruct variants can work, but empirically need a smaller semaphore (often 2–4 concurrent rollouts) and shorter batches to stay stable. With a large semaphore they blow through episodes so quickly that the trainer backlog explodes, which accelerates collapse into short, deterministic answers. Drop
training.async_semaphore_maxin your TOML and watch judge scores closely when running instruct models.