Skip to main content
Reinforcement learning jobs train your policy online against a live task app. Synth provisions the trainer cluster, coordinates rollouts, and streams metrics/events back to the CLI. For CLI deep dives, refer to serve, modal-serve, deploy, train, and eval.

Lifecycle at a glance

  1. Ship a rollout-ready task app – define actions, rewards, and judges so the trainer can call /rollout.
    → See Task app architecture.
  2. Deploy and verify the task app – run locally for smoke tests, then deploy to Modal for scalable runs.
    → See Deploying task apps.
  3. Author an RL config – specify the algorithm (gspo), model, trainer topology, references, and evaluation cadence.
    → See Training jobs.
  4. Launch and monitor the jobuvx synth-ai train --type rl uploads configs, creates the job, and polls status.
  5. Evaluate & iterate – run rollouts (uvx synth-ai eval) to benchmark checkpoints or create more SFT data.
    → See Rollouts & evaluation.

Key commands

ActionCommand
Serve task app locallyuvx synth-ai serve your-task-id --port 8001
Deploy to Modaluvx synth-ai deploy your-task-id --name my-task-app
Launch RL traininguvx synth-ai train --type rl --config configs/rl.toml
Run rollouts / evaluvx synth-ai eval --app-id your-task-id --model ft:...
Retrieve job statuscurl -H "Authorization: Bearer $SYNTH_API_KEY" https://agent-learning.onrender.com/api/learning/jobs/<job_id>
RL builds on the same primitives as SFT—task apps, rollouts, and CLI configs—but adds trainer-specific settings such as GPU topologies, reference models, and judge pipelines. Tip: the CLI auto-loads any .env written by uvx synth-ai setup. Use --env-file for overrides (multiple flags allowed when you need to layer secrets).

Model selection: instruct vs reasoning SKUs

  • Reasoning/Thinking models (e.g. Qwen/Qwen3-4B-Thinking-2507) tend to converge more reliably because their prompts include explicit deliberation windows. They tolerate higher rollout concurrency (training.async_semaphore_max) without starving the trainer.
  • Instruct variants can work, but empirically need a smaller semaphore (often 2–4 concurrent rollouts) and shorter batches to stay stable. With a large semaphore they blow through episodes so quickly that the trainer backlog explodes, which accelerates collapse into short, deterministic answers. Drop training.async_semaphore_max in your TOML and watch judge scores closely when running instruct models.