Overview

Reinforcement learning jobs train your policy online against a live task app. Synth provisions the trainer cluster, coordinates rollouts, and streams metrics/events back to the CLI. For CLI deep dives, refer to serve, modal-serve, deploy, train, and eval.

Lifecycle at a glance

Ship a rollout-ready task app – define actions, rewards, and judges so the trainer can call /rollout.
→ See Task app architecture.
Deploy and verify the task app – run locally for smoke tests, then deploy to Modal for scalable runs.
→ See Deploying task apps.
Author an RL config – specify the algorithm (gspo), model, trainer topology, references, and evaluation cadence.
→ See Training jobs.
Launch and monitor the job – uvx synth-ai train --type rl uploads configs, creates the job, and polls status.
Evaluate & iterate – run rollouts (uvx synth-ai eval) to benchmark checkpoints or create more SFT data.
→ See Rollouts & evaluation.

Key commands

Action	Command
Serve task app locally	`uvx synth-ai serve your-task-id --port 8001`
Deploy to Modal	`uvx synth-ai deploy your-task-id --name my-task-app`
Launch RL training	`uvx synth-ai train --type rl --config configs/rl.toml`
Run rollouts / eval	`uvx synth-ai eval --app-id your-task-id --model ft:...`
Retrieve job status	`curl -H "Authorization: Bearer $SYNTH_API_KEY" https://agent-learning.onrender.com/api/learning/jobs/<job_id>`

RL builds on the same primitives as SFT—task apps, rollouts, and CLI configs—but adds trainer-specific settings such as GPU topologies, reference models, and judge pipelines. Tip: the CLI auto-loads any .env written by uvx synth-ai setup. Use --env-file for overrides (multiple flags allowed when you need to layer secrets).

Model selection: instruct vs reasoning SKUs

Reasoning/Thinking models (e.g. Qwen/Qwen3-4B-Thinking-2507) tend to converge more reliably because their prompts include explicit deliberation windows. They tolerate higher rollout concurrency (training.async_semaphore_max) without starving the trainer.
Instruct variants can work, but empirically need a smaller semaphore (often 2–4 concurrent rollouts) and shorter batches to stay stable. With a large semaphore they blow through episodes so quickly that the trainer backlog explodes, which accelerates collapse into short, deterministic answers. Drop training.async_semaphore_max in your TOML and watch judge scores closely when running instruct models.

Get Started

Fine-Tuning

Reinforcement Learning

CLI Commands

Lifecycle at a glance

Key commands

Model selection: instruct vs reasoning SKUs

Get Started

Fine-Tuning

Reinforcement Learning

CLI Commands

​Lifecycle at a glance

​Key commands

​Model selection: instruct vs reasoning SKUs

Lifecycle at a glance

Key commands

Model selection: instruct vs reasoning SKUs