Skip to main content
RL jobs consume TOML configs that describe the algorithm, model, trainer topology, rollout batch sizes, and evaluation cadence. The CLI converts this into a job payload and submits it to Synth. For command-line options and examples, see Launch Training Jobs.

1. Start from an example config

The repository ships several RL examples (e.g. examples/warming_up_to_rl/configs/rl_from_base_qwen4b.toml). Key sections:
[algorithm]
type = "online"
method = "policy_gradient"
variety = "gspo"

[services]
task_url = "https://my-task-app.modal.run"

[model]
base = "Qwen/Qwen3-4B"           # or source = "ft:..."
label = "crafter-rl-from-base"   # used for trainer logging

[compute]
gpu_type = "H100"
gpu_count = 8

[topology]
type = "single_node_split"
gpus_for_vllm = 4
gpus_for_training = 3
gpus_for_ref = 1

[training]
num_epochs = 1
iterations_per_epoch = 10
batch_size = 16
group_size = 4
learning_rate = 5e-5
step_rewards_enabled = true
step_rewards_mode = "decision_stepwise"

[evaluation]
instances = 10
every_n_iters = 10
seeds = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Make sure services.task_url points to a running task app and that the environment exposes /health and /task_info.

Vision-language (Qwen3-VL)

Qwen3-VL checkpoints run the same GSPO loop, but you must allow images to flow through the config and your task app. Start from examples/qwen_vl/configs/crafter_rl_vision_qwen3vl4b.toml and note the following:
  • Model stanza – point name (or base) at a Qwen/Qwen3-VL-* SKU. The helper will inject supports_vision = true, max_images_per_message = 1, and the correct vision_processor_id. If you write the block by hand, set those fields explicitly so the trainer loads the vision processor and keeps LoRA targets (e.g. mm_projector).
  • Rollout stanza – set use_vision = true so policy prompts consume image segments. max_images_per_step defaults to the model limit (one image per turn) and extra images are dropped. Enable image_only_mode = true when your environment produces images without accompanying text.
  • Thinking variants – if you select a -Thinking SKU, populate thinking_mode = "think" (or another supported mode) plus an optional thinking_budget so the collector forwards the hint to inference.
  • Task app contract – every policy prompt must include images using the OpenAI multimodal schema. A typical turn looks like:
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "What is happening in the arena?"},
        {"type": "image_url", "image_url": {"url": "https://assets.example.com/frame_042.png"}}
      ]
    }
    
    URLs must be reachable from both the rollout actors and the trainer (HTTPS, signed URLs, or data:image/...;base64, blobs). The server enforces the same max_images_per_message cap as SFT.
  • Evaluation / deployment – reuse the same [model] and [rollout] blocks when spawning eval jobs or Modal deployments so the vision processor is materialised in each runtime.

2. Launch the job

uvx synth-ai train \
  --type rl \
  --config configs/rl.toml \
  --task-url https://my-task-app.modal.run
Steps performed by the CLI:
  1. Validate the TOML (RLConfig) and ensure a model is selected (source or base).
  2. Verify the task app via /rl/verify_task_app (multiple API key attempts).
  3. Run a health check against /health and /task_info using ENVIRONMENT_API_KEY.
  4. Create and start the job; by default the CLI polls status until it reaches a terminal state.
Useful overrides:
  • --model – swap in a different base or fine-tuned model without touching the file.
  • --backend – point at a dev backend.
  • --idempotency – provide an Idempotency-Key header (safe retries).
  • --allow-experimental – enable experimental base models.
  • --no-poll / --poll-timeout / --poll-interval – control how the CLI waits for completion.
Reminder: the CLI auto-loads the .env from uvx synth-ai setup. Use --env-file when you need to override secrets (you can supply the flag multiple times to merge values).

3. Monitor progress

During polling you will see status updates plus streaming metrics (mean return, judge scores). If you disable polling:
curl -H "Authorization: Bearer $SYNTH_API_KEY" \
  https://api.usesynth.ai/api/learning/jobs/<job_id>
The response includes:
  • statusrunning, succeeded, failed, etc.
  • linked_job_id – points at the orchestration job streaming events.
  • fine_tuned_model – populated when a checkpoint is ready for evaluation.
For richer logs you can call:
curl -H "Authorization: Bearer $SYNTH_API_KEY" \
  "https://api.usesynth.ai/api/learning/jobs/<job_id>/events?since_seq=0&limit=200"

4. Automate the loop

  1. uvx synth-ai train --type rl --config ... --no-poll
  2. Poll job status via the API until status is terminal.
  3. Grab the fine_tuned_model.
  4. Run uvx synth-ai eval --model <fine_tuned_model> to benchmark.
  5. Feed the resulting traces into SFT or the next RL iteration.
With a reproducible task app and TOML, agents can iterate continuously across collect → train → evaluate cycles.