1. Start from an example config
The repository ships several RL examples (e.g.examples/warming_up_to_rl/configs/rl_from_base_qwen4b.toml). Key sections:
services.task_url points to a running task app and that the environment exposes /health and /task_info.
Vision-language (Qwen3-VL)
Qwen3-VL checkpoints run the same GSPO loop, but you must allow images to flow through the config and your task app. Start fromexamples/qwen_vl/configs/crafter_rl_vision_qwen3vl4b.toml and note the following:
-
Model stanza – point
name(orbase) at aQwen/Qwen3-VL-*SKU. The helper will injectsupports_vision = true,max_images_per_message = 1, and the correctvision_processor_id. If you write the block by hand, set those fields explicitly so the trainer loads the vision processor and keeps LoRA targets (e.g.mm_projector). -
Rollout stanza – set
use_vision = trueso policy prompts consume image segments.max_images_per_stepdefaults to the model limit (one image per turn) and extra images are dropped. Enableimage_only_mode = truewhen your environment produces images without accompanying text. -
Thinking variants – if you select a
-ThinkingSKU, populatethinking_mode = "think"(or another supported mode) plus an optionalthinking_budgetso the collector forwards the hint to inference. -
Task app contract – every policy prompt must include images using the OpenAI multimodal schema. A typical turn looks like:
URLs must be reachable from both the rollout actors and the trainer (HTTPS, signed URLs, or
data:image/...;base64,blobs). The server enforces the samemax_images_per_messagecap as SFT. -
Evaluation / deployment – reuse the same
[model]and[rollout]blocks when spawning eval jobs or Modal deployments so the vision processor is materialised in each runtime.
2. Launch the job
- Validate the TOML (
RLConfig) and ensure a model is selected (sourceorbase). - Verify the task app via
/rl/verify_task_app(multiple API key attempts). - Run a health check against
/healthand/task_infousingENVIRONMENT_API_KEY. - Create and start the job; by default the CLI polls status until it reaches a terminal state.
--model– swap in a different base or fine-tuned model without touching the file.--backend– point at a dev backend.--idempotency– provide an Idempotency-Key header (safe retries).--allow-experimental– enable experimental base models.--no-poll/--poll-timeout/--poll-interval– control how the CLI waits for completion.
.env from uvx synth-ai setup. Use --env-file when you need to override secrets (you can supply the flag multiple times to merge values).
3. Monitor progress
During polling you will see status updates plus streaming metrics (mean return, judge scores). If you disable polling:status–running,succeeded,failed, etc.linked_job_id– points at the orchestration job streaming events.fine_tuned_model– populated when a checkpoint is ready for evaluation.
4. Automate the loop
uvx synth-ai train --type rl --config ... --no-poll- Poll job status via the API until
statusis terminal. - Grab the
fine_tuned_model. - Run
uvx synth-ai eval --model <fine_tuned_model>to benchmark. - Feed the resulting traces into SFT or the next RL iteration.