1. Create the Config TOML for Your Task App
Create a TOML file that follows the schema documented in the RL config reference.Vision-language (Qwen3-VL)
Qwen3-VL checkpoints run the same GSPO loop, but you must allow images to flow through the config and your task app. Start fromexamples/qwen_vl/configs/crafter_rl_vision_qwen3vl4b.toml and note the following:
-
Model stanza – point
name(orbase) at aQwen/Qwen3-VL-*SKU. The helper will injectsupports_vision = true,max_images_per_message = 1, and the correctvision_processor_id. If you write the block by hand, set those fields explicitly so the trainer loads the vision processor and keeps LoRA targets (e.g.mm_projector). -
Rollout stanza – set
use_vision = trueso policy prompts consume image segments.max_images_per_stepdefaults to the model limit (one image per turn) and extra images are dropped. Enableimage_only_mode = truewhen your environment produces images without accompanying text. -
Thinking variants – if you select a
-ThinkingSKU, populatethinking_mode = "think"(or another supported mode) plus an optionalthinking_budgetso the collector forwards the hint to inference. -
Task app contract – every policy prompt must include images using the OpenAI multimodal schema. A typical turn looks like:
URLs must be reachable from both the rollout actors and the trainer (HTTPS, signed URLs, or
). The server enforces the samemax_images_per_messagecap as SFT. -
Evaluation / deployment – reuse the same
[model]and[rollout]blocks when spawning eval jobs or Modal deployments so the vision processor is materialised in each runtime.
2. Launch the Job
--task-url must point at the live Modal deployment you created with uvx synth-ai deploy --runtime modal; the CLI verifies that endpoint before submitting the job.
Steps performed by the CLI:
- Validate the TOML (
RLConfig) and ensure a model is selected (sourceorbase). - Verify the Modal task app via
/rl/verify_task_app(multiple API key attempts). - Run a health check against
/healthand/task_infousingENVIRONMENT_API_KEY. - Create and start the job; by default the CLI polls status until it reaches a terminal state.
--model– swap in a different base or fine-tuned model without touching the file.--idempotency– provide an Idempotency-Key header (safe retries).--allow-experimental– enable experimental base models.--no-poll/--poll-timeout/--poll-interval– control how the CLI waits for completion.
.env from uvx synth-ai setup. Use --env-file when you need to override secrets (you can supply the flag multiple times to merge values).