Skip to main content
RL jobs consume TOML configs that describe the algorithm, model, trainer topology, rollout batch sizes, and evaluation cadence. The CLI converts this into a job payload and submits it to Synth. Reference: Piché et al. (2025). “PipelineRL: Faster On-policy Reinforcement Learning for Long Sequence Generation.” arXiv:2509.19128 Synth implements PipelineRL, which employs concurrent asynchronous data generation and model training with in-flight weight updates, achieving approximately 2x faster learning while maintaining highly on-policy training data. For command-line options and examples, see Launch Training Jobs.

1. Create the Config TOML for Your Task App

Create a TOML file that follows the schema documented in the RL config reference.

Vision-language (Qwen3-VL)

Qwen3-VL checkpoints run the same GSPO loop, but you must allow images to flow through the config and your task app. Start from examples/qwen_vl/configs/crafter_rl_vision_qwen3vl4b.toml and note the following:
  • Model stanza – point name (or base) at a Qwen/Qwen3-VL-* SKU. The helper will inject supports_vision = true, max_images_per_message = 1, and the correct vision_processor_id. If you write the block by hand, set those fields explicitly so the trainer loads the vision processor and keeps LoRA targets (e.g. mm_projector).
  • Rollout stanza – set use_vision = true so policy prompts consume image segments. max_images_per_step defaults to the model limit (one image per turn) and extra images are dropped. Enable image_only_mode = true when your environment produces images without accompanying text.
  • Thinking variants – if you select a -Thinking SKU, populate thinking_mode = "think" (or another supported mode) plus an optional thinking_budget so the collector forwards the hint to inference.
  • Task app contract – every policy prompt must include images using the OpenAI multimodal schema. A typical turn looks like:
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "What is happening in the arena?"},
        {"type": "image_url", "image_url": {"url": "https://assets.example.com/frame_042.png"}}
      ]
    }
    
    URLs must be reachable from both the rollout actors and the trainer (HTTPS, signed URLs, or data:image/...;base64, blobs). The server enforces the same max_images_per_message cap as SFT.
  • Evaluation / deployment – reuse the same [model] and [rollout] blocks when spawning eval jobs or Modal deployments so the vision processor is materialised in each runtime.

2. Launch the Job

uvx synth-ai train \
  --type rl \
  --config configs/rl.toml \
  --task-url https://my-task-app.modal.run
--task-url must point at the live Modal deployment you created with uvx synth-ai deploy --runtime modal; the CLI verifies that endpoint before submitting the job. Steps performed by the CLI:
  1. Validate the TOML (RLConfig) and ensure a model is selected (source or base).
  2. Verify the Modal task app via /rl/verify_task_app (multiple API key attempts).
  3. Run a health check against /health and /task_info using ENVIRONMENT_API_KEY.
  4. Create and start the job; by default the CLI polls status until it reaches a terminal state.
Useful overrides:
  • --model – swap in a different base or fine-tuned model without touching the file.
  • --idempotency – provide an Idempotency-Key header (safe retries).
  • --allow-experimental – enable experimental base models.
  • --no-poll / --poll-timeout / --poll-interval – control how the CLI waits for completion.
Reminder: the CLI auto-loads the .env from uvx synth-ai setup. Use --env-file when you need to override secrets (you can supply the flag multiple times to merge values).

3. Monitor Progress

During polling you will see status updates plus streaming metrics (mean return, judge scores). If you disable polling:
curl -H "Authorization: Bearer $SYNTH_API_KEY" \
  https://agent-learning.onrender.com/api/rl/jobs/<job_id>
See RL Training for complete training documentation.