Skip to main content
Reinforcement learning with Qwen3-VL combines the standard GSPO trainer with multimodal prompts. This page covers the extra wiring needed to push image observations through rollouts, inference, and training. For CLI usage in this workflow, see Launch Training Jobs, Deploy Task Apps, and Run Evaluations.

Supported models

Any Qwen3-VL checkpoint (2B–235B) works with the RL stack. The registry descriptor (backend/app/routes/simple_training/model_families/qwen3_vl.py) adds:
  • supports_vision = true
  • max_images_per_message = 1
  • LoRA projector targets (mm_projector, attention/MLP layers)
These fields are injected into the RL config automatically by the helper (backend/app/routes/clustered_training/core/algorithms/gspo/app_helpers.py) when you pick a VL model.

Task app requirements

Use the Crafter policy (examples/task_apps/crafter/task_app/synth_envs_hosted/envs/crafter/policy.py) as a template:
  • Detect VL models via model_name and set use_vision = True.
  • Include the observation image as a data URL (or HTTPS URL) inside the user message:
user_content = [
    {"type": "text", "text": state_summary},
    {"type": "image_url", "image_url": {"url": observation["observation_image_data_url"]}},
]
  • Support image_only_mode to send image segments without accompanying text when desired.
The inference server validates every image_url segment (backend/app/routes/clustered_training/core/algorithms/gspo/inference/server.py), so ensure the URL is present and fetchable.

Config checklist

[model]
base = "Qwen/Qwen3-VL-4B-Instruct"
supports_vision = true          # auto-set when using helpers
max_images_per_message = 1
vision_processor_id = "Qwen/Qwen3-VL-4B-Instruct"
trainer_mode = "lora"

[rollout]
use_vision = true
image_only_mode = true          # optional
max_images_per_step = 1
policy_name = "crafter-react"

[training]
batch_size = 2                  # keep small; vision inputs are heavy
supports_vision = true
gradient_accumulation_steps = 2

[vllm]
limit_mm_per_prompt = { "image": 1 }
max_model_len = 4096

Thinking variants

If you choose a -Thinking SKU, populate the rollout policy_config with the intended thinking mode:
[rollout.policy_config]
use_vision = true
thinking_mode = "think"
thinking_budget = 512
Both the trainer and evaluator propagate these fields to inference (backend/app/routes/clustered_training/core/algorithms/gspo/evaluation/evaluator.py).

Example workflow

  1. Deploy the Crafter task app (modal deploy examples/task_apps/crafter/task_app/main.py) with vision enabled.
  2. Update examples/qwen_vl/configs/crafter_rl_vision_qwen3vl4b.toml with your task URL and API key secrets.
  3. Launch RL:
    uvx synth-ai train \
      --type rl \
      --config examples/qwen_vl/configs/crafter_rl_vision_qwen3vl4b.toml \
      --poll
    
  4. Monitor rollouts – the trainer logs dropped images if you exceed max_images_per_message, and vLLM reports multimodal prompt usage.
  5. Evaluate / deploy – reuse the same [model] + [rollout] blocks in your eval configs and Modal deployment manifests so the processor files ship with the model.

Tips

  • Concurrency: Vision prompts are larger. Start with max_concurrent_rollouts = 4 and scale cautiously.
  • Topology: Use single_node_split and dedicate at least one GPU to vLLM and one to training; sharded models (235B) require additional GPUs.
  • Data capture: Enable tracing (TASKAPP_TRACING_ENABLED=1) to keep image payloads in your evaluation logs.
  • LoRA projector weights: When using LoRA, ensure target_modules includes the projector (the sample config uses "all-linear" to cover every linear module).