Skip to main content
Synth’s Qwen3-VL checkpoints let you fine-tune multimodal (text + image) models with the same SFT pipeline used for text-only runs. This guide highlights the extra pieces you need to configure when working with image data.

Supported models

  • Qwen/Qwen3-VL-2B-Instruct
  • Qwen/Qwen3-VL-4B-Instruct
  • Qwen/Qwen3-VL-8B-Instruct
  • Qwen/Qwen3-VL-30B-A3B-Instruct
  • Qwen/Qwen3-VL-32B-Instruct
  • Qwen/Qwen3-VL-235B-A22B-Instruct (sharded across >1 GPU)
The model family descriptor (backend/app/routes/simple_training/model_families/qwen3_vl.py) marks these SKUs with supports_vision = True and max_images_per_message = 1, so the trainer automatically switches into the multimodal code path.

Dataset requirements

  • OpenAI multimodal schema – every JSONL row must contain a messages[] array. Messages can mix text segments and image segments:
{
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "What is happening in this frame?"},
        {"type": "image_url", "image_url": {"url": "data:image/png;base64,<...>"}}
      ]
    },
    {
      "role": "assistant",
      "content": "The agent is holding a red cube."
    }
  ],
  "metadata": {"example_id": "frame_042"}
}
  • Image references – the loader accepts HTTPS URLs, signed URLs, or inline data URLs (data:image/png;base64,…). Local file paths only work if the file is packaged with the dataset.
  • ValidationVisionSFTDataProcessor (backend/app/routes/simple_training/training/sft/data.py) enforces the image limit and normalises legacy images fields. Invalid examples are dropped with a warning.

Training config tweaks

When job.model points at a Qwen3-VL SKU, the trainer applies these defaults automatically (backend/app/routes/simple_training/training/sft/trainer.py):
  • Clamp per_device_train_batch_size / per_device_eval_batch_size to 1.
  • Bump gradient_accumulation_steps to ≥ 8 (you can raise this further to reach your effective batch).
  • Force bf16 precision and disable fp16.
  • Disable group_by_length and pin-memory workers (vision collators stream PIL images).
You can override the image cap by setting model.max_images_per_message, but higher values increase VRAM usage.

Example TOML (LoRA SFT)

[algorithm]
type = "offline"
method = "sft"
variety = "lora"

[job]
model = "Qwen/Qwen3-VL-4B-Instruct"
data = "datasets/crafter_vl/train.jsonl"

[compute]
gpu_type = "H200"
gpu_count = 2

[training]
mode = "lora"
use_qlora = true

[hyperparameters]
n_epochs = 2
per_device_batch = 1
gradient_accumulation_steps = 32
learning_rate = 5e-6
sequence_length = 2048
warmup_ratio = 0.03

[lora]
r = 16
alpha = 32
dropout = 0.05
target_modules = ["all-linear"]

Step-by-step

  1. Collect traces from a vision-enabled task app (for example examples/task_apps/crafter/task_app/main.py with use_vision = true).
  2. Filter to JSONL using uvx synth-ai filter; the SFT exporter keeps the multimodal message format.
  3. Train with uvx synth-ai train --type sft --config ft_vl.toml --poll.
  4. Inspect outputs – the fine_tuned_model retains the multimodal metadata, so Modal deployments automatically pull the correct processor.
Need flag references? See Run Evaluations, Filter Traces, Launch Training Jobs, and Deploy Task Apps.

Tips

  • Images are the main memory driver. If you see CUDA OOMs, increase gradient accumulation, reduce sequence_length, or move to a larger GPU profile.
  • The collator (backend/app/routes/simple_training/training/sft/vision_collator.py) truncates additional image segments, so design your prompts around the configured limit.
  • For pure image captioning tasks, set image_only_mode = true in your task app policy so user messages contain only image segments (no text).