Vision-Language (VL)

Synth’s Qwen3-VL checkpoints let you fine-tune multimodal (text + image) models with the same SFT pipeline used for text-only runs. This guide highlights the extra pieces you need to configure when working with image data.

Supported models

Qwen/Qwen3-VL-2B-Instruct
Qwen/Qwen3-VL-4B-Instruct
Qwen/Qwen3-VL-8B-Instruct
Qwen/Qwen3-VL-30B-A3B-Instruct
Qwen/Qwen3-VL-32B-Instruct
Qwen/Qwen3-VL-235B-A22B-Instruct (sharded across >1 GPU)

The model family descriptor (backend/app/routes/simple_training/model_families/qwen3_vl.py) marks these SKUs with supports_vision = True and max_images_per_message = 1, so the trainer automatically switches into the multimodal code path.

Dataset requirements

OpenAI multimodal schema – every JSONL row must contain a messages[] array. Messages can mix text segments and image segments:

{
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "What is happening in this frame?"},
        {"type": "image_url", "image_url": {"url": "data:image/png;base64,<...>"}}
      ]
    },
    {
      "role": "assistant",
      "content": "The agent is holding a red cube."
    }
  ],
  "metadata": {"example_id": "frame_042"}
}

Image references – the loader accepts HTTPS URLs, signed URLs, or inline data URLs (data:image/png;base64,…). Local file paths only work if the file is packaged with the dataset.
Validation – VisionSFTDataProcessor (backend/app/routes/simple_training/training/sft/data.py) enforces the image limit and normalises legacy images fields. Invalid examples are dropped with a warning.

Training config tweaks

When job.model points at a Qwen3-VL SKU, the trainer applies these defaults automatically (backend/app/routes/simple_training/training/sft/trainer.py):

Clamp per_device_train_batch_size / per_device_eval_batch_size to 1.
Bump gradient_accumulation_steps to ≥ 8 (you can raise this further to reach your effective batch).
Force bf16 precision and disable fp16.
Disable group_by_length and pin-memory workers (vision collators stream PIL images).

You can override the image cap by setting model.max_images_per_message, but higher values increase VRAM usage.

Example TOML (LoRA SFT)

[algorithm]
type = "offline"
method = "sft"
variety = "lora"

[job]
model = "Qwen/Qwen3-VL-4B-Instruct"
data = "datasets/crafter_vl/train.jsonl"

[compute]
gpu_type = "H200"
gpu_count = 2

[training]
mode = "lora"
use_qlora = true

[hyperparameters]
n_epochs = 2
per_device_batch = 1
gradient_accumulation_steps = 32
learning_rate = 5e-6
sequence_length = 2048
warmup_ratio = 0.03

[lora]
r = 16
alpha = 32
dropout = 0.05
target_modules = ["all-linear"]

Step-by-step

Collect traces from a vision-enabled task app (for example examples/task_apps/crafter/task_app/main.py with use_vision = true).
Filter to JSONL using uvx synth-ai filter; the SFT exporter keeps the multimodal message format.
Train with uvx synth-ai train --type sft --config ft_vl.toml --poll.
Inspect outputs – the fine_tuned_model retains the multimodal metadata, so Modal deployments automatically pull the correct processor.

Need flag references? See Run Evaluations, Filter Traces, Launch Training Jobs, and Deploy Task Apps.

Tips

Images are the main memory driver. If you see CUDA OOMs, increase gradient accumulation, reduce sequence_length, or move to a larger GPU profile.
The collator (backend/app/routes/simple_training/training/sft/vision_collator.py) truncates additional image segments, so design your prompts around the configured limit.
For pure image captioning tasks, set image_only_mode = true in your task app policy so user messages contain only image segments (no text).

Get Started

Fine-Tuning

Reinforcement Learning

CLI Commands

Vision-Language (VL)

Supported models

Dataset requirements

Training config tweaks

Example TOML (LoRA SFT)

Step-by-step

Tips

Get Started

Fine-Tuning

Reinforcement Learning

CLI Commands

​Supported models

​Dataset requirements

​Training config tweaks

​Example TOML (LoRA SFT)

​Step-by-step

​Tips

Supported models

Dataset requirements

Training config tweaks

Example TOML (LoRA SFT)

Step-by-step

Tips