Supported models
- Qwen/Qwen3-VL-2B-Instruct
- Qwen/Qwen3-VL-4B-Instruct
- Qwen/Qwen3-VL-8B-Instruct
- Qwen/Qwen3-VL-30B-A3B-Instruct
- Qwen/Qwen3-VL-32B-Instruct
- Qwen/Qwen3-VL-235B-A22B-Instruct(sharded across >1 GPU)
backend/app/routes/simple_training/model_families/qwen3_vl.py) marks these SKUs with supports_vision = True and max_images_per_message = 1, so the trainer automatically switches into the multimodal code path.
Dataset requirements
- OpenAI multimodal schema – every JSONL row must contain a messages[]array. Messages can mix text segments and image segments:
- Image references – the loader accepts HTTPS URLs, signed URLs, or inline data URLs (data:image/png;base64,…). Local file paths only work if the file is packaged with the dataset.
- Validation – VisionSFTDataProcessor(backend/app/routes/simple_training/training/sft/data.py) enforces the image limit and normalises legacyimagesfields. Invalid examples are dropped with a warning.
Training config tweaks
Whenjob.model points at a Qwen3-VL SKU, the trainer applies these defaults automatically (backend/app/routes/simple_training/training/sft/trainer.py):
- Clamp per_device_train_batch_size/per_device_eval_batch_sizeto 1.
- Bump gradient_accumulation_stepsto ≥ 8 (you can raise this further to reach your effective batch).
- Force bf16precision and disablefp16.
- Disable group_by_lengthand pin-memory workers (vision collators stream PIL images).
model.max_images_per_message, but higher values increase VRAM usage.
Example TOML (LoRA SFT)
Step-by-step
- Collect traces from a vision-enabled task app (for example examples/task_apps/crafter/task_app/main.pywithuse_vision = true).
- Filter to JSONL using uvx synth-ai filter; the SFT exporter keeps the multimodal message format.
- Train with uvx synth-ai train --type sft --config ft_vl.toml --poll.
- Inspect outputs – the fine_tuned_modelretains the multimodal metadata, so Modal deployments automatically pull the correct processor.
Tips
- Images are the main memory driver. If you see CUDA OOMs, increase gradient accumulation, reduce sequence_length, or move to a larger GPU profile.
- The collator (backend/app/routes/simple_training/training/sft/vision_collator.py) truncates additional image segments, so design your prompts around the configured limit.
- For pure image captioning tasks, set image_only_mode = truein your task app policy so user messages contain only image segments (no text).