- Works via the standard SFT CLI:
uvx synth-ai train --type sft --config <path> - You do not need QLoRA to use LoRA. Use either:
- LoRA with standard precision memory optimizations:
training.mode = "lora"and omituse_qlora - QLoRA (4-bit quantization adapters): set
training.mode = "lora"andtraining.use_qlora = true
- LoRA with standard precision memory optimizations:
- Uses the same hyperparameter keys as FFT; the backend interprets LoRA-appropriate settings
Quickstart
Minimal TOML (LoRA enabled)
What the client validates and sends
- Validates dataset path existence (from
[job].dataor--dataset) and JSONL shape - Uploads training (and optional validation) files to
/api/learning/files - Builds payload with:
modelfrom[job].modeltraining_type = "sft_offline"hyperparametersfrom[hyperparameters](+[training.validation]knobs)metadata.effective_config.computefrom[compute]metadata.effective_config.data.topologyfrom[data.topology]metadata.effective_config.training.{mode,use_qlora}from[training]
Multi‑GPU guidance
- Set
[compute].gpu_type,gpu_count, and optionallynodes - Use
[hyperparameters.parallelism]for deepspeed/FSDP/precision/TP/PP knobs; forwarded verbatim - Optionally add
[data.topology](e.g.,container_count) for visibility; backend validates resource consistency
LoRA ranks and targets
- Default rank is typically 16 in production. To sweep ranks, include the following under hyperparameters:
- Some examples (e.g., Qwen Coder) include a
[lora]block for target selection:
Common issues
- HTTP 400
missing_gpu_type: set[compute].gpu_type(and typicallygpu_count) so it appears undermetadata.effective_config.compute - Dataset not found: provide absolute path or use
--dataset; the client resolves relative paths from the current working directory
Helpful CLI flags
--datasetto override[job].data--examples Nto subset the first N rows for quick smoke tests--dry-runto preview payload without submitting
All sections and parameters (LoRA SFT)
The client recognizes and/or forwards the following sections:-
[job](client reads)model(string, required): base model identifierdataordata_path(string): local path to training JSONL. Required unless overridden by--dataset- Notes:
- Paths are resolved relative to the current working directory (CWD), not the TOML location
- Legacy scripts may also mention
poll_seconds(legacy-only; the new CLI uses--poll-*flags)
-
[compute](forwarded into metadata.effective_config.compute)gpu_type(string): required by backend (e.g., “H100”, “A10G”). Missing this often causes HTTP 400gpu_count(int): number of GPUsnodes(int, optional)
-
[data](partially read)topology(table/dict): forwarded as-is tometadata.effective_config.data.topologyvalidation_path(string, optional): if present and the file exists, the client uploads it and wires validation- Path resolution: relative to the current working directory (CWD)
-
[training](partially read)mode(string, optional): copied intometadata.effective_config.training.mode(documentation hint)use_qlora(bool): set totrueto enable QLoRA (LoRA works without this)[training.validation](optional; some keys are promoted into hyperparameters)enabled(bool, default true): surfaced inmetadata.effective_config.training.validation.enabledevaluation_strategy(string, default “steps”): forwarded into hyperparameterseval_steps(int, default 0): forwardedsave_best_model_at_end(bool, default true): forwardedmetric_for_best_model(string, default “val.loss”): forwardedgreater_is_better(bool, default false): forwarded
-
[hyperparameters](client reads selective keys)- Required/defaulted:
n_epochs(int, default 1)
- Optional (forwarded if present):
batch_size,global_batch,per_device_batch,gradient_accumulation_steps,sequence_length,learning_rate,warmup_ratio,train_kind- LoRA-specific (forwarded if present):
lora_rank,lora_alpha,lora_dropout
- Note: some legacy examples include
world_size. The client does not forwardworld_size; prefer specifyingper_device_batchandgradient_accumulation_stepsexplicitly. [hyperparameters.parallelism](dict): forwarded verbatim (e.g.,use_deepspeed,deepspeed_stage,fsdp,bf16,fp16,tensor_parallel_size,pipeline_parallel_size)
- Required/defaulted:
-
[algorithm](ignored by client): present in some examples for documentation; no effect on payload
- Missing dataset path -> prompt or error; dataset must exist and be valid JSONL
- Missing
gpu_type(backend rule) -> HTTP 400 at create job - Validation path missing -> warning; continues without validation
modelfrom[job].modeltraining_type = "sft_offline"hyperparametersfrom[hyperparameters]+[training.validation]select keysmetadata.effective_config.computefrom[compute]metadata.effective_config.data.topologyfrom[data.topology]metadata.effective_config.training.{mode,use_qlora}from[training]