GSPO: Group Sequence Policy Optimization
GSPO is a policy gradient method for fine-tuning language models via reinforcement learning. It uses pipelined rollouts for efficient on-policy training.When to Use
- Training model weights (not just prompts)
- Multi-turn agent tasks
- When you have a reward signal from environment interaction
- Scaling to larger models with GPU training
Config Reference
Algorithm Config [algorithm]
| Parameter | Type | Default | Description |
|---|---|---|---|
type | string | - | Must be "online" for RL |
method | string | - | "policy_gradient" or "ppo" |
variety | string | - | "gspo" for Group Sequence Policy Optimization |
Services Config [services]
| Parameter | Type | Default | Description |
|---|---|---|---|
task_url | string | - | URL of your task app (tunnel URL) |
judge_url | string | null | Optional judge service URL |
Compute Config [compute]
| Parameter | Type | Default | Description |
|---|---|---|---|
gpu_type | string | - | GPU SKU: "H100", "H200", "A100" |
gpu_count | int | - | Number of GPUs |
nodes | int | 1 | Number of nodes |
Topology Config [topology]
| Parameter | Type | Default | Description |
|---|---|---|---|
type | string | - | "single_node_split" |
gpus_for_vllm | int | - | GPUs for inference server |
gpus_for_training | int | - | GPUs for training |
gpus_for_ref | int | - | GPUs for reference model |
tensor_parallel | int | - | Tensor parallelism degree |
vLLM Config [vllm]
| Parameter | Type | Default | Description |
|---|---|---|---|
tensor_parallel_size | int | - | TP size for inference |
max_model_len | int | 8192 | Max sequence length |
Reference Config [reference]
| Parameter | Type | Default | Description |
|---|---|---|---|
placement | string | - | "none" or "dedicated" |
port | int | 8002 | Reference server port |
tp | int | 1 | Tensor parallelism |
health_max_wait_s | int | 180 | Health check timeout |
health_interval_ms | int | 300 | Health check interval |
Model Config [model]
| Parameter | Type | Default | Description |
|---|---|---|---|
base | string | - | Base model (e.g., "Qwen/Qwen3-4B") |
source | string | - | OR checkpoint ID (e.g., "ft:abc123") |
trainer_mode | string | - | "lora", "full", or "qlora" |
label | string | - | Model identifier/name |
base or source
Rollout Config [rollout]
| Parameter | Type | Default | Description |
|---|---|---|---|
env_name | string | - | Environment/task name |
policy_name | string | - | Policy identifier |
max_turns | int | - | Max steps per episode |
episodes_per_batch | int | - | Episodes per training batch |
max_concurrent_rollouts | int | - | Max concurrent rollouts |
batches_per_step | int | null | Batches per training step |
ops | list[str] | null | Operations: ["agent", "env"] |
env_config | dict | null | Environment-specific config |
policy_config | dict | null | Policy-specific config |
Evaluation Config [evaluation]
| Parameter | Type | Default | Description |
|---|---|---|---|
instances | int | - | Evaluation instances |
every_n_iters | int | - | Evaluate every N iterations |
seeds | list[int] | - | Evaluation seeds |
Training Config [training]
| Parameter | Type | Default | Description |
|---|---|---|---|
num_epochs | int | - | Number of training epochs |
iterations_per_epoch | int | - | Iterations per epoch |
max_turns | int | - | Max turns during training |
batch_size | int | - | Training batch size |
group_size | int | - | GSPO group size |
learning_rate | float | - | Optimizer learning rate |
gradient_accumulation_steps | int | null | Gradient accumulation |
max_accumulated_minibatch | int | null | Max accumulated minibatch |
log_interval | int | null | Log every N steps |
weight_sync_interval | int | null | Sync weights every N steps |
Reward Config (in [training])
| Parameter | Type | Default | Description |
|---|---|---|---|
step_rewards_enabled | bool | null | Enable step-level rewards |
step_rewards_mode | string | null | "off", "decision_stepwise", "env_sparse" |
step_rewards_beta | float | null | Step reward coefficient |
step_rewards_indicator_lambda | float | null | Indicator lambda |
step_rewards_strategy | string | null | Reward strategy |
event_rewards_kind | string | null | "unique" or "absolute" |
Weight Sync Config [training.weight_sync]
| Parameter | Type | Default | Description |
|---|---|---|---|
enable | bool | null | Enable weight sync |
targets | list[str] | null | Sync targets: ["policy"] |
mode | string | null | Sync mode |
direct | bool | null | Direct sync |
verify_every_k | int | null | Verify every K syncs |
LoRA Config [training.lora]
| Parameter | Type | Default | Description |
|---|---|---|---|
r | int | 16 | LoRA rank |
alpha | int | 32 | LoRA alpha |
dropout | float | 0.1 | LoRA dropout |
target_modules | list[str] | - | Modules to apply LoRA |
Judge Config [judge]
| Parameter | Type | Default | Description |
|---|---|---|---|
enabled | bool | false | Enable judge scoring |
type | string | null | Judge type: "synth" |
timeout_s | int | null | Judge timeout |
Judge Reward Blend [judge.reward_blend]
| Parameter | Type | Default | Description |
|---|---|---|---|
env | float | 1.0 | Weight for environment reward |
event | float | 0.0 | Weight for event reward |
outcome | float | 0.0 | Weight for outcome reward |
Judge Options [judge.options]
| Parameter | Type | Default | Description |
|---|---|---|---|
event | bool | null | Enable event judging |
outcome | bool | null | Enable outcome judging |
provider | string | null | Judge provider |
model | string | null | Judge model |
rubric_id | string | null | Rubric identifier |
rubric_overrides | dict | null | Rubric overrides |
tracks | list[str] | null | Tracks to judge |
weights | dict | null | Track weights |
max_concurrency | int | null | Max concurrent judge calls |
Returns
Results Structure
Using Your Model
Dev Inference (testing):List Your Models
Related
- SFT — Supervised fine-tuning
- Production API — Call your trained models
- Artifacts CLI — Export models to HuggingFace
- RL Events — Job events reference