GSPO: Group Sequence Policy Optimization

GSPO is a policy gradient method for fine-tuning language models via reinforcement learning. It uses pipelined rollouts for efficient on-policy training.

Paper: PipelineRL: Faster On-policy Reinforcement Learning for Long Sequence Generation

When to Use

Training model weights (not just prompts)
Multi-turn agent tasks
When you have a reward signal from environment interaction
Scaling to larger models with GPU training

Config Reference

[algorithm]
type = "online"
method = "policy_gradient"
variety = "gspo"

[services]
task_url = "https://your-tunnel.trycloudflare.com"
judge_url = "https://synth-backend.onrender.com/api"  # Optional

[compute]
gpu_type = "H100"
gpu_count = 2
nodes = 1

[topology]
type = "single_node_split"
gpus_for_vllm = 1
gpus_for_training = 1
gpus_for_ref = 0
tensor_parallel = 1

[vllm]
tensor_parallel_size = 1
max_model_len = 8192

[reference]
placement = "none"  # or "dedicated"
port = 8002
tp = 1
health_max_wait_s = 180
health_interval_ms = 300

[model]
base = "Qwen/Qwen3-4B"  # OR source = "ft:checkpoint_id"
trainer_mode = "lora"  # or "full"
label = "my-rl-model"

[rollout]
env_name = "my-task"
policy_name = "my-policy"
max_turns = 10
episodes_per_batch = 32
max_concurrent_rollouts = 8
batches_per_step = 2
ops = ["agent", "env"]

[evaluation]
instances = 50
every_n_iters = 10
seeds = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

[training]
num_epochs = 1
iterations_per_epoch = 20
max_turns = 10
batch_size = 16
group_size = 4
gradient_accumulation_steps = 1
learning_rate = 5e-5
log_interval = 1
weight_sync_interval = 1

# Reward configuration
step_rewards_enabled = true
step_rewards_mode = "decision_stepwise"  # or "off", "env_sparse"
step_rewards_beta = 0.0
step_rewards_indicator_lambda = 1.0
event_rewards_kind = "unique"  # or "absolute"

[training.weight_sync]
enable = true
targets = ["policy"]

[training.lora]
r = 16
alpha = 32
dropout = 0.1
target_modules = ["q_proj", "v_proj"]

[judge]
enabled = false
type = "synth"
timeout_s = 30

[judge.reward_blend]
env = 1.0
event = 0.0
outcome = 0.0

[judge.options]
event = false
outcome = false
provider = "synth"
model = "synth-judge-v1"

Algorithm Config `[algorithm]`

Parameter	Type	Default	Description
`type`	string	-	Must be `"online"` for RL
`method`	string	-	`"policy_gradient"` or `"ppo"`
`variety`	string	-	`"gspo"` for Group Sequence Policy Optimization

Services Config `[services]`

Parameter	Type	Default	Description
`task_url`	string	-	URL of your task app (tunnel URL)
`judge_url`	string	`null`	Optional judge service URL

Compute Config `[compute]`

Parameter	Type	Default	Description
`gpu_type`	string	-	GPU SKU: `"H100"`, `"H200"`, `"A100"`
`gpu_count`	int	-	Number of GPUs
`nodes`	int	`1`	Number of nodes

Topology Config `[topology]`

Parameter	Type	Default	Description
`type`	string	-	`"single_node_split"`
`gpus_for_vllm`	int	-	GPUs for inference server
`gpus_for_training`	int	-	GPUs for training
`gpus_for_ref`	int	-	GPUs for reference model
`tensor_parallel`	int	-	Tensor parallelism degree

vLLM Config `[vllm]`

Parameter	Type	Default	Description
`tensor_parallel_size`	int	-	TP size for inference
`max_model_len`	int	`8192`	Max sequence length

Reference Config `[reference]`

Parameter	Type	Default	Description
`placement`	string	-	`"none"` or `"dedicated"`
`port`	int	`8002`	Reference server port
`tp`	int	`1`	Tensor parallelism
`health_max_wait_s`	int	`180`	Health check timeout
`health_interval_ms`	int	`300`	Health check interval

Model Config `[model]`

Parameter	Type	Default	Description
`base`	string	-	Base model (e.g., `"Qwen/Qwen3-4B"`)
`source`	string	-	OR checkpoint ID (e.g., `"ft:abc123"`)
`trainer_mode`	string	-	`"lora"`, `"full"`, or `"qlora"`
`label`	string	-	Model identifier/name

Note: Set exactly one of base or source

Rollout Config `[rollout]`

Parameter	Type	Default	Description
`env_name`	string	-	Environment/task name
`policy_name`	string	-	Policy identifier
`max_turns`	int	-	Max steps per episode
`episodes_per_batch`	int	-	Episodes per training batch
`max_concurrent_rollouts`	int	-	Max concurrent rollouts
`batches_per_step`	int	`null`	Batches per training step
`ops`	list[str]	`null`	Operations: `["agent", "env"]`
`env_config`	dict	`null`	Environment-specific config
`policy_config`	dict	`null`	Policy-specific config

Evaluation Config `[evaluation]`

Parameter	Type	Default	Description
`instances`	int	-	Evaluation instances
`every_n_iters`	int	-	Evaluate every N iterations
`seeds`	list[int]	-	Evaluation seeds

Training Config `[training]`

Parameter	Type	Default	Description
`num_epochs`	int	-	Number of training epochs
`iterations_per_epoch`	int	-	Iterations per epoch
`max_turns`	int	-	Max turns during training
`batch_size`	int	-	Training batch size
`group_size`	int	-	GSPO group size
`learning_rate`	float	-	Optimizer learning rate
`gradient_accumulation_steps`	int	`null`	Gradient accumulation
`max_accumulated_minibatch`	int	`null`	Max accumulated minibatch
`log_interval`	int	`null`	Log every N steps
`weight_sync_interval`	int	`null`	Sync weights every N steps

Reward Config (in `[training]`)

Parameter	Type	Default	Description
`step_rewards_enabled`	bool	`null`	Enable step-level rewards
`step_rewards_mode`	string	`null`	`"off"`, `"decision_stepwise"`, `"env_sparse"`
`step_rewards_beta`	float	`null`	Step reward coefficient
`step_rewards_indicator_lambda`	float	`null`	Indicator lambda
`step_rewards_strategy`	string	`null`	Reward strategy
`event_rewards_kind`	string	`null`	`"unique"` or `"absolute"`

Weight Sync Config `[training.weight_sync]`

Parameter	Type	Default	Description
`enable`	bool	`null`	Enable weight sync
`targets`	list[str]	`null`	Sync targets: `["policy"]`
`mode`	string	`null`	Sync mode
`direct`	bool	`null`	Direct sync
`verify_every_k`	int	`null`	Verify every K syncs

LoRA Config `[training.lora]`

Parameter	Type	Default	Description
`r`	int	`16`	LoRA rank
`alpha`	int	`32`	LoRA alpha
`dropout`	float	`0.1`	LoRA dropout
`target_modules`	list[str]	-	Modules to apply LoRA

Judge Config `[judge]`

Parameter	Type	Default	Description
`enabled`	bool	`false`	Enable judge scoring
`type`	string	`null`	Judge type: `"synth"`
`timeout_s`	int	`null`	Judge timeout

Judge Reward Blend `[judge.reward_blend]`

Parameter	Type	Default	Description
`env`	float	`1.0`	Weight for environment reward
`event`	float	`0.0`	Weight for event reward
`outcome`	float	`0.0`	Weight for outcome reward

Judge Options `[judge.options]`

Parameter	Type	Default	Description
`event`	bool	`null`	Enable event judging
`outcome`	bool	`null`	Enable outcome judging
`provider`	string	`null`	Judge provider
`model`	string	`null`	Judge model
`rubric_id`	string	`null`	Rubric identifier
`rubric_overrides`	dict	`null`	Rubric overrides
`tracks`	list[str]	`null`	Tracks to judge
`weights`	dict	`null`	Track weights
`max_concurrency`	int	`null`	Max concurrent judge calls

Returns

from synth_ai.sdk.api.train.rl import RLJob

job = RLJob.from_config("rl.toml")
job.submit()
result = job.poll_until_complete()

# Get results
print(f"Status: {result['status']}")
print(f"Final Reward: {result.get('final_reward', 'N/A')}")

# Get model ID
model_id = result.get("model_id")
# e.g., "ft:Qwen/Qwen3-0.6B:job_658ba4f3a93845aa"

Results Structure

{
    "status": "succeeded",
    "final_reward": 0.85,
    "model_id": "ft:Qwen/Qwen3-0.6B:job_abc123",
    "checkpoints": [
        {"step": 100, "path": "..."},
        {"step": 200, "path": "..."},
    ],
}

Using Your Model

Dev Inference (testing):

from synth_ai.sdk import InferenceClient

client = InferenceClient(
    base_url="https://agent-learning.onrender.com",
    api_key=os.environ["SYNTH_API_KEY"],
)

response = await client.create_chat_completion(
    model="ft:Qwen/Qwen3-0.6B:job_abc123",
    messages=[{"role": "user", "content": "Hello!"}],
)

Export to HuggingFace:

uvx synth-ai artifacts export ft:Qwen/Qwen3-0.6B:job_abc123 \
  --repo-id myorg/my-rl-model \
  --private

List Your Models

uvx synth-ai status models --type rl

SFT — Supervised fine-tuning
Production API — Call your trained models
Artifacts CLI — Export models to HuggingFace
RL Events — Job events reference

Training Methods

SDK

CLI

GSPO

GSPO: Group Sequence Policy Optimization

When to Use

Config Reference

Algorithm Config `[algorithm]`

Services Config `[services]`

Compute Config `[compute]`

Topology Config `[topology]`

vLLM Config `[vllm]`

Reference Config `[reference]`

Model Config `[model]`

Rollout Config `[rollout]`

Evaluation Config `[evaluation]`

Training Config `[training]`

Reward Config (in `[training]`)

Weight Sync Config `[training.weight_sync]`

LoRA Config `[training.lora]`

Judge Config `[judge]`

Judge Reward Blend `[judge.reward_blend]`

Judge Options `[judge.options]`

Returns

Results Structure

Using Your Model

List Your Models

Training Methods

SDK

CLI

​GSPO: Group Sequence Policy Optimization

​When to Use

​Config Reference

​Algorithm Config [algorithm]

​Services Config [services]

​Compute Config [compute]

​Topology Config [topology]

​vLLM Config [vllm]

​Reference Config [reference]

​Model Config [model]

​Rollout Config [rollout]

​Evaluation Config [evaluation]

​Training Config [training]

​Reward Config (in [training])

​Weight Sync Config [training.weight_sync]

​LoRA Config [training.lora]

​Judge Config [judge]

​Judge Reward Blend [judge.reward_blend]

​Judge Options [judge.options]

​Returns

​Results Structure

​Using Your Model

​List Your Models

​Related

GSPO: Group Sequence Policy Optimization

When to Use

Config Reference

Algorithm Config `[algorithm]`

Services Config `[services]`

Compute Config `[compute]`

Topology Config `[topology]`

vLLM Config `[vllm]`

Reference Config `[reference]`

Model Config `[model]`

Rollout Config `[rollout]`

Evaluation Config `[evaluation]`

Training Config `[training]`

Reward Config (in `[training]`)

Weight Sync Config `[training.weight_sync]`

LoRA Config `[training.lora]`

Judge Config `[judge]`

Judge Reward Blend `[judge.reward_blend]`

Judge Options `[judge.options]`

Returns

Results Structure

Using Your Model

List Your Models

Related