Skip to main content
GSPO (Group-based Sequence Policy Optimization) trains your model through interaction with a task environment. Unlike SFT which learns from fixed examples, RL learns from rewards—your model generates responses, receives feedback, and improves iteratively.

When to Use RL

Best for:
  • Complex multi-step tasks (agents, tool use, reasoning)
  • Tasks where exploration helps discover better strategies
  • When you have a reward signal but not ideal demonstrations
  • Improving beyond what SFT can achieve
Consider SFT instead if:
  • You have high-quality demonstration data
  • The task is straightforward (classification, simple QA)
  • You want faster iteration cycles

Prerequisites

# Required environment variables in .env
SYNTH_API_KEY=sk_...           # For authentication
TASK_APP_URL=https://...       # Your deployed task app
ENVIRONMENT_API_KEY=sk_env_... # Task app authentication
Install the CLI:
pip install synth-ai
# or
uvx synth-ai --help

How GSPO Works

  1. Collect rollouts: Your model interacts with the task environment, generating trajectories
  2. Compute rewards: Each trajectory receives a score (from env, judge, or both)
  3. Calculate advantages: Compare each trajectory to others in its group
  4. Update policy: Increase probability of high-advantage responses, decrease low ones
  5. Repeat: Sync updated weights to inference, collect new rollouts
The key insight: by grouping trajectories from the same prompt/seed, GSPO reduces variance and learns more efficiently.

Step 1: Build Your Task App

Your task app defines the environment your model interacts with. It must expose a /rollout endpoint that:
  1. Receives rollout requests from the trainer
  2. Runs episodes (model generates actions, environment responds)
  3. Returns trajectories with rewards
See Task App Guide for implementation details. Example task app structure:
from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

@app.post("/rollout")
async def rollout(request: RolloutRequest):
    trajectories = []

    for _ in range(request.batch_size):
        # Run one episode
        obs = env.reset(seed=request.env.seed)
        steps = []
        total_reward = 0

        for turn in range(request.max_turns):
            # Model generates action via inference_url
            action = await call_model(obs, request.inference_url)

            # Environment responds
            obs, reward, done, info = env.step(action)
            total_reward += reward
            steps.append({"obs": obs, "action": action, "reward": reward})

            if done:
                break

        trajectories.append({
            "steps": steps,
            "episode_return": total_reward
        })

    return {"trajectories": trajectories, "metrics": {...}}

Step 2: Deploy Your Task App

The trainer needs to reach your task app over the internet:
synth-ai deploy tunnel path/to/task_app.py \
  --env .env \
  --tunnel-mode managed \
  --tunnel-subdomain my-rl-task
This writes TASK_APP_URL to your .env file. Verify it’s accessible:
curl https://my-rl-task.usesynth.ai/health

Step 3: Create the Configuration

Create a TOML file defining your training parameters:
[algorithm]
type = "online"
method = "policy_gradient"
variety = "gspo"

[services]
task_url = "https://my-rl-task.usesynth.ai"  # From Step 2

[compute]
gpu_type = "H100"
gpu_count = 8

[topology]
type = "single_node_split"
gpus_for_vllm = 4      # Inference
gpus_for_training = 3  # Policy updates
gpus_for_ref = 1       # Reference model (for KL penalty)
tensor_parallel = 4

[model]
base = "Qwen/Qwen3-4B"
label = "my-rl-agent"

[rollout]
env_name = "my_environment"
policy_name = "my_policy"
max_turns = 10
episodes_per_batch = 64
max_concurrent_rollouts = 32

[training]
num_epochs = 1
iterations_per_epoch = 100
batch_size = 64
group_size = 4
learning_rate = 1e-5
log_interval = 1
weight_sync_interval = 1

[evaluation]
instances = 10
every_n_iters = 10
seeds = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Configuration Reference

Algorithm

FieldValueDescription
type"online"On-policy learning (RL)
method"policy_gradient"Policy gradient algorithm
variety"gspo"Group-based sequence optimization

Compute & Topology

ParameterDescription
gpu_typeGPU type (H100, A100)
gpu_countTotal GPUs for the job
gpus_for_vllmGPUs for fast inference during rollouts
gpus_for_trainingGPUs for policy gradient updates
gpus_for_refGPUs for reference model (KL penalty)
tensor_parallelTensor parallelism for vLLM

Rollout

ParameterDefaultDescription
max_turns-Maximum steps per episode
episodes_per_batch-Trajectories collected per training step
max_concurrent_rollouts8Parallel rollout requests
group_size4Trajectories per group (for advantage baseline)

Training

ParameterDefaultDescription
num_epochs1Training epochs
iterations_per_epoch100Training steps per epoch
batch_size64Trajectories per training batch
group_size4Trajectories sharing same prompt (for variance reduction)
learning_rate1e-5Policy learning rate
weight_sync_interval1Steps between syncing weights to inference

Step 4: Launch the Training Job

Using the CLI

synth-ai train --type rl --config my_rl_config.toml --poll
You’ll see progress updates:
[10:15:00]    0.0s  Status: queued
[10:15:30]   30.0s  Status: running | Iter: 1 | Mean Return: 0.45
[10:16:00]   60.0s  Status: running | Iter: 5 | Mean Return: 0.62
[10:16:30]   90.0s  Status: running | Iter: 10 | Mean Return: 0.78
...
[10:45:00] 1800.0s  Status: succeeded | Final Return: 1.24

Using Python

from synth_ai.sdk.api.train.rl import RLJob
import os

job = RLJob.from_config(
    config_path="my_rl_config.toml",
    task_app_url=os.environ["TASK_APP_URL"],
    api_key=os.environ["SYNTH_API_KEY"]
)

job_id = job.submit()
print(f"Job started: {job_id}")

result = job.poll_until_complete(timeout=7200.0)
print(f"Final return: {result.get('final_return')}")

Resume a Job

job = RLJob.from_job_id(
    job_id="rl_abc123",
    api_key=os.environ["SYNTH_API_KEY"]
)

status = job.get_status()
print(f"Status: {status['status']}")

Step 5: Use Your Trained Model

After training completes, you’ll receive a model ID:
from openai import OpenAI

client = OpenAI(
    base_url="https://api.usesynth.ai/v1",
    api_key=os.environ["SYNTH_API_KEY"]
)

response = client.chat.completions.create(
    model="rl:qwen3-4b:my-org:abc123",  # Your RL-trained model
    messages=[
        {"role": "user", "content": "Your task prompt here"}
    ]
)

Adding Judge Scoring

For richer reward signals, enable the judge to score trajectories on multiple criteria:
[judge]
enabled = true
reward_blend = { env = 0.5, event = 0.3, outcome = 0.2 }

[judge.options]
event = true           # Score individual decisions
outcome = true         # Score final outcome
rubric_id = "my-rubric-id"
max_concurrency = 8

Reward Blending

The final reward combines multiple sources:
SourceDescriptionWeight Example
envTask environment rewards0.5
eventJudge scores for each decision0.3
outcomeJudge score for final result0.2
final_reward = 0.5 * env_reward + 0.3 * event_score + 0.2 * outcome_score

Step Rewards (Process Supervision)

For fine-grained feedback on intermediate steps:
[training]
step_rewards_enabled = true
step_rewards_mode = "decision_stepwise"  # Reward each decision
step_rewards_beta = 0.0                  # Discount factor
event_rewards_kind = "unique"            # "unique" or "absolute"
ModeDescription
offOnly terminal reward
decision_stepwiseReward each model decision
env_sparseSparse environment rewards

Monitoring Training

Key Metrics

MetricWhat It MeansGood Sign
mean_returnAverage episode rewardIncreasing over time
policy_lossPPO clipped lossStable, not exploding
kl_divergenceDistance from referenceLow (< 0.1)
clip_fractionHow often clipping activates0.1-0.3
advantage_meanAverage advantageNear 0 (normalized)

Supported Models

See Supported Models for the full list of trainable models.