Train Your Model with Reinforcement Learning

GSPO (Group-based Sequence Policy Optimization) trains your model through interaction with a task environment. Unlike SFT which learns from fixed examples, RL learns from rewards—your model generates responses, receives feedback, and improves iteratively.

When to Use RL

Best for:

Complex multi-step tasks (agents, tool use, reasoning)
Tasks where exploration helps discover better strategies
When you have a reward signal but not ideal demonstrations
Improving beyond what SFT can achieve

Consider SFT instead if:

You have high-quality demonstration data
The task is straightforward (classification, simple QA)
You want faster iteration cycles

Prerequisites

# Required environment variables in .env
SYNTH_API_KEY=sk_...           # For authentication
TASK_APP_URL=https://...       # Your deployed task app
ENVIRONMENT_API_KEY=sk_env_... # Task app authentication

Install the CLI:

pip install synth-ai
# or
uvx synth-ai --help

How GSPO Works

Collect rollouts: Your model interacts with the task environment, generating trajectories
Compute rewards: Each trajectory receives a score (from env, judge, or both)
Calculate advantages: Compare each trajectory to others in its group
Update policy: Increase probability of high-advantage responses, decrease low ones
Repeat: Sync updated weights to inference, collect new rollouts

The key insight: by grouping trajectories from the same prompt/seed, GSPO reduces variance and learns more efficiently.

Step 1: Build Your Task App

Your task app defines the environment your model interacts with. It must expose a /rollout endpoint that:

Receives rollout requests from the trainer
Runs episodes (model generates actions, environment responds)
Returns trajectories with rewards

See Task App Guide for implementation details. Example task app structure:

from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

@app.post("/rollout")
async def rollout(request: RolloutRequest):
    trajectories = []

    for _ in range(request.batch_size):
        # Run one episode
        obs = env.reset(seed=request.env.seed)
        steps = []
        total_reward = 0

        for turn in range(request.max_turns):
            # Model generates action via inference_url
            action = await call_model(obs, request.inference_url)

            # Environment responds
            obs, reward, done, info = env.step(action)
            total_reward += reward
            steps.append({"obs": obs, "action": action, "reward": reward})

            if done:
                break

        trajectories.append({
            "steps": steps,
            "episode_return": total_reward
        })

    return {"trajectories": trajectories, "metrics": {...}}

Step 2: Deploy Your Task App

The trainer needs to reach your task app over the internet:

synth-ai deploy tunnel path/to/task_app.py \
  --env .env \
  --tunnel-mode managed \
  --tunnel-subdomain my-rl-task

This writes TASK_APP_URL to your .env file. Verify it’s accessible:

curl https://my-rl-task.usesynth.ai/health

Step 3: Create the Configuration

Create a TOML file defining your training parameters:

[algorithm]
type = "online"
method = "policy_gradient"
variety = "gspo"

[services]
task_url = "https://my-rl-task.usesynth.ai"  # From Step 2

[compute]
gpu_type = "H100"
gpu_count = 8

[topology]
type = "single_node_split"
gpus_for_vllm = 4      # Inference
gpus_for_training = 3  # Policy updates
gpus_for_ref = 1       # Reference model (for KL penalty)
tensor_parallel = 4

[model]
base = "Qwen/Qwen3-4B"
label = "my-rl-agent"

[rollout]
env_name = "my_environment"
policy_name = "my_policy"
max_turns = 10
episodes_per_batch = 64
max_concurrent_rollouts = 32

[training]
num_epochs = 1
iterations_per_epoch = 100
batch_size = 64
group_size = 4
learning_rate = 1e-5
log_interval = 1
weight_sync_interval = 1

[evaluation]
instances = 10
every_n_iters = 10
seeds = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Configuration Reference

Algorithm

Field	Value	Description
`type`	`"online"`	On-policy learning (RL)
`method`	`"policy_gradient"`	Policy gradient algorithm
`variety`	`"gspo"`	Group-based sequence optimization

Compute & Topology

Parameter	Description
`gpu_type`	GPU type (`H100`, `A100`)
`gpu_count`	Total GPUs for the job
`gpus_for_vllm`	GPUs for fast inference during rollouts
`gpus_for_training`	GPUs for policy gradient updates
`gpus_for_ref`	GPUs for reference model (KL penalty)
`tensor_parallel`	Tensor parallelism for vLLM

Rollout

Parameter	Default	Description
`max_turns`	-	Maximum steps per episode
`episodes_per_batch`	-	Trajectories collected per training step
`max_concurrent_rollouts`	8	Parallel rollout requests
`group_size`	4	Trajectories per group (for advantage baseline)

Training

Parameter	Default	Description
`num_epochs`	1	Training epochs
`iterations_per_epoch`	100	Training steps per epoch
`batch_size`	64	Trajectories per training batch
`group_size`	4	Trajectories sharing same prompt (for variance reduction)
`learning_rate`	1e-5	Policy learning rate
`weight_sync_interval`	1	Steps between syncing weights to inference

Step 4: Launch the Training Job

Using the CLI

synth-ai train --type rl --config my_rl_config.toml --poll

You’ll see progress updates:

[10:15:00]    0.0s  Status: queued
[10:15:30]   30.0s  Status: running | Iter: 1 | Mean Return: 0.45
[10:16:00]   60.0s  Status: running | Iter: 5 | Mean Return: 0.62
[10:16:30]   90.0s  Status: running | Iter: 10 | Mean Return: 0.78
...
[10:45:00] 1800.0s  Status: succeeded | Final Return: 1.24

Using Python

from synth_ai.sdk.api.train.rl import RLJob
import os

job = RLJob.from_config(
    config_path="my_rl_config.toml",
    task_app_url=os.environ["TASK_APP_URL"],
    api_key=os.environ["SYNTH_API_KEY"]
)

job_id = job.submit()
print(f"Job started: {job_id}")

result = job.poll_until_complete(timeout=7200.0)
print(f"Final return: {result.get('final_return')}")

Resume a Job

job = RLJob.from_job_id(
    job_id="rl_abc123",
    api_key=os.environ["SYNTH_API_KEY"]
)

status = job.get_status()
print(f"Status: {status['status']}")

Step 5: Use Your Trained Model

After training completes, you’ll receive a model ID:

from openai import OpenAI

client = OpenAI(
    base_url="https://api.usesynth.ai/v1",
    api_key=os.environ["SYNTH_API_KEY"]
)

response = client.chat.completions.create(
    model="rl:qwen3-4b:my-org:abc123",  # Your RL-trained model
    messages=[
        {"role": "user", "content": "Your task prompt here"}
    ]
)

Adding Judge Scoring

For richer reward signals, enable the judge to score trajectories on multiple criteria:

[judge]
enabled = true
reward_blend = { env = 0.5, event = 0.3, outcome = 0.2 }

[judge.options]
event = true           # Score individual decisions
outcome = true         # Score final outcome
rubric_id = "my-rubric-id"
max_concurrency = 8

Reward Blending

The final reward combines multiple sources:

Source	Description	Weight Example
`env`	Task environment rewards	0.5
`event`	Judge scores for each decision	0.3
`outcome`	Judge score for final result	0.2

final_reward = 0.5 * env_reward + 0.3 * event_score + 0.2 * outcome_score

Step Rewards (Process Supervision)

For fine-grained feedback on intermediate steps:

[training]
step_rewards_enabled = true
step_rewards_mode = "decision_stepwise"  # Reward each decision
step_rewards_beta = 0.0                  # Discount factor
event_rewards_kind = "unique"            # "unique" or "absolute"

Mode	Description
`off`	Only terminal reward
`decision_stepwise`	Reward each model decision
`env_sparse`	Sparse environment rewards

Monitoring Training

Key Metrics

Metric	What It Means	Good Sign
`mean_return`	Average episode reward	Increasing over time
`policy_loss`	PPO clipped loss	Stable, not exploding
`kl_divergence`	Distance from reference	Low (< 0.1)
`clip_fraction`	How often clipping activates	0.1-0.3
`advantage_mean`	Average advantage	Near 0 (normalized)

Supported Models

See Supported Models for the full list of trainable models.

Get Started

Training Walkthroughs

Supported Models

Pricing

Train Your Model with Reinforcement Learning

When to Use RL

Prerequisites

How GSPO Works

Step 1: Build Your Task App

Step 2: Deploy Your Task App

Step 3: Create the Configuration

Configuration Reference

Algorithm

Compute & Topology

Rollout

Training

Step 4: Launch the Training Job

Using the CLI

Using Python

Resume a Job

Step 5: Use Your Trained Model

Adding Judge Scoring

Reward Blending

Step Rewards (Process Supervision)

Monitoring Training

Key Metrics

Supported Models

Get Started

Training Walkthroughs

Supported Models

Pricing

​When to Use RL

​Prerequisites

​How GSPO Works

​Step 1: Build Your Task App

​Step 2: Deploy Your Task App

​Step 3: Create the Configuration

​Configuration Reference

​Algorithm

​Compute & Topology

​Rollout

​Training

​Step 4: Launch the Training Job

​Using the CLI

​Using Python

​Resume a Job

​Step 5: Use Your Trained Model

​Adding Judge Scoring

​Reward Blending

​Step Rewards (Process Supervision)

​Monitoring Training

​Key Metrics

​Supported Models

When to Use RL

Prerequisites

How GSPO Works

Step 1: Build Your Task App

Step 2: Deploy Your Task App

Step 3: Create the Configuration

Configuration Reference

Algorithm

Compute & Topology

Rollout

Training

Step 4: Launch the Training Job

Using the CLI

Using Python

Resume a Job

Step 5: Use Your Trained Model

Adding Judge Scoring

Reward Blending

Step Rewards (Process Supervision)

Monitoring Training

Key Metrics

Supported Models