Skip to main content
GSPO (Group Sequence Policy Optimization) trains model weights via policy gradient RL with pipelined rollouts.

1. Install Demo

uvx synth-ai demo rl
cd demo_rl
Creates task_app.py (Math task), train_cfg.toml (config), main.py (runner).

2. Setup Credentials

uvx synth-ai setup
Opens browser to fetch SYNTH_API_KEY and ENVIRONMENT_API_KEY. Saves to .env.

3. Deploy Task App

uvx synth-ai deploy tunnel task_app.py --env .env
Starts local server + Cloudflare tunnel. Note the tunnel URL.

4. Train

uvx synth-ai train --config train_cfg.toml --task-url <TUNNEL_URL> --poll
Streams training progress. Runtime: 1-4 hours depending on model size and iterations.

Minimal Config

[algorithm]
type = "online"
method = "policy_gradient"
variety = "gspo"

[services]
task_url = "https://your-tunnel.trycloudflare.com"

[compute]
gpu_type = "H100"
gpu_count = 2

[topology]
type = "single_node_split"
gpus_for_vllm = 1
gpus_for_training = 1
gpus_for_ref = 0

[model]
base = "Qwen/Qwen3-4B"
trainer_mode = "lora"
label = "my-rl-model"

[rollout]
env_name = "math"
policy_name = "math-policy"
max_turns = 1
episodes_per_batch = 32
max_concurrent_rollouts = 8

[training]
num_epochs = 1
iterations_per_epoch = 20
batch_size = 16
group_size = 4
learning_rate = 5e-5

[training.lora]
r = 16
alpha = 32
dropout = 0.1
target_modules = ["q_proj", "v_proj"]

[evaluation]
instances = 20
every_n_iters = 10
seeds = [0, 1, 2, 3, 4]

Key Parameters

ParameterPurpose
compute.gpu_countGPU allocation (min 2 for RL)
model.trainer_mode"lora", "qlora", or "full"
rollout.max_turnsSteps per episode
training.batch_sizeTraining batch size
training.group_sizeGSPO group size

Reward Structure

Task app returns rewards via RolloutResponse.metrics:
return RolloutResponse(
    run_id=request.run_id,
    metrics=RolloutMetrics(
        mean_return=1.0 if correct else 0.0,
        episode_returns=[1.0 if correct else 0.0],
        num_steps=1,
        num_episodes=1,
    ),
    trajectories=[...],
)

Get Results

from synth_ai.sdk.api.train.rl import RLJob

job = RLJob.from_config("train_cfg.toml")
job.submit()
result = job.poll_until_complete()

model_id = result.get("model_id")
# e.g., "ft:Qwen/Qwen3-4B:job_abc123"
Use trained model:
# Dev inference
curl -X POST https://agent-learning.onrender.com/api/inference/chat \
  -H "Authorization: Bearer $SYNTH_API_KEY" \
  -d '{"model": "ft:Qwen/Qwen3-4B:job_abc123", "messages": [...]}'

# Export to HuggingFace
uvx synth-ai artifacts export ft:Qwen/Qwen3-4B:job_abc123 --repo-id myorg/model