When to Use RL
Best for:- Complex multi-step tasks (agents, tool use, reasoning)
- Tasks where exploration helps discover better strategies
- When you have a reward signal but not ideal demonstrations
- Improving beyond what SFT can achieve
- You have high-quality demonstration data
- The task is straightforward (classification, simple QA)
- You want faster iteration cycles
Prerequisites
How GSPO Works
- Collect rollouts: Your model interacts with the task environment, generating trajectories
- Compute rewards: Each trajectory receives a score (from env, judge, or both)
- Calculate advantages: Compare each trajectory to others in its group
- Update policy: Increase probability of high-advantage responses, decrease low ones
- Repeat: Sync updated weights to inference, collect new rollouts
Step 1: Build Your Task App
Your task app defines the environment your model interacts with. It must expose a/rollout endpoint that:
- Receives rollout requests from the trainer
- Runs episodes (model generates actions, environment responds)
- Returns trajectories with rewards
Step 2: Deploy Your Task App
The trainer needs to reach your task app over the internet:TASK_APP_URL to your .env file. Verify it’s accessible:
Step 3: Create the Configuration
Create a TOML file defining your training parameters:Configuration Reference
Algorithm
| Field | Value | Description |
|---|---|---|
type | "online" | On-policy learning (RL) |
method | "policy_gradient" | Policy gradient algorithm |
variety | "gspo" | Group-based sequence optimization |
Compute & Topology
| Parameter | Description |
|---|---|
gpu_type | GPU type (H100, A100) |
gpu_count | Total GPUs for the job |
gpus_for_vllm | GPUs for fast inference during rollouts |
gpus_for_training | GPUs for policy gradient updates |
gpus_for_ref | GPUs for reference model (KL penalty) |
tensor_parallel | Tensor parallelism for vLLM |
Rollout
| Parameter | Default | Description |
|---|---|---|
max_turns | - | Maximum steps per episode |
episodes_per_batch | - | Trajectories collected per training step |
max_concurrent_rollouts | 8 | Parallel rollout requests |
group_size | 4 | Trajectories per group (for advantage baseline) |
Training
| Parameter | Default | Description |
|---|---|---|
num_epochs | 1 | Training epochs |
iterations_per_epoch | 100 | Training steps per epoch |
batch_size | 64 | Trajectories per training batch |
group_size | 4 | Trajectories sharing same prompt (for variance reduction) |
learning_rate | 1e-5 | Policy learning rate |
weight_sync_interval | 1 | Steps between syncing weights to inference |
Step 4: Launch the Training Job
Using the CLI
Using Python
Resume a Job
Step 5: Use Your Trained Model
After training completes, you’ll receive a model ID:Adding Judge Scoring
For richer reward signals, enable the judge to score trajectories on multiple criteria:Reward Blending
The final reward combines multiple sources:| Source | Description | Weight Example |
|---|---|---|
env | Task environment rewards | 0.5 |
event | Judge scores for each decision | 0.3 |
outcome | Judge score for final result | 0.2 |
Step Rewards (Process Supervision)
For fine-grained feedback on intermediate steps:| Mode | Description |
|---|---|
off | Only terminal reward |
decision_stepwise | Reward each model decision |
env_sparse | Sparse environment rewards |
Monitoring Training
Key Metrics
| Metric | What It Means | Good Sign |
|---|---|---|
mean_return | Average episode reward | Increasing over time |
policy_loss | PPO clipped loss | Stable, not exploding |
kl_divergence | Distance from reference | Low (< 0.1) |
clip_fraction | How often clipping activates | 0.1-0.3 |
advantage_mean | Average advantage | Near 0 (normalized) |