Skip to main content
In reinforcement learning, judges are LLM-based evaluators that score trajectories in real-time during training. Unlike environment rewards (which measure task completion), judge rewards measure quality, style, and alignment with desired behaviors. Once your config includes judges, run it via synth-ai train to see the combined rewards in action.

Overview

Judges enable:
  • Reward shaping – Combine environment rewards with LLM-based quality scores
  • Preference alignment – Train agents to exhibit preferred behaviors (helpfulness, safety, etc.)
  • Multi-objective optimization – Balance multiple reward signals (task completion + reasoning quality)
  • Automated feedback – No manual reward engineering; judges provide rich signals

1. How Judges Work in RL

Training Loop with Judges

1. Policy generates trajectory
   ├─> Actions: [move_north, craft_pickaxe, mine_stone]
   └─> Observations: [state_0, state_1, state_2, state_3]

2. Environment returns base rewards
   └─> env_reward = +5 (for completing subtask)

3. Judge evaluates trajectory quality
   ├─> Sends trace to judge service
   ├─> Judge LLM scores based on rubric
   └─> judge_reward = 0.85 (quality score 0-1)

4. Combine rewards for training
   └─> total_reward = env_reward + (judge_weight * judge_reward)
                    = 5 + (10 * 0.85) = 13.5

5. Update policy with combined reward
   └─> Policy learns: "This trajectory was good!"

Judge vs Environment Rewards

Reward TypeSourceExamplesWhen to Use
EnvironmentTask logicPoints, health, achievementsClear success metrics
JudgeLLM evaluationQuality, reasoning, safetySubjective criteria
CombinedBothWeighted sumMost RL scenarios

2. Configuring Judges for RL

Basic Judge Configuration

Add a [judge] section to your RL config:
# configs/rl_with_judges.toml
[algorithm]
type = "online"
method = "policy_gradient"
variety = "gspo"

[services]
task_url = "https://my-task-app.modal.run"

[model]
base = "Qwen/Qwen3-4B"
label = "crafter-rl-with-judges"

[compute]
gpu_type = "H100"
gpu_count = 8

[topology]
type = "single_node_split"
gpus_for_vllm = 4
gpus_for_training = 3
gpus_for_ref = 1

# ✅ Judge Configuration
[judge]
enabled = true                     # Enable judge scoring
type = "openai"                    # Provider: "openai", "groq", or "anthropic"
model = "gpt-4o-mini"              # Judge model
api_key_env = "OPENAI_API_KEY"     # Environment variable for API key
rubric_id = "crafter/bundle@v1"    # Rubric to use for scoring

# Judge reward weights
event_weight = 0.0                 # Per-step judge rewards (0=disabled)
outcome_weight = 1.0               # Episode-level judge rewards (1.0=enabled)

# Combine with environment rewards
combine_mode = "weighted_sum"      # "weighted_sum", "multiply", or "replace"
judge_scale = 10.0                 # Scale judge scores (0-1) to match env rewards

[training]
num_epochs = 1
iterations_per_epoch = 20
batch_size = 16
group_size = 4
learning_rate = 5e-5

Judge Providers

[judge]
enabled = true
type = "openai"
model = "gpt-4o-mini"              # Fast and cost-effective
api_key_env = "OPENAI_API_KEY"
rubric_id = "your-task/rubric@v1"
outcome_weight = 1.0
judge_scale = 10.0
Best for: High-quality scoring, structured output, function calling

Groq (Fast & Cheap)

[judge]
enabled = true
type = "groq"
model = "llama-3.3-70b-versatile"  # Fast Llama 3.3
api_key_env = "GROQ_API_KEY"
rubric_id = "your-task/rubric@v1"
outcome_weight = 1.0
judge_scale = 10.0
Best for: Rapid iteration, cost-sensitive training, large batches

Anthropic (Detailed Reasoning)

[judge]
enabled = true
type = "anthropic"
model = "claude-3-5-sonnet-20241022"
api_key_env = "ANTHROPIC_API_KEY"
rubric_id = "your-task/rubric@v1"
outcome_weight = 1.0
judge_scale = 10.0
Best for: Complex reasoning evaluation, safety-critical tasks

3. Defining Rubrics

Rubrics tell judges what to evaluate. Define them in your task app:

Example Rubric (Crafter Agent)

# In your task app
from synth_ai.task.server import RubricBundle, GlobalRubric

rubrics = RubricBundle(
    global_rubric=GlobalRubric(
        id="crafter/bundle@v1",
        description="Evaluate Crafter agent performance",
        criteria=[
            "Achievement progression (collect resources, craft tools)",
            "Survival capability (manage health, avoid death)",
            "Exploration behavior (discover new areas efficiently)",
            "Resource efficiency (minimize wasted actions)",
            "Strategic planning (coherent goal-directed behavior)"
        ],
        scoring_guide={
            "0.9-1.0": "Excellent: Multiple achievements, efficient, strategic",
            "0.7-0.9": "Good: Some achievements, mostly efficient",
            "0.5-0.7": "Adequate: Basic progress, some inefficiency",
            "0.3-0.5": "Poor: Minimal progress, inefficient",
            "0.0-0.3": "Very poor: No progress, random actions"
        }
    )
)

# Add to TaskAppConfig
config = TaskAppConfig(
    app_id="grpo-crafter",
    name="Crafter Agent",
    rubrics=rubrics,
    # ... other config
)

Rubric Best Practices

  1. Clear criteria – Specific, measurable aspects of performance
  2. Scoring scale – Consistent 0-1 scale with clear thresholds
  3. Examples – Provide examples of good/bad trajectories
  4. Alignment – Rubric should align with training objectives

4. Reward Combination Strategies

Weighted Sum (Default)

[judge]
combine_mode = "weighted_sum"
outcome_weight = 1.0
judge_scale = 10.0
# total_reward = env_reward + (judge_scale * judge_score * outcome_weight)
# Example: 5 + (10 * 0.85 * 1.0) = 13.5
Use when: You want judges to augment environment rewards

Multiplicative

[judge]
combine_mode = "multiply"
outcome_weight = 1.0
judge_scale = 1.0
# total_reward = env_reward * (judge_scale * judge_score)
# Example: 5 * (1.0 * 0.85) = 4.25
Use when: Judge scores should act as quality multipliers

Replace (Judge Only)

[judge]
combine_mode = "replace"
outcome_weight = 1.0
judge_scale = 10.0
# total_reward = judge_scale * judge_score
# Example: 10 * 0.85 = 8.5 (ignores env_reward)
Use when: Environment rewards are unreliable or unavailable

5. Event vs Outcome Judges

Event-Level Judges (Per-Step)

Grade each action/decision in the trajectory:
[judge]
event_weight = 1.0        # Enable per-step judging
outcome_weight = 0.0      # Disable episode-level judging
judge_scale = 1.0         # Small scale for frequent rewards
Pros: Dense reward signal, helps with credit assignment
Cons: Expensive (many LLM calls), can be noisy
Best for: Long episodes, sparse environment rewards

Outcome-Level Judges (Episode)

Grade the entire trajectory once at the end:
[judge]
event_weight = 0.0        # Disable per-step judging
outcome_weight = 1.0      # Enable episode-level judging
judge_scale = 10.0        # Larger scale for sparse rewards
Pros: Cost-effective, holistic evaluation, less noise
Cons: Sparse signal, harder credit assignment
Best for: Short episodes, clear success/failure states

Hybrid (Both)

[judge]
event_weight = 0.5        # Moderate per-step rewards
outcome_weight = 1.0      # Full episode-level rewards
judge_scale = 5.0
Balance dense feedback with holistic evaluation.

6. Monitoring Judge Performance

Training Logs

The trainer emits judge metrics:
[ITER 5] mean_return=12.3, env_return=5.2, judge_return=7.1
[RUBRIC_CALL] policy=crafter-react, provider=openai, model=gpt-4o-mini
[RUBRIC_SCORE] event=0.75, outcome=0.85, combined=8.5

Dashboard Metrics

Track these metrics during training:
  • Mean judge score – Average quality across trajectories
  • Judge/env ratio – Balance between judge and environment rewards
  • Judge variance – Consistency of judge scoring
  • Judge call latency – Time spent on judge evaluations

Validating Judges

Periodically check that judges align with your goals:
# Run evaluation with and without judges
uvx synth-ai eval \
  --model ft:your-model:checkpoint-100 \
  --seeds 1-20 \
  --config configs/eval_with_judges.toml

uvx synth-ai eval \
  --model ft:your-model:checkpoint-100 \
  --seeds 1-20 \
  --config configs/eval_no_judges.toml

# Compare metrics
# Do judge-trained models perform better on your target metrics?

7. Advanced: Async Judge Pipeline

For high-throughput RL, judges run asynchronously:
[training]
pipeline_async_enabled = true     # Enable async judging
async_semaphore_max = 16          # Max concurrent judge calls
judge_timeout_s = 30.0            # Timeout per judge call
How it works:
  1. Trainer collects trajectory batch
  2. Submits all trajectories to judge service in parallel
  3. Continues generating next batch while judges score previous batch
  4. Consumes judge results asynchronously for training
Benefits: 10-20% faster training, better GPU utilization

8. Cost Optimization

Reduce Judge Costs

  1. Use cheaper providers
    [judge]
    type = "groq"
    model = "llama-3.3-70b-versatile"  # ~10x cheaper than GPT-4o
    
  2. Outcome-only judging
    [judge]
    event_weight = 0.0      # Disable per-step (expensive)
    outcome_weight = 1.0    # Only judge full episodes
    
  3. Sample trajectories
    [judge]
    sample_rate = 0.5       # Judge only 50% of trajectories
    
  4. Shorter episodes
    [rollout]
    max_turns = 20          # Limit episode length
    

Cost Comparison

SetupJudge Calls/EpisodeCost/1000 Episodes
Event + Outcome (GPT-4o-mini)~20$2.00
Outcome only (GPT-4o-mini)1$0.10
Outcome only (Groq Llama)1$0.01

9. Complete Example

Here’s a full RL config with judges for Crafter:
[algorithm]
type = "online"
method = "policy_gradient"
variety = "gspo"

[services]
task_url = "https://my-crafter-app.modal.run"

[model]
base = "Qwen/Qwen3-4B"
label = "crafter-rl-judged-v1"

[compute]
gpu_type = "H100"
gpu_count = 8

[topology]
type = "single_node_split"
gpus_for_vllm = 4
gpus_for_training = 3
gpus_for_ref = 1

[judge]
enabled = true
type = "openai"
model = "gpt-4o-mini"
api_key_env = "OPENAI_API_KEY"
rubric_id = "crafter/bundle@v1"
outcome_weight = 1.0
event_weight = 0.0
judge_scale = 10.0
combine_mode = "weighted_sum"

[rollout]
env_name = "crafter"
max_turns = 50
episodes_per_batch = 8
policy_name = "crafter-react"
max_concurrent_rollouts = 16

[training]
num_epochs = 2
iterations_per_epoch = 50
batch_size = 16
group_size = 4
learning_rate = 5e-5
gradient_accumulation_steps = 2

# Enable async judging for better throughput
pipeline_async_enabled = true
async_semaphore_max = 16

# Step rewards from environment
step_rewards_enabled = true
step_rewards_mode = "decision_stepwise"

[evaluation]
instances = 20
every_n_iters = 10
seeds = [100, 101, 102, 103, 104, 105, 106, 107, 108, 109,
         110, 111, 112, 113, 114, 115, 116, 117, 118, 119]
Run training:
uvx synth-ai train \
  --type rl \
  --config configs/rl_with_judges.toml \
  --env-file .env \
  --poll

10. Troubleshooting

Judges Too Slow

Symptom: Training stalls waiting for judge responses Solutions:
  • Switch to Groq provider (faster)
  • Disable event-level judging (event_weight = 0.0)
  • Increase async_semaphore_max for more parallel calls
  • Reduce episode length (max_turns)

Judge Scores Too Harsh/Lenient

Symptom: All trajectories score 0.2 or all score 0.95 Solutions:
  • Refine rubric criteria to be more discriminative
  • Adjust judge_scale to match environment reward magnitude
  • Switch judge model (GPT-4o > GPT-4o-mini > Llama)
  • Add explicit scoring examples in rubric

High Judge Costs

Symptom: Training bills are dominated by judge API calls Solutions:
  • Use Groq/Llama instead of OpenAI
  • Set outcome_weight = 1.0, event_weight = 0.0
  • Sample trajectories: sample_rate = 0.5
  • Shorter episodes: max_turns = 20

Judge Rewards Dominating

Symptom: Policy ignores environment rewards, optimizes only for judge Solutions:
  • Reduce judge_scale (try 5.0 → 2.0)
  • Lower outcome_weight (try 1.0 → 0.5)
  • Use combine_mode = "multiply" instead of "weighted_sum"
  • Ensure environment rewards are meaningful

Summary

Judge-powered RL workflow:
  1. ✅ Define rubrics in your task app
  2. ✅ Enable judges in RL config with [judge] section
  3. ✅ Choose provider and model (openai/gpt-4o-mini recommended)
  4. ✅ Configure reward combination (weighted_sum + judge_scale)
  5. ✅ Run training with uvx synth-ai train --type rl
  6. ✅ Monitor judge metrics in training logs and dashboard
Judges enable preference alignment and quality-aware RL, letting you train agents that balance task completion with desired behaviors and reasoning quality.