Judges - Synth AI

In reinforcement learning, judges are LLM-based evaluators that score trajectories in real-time during training. Unlike environment rewards (which measure task completion), judge rewards measure quality, style, and alignment with desired behaviors. Once your config includes judges, run it via synth-ai train to see the combined rewards in action.

Overview

Judges enable:

✅ Reward shaping – Combine environment rewards with LLM-based quality scores
✅ Preference alignment – Train agents to exhibit preferred behaviors (helpfulness, safety, etc.)
✅ Multi-objective optimization – Balance multiple reward signals (task completion + reasoning quality)
✅ Automated feedback – No manual reward engineering; judges provide rich signals

1. How Judges Work in RL

Training Loop with Judges

1. Policy generates trajectory
   ├─> Actions: [move_north, craft_pickaxe, mine_stone]
   └─> Observations: [state_0, state_1, state_2, state_3]

2. Environment returns base rewards
   └─> env_reward = +5 (for completing subtask)

3. Judge evaluates trajectory quality
   ├─> Sends trace to judge service
   ├─> Judge LLM scores based on rubric
   └─> judge_reward = 0.85 (quality score 0-1)

4. Combine rewards for training
   └─> total_reward = env_reward + (judge_weight * judge_reward)
                    = 5 + (10 * 0.85) = 13.5

5. Update policy with combined reward
   └─> Policy learns: "This trajectory was good!"

Judge vs Environment Rewards

Reward Type	Source	Examples	When to Use
Environment	Task logic	Points, health, achievements	Clear success metrics
Judge	LLM evaluation	Quality, reasoning, safety	Subjective criteria
Combined	Both	Weighted sum	Most RL scenarios

2. Configuring Judges for RL

Basic Judge Configuration

Add a [judge] section to your RL config:

# configs/rl_with_judges.toml
[algorithm]
type = "online"
method = "policy_gradient"
variety = "gspo"

[services]
task_url = "https://my-task-app.modal.run"

[model]
base = "Qwen/Qwen3-4B"
label = "crafter-rl-with-judges"

[compute]
gpu_type = "H100"
gpu_count = 8

[topology]
type = "single_node_split"
gpus_for_vllm = 4
gpus_for_training = 3
gpus_for_ref = 1

# ✅ Judge Configuration
[judge]
enabled = true                     # Enable judge scoring
type = "openai"                    # Provider: "openai", "groq", or "anthropic"
model = "gpt-4o-mini"              # Judge model
api_key_env = "OPENAI_API_KEY"     # Environment variable for API key
rubric_id = "crafter/bundle@v1"    # Rubric to use for scoring

# Judge reward weights
event_weight = 0.0                 # Per-step judge rewards (0=disabled)
outcome_weight = 1.0               # Episode-level judge rewards (1.0=enabled)

# Combine with environment rewards
combine_mode = "weighted_sum"      # "weighted_sum", "multiply", or "replace"
judge_scale = 10.0                 # Scale judge scores (0-1) to match env rewards

[training]
num_epochs = 1
iterations_per_epoch = 20
batch_size = 16
group_size = 4
learning_rate = 5e-5

Judge Providers

OpenAI (Recommended)

[judge]
enabled = true
type = "openai"
model = "gpt-4o-mini"              # Fast and cost-effective
api_key_env = "OPENAI_API_KEY"
rubric_id = "your-task/rubric@v1"
outcome_weight = 1.0
judge_scale = 10.0

Best for: High-quality scoring, structured output, function calling

Groq (Fast & Cheap)

[judge]
enabled = true
type = "groq"
model = "llama-3.3-70b-versatile"  # Fast Llama 3.3
api_key_env = "GROQ_API_KEY"
rubric_id = "your-task/rubric@v1"
outcome_weight = 1.0
judge_scale = 10.0

Best for: Rapid iteration, cost-sensitive training, large batches

Anthropic (Detailed Reasoning)

[judge]
enabled = true
type = "anthropic"
model = "claude-3-5-sonnet-20241022"
api_key_env = "ANTHROPIC_API_KEY"
rubric_id = "your-task/rubric@v1"
outcome_weight = 1.0
judge_scale = 10.0

Best for: Complex reasoning evaluation, safety-critical tasks

3. Defining Rubrics

Rubrics tell judges what to evaluate. Define them in your task app:

Example Rubric (Crafter Agent)

# In your task app
from synth_ai.task.server import RubricBundle, GlobalRubric

rubrics = RubricBundle(
    global_rubric=GlobalRubric(
        id="crafter/bundle@v1",
        description="Evaluate Crafter agent performance",
        criteria=[
            "Achievement progression (collect resources, craft tools)",
            "Survival capability (manage health, avoid death)",
            "Exploration behavior (discover new areas efficiently)",
            "Resource efficiency (minimize wasted actions)",
            "Strategic planning (coherent goal-directed behavior)"
        ],
        scoring_guide={
            "0.9-1.0": "Excellent: Multiple achievements, efficient, strategic",
            "0.7-0.9": "Good: Some achievements, mostly efficient",
            "0.5-0.7": "Adequate: Basic progress, some inefficiency",
            "0.3-0.5": "Poor: Minimal progress, inefficient",
            "0.0-0.3": "Very poor: No progress, random actions"
        }
    )
)

# Add to TaskAppConfig
config = TaskAppConfig(
    app_id="grpo-crafter",
    name="Crafter Agent",
    rubrics=rubrics,
    # ... other config
)

Rubric Best Practices

Clear criteria – Specific, measurable aspects of performance
Scoring scale – Consistent 0-1 scale with clear thresholds
Examples – Provide examples of good/bad trajectories
Alignment – Rubric should align with training objectives

4. Reward Combination Strategies

Weighted Sum (Default)

[judge]
combine_mode = "weighted_sum"
outcome_weight = 1.0
judge_scale = 10.0

# total_reward = env_reward + (judge_scale * judge_score * outcome_weight)
# Example: 5 + (10 * 0.85 * 1.0) = 13.5

Use when: You want judges to augment environment rewards

Multiplicative

[judge]
combine_mode = "multiply"
outcome_weight = 1.0
judge_scale = 1.0

# total_reward = env_reward * (judge_scale * judge_score)
# Example: 5 * (1.0 * 0.85) = 4.25

Use when: Judge scores should act as quality multipliers

Replace (Judge Only)

[judge]
combine_mode = "replace"
outcome_weight = 1.0
judge_scale = 10.0

# total_reward = judge_scale * judge_score
# Example: 10 * 0.85 = 8.5 (ignores env_reward)

Use when: Environment rewards are unreliable or unavailable

5. Event vs Outcome Judges

Event-Level Judges (Per-Step)

Grade each action/decision in the trajectory:

[judge]
event_weight = 1.0        # Enable per-step judging
outcome_weight = 0.0      # Disable episode-level judging
judge_scale = 1.0         # Small scale for frequent rewards

Pros: Dense reward signal, helps with credit assignment
Cons: Expensive (many LLM calls), can be noisy Best for: Long episodes, sparse environment rewards

Outcome-Level Judges (Episode)

Grade the entire trajectory once at the end:

[judge]
event_weight = 0.0        # Disable per-step judging
outcome_weight = 1.0      # Enable episode-level judging
judge_scale = 10.0        # Larger scale for sparse rewards

Pros: Cost-effective, holistic evaluation, less noise
Cons: Sparse signal, harder credit assignment Best for: Short episodes, clear success/failure states

Hybrid (Both)

[judge]
event_weight = 0.5        # Moderate per-step rewards
outcome_weight = 1.0      # Full episode-level rewards
judge_scale = 5.0

Balance dense feedback with holistic evaluation.

6. Monitoring Judge Performance

Training Logs

The trainer emits judge metrics:

[ITER 5] mean_return=12.3, env_return=5.2, judge_return=7.1
[RUBRIC_CALL] policy=crafter-react, provider=openai, model=gpt-4o-mini
[RUBRIC_SCORE] event=0.75, outcome=0.85, combined=8.5

Dashboard Metrics

Track these metrics during training:

Mean judge score – Average quality across trajectories
Judge/env ratio – Balance between judge and environment rewards
Judge variance – Consistency of judge scoring
Judge call latency – Time spent on judge evaluations

Validating Judges

Periodically check that judges align with your goals:

# Run evaluation with and without judges
uvx synth-ai eval \
  --model ft:your-model:checkpoint-100 \
  --seeds 1-20 \
  --config configs/eval_with_judges.toml

uvx synth-ai eval \
  --model ft:your-model:checkpoint-100 \
  --seeds 1-20 \
  --config configs/eval_no_judges.toml

# Compare metrics
# Do judge-trained models perform better on your target metrics?

7. Advanced: Async Judge Pipeline

For high-throughput RL, judges run asynchronously:

[training]
pipeline_async_enabled = true     # Enable async judging
async_semaphore_max = 16          # Max concurrent judge calls
judge_timeout_s = 30.0            # Timeout per judge call

How it works:

Trainer collects trajectory batch
Submits all trajectories to judge service in parallel
Continues generating next batch while judges score previous batch
Consumes judge results asynchronously for training

Benefits: 10-20% faster training, better GPU utilization

8. Cost Optimization

Reduce Judge Costs

Use cheaper providers

[judge]
type = "groq"
model = "llama-3.3-70b-versatile"  # ~10x cheaper than GPT-4o

Outcome-only judging

[judge]
event_weight = 0.0      # Disable per-step (expensive)
outcome_weight = 1.0    # Only judge full episodes

Sample trajectories

[judge]
sample_rate = 0.5       # Judge only 50% of trajectories

Shorter episodes

[rollout]
max_turns = 20          # Limit episode length

Cost Comparison

Setup	Judge Calls/Episode	Cost/1000 Episodes
Event + Outcome (GPT-4o-mini)	~20	$2.00
Outcome only (GPT-4o-mini)	1	$0.10
Outcome only (Groq Llama)	1	$0.01

9. Complete Example

Here’s a full RL config with judges for Crafter:

[algorithm]
type = "online"
method = "policy_gradient"
variety = "gspo"

[services]
task_url = "https://my-crafter-app.modal.run"

[model]
base = "Qwen/Qwen3-4B"
label = "crafter-rl-judged-v1"

[compute]
gpu_type = "H100"
gpu_count = 8

[topology]
type = "single_node_split"
gpus_for_vllm = 4
gpus_for_training = 3
gpus_for_ref = 1

[judge]
enabled = true
type = "openai"
model = "gpt-4o-mini"
api_key_env = "OPENAI_API_KEY"
rubric_id = "crafter/bundle@v1"
outcome_weight = 1.0
event_weight = 0.0
judge_scale = 10.0
combine_mode = "weighted_sum"

[rollout]
env_name = "crafter"
max_turns = 50
episodes_per_batch = 8
policy_name = "crafter-react"
max_concurrent_rollouts = 16

[training]
num_epochs = 2
iterations_per_epoch = 50
batch_size = 16
group_size = 4
learning_rate = 5e-5
gradient_accumulation_steps = 2

# Enable async judging for better throughput
pipeline_async_enabled = true
async_semaphore_max = 16

# Step rewards from environment
step_rewards_enabled = true
step_rewards_mode = "decision_stepwise"

[evaluation]
instances = 20
every_n_iters = 10
seeds = [100, 101, 102, 103, 104, 105, 106, 107, 108, 109,
         110, 111, 112, 113, 114, 115, 116, 117, 118, 119]

Run training:

uvx synth-ai train \
  --type rl \
  --config configs/rl_with_judges.toml \
  --env-file .env \
  --poll

10. Troubleshooting

Judges Too Slow

Symptom: Training stalls waiting for judge responses Solutions:

Switch to Groq provider (faster)
Disable event-level judging (event_weight = 0.0)
Increase async_semaphore_max for more parallel calls
Reduce episode length (max_turns)

Judge Scores Too Harsh/Lenient

Symptom: All trajectories score 0.2 or all score 0.95 Solutions:

Refine rubric criteria to be more discriminative
Adjust judge_scale to match environment reward magnitude
Switch judge model (GPT-4o > GPT-4o-mini > Llama)
Add explicit scoring examples in rubric

High Judge Costs

Symptom: Training bills are dominated by judge API calls Solutions:

Use Groq/Llama instead of OpenAI
Set outcome_weight = 1.0, event_weight = 0.0
Sample trajectories: sample_rate = 0.5
Shorter episodes: max_turns = 20

Judge Rewards Dominating

Symptom: Policy ignores environment rewards, optimizes only for judge Solutions:

Reduce judge_scale (try 5.0 → 2.0)
Lower outcome_weight (try 1.0 → 0.5)
Use combine_mode = "multiply" instead of "weighted_sum"
Ensure environment rewards are meaningful

Summary

Judge-powered RL workflow:

✅ Define rubrics in your task app
✅ Enable judges in RL config with [judge] section
✅ Choose provider and model (openai/gpt-4o-mini recommended)
✅ Configure reward combination (weighted_sum + judge_scale)
✅ Run training with uvx synth-ai train --type rl
✅ Monitor judge metrics in training logs and dashboard

Judges enable preference alignment and quality-aware RL, letting you train agents that balance task completion with desired behaviors and reasoning quality.

Get Started

Fine-Tuning

Reinforcement Learning

CLI Commands

​Overview

​1. How Judges Work in RL

​Training Loop with Judges

​Judge vs Environment Rewards

​2. Configuring Judges for RL

​Basic Judge Configuration

​Judge Providers

​OpenAI (Recommended)

​Groq (Fast & Cheap)

​Anthropic (Detailed Reasoning)

​3. Defining Rubrics

​Example Rubric (Crafter Agent)

​Rubric Best Practices

​4. Reward Combination Strategies

​Weighted Sum (Default)

​Multiplicative

​Replace (Judge Only)

​5. Event vs Outcome Judges

​Event-Level Judges (Per-Step)

​Outcome-Level Judges (Episode)

​Hybrid (Both)

​6. Monitoring Judge Performance

​Training Logs

​Dashboard Metrics

​Validating Judges

​7. Advanced: Async Judge Pipeline

​8. Cost Optimization

​Reduce Judge Costs

​Cost Comparison

​9. Complete Example

​10. Troubleshooting

​Judges Too Slow

​Judge Scores Too Harsh/Lenient

​High Judge Costs

​Judge Rewards Dominating

​Summary