synth-ai train to see the combined rewards in action.
Overview
Judges enable:- ✅ Reward shaping – Combine environment rewards with LLM-based quality scores
- ✅ Preference alignment – Train agents to exhibit preferred behaviors (helpfulness, safety, etc.)
- ✅ Multi-objective optimization – Balance multiple reward signals (task completion + reasoning quality)
- ✅ Automated feedback – No manual reward engineering; judges provide rich signals
1. How Judges Work in RL
Training Loop with Judges
Judge vs Environment Rewards
| Reward Type | Source | Examples | When to Use |
|---|---|---|---|
| Environment | Task logic | Points, health, achievements | Clear success metrics |
| Judge | LLM evaluation | Quality, reasoning, safety | Subjective criteria |
| Combined | Both | Weighted sum | Most RL scenarios |
2. Configuring Judges for RL
Basic Judge Configuration
Add a[judge] section to your RL config:
Judge Providers
OpenAI (Recommended)
Groq (Fast & Cheap)
Anthropic (Detailed Reasoning)
3. Defining Rubrics
Rubrics tell judges what to evaluate. Define them in your task app:Example Rubric (Crafter Agent)
Rubric Best Practices
- Clear criteria – Specific, measurable aspects of performance
- Scoring scale – Consistent 0-1 scale with clear thresholds
- Examples – Provide examples of good/bad trajectories
- Alignment – Rubric should align with training objectives
4. Reward Combination Strategies
Weighted Sum (Default)
Multiplicative
Replace (Judge Only)
5. Event vs Outcome Judges
Event-Level Judges (Per-Step)
Grade each action/decision in the trajectory:Cons: Expensive (many LLM calls), can be noisy Best for: Long episodes, sparse environment rewards
Outcome-Level Judges (Episode)
Grade the entire trajectory once at the end:Cons: Sparse signal, harder credit assignment Best for: Short episodes, clear success/failure states
Hybrid (Both)
6. Monitoring Judge Performance
Training Logs
The trainer emits judge metrics:Dashboard Metrics
Track these metrics during training:- Mean judge score – Average quality across trajectories
- Judge/env ratio – Balance between judge and environment rewards
- Judge variance – Consistency of judge scoring
- Judge call latency – Time spent on judge evaluations
Validating Judges
Periodically check that judges align with your goals:7. Advanced: Async Judge Pipeline
For high-throughput RL, judges run asynchronously:- Trainer collects trajectory batch
- Submits all trajectories to judge service in parallel
- Continues generating next batch while judges score previous batch
- Consumes judge results asynchronously for training
8. Cost Optimization
Reduce Judge Costs
-
Use cheaper providers
-
Outcome-only judging
-
Sample trajectories
-
Shorter episodes
Cost Comparison
| Setup | Judge Calls/Episode | Cost/1000 Episodes |
|---|---|---|
| Event + Outcome (GPT-4o-mini) | ~20 | $2.00 |
| Outcome only (GPT-4o-mini) | 1 | $0.10 |
| Outcome only (Groq Llama) | 1 | $0.01 |
9. Complete Example
Here’s a full RL config with judges for Crafter:10. Troubleshooting
Judges Too Slow
Symptom: Training stalls waiting for judge responses Solutions:- Switch to Groq provider (faster)
- Disable event-level judging (
event_weight = 0.0) - Increase
async_semaphore_maxfor more parallel calls - Reduce episode length (
max_turns)
Judge Scores Too Harsh/Lenient
Symptom: All trajectories score 0.2 or all score 0.95 Solutions:- Refine rubric criteria to be more discriminative
- Adjust
judge_scaleto match environment reward magnitude - Switch judge model (GPT-4o > GPT-4o-mini > Llama)
- Add explicit scoring examples in rubric
High Judge Costs
Symptom: Training bills are dominated by judge API calls Solutions:- Use Groq/Llama instead of OpenAI
- Set
outcome_weight = 1.0, event_weight = 0.0 - Sample trajectories:
sample_rate = 0.5 - Shorter episodes:
max_turns = 20
Judge Rewards Dominating
Symptom: Policy ignores environment rewards, optimizes only for judge Solutions:- Reduce
judge_scale(try 5.0 → 2.0) - Lower
outcome_weight(try 1.0 → 0.5) - Use
combine_mode = "multiply"instead of"weighted_sum" - Ensure environment rewards are meaningful
Summary
Judge-powered RL workflow:- ✅ Define rubrics in your task app
- ✅ Enable judges in RL config with
[judge]section - ✅ Choose provider and model (
openai/gpt-4o-minirecommended) - ✅ Configure reward combination (
weighted_sum+judge_scale) - ✅ Run training with
uvx synth-ai train --type rl - ✅ Monitor judge metrics in training logs and dashboard