Overview
Judges enable:- ✅ Quality-aware optimization – Optimize prompts for both accuracy and quality/style
- ✅ Preference alignment – Train prompts to exhibit preferred behaviors (helpfulness, safety, reasoning quality)
- ✅ Multi-objective search – Balance multiple objectives in Pareto optimization (accuracy + quality + efficiency)
- ✅ Automated feedback – Get rich evaluation signals without manual reward engineering
How Judges Work in Prompt Optimization
Evaluation Flow with Judges
Judge vs Environment Rewards
| Reward Type | Source | Examples | When to Use |
|---|---|---|---|
| Environment | Task app logic | Classification accuracy, QA correctness | Clear success metrics |
| Judge | LLM evaluation | Reasoning quality, helpfulness, safety | Subjective criteria |
| Combined | Both | Weighted sum | Most optimization scenarios |
Configuring Judges for Prompt Optimization
Basic Judge Configuration
Add a[judge] section to your prompt optimization config:
Judge Providers
Groq (Recommended for Prompt Optimization)
OpenAI
Reward Source Strategies
Task App Only (Default)
Judge Only
Fused (Recommended)
- Environment reward:
env_reward = accuracy(0-1) - Judge reward:
judge_reward = judge_score(0-1) - Fused reward:
fused = env_reward + (judge_weight * judge_reward) - Default
judge_weightis 0.25, giving environment rewards 4x weight
Defining Rubrics
Rubrics tell judges what to evaluate. Define them in your task app or use Synth-hosted rubrics:Example Rubric (Banking77 Intent Classification)
Rubric Best Practices
- Clear criteria – Specific, measurable aspects of quality
- Scoring scale – Consistent 0-1 scale with clear thresholds
- Task alignment – Rubric should align with optimization objectives
- Examples – Provide examples of good/bad responses when possible
Judge Score Tracking
Archive Storage
Judge scores are automatically stored in the optimization archive alongside accuracy metrics:Score Analysis
Judge scores enable:- Quality filtering: Filter prompts by judge score threshold
- Multi-objective analysis: Compare accuracy vs quality trade-offs
- Pareto optimization: GEPA maintains Pareto front including judge scores
- Ablation studies: Compare optimization with/without judges
Cost Optimization
Reduce Judge Costs
-
Use cheaper providers
-
Outcome-only judging (default)
- Judges evaluate full trajectories, not per-step
- One judge call per prompt evaluation
- Cost scales with number of prompt variants evaluated
-
Sample evaluations
- GEPA: Evaluates all variants (unavoidable for Pareto optimization)
- MIPRO: Mini-batch evaluation naturally reduces judge calls
Cost Comparison
| Setup | Judge Calls/Generation | Cost/1000 Evaluations |
|---|---|---|
| Groq (gpt-oss-120b) | 1 per variant | ~$0.10 |
| OpenAI (gpt-4o-mini) | 1 per variant | ~$1.00 |
| OpenAI (gpt-4o) | 1 per variant | ~$10.00 |
Complete Example
Here’s a full GEPA config with judges:Troubleshooting
Judges Too Slow
Symptom: Optimization stalls waiting for judge responses Solutions:- Switch to Groq provider (faster)
- Use smaller judge models (
gpt-oss-120bvsgpt-4o) - Reduce evaluation batch sizes
Judge Scores Not Appearing
Symptom:judge_score is None in archive entries
Solutions:
- Verify
[judge]section is configured correctly - Check
SYNTH_API_KEYenvironment variable is set - Ensure
backend_rubric_idmatches a valid rubric in your task app - Check logs for judge API errors
High Judge Costs
Symptom: Optimization bills dominated by judge API calls Solutions:- Use Groq instead of OpenAI
- Reduce population size (GEPA) or iterations (MIPRO)
- Consider judge-only optimization if environment rewards are unreliable
Summary
Judge-powered prompt optimization workflow:- ✅ Define rubrics in your task app or use Synth-hosted rubrics
- ✅ Enable judges in config with
[prompt_learning.judge]section - ✅ Choose provider and model (
groq/gpt-oss-120brecommended) - ✅ Configure reward source (
fusedrecommended for balanced optimization) - ✅ Run optimization with
uvx synth-ai train --type prompt_learning - ✅ Analyze judge scores in optimization archive