Judges

In prompt optimization, judges are LLM-based evaluators that score prompt performance based on quality, style, and alignment with desired behaviors. Unlike environment rewards (which measure task accuracy), judge rewards measure subjective quality aspects that are hard to quantify programmatically. Both GEPA and MIPRO optimizers support hosted judge evaluation, allowing you to optimize prompts for both accuracy and quality simultaneously.

Overview

Judges enable:

✅ Quality-aware optimization – Optimize prompts for both accuracy and quality/style
✅ Preference alignment – Train prompts to exhibit preferred behaviors (helpfulness, safety, reasoning quality)
✅ Multi-objective search – Balance multiple objectives in Pareto optimization (accuracy + quality + efficiency)
✅ Automated feedback – Get rich evaluation signals without manual reward engineering

How Judges Work in Prompt Optimization

Evaluation Flow with Judges

1. Optimizer evaluates prompt variant
   ├─> Sends prompt to task app
   ├─> Task app returns environment reward (accuracy)
   └─> Trajectory trace captured

2. Judge evaluates trajectory quality
   ├─> Sends trace to hosted judge service
   ├─> Judge LLM scores based on rubric
   └─> judge_score = 0.85 (quality score 0-1)

3. Combine rewards for optimization
   └─> fused_score = env_reward + (judge_weight * judge_score)
                    = 0.75 + (0.25 * 0.85) = 0.96

4. Store scores in archive
   └─> Archive entry includes: accuracy, judge_score, fused_score

Judge vs Environment Rewards

Reward Type	Source	Examples	When to Use
Environment	Task app logic	Classification accuracy, QA correctness	Clear success metrics
Judge	LLM evaluation	Reasoning quality, helpfulness, safety	Subjective criteria
Combined	Both	Weighted sum	Most optimization scenarios

Configuring Judges for Prompt Optimization

Basic Judge Configuration

Add a [judge] section to your prompt optimization config:

# configs/banking77_with_judges.toml
[prompt_learning]
algorithm = "gepa"  # or "mipro"
task_app_url = "https://my-task-app.modal.run"

[prompt_learning.gepa]
initial_population_size = 20
num_generations = 15

# ✅ Judge Configuration
[prompt_learning.judge]
enabled = true                     # Enable judge scoring
reward_source = "fused"            # "task_app", "judge", or "fused"
backend_base = "https://judge.synth.ai/api"
backend_api_key_env = "SYNTH_API_KEY"
backend_provider = "groq"
backend_model = "openai/gpt-oss-120b"
backend_rubric_id = "banking77/quality@v1"

# Reward fusion (when reward_source = "fused")
# Judge scores are weighted and combined with environment rewards

Judge Providers

Groq (Recommended for Prompt Optimization)

[prompt_learning.judge]
enabled = true
reward_source = "fused"
backend_base = "https://judge.synth.ai/api"
backend_api_key_env = "SYNTH_API_KEY"
backend_provider = "groq"
backend_model = "openai/gpt-oss-120b"  # Fast and cost-effective
backend_rubric_id = "your-task/quality@v1"

Best for: Cost-effective optimization, fast iteration, large evaluation budgets

OpenAI

[prompt_learning.judge]
enabled = true
reward_source = "fused"
backend_base = "https://judge.synth.ai/api"
backend_api_key_env = "SYNTH_API_KEY"
backend_provider = "openai"
backend_model = "gpt-4o-mini"  # High-quality scoring
backend_rubric_id = "your-task/quality@v1"

Best for: High-quality scoring, structured evaluation, complex reasoning tasks

Reward Source Strategies

Task App Only (Default)

[prompt_learning.judge]
enabled = false  # or omit [judge] section entirely

Uses only environment rewards (accuracy) from task app. No judge evaluation.

Judge Only

[prompt_learning.judge]
enabled = true
reward_source = "judge"  # Use only judge scores

Optimizes purely based on judge quality scores. Useful when environment rewards are unreliable or you want to optimize for quality over accuracy.

Fused (Recommended)

[prompt_learning.judge]
enabled = true
reward_source = "fused"  # Combine both rewards

Combines environment rewards with judge scores. The optimizer balances both accuracy and quality in its search. How fusion works:

Environment reward: env_reward = accuracy (0-1)
Judge reward: judge_reward = judge_score (0-1)
Fused reward: fused = env_reward + (judge_weight * judge_reward)
Default judge_weight is 0.25, giving environment rewards 4x weight

Defining Rubrics

Rubrics tell judges what to evaluate. Define them in your task app or use Synth-hosted rubrics:

Example Rubric (Banking77 Intent Classification)

# In your task app
from synth_ai.task.server import RubricBundle, GlobalRubric

rubrics = RubricBundle(
    global_rubric=GlobalRubric(
        id="banking77/quality@v1",
        description="Evaluate banking intent classification quality",
        criteria=[
            "Response clarity and professionalism",
            "Appropriate tone for banking context",
            "Completeness of response",
            "Reasoning quality (if applicable)"
        ],
        scoring_guide={
            "0.9-1.0": "Excellent: Clear, professional, complete",
            "0.7-0.9": "Good: Mostly clear and appropriate",
            "0.5-0.7": "Adequate: Basic quality, some issues",
            "0.3-0.5": "Poor: Unclear or inappropriate",
            "0.0-0.3": "Very poor: Confusing or unprofessional"
        }
    )
)

Rubric Best Practices

Clear criteria – Specific, measurable aspects of quality
Scoring scale – Consistent 0-1 scale with clear thresholds
Task alignment – Rubric should align with optimization objectives
Examples – Provide examples of good/bad responses when possible

Judge Score Tracking

Archive Storage

Judge scores are automatically stored in the optimization archive alongside accuracy metrics:

from synth_ai.learning import get_prompt_text

# Retrieve best prompt with judge score
best_entry = get_prompt_text(job_id="pl_abc123", rank=1)
print(f"Accuracy: {best_entry['accuracy']}")
print(f"Judge Score: {best_entry['judge_score']}")  # 0.0-1.0
print(f"Fused Score: {best_entry['accuracy']}")  # Combined score

Score Analysis

Judge scores enable:

Quality filtering: Filter prompts by judge score threshold
Multi-objective analysis: Compare accuracy vs quality trade-offs
Pareto optimization: GEPA maintains Pareto front including judge scores
Ablation studies: Compare optimization with/without judges

Cost Optimization

Reduce Judge Costs

Use cheaper providers

[prompt_learning.judge]
backend_provider = "groq"
backend_model = "openai/gpt-oss-120b"  # ~10x cheaper than GPT-4o

Outcome-only judging (default)
- Judges evaluate full trajectories, not per-step
- One judge call per prompt evaluation
- Cost scales with number of prompt variants evaluated
Sample evaluations
- GEPA: Evaluates all variants (unavoidable for Pareto optimization)
- MIPRO: Mini-batch evaluation naturally reduces judge calls

Cost Comparison

Setup	Judge Calls/Generation	Cost/1000 Evaluations
Groq (gpt-oss-120b)	1 per variant	~$0.10
OpenAI (gpt-4o-mini)	1 per variant	~$1.00
OpenAI (gpt-4o)	1 per variant	~$10.00

Complete Example

Here’s a full GEPA config with judges:

[prompt_learning]
algorithm = "gepa"
task_app_url = "https://banking77-task.modal.run"
task_app_api_key = "your-api-key"
env_name = "banking77"

[prompt_learning.initial_prompt]
messages = [
    { role = "system", content = "You are a banking assistant..." },
    { role = "user", pattern = "Customer Query: {query}" }
]

[prompt_learning.gepa]
initial_population_size = 20
num_generations = 15
mutation_model = "openai/gpt-oss-120b"
mutation_provider = "groq"

[prompt_learning.judge]
enabled = true
reward_source = "fused"
backend_base = "https://judge.synth.ai/api"
backend_api_key_env = "SYNTH_API_KEY"
backend_provider = "groq"
backend_model = "openai/gpt-oss-120b"
backend_rubric_id = "banking77/quality@v1"

[prompt_learning.policy]
model = "meta/llama-3.1-8b-instant"
provider = "groq"

Run optimization:

uvx synth-ai train \
  --type prompt_learning \
  --config configs/banking77_with_judges.toml \
  --poll

Troubleshooting

Judges Too Slow

Symptom: Optimization stalls waiting for judge responses Solutions:

Switch to Groq provider (faster)
Use smaller judge models (gpt-oss-120b vs gpt-4o)
Reduce evaluation batch sizes

Judge Scores Not Appearing

Symptom: judge_score is None in archive entries Solutions:

Verify [judge] section is configured correctly
Check SYNTH_API_KEY environment variable is set
Ensure backend_rubric_id matches a valid rubric in your task app
Check logs for judge API errors

High Judge Costs

Symptom: Optimization bills dominated by judge API calls Solutions:

Use Groq instead of OpenAI
Reduce population size (GEPA) or iterations (MIPRO)
Consider judge-only optimization if environment rewards are unreliable

Summary

Judge-powered prompt optimization workflow:

✅ Define rubrics in your task app or use Synth-hosted rubrics
✅ Enable judges in config with [prompt_learning.judge] section
✅ Choose provider and model (groq/gpt-oss-120b recommended)
✅ Configure reward source (fused recommended for balanced optimization)
✅ Run optimization with uvx synth-ai train --type prompt_learning
✅ Analyze judge scores in optimization archive

Judges enable quality-aware prompt optimization, letting you optimize prompts that balance accuracy with desired behaviors, reasoning quality, and style preferences.

Start Training

Prompt Optimization

Supervised Fine-Tuning

Reinforcement Learning

Overview

How Judges Work in Prompt Optimization

Evaluation Flow with Judges

Judge vs Environment Rewards

Configuring Judges for Prompt Optimization

Basic Judge Configuration

Judge Providers

Groq (Recommended for Prompt Optimization)

OpenAI

Reward Source Strategies

Task App Only (Default)

Judge Only

Fused (Recommended)

Defining Rubrics

Example Rubric (Banking77 Intent Classification)

Rubric Best Practices

Judge Score Tracking

Archive Storage

Score Analysis

Cost Optimization

Reduce Judge Costs

Cost Comparison

Complete Example

Troubleshooting

Judges Too Slow

Judge Scores Not Appearing

High Judge Costs

Summary

Start Training

Prompt Optimization

Supervised Fine-Tuning

Reinforcement Learning

​Overview

​How Judges Work in Prompt Optimization

​Evaluation Flow with Judges

​Judge vs Environment Rewards

​Configuring Judges for Prompt Optimization

​Basic Judge Configuration

​Judge Providers

​Groq (Recommended for Prompt Optimization)

​OpenAI

​Reward Source Strategies

​Task App Only (Default)

​Judge Only

​Fused (Recommended)

​Defining Rubrics

​Example Rubric (Banking77 Intent Classification)

​Rubric Best Practices

​Judge Score Tracking

​Archive Storage

​Score Analysis

​Cost Optimization

​Reduce Judge Costs

​Cost Comparison

​Complete Example

​Troubleshooting

​Judges Too Slow

​Judge Scores Not Appearing

​High Judge Costs

​Summary

Overview

How Judges Work in Prompt Optimization

Evaluation Flow with Judges

Judge vs Environment Rewards

Configuring Judges for Prompt Optimization

Basic Judge Configuration

Judge Providers

Groq (Recommended for Prompt Optimization)

OpenAI

Reward Source Strategies

Task App Only (Default)

Judge Only

Fused (Recommended)

Defining Rubrics

Example Rubric (Banking77 Intent Classification)

Rubric Best Practices

Judge Score Tracking

Archive Storage

Score Analysis

Cost Optimization

Reduce Judge Costs

Cost Comparison

Complete Example

Troubleshooting

Judges Too Slow

Judge Scores Not Appearing

High Judge Costs

Summary