Skip to main content
In prompt optimization, judges are LLM-based evaluators that score prompt performance based on quality, style, and alignment with desired behaviors. Unlike environment rewards (which measure task accuracy), judge rewards measure subjective quality aspects that are hard to quantify programmatically. Both GEPA and MIPRO optimizers support hosted judge evaluation, allowing you to optimize prompts for both accuracy and quality simultaneously.

Overview

Judges enable:
  • Quality-aware optimization – Optimize prompts for both accuracy and quality/style
  • Preference alignment – Train prompts to exhibit preferred behaviors (helpfulness, safety, reasoning quality)
  • Multi-objective search – Balance multiple objectives in Pareto optimization (accuracy + quality + efficiency)
  • Automated feedback – Get rich evaluation signals without manual reward engineering

How Judges Work in Prompt Optimization

Evaluation Flow with Judges

1. Optimizer evaluates prompt variant
   ├─> Sends prompt to task app
   ├─> Task app returns environment reward (accuracy)
   └─> Trajectory trace captured

2. Judge evaluates trajectory quality
   ├─> Sends trace to hosted judge service
   ├─> Judge LLM scores based on rubric
   └─> judge_score = 0.85 (quality score 0-1)

3. Combine rewards for optimization
   └─> fused_score = env_reward + (judge_weight * judge_score)
                    = 0.75 + (0.25 * 0.85) = 0.96

4. Store scores in archive
   └─> Archive entry includes: accuracy, judge_score, fused_score

Judge vs Environment Rewards

Reward TypeSourceExamplesWhen to Use
EnvironmentTask app logicClassification accuracy, QA correctnessClear success metrics
JudgeLLM evaluationReasoning quality, helpfulness, safetySubjective criteria
CombinedBothWeighted sumMost optimization scenarios

Configuring Judges for Prompt Optimization

Basic Judge Configuration

Add a [judge] section to your prompt optimization config:
# configs/banking77_with_judges.toml
[prompt_learning]
algorithm = "gepa"  # or "mipro"
task_app_url = "https://my-task-app.modal.run"

[prompt_learning.gepa]
initial_population_size = 20
num_generations = 15

# ✅ Judge Configuration
[prompt_learning.judge]
enabled = true                     # Enable judge scoring
reward_source = "fused"            # "task_app", "judge", or "fused"
backend_base = "https://judge.synth.ai/api"
backend_api_key_env = "SYNTH_API_KEY"
backend_provider = "groq"
backend_model = "openai/gpt-oss-120b"
backend_rubric_id = "banking77/quality@v1"

# Reward fusion (when reward_source = "fused")
# Judge scores are weighted and combined with environment rewards

Judge Providers

[prompt_learning.judge]
enabled = true
reward_source = "fused"
backend_base = "https://judge.synth.ai/api"
backend_api_key_env = "SYNTH_API_KEY"
backend_provider = "groq"
backend_model = "openai/gpt-oss-120b"  # Fast and cost-effective
backend_rubric_id = "your-task/quality@v1"
Best for: Cost-effective optimization, fast iteration, large evaluation budgets

OpenAI

[prompt_learning.judge]
enabled = true
reward_source = "fused"
backend_base = "https://judge.synth.ai/api"
backend_api_key_env = "SYNTH_API_KEY"
backend_provider = "openai"
backend_model = "gpt-4o-mini"  # High-quality scoring
backend_rubric_id = "your-task/quality@v1"
Best for: High-quality scoring, structured evaluation, complex reasoning tasks

Reward Source Strategies

Task App Only (Default)

[prompt_learning.judge]
enabled = false  # or omit [judge] section entirely
Uses only environment rewards (accuracy) from task app. No judge evaluation.

Judge Only

[prompt_learning.judge]
enabled = true
reward_source = "judge"  # Use only judge scores
Optimizes purely based on judge quality scores. Useful when environment rewards are unreliable or you want to optimize for quality over accuracy.
[prompt_learning.judge]
enabled = true
reward_source = "fused"  # Combine both rewards
Combines environment rewards with judge scores. The optimizer balances both accuracy and quality in its search. How fusion works:
  • Environment reward: env_reward = accuracy (0-1)
  • Judge reward: judge_reward = judge_score (0-1)
  • Fused reward: fused = env_reward + (judge_weight * judge_reward)
  • Default judge_weight is 0.25, giving environment rewards 4x weight

Defining Rubrics

Rubrics tell judges what to evaluate. Define them in your task app or use Synth-hosted rubrics:

Example Rubric (Banking77 Intent Classification)

# In your task app
from synth_ai.task.server import RubricBundle, GlobalRubric

rubrics = RubricBundle(
    global_rubric=GlobalRubric(
        id="banking77/quality@v1",
        description="Evaluate banking intent classification quality",
        criteria=[
            "Response clarity and professionalism",
            "Appropriate tone for banking context",
            "Completeness of response",
            "Reasoning quality (if applicable)"
        ],
        scoring_guide={
            "0.9-1.0": "Excellent: Clear, professional, complete",
            "0.7-0.9": "Good: Mostly clear and appropriate",
            "0.5-0.7": "Adequate: Basic quality, some issues",
            "0.3-0.5": "Poor: Unclear or inappropriate",
            "0.0-0.3": "Very poor: Confusing or unprofessional"
        }
    )
)

Rubric Best Practices

  1. Clear criteria – Specific, measurable aspects of quality
  2. Scoring scale – Consistent 0-1 scale with clear thresholds
  3. Task alignment – Rubric should align with optimization objectives
  4. Examples – Provide examples of good/bad responses when possible

Judge Score Tracking

Archive Storage

Judge scores are automatically stored in the optimization archive alongside accuracy metrics:
from synth_ai.learning import get_prompt_text

# Retrieve best prompt with judge score
best_entry = get_prompt_text(job_id="pl_abc123", rank=1)
print(f"Accuracy: {best_entry['accuracy']}")
print(f"Judge Score: {best_entry['judge_score']}")  # 0.0-1.0
print(f"Fused Score: {best_entry['accuracy']}")  # Combined score

Score Analysis

Judge scores enable:
  • Quality filtering: Filter prompts by judge score threshold
  • Multi-objective analysis: Compare accuracy vs quality trade-offs
  • Pareto optimization: GEPA maintains Pareto front including judge scores
  • Ablation studies: Compare optimization with/without judges

Cost Optimization

Reduce Judge Costs

  1. Use cheaper providers
    [prompt_learning.judge]
    backend_provider = "groq"
    backend_model = "openai/gpt-oss-120b"  # ~10x cheaper than GPT-4o
    
  2. Outcome-only judging (default)
    • Judges evaluate full trajectories, not per-step
    • One judge call per prompt evaluation
    • Cost scales with number of prompt variants evaluated
  3. Sample evaluations
    • GEPA: Evaluates all variants (unavoidable for Pareto optimization)
    • MIPRO: Mini-batch evaluation naturally reduces judge calls

Cost Comparison

SetupJudge Calls/GenerationCost/1000 Evaluations
Groq (gpt-oss-120b)1 per variant~$0.10
OpenAI (gpt-4o-mini)1 per variant~$1.00
OpenAI (gpt-4o)1 per variant~$10.00

Complete Example

Here’s a full GEPA config with judges:
[prompt_learning]
algorithm = "gepa"
task_app_url = "https://banking77-task.modal.run"
task_app_api_key = "your-api-key"
env_name = "banking77"

[prompt_learning.initial_prompt]
messages = [
    { role = "system", content = "You are a banking assistant..." },
    { role = "user", pattern = "Customer Query: {query}" }
]

[prompt_learning.gepa]
initial_population_size = 20
num_generations = 15
mutation_model = "openai/gpt-oss-120b"
mutation_provider = "groq"

[prompt_learning.judge]
enabled = true
reward_source = "fused"
backend_base = "https://judge.synth.ai/api"
backend_api_key_env = "SYNTH_API_KEY"
backend_provider = "groq"
backend_model = "openai/gpt-oss-120b"
backend_rubric_id = "banking77/quality@v1"

[prompt_learning.policy]
model = "meta/llama-3.1-8b-instant"
provider = "groq"
Run optimization:
uvx synth-ai train \
  --type prompt_learning \
  --config configs/banking77_with_judges.toml \
  --poll

Troubleshooting

Judges Too Slow

Symptom: Optimization stalls waiting for judge responses Solutions:
  • Switch to Groq provider (faster)
  • Use smaller judge models (gpt-oss-120b vs gpt-4o)
  • Reduce evaluation batch sizes

Judge Scores Not Appearing

Symptom: judge_score is None in archive entries Solutions:
  • Verify [judge] section is configured correctly
  • Check SYNTH_API_KEY environment variable is set
  • Ensure backend_rubric_id matches a valid rubric in your task app
  • Check logs for judge API errors

High Judge Costs

Symptom: Optimization bills dominated by judge API calls Solutions:
  • Use Groq instead of OpenAI
  • Reduce population size (GEPA) or iterations (MIPRO)
  • Consider judge-only optimization if environment rewards are unreliable

Summary

Judge-powered prompt optimization workflow:
  1. ✅ Define rubrics in your task app or use Synth-hosted rubrics
  2. ✅ Enable judges in config with [prompt_learning.judge] section
  3. ✅ Choose provider and model (groq/gpt-oss-120b recommended)
  4. ✅ Configure reward source (fused recommended for balanced optimization)
  5. ✅ Run optimization with uvx synth-ai train --type prompt_learning
  6. ✅ Analyze judge scores in optimization archive
Judges enable quality-aware prompt optimization, letting you optimize prompts that balance accuracy with desired behaviors, reasoning quality, and style preferences.