Skip to main content

Overview

Synth supports two state-of-the-art prompt optimization algorithms:
  1. GEPA (Genetic Evolution of Prompt Architectures) - Population-based evolutionary search
  2. MIPRO (Meta-Instruction PROposer) - Meta-learning with Bayesian optimization
Both algorithms use an interceptor pattern that ensures optimized prompts never reach task apps. All prompt modifications happen in the backend via an inference interceptor.

References

  • GEPA: Agrawal et al. (2025). “GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning.” arXiv:2507.19457
  • MIPRO: Opsahl-Ong et al. (2024). “Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs.” arXiv:2406.11695

GEPA (Genetic Evolution of Prompt Architectures)

Reference: Agrawal et al. (2025). “GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning.” arXiv:2507.19457 GEPA outperforms GRPO by 10% on average and by up to 20%, while using up to 35x fewer rollouts. It also outperforms MIPROv2 by over 10% across two LLMs.

How It Works

GEPA uses evolutionary principles inspired by genetic algorithms:
  1. Population Initialization
    • Starts with baseline prompt + random mutations
    • Creates initial population of 20-30 prompt variants
  2. Evaluation
    • Evaluates each prompt variant on training seeds
    • Tracks multiple objectives: accuracy, token count, tool call rate
  3. Selection (Pareto Front)
    • Maintains non-dominated solutions
    • Balances performance vs. prompt length
    • Keeps top-K solutions in Pareto archive
  4. Variation
    • Mutation: LLM-guided or regex-based prompt modifications
    • Crossover: Combines two parent prompts to create offspring
  5. Evolution Loop
    • Repeats for 10-15 generations
    • Population evolves toward better solutions

Key Features

  • Pareto Optimization: Maintains diverse solutions balancing multiple objectives
  • LLM-Guided Mutations: Uses mutation models (e.g., gpt-oss-120b) for intelligent modifications
  • Pattern Mode: Supports transformation-based mutations for systematic changes
  • Multi-Stage Support: Module-aware evolution for pipeline optimization
  • Reflective Feedback: Analyzes execution traces to guide mutations

Configuration Example

[prompt_learning]
algorithm = "gepa"
task_app_url = "http://127.0.0.1:8102"

[prompt_learning.gepa]
initial_population_size = 20
num_generations = 15
mutation_rate = 0.3
crossover_rate = 0.5
rollout_budget = 1000
max_concurrent_rollouts = 20
pareto_set_size = 20

[prompt_learning.gepa.mutation]
llm_model = "openai/gpt-oss-120b"
llm_provider = "groq"

Typical Results

  • Baseline: 60-75% accuracy
  • After 5 generations: 75-80% accuracy
  • After 10 generations: 80-85% accuracy
  • After 15 generations: 85-90%+ accuracy

Best For

  • Classification tasks (Banking77, intent classification)
  • Multi-hop QA (HotpotQA)
  • Tasks requiring diverse prompt variants
  • Large evaluation budgets (1000+ rollouts)

MIPRO (Meta-Instruction PROposer)

Reference: Opsahl-Ong et al. (2024). “Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs.” arXiv:2406.11695 MIPRO outperforms baseline optimizers on five of seven diverse multi-stage LM programs using Llama-3-8B, by as high as 13% accuracy.

How It Works

MIPRO uses meta-learning to propose better instructions:
  1. Bootstrap Phase
    • Evaluates baseline prompt on bootstrap seeds
    • Collects high-scoring examples (score >= threshold)
    • Generates few-shot demonstrations
    • Initializes meta-model with task-specific context
  2. Instruction Generation
    • Meta-LLM (e.g., GPT-4o-mini) proposes instruction variants
    • Uses few-shot examples, reference corpus (50k tokens), system specs
    • Generates additive guidance (not rewrites)
  3. TPE-Guided Search
    • Tree-structured Parzen Estimator suggests candidates
    • Evaluates proposals on mini-batch of seeds
    • Updates TPE distribution based on results
  4. Optimization Loop
    • Repeats for 10-20 iterations
    • Each iteration evaluates 4-6 prompt variants
    • TPE guides search toward promising regions

Key Features

  • Bootstrap Phase: Starts with task-specific examples (not cold-start)
  • Meta-LLM Proposals: Uses GPT-4o-mini or similar for instruction generation
  • Reference Corpus: Injects up to 50k tokens of dataset examples
  • System Spec Integration: Uses JSON specifications for constraint-aware optimization
  • Multi-Stage Support: Per-stage instruction proposals with LCS detection
  • Token Budget Tracking: Monitors and enforces token limits

Configuration Example

[prompt_learning]
algorithm = "mipro"
task_app_url = "https://my-task-app.modal.run"

[prompt_learning.mipro]
num_iterations = 16
num_evaluations_per_iteration = 6
batch_size = 6
max_concurrent = 20

# Seed pools
bootstrap_train_seeds = [0, 1, 2, 3, 4]
online_pool = [5, 6, 7, 8, 9]
test_pool = [20, 21, 22, 23, 24]
reference_pool = [50, 51, 52, ..., 149]  # Optional: for meta-prompt context

# Meta-model
meta_model = "gpt-4o-mini"
meta_model_provider = "openai"
few_shot_score_threshold = 0.85

Typical Results

  • Bootstrap Phase: Collects 3-5 high-scoring examples
  • After 8 iterations: ~80-85% accuracy
  • After 16 iterations: ~85-90% accuracy (similar to GEPA)
  • Advantage: Achieves similar results with ~96 rollouts vs. ~1000 for GEPA

Best For

  • Tasks with clear structure (can bootstrap with examples)
  • Efficient optimization (fewer evaluations needed)
  • Token budget constraints
  • Task-specific improvements

Detailed Comparison

AspectGEPAMIPRO
Search MethodGenetic evolution (mutation + crossover)Meta-LLM proposals + TPE
InitializationRandom population (20-30 variants)Bootstrap phase (few-shot examples)
ExplorationBroad, diverse variantsFocused, efficient search
GuidancePareto optimizationBayesian optimization (TPE)
MutationsLLM-guided or regex-basedMeta-model proposals
EvaluationFull evaluation on 30 seedsMini-batch on 5 seeds per iteration
Computational CostLower (fewer LLM calls)Higher (meta-model calls)
Convergence10-15 generations10-20 iterations
Total Evaluations~1000 rollouts~96 rollouts
Best ForBroad explorationTask-specific optimization
Pareto Front✅ Yes (diverse solutions)❌ No (single best solution)
Multi-Stage✅ Yes (module-aware)✅ Yes (per-stage proposals)

Architecture: Inference Interception

Both algorithms use the same interceptor pattern:
✅ CORRECT FLOW:
1. Backend registers optimized prompt with interceptor
2. Task app calls LLM with clean policy config (no prompts)
3. Interceptor substitutes optimized prompt before forwarding to LLM
4. LLM receives optimized prompt, returns result
5. Task app evaluates result, returns score

❌ WRONG FLOW (NEVER DO THIS):
Backend → prompt_template in payload → Task App
Key Benefits:
  • Task apps remain unchanged during optimization
  • Prompt optimization logic stays in backend
  • Secure, correct prompt substitution
  • No prompt leakage to task apps

Model Requirements

Policy Models (Both Algorithms)

  • OpenAI: gpt-4o, gpt-4o-mini, gpt-4.1, gpt-4.1-mini, gpt-4.1-nano, gpt-5, gpt-5-mini, gpt-5-nano
  • Groq: gpt-oss-20b, gpt-oss-120b, llama-3.3-70b-versatile, qwen-32b, qwen3-32b
  • Google: gemini-2.5-pro, gemini-2.5-pro-gt200k, gemini-2.5-flash, gemini-2.5-flash-lite

Mutation Models (GEPA Only)

  • Common: openai/gpt-oss-120b, llama-3.3-70b-versatile
  • Nano models rejected (too small for generation)

Meta Models (MIPRO Only)

  • Common: gpt-4o-mini (most common default), gpt-4.1-mini
  • Nano models rejected (too small for generation)
Note: gpt-5-pro is explicitly rejected for all model types (too expensive)

Multi-Stage Pipeline Support

Both algorithms support optimizing prompts for multi-stage pipelines:

GEPA Multi-Stage

  • Module-aware evolution: Each pipeline module gets its own gene
  • Module selection: Mutations target specific modules
  • Uniform crossover: Combines parent genes per module
  • Aggregated scoring: Sum of module lengths for Pareto optimization

MIPRO Multi-Stage

  • Per-stage proposals: Meta-LLM generates instructions for each stage
  • LCS detection: Automatically identifies which stage is being called
  • Stage-specific meta-prompts: Includes pipeline overview, stage role, baseline
  • Unified evaluation: Tracks end-to-end performance

Choosing the Right Algorithm

Use GEPA if:
  • ✅ You want diverse prompt variants (Pareto front)
  • ✅ You have a large evaluation budget (1000+ rollouts)
  • ✅ You need broad exploration of the prompt space
  • ✅ You’re optimizing classification or multi-hop QA tasks
Use MIPRO if:
  • ✅ You want faster convergence with fewer evaluations
  • ✅ You have clear task structure (can bootstrap with examples)
  • ✅ You need efficient optimization (mini-batch evaluation)
  • ✅ You have token budget constraints
  • ✅ You want task-specific improvements

Next Steps