Skip to main content

Overview

Synth supports two state-of-the-art prompt optimization algorithms:
  1. GEPA (Genetic Evolution of Prompt Architectures) - Population-based evolutionary search
  2. MIPRO (Meta-Instruction PROposer) - Meta-learning with Bayesian optimization

References

  • GEPA: Agrawal et al. (2025). “GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning.” arXiv:2507.19457
  • MIPRO: Opsahl-Ong et al. (2024). “Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs.” arXiv:2406.11695

GEPA (Genetic Evolution of Prompt Architectures)

GEPA outperforms GRPO by 10% on average and by up to 20%, while using up to 35x fewer rollouts. It also outperforms MIPROv2 by over 10% across two LLMs.

How It Works

GEPA uses evolutionary principles inspired by genetic algorithms:
  1. Population Initialization
    • Starts with baseline prompt + random mutations
    • Creates initial population of 20-30 prompt variants
  2. Evaluation
    • Evaluates each prompt variant on training seeds
    • Tracks multiple objectives: accuracy, token count, tool call rate
  3. Selection (Pareto Front)
    • Maintains non-dominated solutions
    • Balances performance vs. prompt length
    • Keeps top-K solutions in Pareto archive
  4. Variation
    • Mutation: LLM-guided or regex-based prompt modifications
    • Crossover: Combines two parent prompts to create offspring
  5. Evolution Loop
    • Repeats for 10-15 generations
    • Population evolves toward better solutions

Key Features

  • Pareto Optimization: Maintains diverse solutions balancing multiple objectives
  • LLM-Guided Mutations: Uses mutation models (e.g., gpt-oss-120b) for intelligent modifications
  • Pattern Mode: Supports transformation-based mutations for systematic changes
  • Multi-Stage Support: Module-aware evolution for pipeline optimization
  • Reflective Feedback: Analyzes execution traces to guide mutations

Typical Results

  • Baseline: 60-75% accuracy
  • After 5 generations: 75-80% accuracy
  • After 10 generations: 80-85% accuracy
  • After 15 generations: 85-90%+ accuracy

Best For

  • Classification tasks (Banking77, intent classification)
  • Multi-hop QA (HotpotQA)
  • Tasks requiring diverse prompt variants
  • Large evaluation budgets (1000+ rollouts)

MIPRO (Meta-Instruction PROposer)

MIPRO outperforms baseline optimizers on five of seven diverse multi-stage LM programs using Llama-3-8B, by as high as 13% accuracy.

How It Works

MIPRO uses meta-learning to propose better instructions:
  1. Bootstrap Phase
    • Evaluates baseline prompt on bootstrap seeds
    • Collects high-scoring examples (score >= threshold)
    • Generates few-shot demonstrations
    • Initializes meta-model with task-specific context
  2. Instruction Generation
    • Meta-LLM (e.g., GPT-4o-mini) proposes instruction variants
    • Uses few-shot examples, reference corpus (50k tokens), system specs
    • Generates additive guidance (not rewrites)
  3. TPE-Guided Search
    • Tree-structured Parzen Estimator suggests candidates
    • Evaluates proposals on mini-batch of seeds
    • Updates TPE distribution based on results
  4. Optimization Loop
    • Repeats for 10-20 iterations
    • Each iteration evaluates 4-6 prompt variants
    • TPE guides search toward promising regions

Key Features

  • Bootstrap Phase: Starts with task-specific examples (not cold-start)
  • Meta-LLM Proposals: Uses GPT-4o-mini or similar for instruction generation
  • Reference Corpus: Injects up to 50k tokens of dataset examples
  • System Spec Integration: Uses JSON specifications for constraint-aware optimization
  • Multi-Stage Support: Per-stage instruction proposals with LCS detection
  • Token Budget Tracking: Monitors and enforces token limits

Typical Results

  • Bootstrap Phase: Collects 3-5 high-scoring examples
  • After 8 iterations: ~80-85% accuracy
  • After 16 iterations: ~85-90% accuracy (similar to GEPA)
  • Advantage: Achieves similar results with ~96 rollouts vs. ~1000 for GEPA

Best For

  • Tasks with clear structure (can bootstrap with examples)
  • Efficient optimization (fewer evaluations needed)
  • Token budget constraints
  • Task-specific improvements

Detailed Comparison

AspectGEPAMIPRO
Search MethodGenetic evolution (mutation + crossover)Meta-LLM proposals + TPE
InitializationRandom population (20-30 variants)Bootstrap phase (few-shot examples)
ExplorationBroad, diverse variantsFocused, efficient search
GuidancePareto optimizationBayesian optimization (TPE)
MutationsLLM-guided or regex-basedMeta-model proposals
EvaluationFull evaluation on 30 seedsMini-batch on 5 seeds per iteration
Computational CostLower (fewer LLM calls)Higher (meta-model calls)
Convergence10-15 generations10-20 iterations
Total Evaluations~1000 rollouts~96 rollouts
Best ForBroad explorationTask-specific optimization
Pareto Front✅ Yes (diverse solutions)❌ No (single best solution)
Multi-Stage✅ Yes (module-aware)✅ Yes (per-stage proposals)

Architecture: Inference Interception

Both algorithms use the same interceptor pattern: Key Benefits:
  • Task apps remain unchanged during optimization
  • Prompt optimization logic stays in backend
  • Secure, correct prompt substitution
  • No prompt leakage to task apps

Model Requirements

Policy Models (Both Algorithms)

  • OpenAI: gpt-4o, gpt-4o-mini, gpt-4.1, gpt-4.1-mini, gpt-4.1-nano, gpt-5, gpt-5-mini, gpt-5-nano
  • Groq: gpt-oss-20b, gpt-oss-120b, llama-3.3-70b-versatile, qwen-32b, qwen3-32b
  • Google: gemini-2.5-pro, gemini-2.5-pro-gt200k, gemini-2.5-flash, gemini-2.5-flash-lite

Mutation Models (GEPA Only)

  • Common: openai/gpt-oss-120b, llama-3.3-70b-versatile
  • Nano models rejected (too small for generation)

Meta Models (MIPRO Only)

  • Common: gpt-4o-mini (most common default), gpt-4.1-mini
  • Nano models rejected (too small for generation)
Note: gpt-5-pro is explicitly rejected for all model types (too expensive)

Multi-Stage Pipeline Support

Both algorithms support optimizing prompts for multi-stage pipelines:

GEPA Multi-Stage

  • Module-aware evolution: Each pipeline module gets its own gene
  • Module selection: Mutations target specific modules
  • Uniform crossover: Combines parent genes per module
  • Aggregated scoring: Sum of module lengths for Pareto optimization

MIPRO Multi-Stage

  • Per-stage proposals: Meta-LLM generates instructions for each stage
  • LCS detection: Automatically identifies which stage is being called
  • Stage-specific meta-prompts: Includes pipeline overview, stage role, baseline
  • Unified evaluation: Tracks end-to-end performance

Choosing the Right Algorithm

Use GEPA if:
  • ✅ You want diverse prompt variants (Pareto front)
  • ✅ You have a large evaluation budget (1000+ rollouts)
  • ✅ You need broad exploration of the prompt space
  • ✅ You’re optimizing classification or multi-hop QA tasks
Use MIPRO if:
  • ✅ You want faster convergence with fewer evaluations
  • ✅ You have clear task structure (can bootstrap with examples)
  • ✅ You need efficient optimization (mini-batch evaluation)
  • ✅ You have token budget constraints
  • ✅ You want task-specific improvements

Next Steps