Overview
Synth supports two state-of-the-art prompt optimization algorithms:- GEPA (Genetic Evolution of Prompt Architectures) - Population-based evolutionary search
- MIPRO (Meta-Instruction PROposer) - Meta-learning with Bayesian optimization
References
- GEPA: Agrawal et al. (2025). “GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning.” arXiv:2507.19457
- MIPRO: Opsahl-Ong et al. (2024). “Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs.” arXiv:2406.11695
GEPA (Genetic Evolution of Prompt Architectures)
Reference: Agrawal et al. (2025). “GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning.” arXiv:2507.19457 GEPA outperforms GRPO by 10% on average and by up to 20%, while using up to 35x fewer rollouts. It also outperforms MIPROv2 by over 10% across two LLMs.How It Works
GEPA uses evolutionary principles inspired by genetic algorithms:-
Population Initialization
- Starts with baseline prompt + random mutations
- Creates initial population of 20-30 prompt variants
-
Evaluation
- Evaluates each prompt variant on training seeds
- Tracks multiple objectives: accuracy, token count, tool call rate
-
Selection (Pareto Front)
- Maintains non-dominated solutions
- Balances performance vs. prompt length
- Keeps top-K solutions in Pareto archive
-
Variation
- Mutation: LLM-guided or regex-based prompt modifications
- Crossover: Combines two parent prompts to create offspring
-
Evolution Loop
- Repeats for 10-15 generations
- Population evolves toward better solutions
Key Features
- Pareto Optimization: Maintains diverse solutions balancing multiple objectives
- LLM-Guided Mutations: Uses mutation models (e.g.,
gpt-oss-120b) for intelligent modifications - Pattern Mode: Supports transformation-based mutations for systematic changes
- Multi-Stage Support: Module-aware evolution for pipeline optimization
- Reflective Feedback: Analyzes execution traces to guide mutations
Configuration Example
Typical Results
- Baseline: 60-75% accuracy
- After 5 generations: 75-80% accuracy
- After 10 generations: 80-85% accuracy
- After 15 generations: 85-90%+ accuracy
Best For
- Classification tasks (Banking77, intent classification)
- Multi-hop QA (HotpotQA)
- Tasks requiring diverse prompt variants
- Large evaluation budgets (1000+ rollouts)
MIPRO (Meta-Instruction PROposer)
Reference: Opsahl-Ong et al. (2024). “Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs.” arXiv:2406.11695 MIPRO outperforms baseline optimizers on five of seven diverse multi-stage LM programs using Llama-3-8B, by as high as 13% accuracy.How It Works
MIPRO uses meta-learning to propose better instructions:-
Bootstrap Phase
- Evaluates baseline prompt on bootstrap seeds
- Collects high-scoring examples (score >= threshold)
- Generates few-shot demonstrations
- Initializes meta-model with task-specific context
-
Instruction Generation
- Meta-LLM (e.g., GPT-4o-mini) proposes instruction variants
- Uses few-shot examples, reference corpus (50k tokens), system specs
- Generates additive guidance (not rewrites)
-
TPE-Guided Search
- Tree-structured Parzen Estimator suggests candidates
- Evaluates proposals on mini-batch of seeds
- Updates TPE distribution based on results
-
Optimization Loop
- Repeats for 10-20 iterations
- Each iteration evaluates 4-6 prompt variants
- TPE guides search toward promising regions
Key Features
- Bootstrap Phase: Starts with task-specific examples (not cold-start)
- Meta-LLM Proposals: Uses GPT-4o-mini or similar for instruction generation
- Reference Corpus: Injects up to 50k tokens of dataset examples
- System Spec Integration: Uses JSON specifications for constraint-aware optimization
- Multi-Stage Support: Per-stage instruction proposals with LCS detection
- Token Budget Tracking: Monitors and enforces token limits
Configuration Example
Typical Results
- Bootstrap Phase: Collects 3-5 high-scoring examples
- After 8 iterations: ~80-85% accuracy
- After 16 iterations: ~85-90% accuracy (similar to GEPA)
- Advantage: Achieves similar results with ~96 rollouts vs. ~1000 for GEPA
Best For
- Tasks with clear structure (can bootstrap with examples)
- Efficient optimization (fewer evaluations needed)
- Token budget constraints
- Task-specific improvements
Detailed Comparison
| Aspect | GEPA | MIPRO |
|---|---|---|
| Search Method | Genetic evolution (mutation + crossover) | Meta-LLM proposals + TPE |
| Initialization | Random population (20-30 variants) | Bootstrap phase (few-shot examples) |
| Exploration | Broad, diverse variants | Focused, efficient search |
| Guidance | Pareto optimization | Bayesian optimization (TPE) |
| Mutations | LLM-guided or regex-based | Meta-model proposals |
| Evaluation | Full evaluation on 30 seeds | Mini-batch on 5 seeds per iteration |
| Computational Cost | Lower (fewer LLM calls) | Higher (meta-model calls) |
| Convergence | 10-15 generations | 10-20 iterations |
| Total Evaluations | ~1000 rollouts | ~96 rollouts |
| Best For | Broad exploration | Task-specific optimization |
| Pareto Front | ✅ Yes (diverse solutions) | ❌ No (single best solution) |
| Multi-Stage | ✅ Yes (module-aware) | ✅ Yes (per-stage proposals) |
Architecture: Inference Interception
Both algorithms use the same interceptor pattern:- Task apps remain unchanged during optimization
- Prompt optimization logic stays in backend
- Secure, correct prompt substitution
- No prompt leakage to task apps
Model Requirements
Policy Models (Both Algorithms)
- OpenAI:
gpt-4o,gpt-4o-mini,gpt-4.1,gpt-4.1-mini,gpt-4.1-nano,gpt-5,gpt-5-mini,gpt-5-nano - Groq:
gpt-oss-20b,gpt-oss-120b,llama-3.3-70b-versatile,qwen-32b,qwen3-32b - Google:
gemini-2.5-pro,gemini-2.5-pro-gt200k,gemini-2.5-flash,gemini-2.5-flash-lite
Mutation Models (GEPA Only)
- Common:
openai/gpt-oss-120b,llama-3.3-70b-versatile - Nano models rejected (too small for generation)
Meta Models (MIPRO Only)
- Common:
gpt-4o-mini(most common default),gpt-4.1-mini - Nano models rejected (too small for generation)
gpt-5-pro is explicitly rejected for all model types (too expensive)
Multi-Stage Pipeline Support
Both algorithms support optimizing prompts for multi-stage pipelines:GEPA Multi-Stage
- Module-aware evolution: Each pipeline module gets its own gene
- Module selection: Mutations target specific modules
- Uniform crossover: Combines parent genes per module
- Aggregated scoring: Sum of module lengths for Pareto optimization
MIPRO Multi-Stage
- Per-stage proposals: Meta-LLM generates instructions for each stage
- LCS detection: Automatically identifies which stage is being called
- Stage-specific meta-prompts: Includes pipeline overview, stage role, baseline
- Unified evaluation: Tracks end-to-end performance
Choosing the Right Algorithm
Use GEPA if:- ✅ You want diverse prompt variants (Pareto front)
- ✅ You have a large evaluation budget (1000+ rollouts)
- ✅ You need broad exploration of the prompt space
- ✅ You’re optimizing classification or multi-hop QA tasks
- ✅ You want faster convergence with fewer evaluations
- ✅ You have clear task structure (can bootstrap with examples)
- ✅ You need efficient optimization (mini-batch evaluation)
- ✅ You have token budget constraints
- ✅ You want task-specific improvements
Next Steps
- Configuration Reference – Complete parameter documentation
- Training Guide – Step-by-step instructions
- Banking77 Example – Complete walkthrough