- GEPA: Agrawal et al. (2025). “GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning.” arXiv:2507.19457
- MIPRO: Opsahl-Ong et al. (2024). “Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs.” arXiv:2406.11695
1. Build a prompt evaluation task app
Use the TaskAppConfig interface (same as SFT) to describe dataset splits, rubrics, and rollout handlers. → Create a prompt evaluation task app2. Add a Modal entrypoint
Exposemodal.App(...) and mount your task app code so Synth’s CLI can package it.
→ Modal wrapper requirements
3. Author the prompt optimization config
Capture algorithm choice (GEPA or MIPRO), initial prompt template, training/validation seeds, and optimization parameters in TOML. → Read: Prompt optimization configs4. Query and evaluate results
Use the Python API or REST endpoints to retrieve optimized prompts and evaluate them on held-out validation sets.→ Read: Querying results
Algorithm Overview
GEPA (Genetic Evolution of Prompt Architectures)
Best for: Broad exploration, diverse prompt variants, classification tasksReference: Agrawal et al. (2025) GEPA uses evolutionary principles to explore the prompt space:
- Population-based search with multiple prompt variants
- LLM-guided mutations for intelligent prompt modifications
- Pareto optimization balancing performance and prompt length
- Multi-stage support for pipeline optimization
- Maintains a Pareto front of non-dominated solutions
- Supports both template mode and pattern-based transformations
- Module-aware evolution for multi-stage pipelines
- Reflective feedback from execution traces
MIPRO (Meta-Instruction PROposer)
Best for: Efficient optimization, task-specific improvements, faster convergenceReference: Opsahl-Ong et al. (2024) MIPRO uses meta-learning to propose better instructions:
- Meta-LLM (e.g., GPT-4o-mini) generates instruction variants
- TPE (Tree-structured Parzen Estimator) guides Bayesian search
- Bootstrap phase collects few-shot examples from high-scoring seeds
- Reference corpus (up to 50k tokens) enriches meta-prompts
- System spec integration for constraint-aware optimization
- Bootstrap phase initializes with task-specific examples
- Program-aware instruction proposals
- Multi-stage pipeline support with LCS-based stage detection
- Token budget tracking and cost optimization
- System spec integration for constraint-aware optimization
Architecture: Inference Interception
🚨 Critical: Both algorithms use an interceptor pattern that ensures optimized prompts never reach task apps. All prompt modifications happen in the backend via an inference interceptor that substitutes prompts before they reach the LLM.- Task apps remain unchanged during optimization
- Prompt optimization logic stays in the backend
- Secure, correct prompt substitution
Supported Models
Policy Models (Task Execution)
Both GEPA and MIPRO support policy models from:- OpenAI:
gpt-4o,gpt-4o-mini,gpt-4.1,gpt-4.1-mini,gpt-4.1-nano,gpt-5,gpt-5-mini,gpt-5-nano - Groq:
gpt-oss-20b,gpt-oss-120b,llama-3.3-70b-versatile,qwen-32b,qwen3-32b - Google:
gemini-2.5-pro,gemini-2.5-pro-gt200k,gemini-2.5-flash,gemini-2.5-flash-lite
Mutation Models (GEPA Only)
Used to generate prompt mutations:- Common choices:
openai/gpt-oss-120b,llama-3.3-70b-versatile - Nano models are rejected (too small for generation tasks)
Meta Models (MIPRO Only)
Used to generate instruction proposals:- Common choices:
gpt-4o-mini,gpt-4.1-mini(most common default) - Nano models are rejected (too small for generation tasks)
gpt-5-pro is explicitly rejected for all model types (too expensive: 120 per 1M tokens)
See Supported Models for complete details.
When to Use Each Algorithm
| Aspect | GEPA | MIPRO |
|---|---|---|
| Search Method | Genetic evolution | Meta-LLM + TPE |
| Exploration | Broad, diverse variants | Focused, efficient |
| Computational Cost | Lower (fewer LLM calls) | Higher (meta-model calls) |
| Convergence | 10-15 generations | 10-20 iterations |
| Best For | Classification, multi-hop QA | Task-specific optimization |
| Evaluation Budget | ~1000 rollouts | ~96 rollouts |
- You want diverse prompt variants (Pareto front)
- You have a large evaluation budget (1000+ rollouts)
- You need broad exploration of the prompt space
- You want faster convergence with fewer evaluations
- You have clear task structure (can bootstrap with examples)
- You need efficient optimization (mini-batch evaluation)
Multi-Stage Pipeline Support
Both algorithms support optimizing prompts for multi-stage pipelines (e.g., Banking77 classifier → calibrator):- LCS-based stage detection automatically identifies which stage is being called
- Per-stage optimization evolves separate instructions for each pipeline module
- Unified evaluation tracks end-to-end performance across all stages
Next Steps
- Algorithm Comparison – Detailed comparison of GEPA vs MIPRO
- System Specifications – How specs guide optimization
- Configuration Reference – Complete parameter documentation
- Training Guide – Step-by-step training instructions
- Banking77 Example – Complete walkthrough