Why GEPA?
GEPA outperforms GRPO by 10% on average (up to 20%) while using up to 35x fewer rollouts. It also outperforms MIPROv2 by over 10% across multiple LLMs. Best for:- Classification tasks (Banking77, intent classification)
- Multi-hop QA (HotpotQA)
- Instruction-following tasks
- When you want diverse prompt variants (Pareto front)
Prerequisites
Before starting, ensure you have:How GEPA Works
GEPA uses evolutionary principles to explore the prompt space. Understanding the algorithm helps you configure it effectively.The Optimization Flow
- Initialize
- Split seeds into pareto_seeds and feedback_seeds
- Evaluate baseline transformation
- Generate initial population via proposer
- Evaluate & add to Pareto archive
- Evolve (for each generation)
- For each child:
- Select parent (instance-wise Pareto sampling)
- Generate feedback from parent trace
- Mutate via proposer (LLM-guided)
- Minibatch gating (quick eval)
- Full Pareto evaluation (if gating passed)
- Update archive if non-dominated
- For each child:
- Terminate
- Budget exhausted OR
- Generation limit OR
- No improvement for N generations
- Return best transformation, by accuracy
Key Components
1. Pattern-Based Transformations
GEPA represents prompt changes as transformations that can be applied to your baseline:2. Pareto Archive
GEPA maintains a Pareto front of non-dominated solutions, balancing multiple objectives:- Accuracy (primary) – Task performance
- Tool call rate – Function calling frequency (for agentic tasks)
3. Instance-Wise Parent Selection
Unlike traditional selection that uses aggregate scores, GEPA counts how many individual seeds each prompt “wins” on:4. LLM-Guided Mutations
The proposer (meta-model) generates new prompts by analyzing:- Current instruction (baseline)
- Rollout examples (input/output/feedback for each seed)
- Trace feedback (e.g., “model under-utilizes tools”)
- Dataset and program context
Step 1: Create a Task App
Your task app evaluates prompts by running rollouts and returning scores. See Task App Guide for details. Example Banking77 task app structure:Step 2: Deploy Your Task App
The Synth backend needs to reach your task app over the internet to send rollout requests. Deploy using a managed Cloudflare tunnel:- Starts your task app locally on port 8000
- Creates a Cloudflare tunnel with a stable URL
- Writes
TASK_APP_URLto your.envfile
Verify the Deployment
Check that your task app is accessible:Keep the Tunnel Running
By default, the tunnel runs in the background. To keep it in the foreground (useful for debugging):Step 3: Create the Configuration
Create a TOML file defining your optimization parameters. Thetask_app_url should match the URL from Step 2 (stored in your .env as TASK_APP_URL):
Configuration Parameters
| Parameter | Description | Default | Recommended Range |
|---|---|---|---|
num_generations | Evolutionary cycles | 10 | 5-20 |
children_per_generation | Mutations per generation | 5 | 3-10 |
pareto_set_size | Seeds for Pareto evaluation | 20 | 15-30 |
minibatch_size | Seeds for gating evaluation | 3 | 2-5 |
rollout_budget | Total rollouts allowed | 1000 | 200-2000 |
archive_size | Max Pareto archive size | 64 | 32-128 |
feedback_fraction | Fraction of seeds for feedback | 0.3 | 0.2-0.5 |
proposer_mode | Proposer type (synth, gepa-ai, dspy) | synth | - |
Step 4: Launch the Optimization Job
- Validate your TOML configuration
- Verify the task app is reachable
- Submit the job to Synth
- Poll for completion (with
--poll)
Understanding the Output
During optimization, you’ll see progress updates:Step 5: Understanding the Optimization Process
Generation-by-Generation Progress
| Generation | What Happens | Expected Accuracy |
|---|---|---|
| 0 (baseline) | Evaluate initial prompt | 60-75% |
| 1-3 | Explore diverse mutations | 70-80% |
| 5-10 | Convergence begins | 80-85% |
| 10-15 | Fine-tuning best solutions | 85-90%+ |
How Mutations Are Generated
The proposer receives:- Baseline instruction: Your current system prompt
- Rollout examples: Input/output pairs with feedback (correct/incorrect, error messages)
- Trace statistics: Tool call rate, trajectory length, etc.
- Feedback hints: Rule-based suggestions like “model under-utilizes tools”
Minibatch Gating
Before full evaluation, GEPA performs a quick check:- Evaluate child on a small minibatch (3 seeds)
- Compare to parent’s score on the same seeds
- If child is worse → skip full evaluation (saves budget)
- If child is promising → proceed to full Pareto evaluation
Step 6: Retrieve Optimized Prompts
After completion, fetch your results using the Python SDK:Understanding the Pareto Front
GEPA returns multiple prompts representing different trade-offs:| Rank | Accuracy | Token Count | Trade-off |
|---|---|---|---|
| 1 | 92% | 450 | Highest accuracy |
| 2 | 90% | 280 | Good accuracy, shorter |
| 3 | 88% | 150 | Efficient, still performant |
Step 7: Use the Optimized Prompt
Replace your baseline prompt with the optimized version:In-Process Optimization
For development and testing, run everything from a single Python script:Termination Conditions
GEPA stops when any condition is met:| Condition | Description | Configuration |
|---|---|---|
rollout_budget | Total rollouts exhausted | rollout_budget = 1000 |
max_spend_usd | USD budget limit | max_spend_usd = 5.0 |
num_generations | Generation limit reached | num_generations = 15 |
patience_generations | No improvement for N generations | patience_generations = 5 |