Skip to main content
GEPA (Generalized Evolutionary Prompt Adaptation) is a reflective evolutionary optimizer that automatically improves your prompts through LLM-guided mutations and multi-objective selection. This walkthrough covers everything from setup to retrieving your optimized prompts. Reference: Agrawal et al. (2025). “GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning.” arXiv:2507.19457

Why GEPA?

GEPA outperforms GRPO by 10% on average (up to 20%) while using up to 35x fewer rollouts. It also outperforms MIPROv2 by over 10% across multiple LLMs. Best for:
  • Classification tasks (Banking77, intent classification)
  • Multi-hop QA (HotpotQA)
  • Instruction-following tasks
  • When you want diverse prompt variants (Pareto front)
Typical results: 60-75% baseline accuracy → 85-90%+ after 15 generations

Prerequisites

Before starting, ensure you have:
# Required environment variables in .env
GROQ_API_KEY=gsk_...          # For policy model inference
SYNTH_API_KEY=sk_...          # For backend authentication
ENVIRONMENT_API_KEY=sk_env_... # Optional - auto-generated if not set
Install the Synth AI CLI:
pip install synth-ai
# or
uvx synth-ai --help

How GEPA Works

GEPA uses evolutionary principles to explore the prompt space. Understanding the algorithm helps you configure it effectively.

The Optimization Flow

  1. Initialize
    • Split seeds into pareto_seeds and feedback_seeds
    • Evaluate baseline transformation
    • Generate initial population via proposer
    • Evaluate & add to Pareto archive
  2. Evolve (for each generation)
    • For each child:
      • Select parent (instance-wise Pareto sampling)
      • Generate feedback from parent trace
      • Mutate via proposer (LLM-guided)
      • Minibatch gating (quick eval)
      • Full Pareto evaluation (if gating passed)
      • Update archive if non-dominated
  3. Terminate
    • Budget exhausted OR
    • Generation limit OR
    • No improvement for N generations
  4. Return best transformation, by accuracy

Key Components

1. Pattern-Based Transformations

GEPA represents prompt changes as transformations that can be applied to your baseline:
# A transformation replaces text in your prompt
TextTransformation(
    old_text="You are a helpful assistant.",      # Original text
    new_text="You are a banking classification expert...",  # Optimized text
    apply_to_role="system"  # Only apply to system messages
)

2. Pareto Archive

GEPA maintains a Pareto front of non-dominated solutions, balancing multiple objectives:
  • Accuracy (primary) – Task performance
  • Tool call rate – Function calling frequency (for agentic tasks)
Solutions are kept if they’re not dominated by any other solution across all objectives.

3. Instance-Wise Parent Selection

Unlike traditional selection that uses aggregate scores, GEPA counts how many individual seeds each prompt “wins” on:
# Parent selection weights prompts by per-seed wins
wins = count_seeds_where_prompt_is_best(prompt)
selection_weight = (wins + ε) ** selection_pressure
This favors prompts that excel on specific example types, encouraging specialization.

4. LLM-Guided Mutations

The proposer (meta-model) generates new prompts by analyzing:
  • Current instruction (baseline)
  • Rollout examples (input/output/feedback for each seed)
  • Trace feedback (e.g., “model under-utilizes tools”)
  • Dataset and program context
The proposer uses instruction typology to structure outputs with: input descriptions, core task, premises, heuristics, constraints, rules, and output descriptions.

Step 1: Create a Task App

Your task app evaluates prompts by running rollouts and returning scores. See Task App Guide for details. Example Banking77 task app structure:
from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

class RolloutRequest(BaseModel):
    seed: int
    run_id: str
    # ... other fields

@app.post("/rollout")
async def rollout(request: RolloutRequest):
    # 1. Load example for this seed
    example = load_example(request.seed)

    # 2. Call your LLM with the prompt (interceptor handles substitution)
    prediction = await call_llm(example.query)

    # 3. Score the prediction
    correct = prediction == example.expected_label

    return {
        "metrics": {"correct": correct},
        "outcome": 1.0 if correct else 0.0
    }

Step 2: Deploy Your Task App

The Synth backend needs to reach your task app over the internet to send rollout requests. Deploy using a managed Cloudflare tunnel:
synth-ai deploy tunnel path/to/task_app.py \
  --env .env \
  --tunnel-mode managed \
  --tunnel-subdomain my-company
This command:
  1. Starts your task app locally on port 8000
  2. Creates a Cloudflare tunnel with a stable URL
  3. Writes TASK_APP_URL to your .env file
You’ll see output like:
Wrote TASK_APP_URL=https://my-company.usesynth.ai to /path/to/.env

Verify the Deployment

Check that your task app is accessible:
curl https://my-company.usesynth.ai/health
The tunnel URL persists across restarts. On subsequent runs, the CLI will reuse your existing tunnel:
ℹ️  Reusing existing managed tunnel for my-org: https://my-company.usesynth.ai

Keep the Tunnel Running

By default, the tunnel runs in the background. To keep it in the foreground (useful for debugging):
synth-ai deploy tunnel path/to/task_app.py \
  --env .env \
  --tunnel-mode managed \
  --wait

Step 3: Create the Configuration

Create a TOML file defining your optimization parameters. The task_app_url should match the URL from Step 2 (stored in your .env as TASK_APP_URL):
[prompt_learning]
algorithm = "gepa"
task_app_url = "https://my-company.usesynth.ai"  # From TASK_APP_URL in .env
task_app_id = "banking77"

# Training seeds (used during optimization)
evaluation_seeds = [50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79]

# Validation seeds (held-out for final evaluation)
validation_seeds = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]

# Initial prompt template
[prompt_learning.initial_prompt]
messages = [
  { role = "system", content = "You are a banking intent classification assistant." },
  { role = "user", pattern = "Customer Query: {query}\n\nClassify this query into one of 77 banking intents." }
]

# GEPA-specific configuration
[prompt_learning.gepa]
num_generations = 15              # Evolutionary cycles to run
children_per_generation = 5       # Mutations per generation
pareto_set_size = 20              # Seeds for Pareto evaluation
minibatch_size = 3                # Seeds for quick gating
rollout_budget = 1000             # Total rollouts allowed
archive_size = 64                 # Max Pareto archive size

Configuration Parameters

ParameterDescriptionDefaultRecommended Range
num_generationsEvolutionary cycles105-20
children_per_generationMutations per generation53-10
pareto_set_sizeSeeds for Pareto evaluation2015-30
minibatch_sizeSeeds for gating evaluation32-5
rollout_budgetTotal rollouts allowed1000200-2000
archive_sizeMax Pareto archive size6432-128
feedback_fractionFraction of seeds for feedback0.30.2-0.5
proposer_modeProposer type (synth, gepa-ai, dspy)synth-

Step 4: Launch the Optimization Job

uvx synth-ai train \
  --config configs/prompt_learning/banking77_gepa.toml \
  --poll
The CLI will:
  1. Validate your TOML configuration
  2. Verify the task app is reachable
  3. Submit the job to Synth
  4. Poll for completion (with --poll)

Understanding the Output

During optimization, you’ll see progress updates:
[18:35:37]    0.0s  Status: running
[18:35:42]    5.2s  Status: running | Best: 0.500
[18:35:48]   11.4s  Status: running | Best: 0.625
[18:35:54]   17.6s  Status: running | Best: 0.750
[18:36:00]   23.8s  Status: running | Best: 0.875
...
[18:38:50]  175.9s  Status: succeeded | Best: 0.875
Your task app logs will show rollout requests:
[TASK_APP] INBOUND_ROLLOUT: run_id=prompt-learning-74-5bec8a6f seed=74
[TASK_APP] PREDICTION: expected=card_arrival predicted=card_delivery_estimate correct=False
[BANKING77_ROLLOUT] run_id=prompt-learning-74-5bec8a6f reward=0.0

Step 5: Understanding the Optimization Process

Generation-by-Generation Progress

GenerationWhat HappensExpected Accuracy
0 (baseline)Evaluate initial prompt60-75%
1-3Explore diverse mutations70-80%
5-10Convergence begins80-85%
10-15Fine-tuning best solutions85-90%+

How Mutations Are Generated

The proposer receives:
  1. Baseline instruction: Your current system prompt
  2. Rollout examples: Input/output pairs with feedback (correct/incorrect, error messages)
  3. Trace statistics: Tool call rate, trajectory length, etc.
  4. Feedback hints: Rule-based suggestions like “model under-utilizes tools”
It generates a new instruction following instruction typology:
[Input Description]
You will be given a customer banking query.

[Core Task Description]
Your task is to classify the query into one of 77 banking intents.

[Premises]
Banking queries often contain domain-specific terminology.
Multiple intents may seem applicable; choose the most specific.

[Heuristics]
Look for keywords indicating the customer's primary need.
Consider the emotional tone to distinguish complaints from inquiries.

[Constraints]
Avoid defaulting to generic intents when specific ones apply.

[Rules]
Output only the intent name, nothing else.

[Output Description]
Return exactly one intent from the predefined list.

Minibatch Gating

Before full evaluation, GEPA performs a quick check:
  1. Evaluate child on a small minibatch (3 seeds)
  2. Compare to parent’s score on the same seeds
  3. If child is worse → skip full evaluation (saves budget)
  4. If child is promising → proceed to full Pareto evaluation
This saves significant compute by filtering out poor mutations early.

Step 6: Retrieve Optimized Prompts

After completion, fetch your results using the Python SDK:
import os
from synth_ai.learning import get_prompts, get_prompt_text, get_scoring_summary

BASE_URL = os.environ.get("BACKEND_BASE_URL", "https://agent-learning.onrender.com/api")
API_KEY = os.environ["SYNTH_API_KEY"]
JOB_ID = "pl_abc123"  # From the train command output

# Get all results
results = get_prompts(job_id=JOB_ID, base_url=BASE_URL, api_key=API_KEY)
print(f"Best Score: {results['best_score']:.3f}")

# Get top 5 prompts from Pareto front
for rank in range(1, 6):
    prompt = get_prompt_text(job_id=JOB_ID, base_url=BASE_URL, api_key=API_KEY, rank=rank)
    print(f"Rank {rank}: {len(prompt)} chars")
    print(prompt[:200] + "...")

# Get scoring summary
summary = get_scoring_summary(job_id=JOB_ID, base_url=BASE_URL, api_key=API_KEY)
print(f"Train={summary['best_train_accuracy']:.3f}")
print(f"Validation={summary.get('best_validation_accuracy', 0.0):.3f}")
print(f"Candidates Tried={summary['num_candidates_tried']}")

Understanding the Pareto Front

GEPA returns multiple prompts representing different trade-offs:
RankAccuracyToken CountTrade-off
192%450Highest accuracy
290%280Good accuracy, shorter
388%150Efficient, still performant
Choose based on your latency/cost requirements.

Step 7: Use the Optimized Prompt

Replace your baseline prompt with the optimized version:
# Before: baseline prompt
system_prompt = "You are a banking intent classification assistant."

# After: optimized prompt (rank 1 from GEPA)
system_prompt = get_prompt_text(job_id=JOB_ID, rank=1)

# Use in your application
response = await openai.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": f"Customer Query: {query}"}
    ]
)

In-Process Optimization

For development and testing, run everything from a single Python script:
from synth_ai.task import InProcessTaskApp
from synth_ai.learning import LearningClient

# Start task app in-process (handles tunneling automatically)
async with InProcessTaskApp(app=my_task_app) as task_app:
    # Task app is now accessible via tunnel
    task_app_url = task_app.url

    # Submit optimization job
    client = LearningClient(base_url=BACKEND_URL, api_key=API_KEY)
    job = await client.create_job(config=my_config)
    await client.start_job(job["id"])

    # Poll until complete
    result = await client.poll_until_terminal(job["id"])
    print(f"Best score: {result['best_score']}")
See In-Process Task App Walkthrough for a complete example.

Termination Conditions

GEPA stops when any condition is met:
ConditionDescriptionConfiguration
rollout_budgetTotal rollouts exhaustedrollout_budget = 1000
max_spend_usdUSD budget limitmax_spend_usd = 5.0
num_generationsGeneration limit reachednum_generations = 15
patience_generationsNo improvement for N generationspatience_generations = 5

Supported Models

See Supported Models for Prompt Optimization for the full list of policy models.