Optimize Your Model's Prompts using GEPA

GEPA (Generalized Evolutionary Prompt Adaptation) is a reflective evolutionary optimizer that automatically improves your prompts through LLM-guided mutations and multi-objective selection. This walkthrough covers everything from setup to retrieving your optimized prompts. Reference: Agrawal et al. (2025). “GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning.” arXiv:2507.19457

Why GEPA?

GEPA outperforms GRPO by 10% on average (up to 20%) while using up to 35x fewer rollouts. It also outperforms MIPROv2 by over 10% across multiple LLMs. Best for:

Classification tasks (Banking77, intent classification)
Multi-hop QA (HotpotQA)
Instruction-following tasks
When you want diverse prompt variants (Pareto front)

Typical results: 60-75% baseline accuracy → 85-90%+ after 15 generations

Prerequisites

Before starting, ensure you have:

# Required environment variables in .env
GROQ_API_KEY=gsk_...          # For policy model inference
SYNTH_API_KEY=sk_...          # For backend authentication
ENVIRONMENT_API_KEY=sk_env_... # Optional - auto-generated if not set

Install the Synth AI CLI:

pip install synth-ai
# or
uvx synth-ai --help

How GEPA Works

GEPA uses evolutionary principles to explore the prompt space. Understanding the algorithm helps you configure it effectively.

The Optimization Flow

Initialize
- Split seeds into pareto_seeds and feedback_seeds
- Evaluate baseline transformation
- Generate initial population via proposer
- Evaluate & add to Pareto archive
Evolve (for each generation)
- For each child:
  - Select parent (instance-wise Pareto sampling)
  - Generate feedback from parent trace
  - Mutate via proposer (LLM-guided)
  - Minibatch gating (quick eval)
  - Full Pareto evaluation (if gating passed)
  - Update archive if non-dominated
Terminate
- Budget exhausted OR
- Generation limit OR
- No improvement for N generations
Return best transformation, by accuracy

Key Components

1. Pattern-Based Transformations

GEPA represents prompt changes as transformations that can be applied to your baseline:

# A transformation replaces text in your prompt
TextTransformation(
    old_text="You are a helpful assistant.",      # Original text
    new_text="You are a banking classification expert...",  # Optimized text
    apply_to_role="system"  # Only apply to system messages
)

2. Pareto Archive

GEPA maintains a Pareto front of non-dominated solutions, balancing multiple objectives:

Accuracy (primary) – Task performance
Tool call rate – Function calling frequency (for agentic tasks)

Solutions are kept if they’re not dominated by any other solution across all objectives.

3. Instance-Wise Parent Selection

Unlike traditional selection that uses aggregate scores, GEPA counts how many individual seeds each prompt “wins” on:

# Parent selection weights prompts by per-seed wins
wins = count_seeds_where_prompt_is_best(prompt)
selection_weight = (wins + ε) ** selection_pressure

This favors prompts that excel on specific example types, encouraging specialization.

4. LLM-Guided Mutations

The proposer (meta-model) generates new prompts by analyzing:

Current instruction (baseline)
Rollout examples (input/output/feedback for each seed)
Trace feedback (e.g., “model under-utilizes tools”)
Dataset and program context

The proposer uses instruction typology to structure outputs with: input descriptions, core task, premises, heuristics, constraints, rules, and output descriptions.

Step 1: Create a Task App

Your task app evaluates prompts by running rollouts and returning scores. See Task App Guide for details. Example Banking77 task app structure:

from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

class RolloutRequest(BaseModel):
    seed: int
    run_id: str
    # ... other fields

@app.post("/rollout")
async def rollout(request: RolloutRequest):
    # 1. Load example for this seed
    example = load_example(request.seed)

    # 2. Call your LLM with the prompt (interceptor handles substitution)
    prediction = await call_llm(example.query)

    # 3. Score the prediction
    correct = prediction == example.expected_label

    return {
        "metrics": {"correct": correct},
        "outcome": 1.0 if correct else 0.0
    }

Step 2: Deploy Your Task App

The Synth backend needs to reach your task app over the internet to send rollout requests. Deploy using a managed Cloudflare tunnel:

synth-ai deploy tunnel path/to/task_app.py \
  --env .env \
  --tunnel-mode managed \
  --tunnel-subdomain my-company

This command:

Starts your task app locally on port 8000
Creates a Cloudflare tunnel with a stable URL
Writes TASK_APP_URL to your .env file

You’ll see output like:

Wrote TASK_APP_URL=https://my-company.usesynth.ai to /path/to/.env

Verify the Deployment

Check that your task app is accessible:

curl https://my-company.usesynth.ai/health

The tunnel URL persists across restarts. On subsequent runs, the CLI will reuse your existing tunnel:

ℹ️  Reusing existing managed tunnel for my-org: https://my-company.usesynth.ai

Keep the Tunnel Running

By default, the tunnel runs in the background. To keep it in the foreground (useful for debugging):

synth-ai deploy tunnel path/to/task_app.py \
  --env .env \
  --tunnel-mode managed \
  --wait

Step 3: Create the Configuration

Create a TOML file defining your optimization parameters. The task_app_url should match the URL from Step 2 (stored in your .env as TASK_APP_URL):

[prompt_learning]
algorithm = "gepa"
task_app_url = "https://my-company.usesynth.ai"  # From TASK_APP_URL in .env
task_app_id = "banking77"

# Training seeds (used during optimization)
evaluation_seeds = [50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79]

# Validation seeds (held-out for final evaluation)
validation_seeds = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]

# Initial prompt template
[prompt_learning.initial_prompt]
messages = [
  { role = "system", content = "You are a banking intent classification assistant." },
  { role = "user", pattern = "Customer Query: {query}\n\nClassify this query into one of 77 banking intents." }
]

# GEPA-specific configuration
[prompt_learning.gepa]
num_generations = 15              # Evolutionary cycles to run
children_per_generation = 5       # Mutations per generation
pareto_set_size = 20              # Seeds for Pareto evaluation
minibatch_size = 3                # Seeds for quick gating
rollout_budget = 1000             # Total rollouts allowed
archive_size = 64                 # Max Pareto archive size

Configuration Parameters

Parameter	Description	Default	Recommended Range
`num_generations`	Evolutionary cycles	10	5-20
`children_per_generation`	Mutations per generation	5	3-10
`pareto_set_size`	Seeds for Pareto evaluation	20	15-30
`minibatch_size`	Seeds for gating evaluation	3	2-5
`rollout_budget`	Total rollouts allowed	1000	200-2000
`archive_size`	Max Pareto archive size	64	32-128
`feedback_fraction`	Fraction of seeds for feedback	0.3	0.2-0.5
`proposer_mode`	Proposer type (`synth`, `gepa-ai`, `dspy`)	`synth`	-

Step 4: Launch the Optimization Job

uvx synth-ai train \
  --config configs/prompt_learning/banking77_gepa.toml \
  --poll

The CLI will:

Validate your TOML configuration
Verify the task app is reachable
Submit the job to Synth
Poll for completion (with --poll)

Understanding the Output

During optimization, you’ll see progress updates:

[18:35:37]    0.0s  Status: running
[18:35:42]    5.2s  Status: running | Best: 0.500
[18:35:48]   11.4s  Status: running | Best: 0.625
[18:35:54]   17.6s  Status: running | Best: 0.750
[18:36:00]   23.8s  Status: running | Best: 0.875
...
[18:38:50]  175.9s  Status: succeeded | Best: 0.875

Your task app logs will show rollout requests:

[TASK_APP] INBOUND_ROLLOUT: run_id=prompt-learning-74-5bec8a6f seed=74
[TASK_APP] PREDICTION: expected=card_arrival predicted=card_delivery_estimate correct=False
[BANKING77_ROLLOUT] run_id=prompt-learning-74-5bec8a6f reward=0.0

Step 5: Understanding the Optimization Process

Generation-by-Generation Progress

Generation	What Happens	Expected Accuracy
0 (baseline)	Evaluate initial prompt	60-75%
1-3	Explore diverse mutations	70-80%
5-10	Convergence begins	80-85%
10-15	Fine-tuning best solutions	85-90%+

How Mutations Are Generated

The proposer receives:

Baseline instruction: Your current system prompt
Rollout examples: Input/output pairs with feedback (correct/incorrect, error messages)
Trace statistics: Tool call rate, trajectory length, etc.
Feedback hints: Rule-based suggestions like “model under-utilizes tools”

It generates a new instruction following instruction typology:

[Input Description]
You will be given a customer banking query.

[Core Task Description]
Your task is to classify the query into one of 77 banking intents.

[Premises]
Banking queries often contain domain-specific terminology.
Multiple intents may seem applicable; choose the most specific.

[Heuristics]
Look for keywords indicating the customer's primary need.
Consider the emotional tone to distinguish complaints from inquiries.

[Constraints]
Avoid defaulting to generic intents when specific ones apply.

[Rules]
Output only the intent name, nothing else.

[Output Description]
Return exactly one intent from the predefined list.

Minibatch Gating

Before full evaluation, GEPA performs a quick check:

Evaluate child on a small minibatch (3 seeds)
Compare to parent’s score on the same seeds
If child is worse → skip full evaluation (saves budget)
If child is promising → proceed to full Pareto evaluation

This saves significant compute by filtering out poor mutations early.

Step 6: Retrieve Optimized Prompts

After completion, fetch your results using the Python SDK:

import os
from synth_ai.learning import get_prompts, get_prompt_text, get_scoring_summary

BASE_URL = os.environ.get("BACKEND_BASE_URL", "https://agent-learning.onrender.com/api")
API_KEY = os.environ["SYNTH_API_KEY"]
JOB_ID = "pl_abc123"  # From the train command output

# Get all results
results = get_prompts(job_id=JOB_ID, base_url=BASE_URL, api_key=API_KEY)
print(f"Best Score: {results['best_score']:.3f}")

# Get top 5 prompts from Pareto front
for rank in range(1, 6):
    prompt = get_prompt_text(job_id=JOB_ID, base_url=BASE_URL, api_key=API_KEY, rank=rank)
    print(f"Rank {rank}: {len(prompt)} chars")
    print(prompt[:200] + "...")

# Get scoring summary
summary = get_scoring_summary(job_id=JOB_ID, base_url=BASE_URL, api_key=API_KEY)
print(f"Train={summary['best_train_accuracy']:.3f}")
print(f"Validation={summary.get('best_validation_accuracy', 0.0):.3f}")
print(f"Candidates Tried={summary['num_candidates_tried']}")

Understanding the Pareto Front

GEPA returns multiple prompts representing different trade-offs:

Rank	Accuracy	Token Count	Trade-off
1	92%	450	Highest accuracy
2	90%	280	Good accuracy, shorter
3	88%	150	Efficient, still performant

Choose based on your latency/cost requirements.

Step 7: Use the Optimized Prompt

Replace your baseline prompt with the optimized version:

# Before: baseline prompt
system_prompt = "You are a banking intent classification assistant."

# After: optimized prompt (rank 1 from GEPA)
system_prompt = get_prompt_text(job_id=JOB_ID, rank=1)

# Use in your application
response = await openai.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": f"Customer Query: {query}"}
    ]
)

In-Process Optimization

For development and testing, run everything from a single Python script:

from synth_ai.task import InProcessTaskApp
from synth_ai.learning import LearningClient

# Start task app in-process (handles tunneling automatically)
async with InProcessTaskApp(app=my_task_app) as task_app:
    # Task app is now accessible via tunnel
    task_app_url = task_app.url

    # Submit optimization job
    client = LearningClient(base_url=BACKEND_URL, api_key=API_KEY)
    job = await client.create_job(config=my_config)
    await client.start_job(job["id"])

    # Poll until complete
    result = await client.poll_until_terminal(job["id"])
    print(f"Best score: {result['best_score']}")

See In-Process Task App Walkthrough for a complete example.

Termination Conditions

GEPA stops when any condition is met:

Condition	Description	Configuration
`rollout_budget`	Total rollouts exhausted	`rollout_budget = 1000`
`max_spend_usd`	USD budget limit	`max_spend_usd = 5.0`
`num_generations`	Generation limit reached	`num_generations = 15`
`patience_generations`	No improvement for N generations	`patience_generations = 5`

Supported Models

See Supported Models for Prompt Optimization for the full list of policy models.

Get Started

Training Walkthroughs

Supported Models

Pricing

Optimize Your Model's Prompts using GEPA

Why GEPA?

Prerequisites

How GEPA Works

The Optimization Flow

Key Components

1. Pattern-Based Transformations

2. Pareto Archive

3. Instance-Wise Parent Selection

4. LLM-Guided Mutations

Step 1: Create a Task App

Step 2: Deploy Your Task App

Verify the Deployment

Keep the Tunnel Running

Step 3: Create the Configuration

Configuration Parameters

Step 4: Launch the Optimization Job

Understanding the Output

Step 5: Understanding the Optimization Process

Generation-by-Generation Progress

How Mutations Are Generated

Minibatch Gating

Step 6: Retrieve Optimized Prompts

Understanding the Pareto Front

Step 7: Use the Optimized Prompt

In-Process Optimization

Termination Conditions

Supported Models

Get Started

Training Walkthroughs

Supported Models

Pricing

​Why GEPA?

​Prerequisites

​How GEPA Works

​The Optimization Flow

​Key Components

​1. Pattern-Based Transformations

​2. Pareto Archive

​3. Instance-Wise Parent Selection

​4. LLM-Guided Mutations

​Step 1: Create a Task App

​Step 2: Deploy Your Task App

​Verify the Deployment

​Keep the Tunnel Running

​Step 3: Create the Configuration

​Configuration Parameters

​Step 4: Launch the Optimization Job

​Understanding the Output

​Step 5: Understanding the Optimization Process

​Generation-by-Generation Progress

​How Mutations Are Generated

​Minibatch Gating

​Step 6: Retrieve Optimized Prompts

​Understanding the Pareto Front

​Step 7: Use the Optimized Prompt

​In-Process Optimization

​Termination Conditions

​Supported Models

Why GEPA?

Prerequisites

How GEPA Works

The Optimization Flow

Key Components

1. Pattern-Based Transformations

2. Pareto Archive

3. Instance-Wise Parent Selection

4. LLM-Guided Mutations

Step 1: Create a Task App

Step 2: Deploy Your Task App

Verify the Deployment

Keep the Tunnel Running

Step 3: Create the Configuration

Configuration Parameters

Step 4: Launch the Optimization Job

Understanding the Output

Step 5: Understanding the Optimization Process

Generation-by-Generation Progress

How Mutations Are Generated

Minibatch Gating

Step 6: Retrieve Optimized Prompts

Understanding the Pareto Front

Step 7: Use the Optimized Prompt

In-Process Optimization

Termination Conditions

Supported Models