Skip to main content
This walkthrough demonstrates GEPA (Genetic Evolutionary Prompt Algorithm) on the Banking77 intent classification task. We’ll explore:
  • The baseline prompt and its limitations
  • How GEPA discovers better prompts
  • Detailed analysis of what changed and why it worked
  • Per-seed performance breakdown

The Task: Banking77 Intent Classification

Banking77 is a dataset of 77 banking-related customer intents. Given a customer query like:
“I am still waiting on my card?”
The model must classify it as one of 77 intents (e.g., card_arrival, card_delivery_estimate, lost_or_stolen_card). This is challenging because many intents are semantically similar. For example, these queries have subtly different intents:
QueryIntent
”I am still waiting on my card?”card_arrival
”When will my card arrive?”card_delivery_estimate
”I think my card was stolen”lost_or_stolen_card

The Baseline Prompt

We start with a simple, intuitive prompt:
[[prompt_learning.initial_prompt.messages]]
role = "system"
pattern = "You are an expert banking assistant that classifies customer queries
into banking intents. Given a customer message, respond with exactly one intent
label from the provided list using the `banking77_classify` tool."

[[prompt_learning.initial_prompt.messages]]
role = "user"
pattern = "Customer Query: {query}

Available Intents:
{available_intents}

Classify this query into one of the above banking intents using the tool call."
This is a reasonable starting point, but how well does it actually perform?

Baseline Performance: 44% Accuracy

On 120 training seeds with gpt-4.1-nano, the baseline achieves only 44% accuracy. Here’s a sample of what went wrong:
QueryExpectedPredictedCorrect
”When did you send me my new card?”card_arrivalcard_delivery_estimate
“My new card has not been delivered yet. What is going on?”card_arrivalcard_delivery_estimate
“I am still waiting on my card?”card_arrivalcard_arrival
“Is there a way to track the delivery of my card?”card_arrivalcard_arrival
The model struggles to distinguish between card_arrival (checking on a card that should have arrived) and card_delivery_estimate (asking about delivery timeline). Both involve cards and delivery, but the intents are different.

Running GEPA Optimization

We ran GEPA with the following configuration:
[prompt_learning.gepa]
env_name = "banking77"
proposer_effort = "MEDIUM"

[prompt_learning.gepa.evaluation]
train_seeds = [0, 1, 2, ..., 119]  # 120 training seeds
val_seeds = [200, 202, ..., 598]   # 200 validation seeds

[prompt_learning.gepa.population]
initial_size = 12          # Start with 12 candidate prompts
num_generations = 20       # Run 20 generations
children_per_generation = 6

[prompt_learning.gepa.archive]
max_size = 20              # Keep top 20 candidates
pareto_set_size = 40       # Use 40 seeds for Pareto scoring
Total runtime: ~5 minutes

The Optimized Prompt: 81.5% Accuracy

GEPA found a prompt that achieved 81.5% validation accuracy - nearly double the baseline. The winning prompt (candidate trans_00013) transformed the system message from:
You are an expert banking assistant that classifies customer queries...
to a structured format:
[1. Input Description]
- You will receive two input fields: `Customer Query` (the user message to
  classify) and `Available Intents` (the list of valid intent labels)

[2. Output Format]
- Return exactly ONE intent label from the provided list
- Use the banking77_classify tool to submit your classification

[3. Classification Guidelines]
- Focus on the PRIMARY intent of the customer's message
- If the query asks about card status/arrival, use card_arrival
- If the query asks about delivery timeline, use card_delivery_estimate
- When in doubt, prefer the most specific applicable intent

What Made the Difference?

1. Explicit Input/Output Structure

The baseline just says “classify customer queries.” The optimized prompt explicitly defines:
  • What inputs the model receives
  • What output format is expected
  • That only ONE intent should be returned

2. Disambiguation Rules

The optimized prompt includes explicit guidance for confusing cases:
“If the query asks about card status/arrival, use card_arrival. If the query asks about delivery timeline, use card_delivery_estimate.”
This addresses the exact failure mode we saw in the baseline.

3. Decision Priority

“When in doubt, prefer the most specific applicable intent.”
This heuristic helps the model make consistent decisions in ambiguous cases.

Candidate Comparison

GEPA generated 13 candidates during optimization. Here’s how they compared:
CandidateTrain AccuracyVal AccuracyNotes
trans_0001362.5%81.5%Best overall - structured format
c747.5%-Second best train score
c045.0%-Good train, but didn’t generalize
c840.0%-Middle of pack
c332.5%-Structured but verbose
c130.0%-Too abstract
c110.0%-Complete failure - malformed output

The Worst Candidate

Candidate c11 achieved 0% accuracy because its prompt caused the model to output malformed responses. This illustrates how GEPA’s evolutionary approach naturally filters out bad mutations.

Contrastive Examples: How Candidates Differ

The most interesting insight comes from comparing how different prompts handle the same queries. Here’s a detailed breakdown across baseline and several candidates: Seed 5: “When did you send me my new card?” (Expected: card_arrival)
CandidatePredictionCorrect
Baselinecard_delivery_estimate
c0card_arrival
c1card_delivery_estimate
c3card_arrival
c7card_delivery_estimate
trans_00013card_delivery_estimate
Even the best candidate (trans_00013) gets this wrong - showing that some queries are inherently ambiguous. Seed 67: “how to track the card you sent” (Expected: card_arrival)
CandidatePredictionCorrect
Baselinecard_arrival
c0card_delivery_estimate
c3card_arrival
c7card_arrival
trans_00013card_arrival
Here c0 breaks a case the baseline got right - illustrating how optimization isn’t monotonic. Seed 13: “Is there a way to track the delivery of my card?” (Expected: card_arrival)
CandidatePredictionCorrect
Baselinecard_arrival
c0card_delivery_estimate
c3card_arrival
c7card_delivery_estimate
trans_00013card_arrival
trans_00013 matches baseline here, while c0 and c7 fail. Seed 0: “I am still waiting on my card?” (Expected: card_arrival)
CandidatePredictionCorrect
Baselinecard_arrival
c0card_arrival
c3card_delivery_estimate
c7card_arrival
c11I am still waiting on my card?
trans_00013card_arrival
c11’s prompt is so broken it echoes the query back instead of classifying it.

Failure Mode Analysis

The data reveals three distinct failure modes:
  1. Intent confusion (c0, c1, c7): Correctly formatted output, wrong intent
    • “Is there a way to track the delivery?” → card_delivery_estimate instead of card_arrival
  2. Query echoing (c2, c5, c6, c9, c11): Model outputs the query instead of an intent
    • “When did you send me my new card?” → "When did you send me my new card?"
  3. Format pollution (c4, c6): Model outputs partial prompt text
    • “I am still waiting on my card?” → "Customer Query" or "available intents"
The best candidate (trans_00013) avoids all three failure modes, achieving clean intent outputs on every seed.

Disagreement Analysis

Comparing the baseline to the best candidate on 40 seeds:
  • 23 seeds showed disagreement (different predictions)
  • Baseline won on 8 seeds (best candidate got these wrong)
  • Best candidate won on 15 seeds (baseline got these wrong)

Seeds Where Baseline Won

SeedQueryBaselineBestExpected
6”Do you have info about the card on delivery?”card_arrival
12”Can you track my card for me?”card_arrival
30”How long before my card arrives?”card_arrival

Seeds Where Best Candidate Won

SeedQueryBaselineBestExpected
0”I am still waiting on my card?”card_arrival
5”When did you send me my new card?”card_arrival
17”Has my replacement card been shipped?”card_arrival
The net gain of +7 seeds explains the accuracy improvement.

The Evolution Process

GEPA uses genetic algorithms to evolve prompts:
  1. Initialize: Start with 12 candidate prompts (mutations of the baseline)
  2. Evaluate: Test each candidate on a subset of training seeds
  3. Select: Keep the best performers (Pareto optimal)
  4. Mutate: Generate 6 new children per generation via LLM-guided mutations
  5. Repeat: Run for 20 generations
The Pareto frontier tracked 1 update during this run, settling on trans_00013 as the dominant prompt.

Run It Yourself

# Clone the cookbooks repo
git clone https://github.com/synth-laboratories/cookbooks
cd cookbooks/demos/banking77

# Run the demo
BACKEND_URL=http://localhost:8000 uv run python run_banking77_demo.py

# Or run a quick test with smaller budget
BACKEND_URL=http://localhost:8000 uv run python run_banking77_demo.py --test
Results are saved to results/:
  • candidates.json - All candidates with scores and prompts
  • seed_details.json - Per-seed predictions and ground truth
  • seed_analysis.json - Disagreement analysis between baseline and best
  • analysis_report.md - Human-readable summary

Source Code