- The baseline prompt and its limitations
- How GEPA discovers better prompts
- Detailed analysis of what changed and why it worked
- Per-seed performance breakdown
The Task: Banking77 Intent Classification
Banking77 is a dataset of 77 banking-related customer intents. Given a customer query like:“I am still waiting on my card?”The model must classify it as one of 77 intents (e.g.,
card_arrival, card_delivery_estimate, lost_or_stolen_card).
This is challenging because many intents are semantically similar. For example, these queries have subtly different intents:
| Query | Intent |
|---|---|
| ”I am still waiting on my card?” | card_arrival |
| ”When will my card arrive?” | card_delivery_estimate |
| ”I think my card was stolen” | lost_or_stolen_card |
The Baseline Prompt
We start with a simple, intuitive prompt:Baseline Performance: 44% Accuracy
On 120 training seeds withgpt-4.1-nano, the baseline achieves only 44% accuracy. Here’s a sample of what went wrong:
| Query | Expected | Predicted | Correct |
|---|---|---|---|
| ”When did you send me my new card?” | card_arrival | card_delivery_estimate | ❌ |
| “My new card has not been delivered yet. What is going on?” | card_arrival | card_delivery_estimate | ❌ |
| “I am still waiting on my card?” | card_arrival | card_arrival | ✓ |
| “Is there a way to track the delivery of my card?” | card_arrival | card_arrival | ✓ |
card_arrival (checking on a card that should have arrived) and card_delivery_estimate (asking about delivery timeline). Both involve cards and delivery, but the intents are different.
Running GEPA Optimization
We ran GEPA with the following configuration:The Optimized Prompt: 81.5% Accuracy
GEPA found a prompt that achieved 81.5% validation accuracy - nearly double the baseline. The winning prompt (candidatetrans_00013) transformed the system message from:
What Made the Difference?
1. Explicit Input/Output Structure
The baseline just says “classify customer queries.” The optimized prompt explicitly defines:- What inputs the model receives
- What output format is expected
- That only ONE intent should be returned
2. Disambiguation Rules
The optimized prompt includes explicit guidance for confusing cases:“If the query asks about card status/arrival, useThis addresses the exact failure mode we saw in the baseline.card_arrival. If the query asks about delivery timeline, usecard_delivery_estimate.”
3. Decision Priority
“When in doubt, prefer the most specific applicable intent.”This heuristic helps the model make consistent decisions in ambiguous cases.
Candidate Comparison
GEPA generated 13 candidates during optimization. Here’s how they compared:| Candidate | Train Accuracy | Val Accuracy | Notes |
|---|---|---|---|
| trans_00013 | 62.5% | 81.5% | Best overall - structured format |
| c7 | 47.5% | - | Second best train score |
| c0 | 45.0% | - | Good train, but didn’t generalize |
| c8 | 40.0% | - | Middle of pack |
| c3 | 32.5% | - | Structured but verbose |
| c1 | 30.0% | - | Too abstract |
| c11 | 0.0% | - | Complete failure - malformed output |
The Worst Candidate
Candidatec11 achieved 0% accuracy because its prompt caused the model to output malformed responses. This illustrates how GEPA’s evolutionary approach naturally filters out bad mutations.
Contrastive Examples: How Candidates Differ
The most interesting insight comes from comparing how different prompts handle the same queries. Here’s a detailed breakdown across baseline and several candidates: Seed 5: “When did you send me my new card?” (Expected:card_arrival)
| Candidate | Prediction | Correct |
|---|---|---|
| Baseline | card_delivery_estimate | ❌ |
| c0 | card_arrival | ✓ |
| c1 | card_delivery_estimate | ❌ |
| c3 | card_arrival | ✓ |
| c7 | card_delivery_estimate | ❌ |
| trans_00013 | card_delivery_estimate | ❌ |
card_arrival)
| Candidate | Prediction | Correct |
|---|---|---|
| Baseline | card_arrival | ✓ |
| c0 | card_delivery_estimate | ❌ |
| c3 | card_arrival | ✓ |
| c7 | card_arrival | ✓ |
| trans_00013 | card_arrival | ✓ |
card_arrival)
| Candidate | Prediction | Correct |
|---|---|---|
| Baseline | card_arrival | ✓ |
| c0 | card_delivery_estimate | ❌ |
| c3 | card_arrival | ✓ |
| c7 | card_delivery_estimate | ❌ |
| trans_00013 | card_arrival | ✓ |
card_arrival)
| Candidate | Prediction | Correct |
|---|---|---|
| Baseline | card_arrival | ✓ |
| c0 | card_arrival | ✓ |
| c3 | card_delivery_estimate | ❌ |
| c7 | card_arrival | ✓ |
| c11 | I am still waiting on my card? | ❌ |
| trans_00013 | card_arrival | ✓ |
Failure Mode Analysis
The data reveals three distinct failure modes:-
Intent confusion (c0, c1, c7): Correctly formatted output, wrong intent
- “Is there a way to track the delivery?” →
card_delivery_estimateinstead ofcard_arrival
- “Is there a way to track the delivery?” →
-
Query echoing (c2, c5, c6, c9, c11): Model outputs the query instead of an intent
- “When did you send me my new card?” →
"When did you send me my new card?"
- “When did you send me my new card?” →
-
Format pollution (c4, c6): Model outputs partial prompt text
- “I am still waiting on my card?” →
"Customer Query"or"available intents"
- “I am still waiting on my card?” →
Disagreement Analysis
Comparing the baseline to the best candidate on 40 seeds:- 23 seeds showed disagreement (different predictions)
- Baseline won on 8 seeds (best candidate got these wrong)
- Best candidate won on 15 seeds (baseline got these wrong)
Seeds Where Baseline Won
| Seed | Query | Baseline | Best | Expected |
|---|---|---|---|---|
| 6 | ”Do you have info about the card on delivery?” | ✓ | ❌ | card_arrival |
| 12 | ”Can you track my card for me?” | ✓ | ❌ | card_arrival |
| 30 | ”How long before my card arrives?” | ✓ | ❌ | card_arrival |
Seeds Where Best Candidate Won
| Seed | Query | Baseline | Best | Expected |
|---|---|---|---|---|
| 0 | ”I am still waiting on my card?” | ❌ | ✓ | card_arrival |
| 5 | ”When did you send me my new card?” | ❌ | ✓ | card_arrival |
| 17 | ”Has my replacement card been shipped?” | ❌ | ✓ | card_arrival |
The Evolution Process
GEPA uses genetic algorithms to evolve prompts:- Initialize: Start with 12 candidate prompts (mutations of the baseline)
- Evaluate: Test each candidate on a subset of training seeds
- Select: Keep the best performers (Pareto optimal)
- Mutate: Generate 6 new children per generation via LLM-guided mutations
- Repeat: Run for 20 generations
trans_00013 as the dominant prompt.
Run It Yourself
results/:
candidates.json- All candidates with scores and promptsseed_details.json- Per-seed predictions and ground truthseed_analysis.json- Disagreement analysis between baseline and bestanalysis_report.md- Human-readable summary
Source Code
- Demo script: run_banking77_demo.py
- Task app: banking77_task_app.py
- Config: banking77_gepa_demo.toml