Dataset
Banking77 is a dataset of 77 banking-related customer intents across 13,083 customer queries, like:| Query | Intent |
|---|---|
| ”I am still waiting on my card?” | card_arrival |
| ”When will my card arrive?” | card_delivery_estimate |
| ”I think my card was stolen” | lost_or_stolen_card |
card_arrival- checking on a card that should have arrivedcard_delivery_estimate- asking about delivery timeline
Baseline: 43.5%
Baseline Seed Performance
| Seed | Query | Baseline Output | Correct |
|---|---|---|---|
| 40 | ”I need to find out where the card is that I ordered.” | card_arrival | ✓ |
| 34 | ”Where is the card I ordered 2 weeks ago?” | card_arrival | ✓ |
| 56 | ”How long does it take for me to get my new card?” | card_delivery_estimate | ✗ |
| 114 | ”I still haven’t gotten my new card. When will it get here?” | card_arrival | ✓ |
| 93 | ”WHAT IS THE SOLUTION OF THIS PROBLEM” | card_arrival | ✓ |
Phase 1: Initial Population
12 candidate prompts were generated. Their results ranged from 2.5% to 65%:| Candidate | Accuracy | What Happened |
|---|---|---|
| c9 | 65% | Best initial - clear input/output structure |
| c5 | 62.5% | Strong - explicit field descriptions |
| c0, c11 | 57.5% | Good - but introduced regressions |
| c8 | 55% | Decent - two-input structure |
| c7 | 47.5% | Mixed - verbose descriptions |
| c2 | 40% | Prompt leakage failures |
| c4 | 37.5% | Lowercase field names |
| c1 | 12.5% | Query echoing failures |
| c3 | 7.5% | Numbered lists misinterpreted |
| c10 | 5% | Inverted input priority |
| c6 | 2.5% | Near-total structural confusion |
Seed-Level Comparison: What Went Wrong
Seed 34: “Where is the card I ordered 2 weeks ago?” (Expected:card_arrival)
| Candidate | Output | Result |
|---|---|---|
| Baseline | card_arrival | ✓ |
| c0 | card_delivery_estimate | ✗ regression |
| c1 | "Where is the card I ordered 2 weeks ago?" | ✗ echoed query |
| c2 | "Where is the card I ordered 2 weeks ago?" | ✗ echoed query |
| trans_00013 | card_arrival | ✓ fixed |
card_arrival)
| Candidate | Output | Result |
|---|---|---|
| Baseline | card_arrival | ✓ |
| c0 | card_not_working | ✗ regression |
| c1 | "WHAT IS THE SOLUTION OF THIS PROBLEM" | ✗ echoed query |
| c2 | "what_is_the_solution_of_this_problem" | ✗ prompt leakage |
| trans_00013 | card_arrival | ✓ fixed |
customer_query field name from c2’s prompt structure.
Seed 114: “I still haven’t gotten my new card. When will it get here?” (Expected: card_arrival)
| Candidate | Output | Result |
|---|---|---|
| Baseline | card_arrival | ✓ |
| c0 | card_delivery_estimate | ✗ regression |
| c1 | "Customer" | ✗ prompt fragment |
| trans_00013 | card_arrival | ✓ fixed |
"Customer" - a fragment of the prompt’s Customer Query field name. This indicates the model was confused about what to output.
Failure Mode Analysis
The seed data reveals three distinct failure modes: 1. Intent Confusion (c0, c4, c7)- Model outputs a valid intent, but the wrong one
- Example:
card_delivery_estimateinstead ofcard_arrival - Cause: Prompt structure didn’t disambiguate similar intents
- Model repeats the input query instead of classifying
- Example:
"Where is the card I ordered 2 weeks ago?" - Cause: Prompt structure confused input vs output expectations
- Model outputs fragments of the prompt template
- Examples:
"customer_query","{customer_query}","Customer" - Cause: Field name descriptions leaked into output space
Phase 2: Evolution
trans_00013 descended from trans_00010, inheriting successful patterns while fixing failure modes.
What trans_00013 Fixed
| Seed | Query | c0 | trans_00013 |
|---|---|---|---|
| 34 | ”Where is the card I ordered 2 weeks ago?” | ✗ card_delivery_estimate | ✓ card_arrival |
| 93 | ”WHAT IS THE SOLUTION OF THIS PROBLEM” | ✗ card_not_working | ✓ card_arrival |
| 114 | ”I still haven’t gotten my new card. When will it get here?” | ✗ card_delivery_estimate | ✓ card_arrival |
trans_00013 recovered the baseline’s correct predictions while maintaining the structural improvements that helped on other seeds.
The Winning Transformation
- Brackets
[1. Input Description]create clear section boundaries Customer Querymatches the user message format exactly (nocustomer_querymismatch)- “describing a banking need” frames the task more precisely than “asking about an issue”
- “complete list of 77” explicitly scopes the intent space
The Sticking Point
Seed 56: “How long does it take for me to get my new card?” (Expected:card_arrival)
| Candidate | Output | Correct |
|---|---|---|
| Baseline | card_delivery_estimate | ✗ |
| c0 | card_delivery_estimate | ✗ |
| c9 | card_delivery_estimate | ✗ |
| trans_00013 | card_delivery_estimate | ✗ |
trans_00013, got this wrong — the query reads like a delivery timing question. This represents a labeling ambiguity in the dataset itself.
Phase 3: Validation
The top 10 candidates were validated on 200 held-out seeds (120-319):| Metric | trans_00013 |
|---|---|
| Train (120 seeds) | 82.5% |
| Val (200 seeds) | 64% |
| Drop | -18.5 pp |
Phase 4: Pareto Selection
Net Improvement
| Metric | Baseline | trans_00013 | Δ |
|---|---|---|---|
| Train | 43.5% | 82.5% | +39 pp |
| Val | ~43% | 64% | +21 pp |
- 17 disagreement seeds
- c9 won 13, baseline won 4
- Net: +9 seeds flipped to correct
What GEPA Learned
- Bracketed sections -
[1. Input Description]over1. Input Description - Exact field name matching -
Customer Querynotcustomer_query - Precise task framing - “describing a banking need” over “asking about an issue”
- Explicit output boundaries -
[2. Output Format]section prevents prompt leakage
About the run
Configuration
- Model: gpt-4.1-nano
- Initial size: 12
- Number of generations: 20
- Children per generation: 6
- Validation top K: 10
- Pareto set size: 40
Metrics
- Runtime: 5m 54s
- Cost: $0.47
- Training seeds: 120
- Validation seeds: 200