Skip to main content

Dataset

Banking77 is a dataset of 77 banking-related customer intents across 13,083 customer queries, like:
QueryIntent
”I am still waiting on my card?”card_arrival
”When will my card arrive?”card_delivery_estimate
”I think my card was stolen”lost_or_stolen_card
With intent classification, subtle distinctions trip up prompts, like:
  • card_arrival - checking on a card that should have arrived
  • card_delivery_estimate - asking about delivery timeline
Queries like “How long does it take for me to get my new card?” could plausibly map to either intent.

Baseline: 43.5%

System: You are an expert banking assistant that classifies customer
queries into banking intents. Given a customer message, respond with
exactly one intent label from the provided list using the
`banking77_classify` tool.

User: Customer Query: {query}

Available Intents:
{available_intents}

Classify this query into one of the above banking intents using the tool call.
On 120 training seeds with gpt-4.1-nano, this prompt achieved 43.5% accuracy.

Baseline Seed Performance

SeedQueryBaseline OutputCorrect
40”I need to find out where the card is that I ordered.”card_arrival
34”Where is the card I ordered 2 weeks ago?”card_arrival
56”How long does it take for me to get my new card?”card_delivery_estimate
114”I still haven’t gotten my new card. When will it get here?”card_arrival
93”WHAT IS THE SOLUTION OF THIS PROBLEM”card_arrival
Seed 56 shows the core confusion: the model interprets “how long” as a delivery timing question instead of a card status check.

Phase 1: Initial Population

12 candidate prompts were generated. Their results ranged from 2.5% to 65%:
CandidateAccuracyWhat Happened
c965%Best initial - clear input/output structure
c562.5%Strong - explicit field descriptions
c0, c1157.5%Good - but introduced regressions
c855%Decent - two-input structure
c747.5%Mixed - verbose descriptions
c240%Prompt leakage failures
c437.5%Lowercase field names
c112.5%Query echoing failures
c37.5%Numbered lists misinterpreted
c105%Inverted input priority
c62.5%Near-total structural confusion

Seed-Level Comparison: What Went Wrong

Seed 34: “Where is the card I ordered 2 weeks ago?” (Expected: card_arrival)
CandidateOutputResult
Baselinecard_arrival
c0card_delivery_estimate✗ regression
c1"Where is the card I ordered 2 weeks ago?"✗ echoed query
c2"Where is the card I ordered 2 weeks ago?"✗ echoed query
trans_00013card_arrival✓ fixed
c0 introduced a regression - baseline got this right, but c0’s transformation caused the model to misclassify. c1 and c2 completely failed by echoing the query back instead of classifying it. Seed 93: “WHAT IS THE SOLUTION OF THIS PROBLEM” (Expected: card_arrival)
CandidateOutputResult
Baselinecard_arrival
c0card_not_working✗ regression
c1"WHAT IS THE SOLUTION OF THIS PROBLEM"✗ echoed query
c2"what_is_the_solution_of_this_problem"✗ prompt leakage
trans_00013card_arrival✓ fixed
c2’s output shows prompt leakage - the model output a snake_case version of the query, mimicking the customer_query field name from c2’s prompt structure. Seed 114: “I still haven’t gotten my new card. When will it get here?” (Expected: card_arrival)
CandidateOutputResult
Baselinecard_arrival
c0card_delivery_estimate✗ regression
c1"Customer"✗ prompt fragment
trans_00013card_arrival✓ fixed
c1 output "Customer" - a fragment of the prompt’s Customer Query field name. This indicates the model was confused about what to output.

Failure Mode Analysis

The seed data reveals three distinct failure modes: 1. Intent Confusion (c0, c4, c7)
  • Model outputs a valid intent, but the wrong one
  • Example: card_delivery_estimate instead of card_arrival
  • Cause: Prompt structure didn’t disambiguate similar intents
2. Query Echoing (c1, c3)
  • Model repeats the input query instead of classifying
  • Example: "Where is the card I ordered 2 weeks ago?"
  • Cause: Prompt structure confused input vs output expectations
3. Prompt Leakage (c2, c6, c10)
  • Model outputs fragments of the prompt template
  • Examples: "customer_query", "{customer_query}", "Customer"
  • Cause: Field name descriptions leaked into output space

Phase 2: Evolution

baseline (43.5%)
    ↓ mutation
c0-c11 (2.5% - 65%)
    ↓ evolution
trans_00010 (intermediate)
    ↓ mutation
trans_00013 (82.5%)
trans_00013 descended from trans_00010, inheriting successful patterns while fixing failure modes.

What trans_00013 Fixed

SeedQueryc0trans_00013
34”Where is the card I ordered 2 weeks ago?”card_delivery_estimatecard_arrival
93”WHAT IS THE SOLUTION OF THIS PROBLEM”card_not_workingcard_arrival
114”I still haven’t gotten my new card. When will it get here?”card_delivery_estimatecard_arrival
trans_00013 recovered the baseline’s correct predictions while maintaining the structural improvements that helped on other seeds.

The Winning Transformation

[1. Input Description]
- Input field `Customer Query`: a single user message describing a banking need or question
- Input field `Available Intents`: the complete list of 77 valid intent labels

[2. Output Format]
- Return exactly one intent label from Available Intents
- Use the banking77_classify tool
Why it worked:
  • Brackets [1. Input Description] create clear section boundaries
  • Customer Query matches the user message format exactly (no customer_query mismatch)
  • “describing a banking need” frames the task more precisely than “asking about an issue”
  • “complete list of 77” explicitly scopes the intent space

The Sticking Point

Seed 56: “How long does it take for me to get my new card?” (Expected: card_arrival)
CandidateOutputCorrect
Baselinecard_delivery_estimate
c0card_delivery_estimate
c9card_delivery_estimate
trans_00013card_delivery_estimate
Every candidate, including trans_00013, got this wrong — the query reads like a delivery timing question. This represents a labeling ambiguity in the dataset itself.

Phase 3: Validation

The top 10 candidates were validated on 200 held-out seeds (120-319):
Metrictrans_00013
Train (120 seeds)82.5%
Val (200 seeds)64%
Drop-18.5 pp

Phase 4: Pareto Selection

Net Improvement

MetricBaselinetrans_00013Δ
Train43.5%82.5%+39 pp
Val~43%64%+21 pp
On the 40 Pareto-scored seeds comparing c9 vs baseline:
  • 17 disagreement seeds
  • c9 won 13, baseline won 4
  • Net: +9 seeds flipped to correct

What GEPA Learned

  1. Bracketed sections - [1. Input Description] over 1. Input Description
  2. Exact field name matching - Customer Query not customer_query
  3. Precise task framing - “describing a banking need” over “asking about an issue”
  4. Explicit output boundaries - [2. Output Format] section prevents prompt leakage

About the run

Configuration

  • Model: gpt-4.1-nano
  • Initial size: 12
  • Number of generations: 20
  • Children per generation: 6
  • Validation top K: 10
  • Pareto set size: 40

Metrics

  • Runtime: 5m 54s
  • Cost: $0.47
  • Training seeds: 120
  • Validation seeds: 200

Run This Yourself with Synth AI

git clone https://github.com/synth-laboratories/cookbooks.git
cd cookbooks/code/demos/banking77
uvx --python "<3.14" synth-ai==0.3.2.dev3 setup
uv run run_banking77_demo.py

Optimize your model’s prompts for free with Synth AI