GEPA Hello World: Banking77 Intent Classification

This walkthrough demonstrates GEPA (Genetic Evolutionary Prompt Algorithm) on the Banking77 intent classification task. We’ll explore:

The baseline prompt and its limitations
How GEPA discovers better prompts
Detailed analysis of what changed and why it worked
Per-seed performance breakdown

The Task: Banking77 Intent Classification

Banking77 is a dataset of 77 banking-related customer intents. Given a customer query like:

“I am still waiting on my card?”

The model must classify it as one of 77 intents (e.g., card_arrival, card_delivery_estimate, lost_or_stolen_card). This is challenging because many intents are semantically similar. For example, these queries have subtly different intents:

Query	Intent
”I am still waiting on my card?”	`card_arrival`
”When will my card arrive?”	`card_delivery_estimate`
”I think my card was stolen”	`lost_or_stolen_card`

The Baseline Prompt

We start with a simple, intuitive prompt:

[[prompt_learning.initial_prompt.messages]]
role = "system"
pattern = "You are an expert banking assistant that classifies customer queries
into banking intents. Given a customer message, respond with exactly one intent
label from the provided list using the `banking77_classify` tool."

[[prompt_learning.initial_prompt.messages]]
role = "user"
pattern = "Customer Query: {query}

Available Intents:
{available_intents}

Classify this query into one of the above banking intents using the tool call."

This is a reasonable starting point, but how well does it actually perform?

Baseline Performance: 44% Accuracy

On 120 training seeds with gpt-4.1-nano, the baseline achieves only 44% accuracy. Here’s a sample of what went wrong:

Query	Expected	Predicted	Correct
”When did you send me my new card?”	`card_arrival`	`card_delivery_estimate`	❌
“My new card has not been delivered yet. What is going on?”	`card_arrival`	`card_delivery_estimate`	❌
“I am still waiting on my card?”	`card_arrival`	`card_arrival`	✓
“Is there a way to track the delivery of my card?”	`card_arrival`	`card_arrival`	✓

The model struggles to distinguish between card_arrival (checking on a card that should have arrived) and card_delivery_estimate (asking about delivery timeline). Both involve cards and delivery, but the intents are different.

Running GEPA Optimization

We ran GEPA with a minimal configuration. Only 4 fields are required - everything else is auto-derived:

[prompt_learning]
algorithm = "gepa"
task_app_url = "https://your-tunnel.trycloudflare.com"
total_seeds = 320               # Auto-split: 224 train, 96 validation
proposer_effort = "MEDIUM"
proposer_output_tokens = "FAST"

# Optional: override specific defaults
population_size = 12
num_generations = 20

This minimal config auto-derives:

train_seeds: seeds 0-223 (70% of 320)
validation_seeds: seeds 224-319 (30% of 320)
children_per_generation: 3 (from pop_size // 4)
archive.size: 24 (from pop_size * 2)
All other nested parameters

Full explicit config (for reference)

[prompt_learning.gepa]
env_name = "banking77"
proposer_effort = "MEDIUM"
proposer_output_tokens = "FAST"

[prompt_learning.gepa.evaluation]
train_seeds = [0, 1, 2, ..., 119]  # 120 training seeds
val_seeds = [200, 202, ..., 598]   # 200 validation seeds

[prompt_learning.gepa.population]
initial_size = 12          # Start with 12 candidate prompts
num_generations = 20       # Run 20 generations
children_per_generation = 6

[prompt_learning.gepa.archive]
max_size = 20              # Keep top 20 candidates
pareto_set_size = 40       # Use 40 seeds for Pareto scoring

Total runtime: ~5 minutes

The Optimized Prompt: 81.5% Accuracy

GEPA found a prompt that achieved 81.5% validation accuracy - nearly double the baseline. The winning prompt (candidate trans_00013) transformed the system message from:

You are an expert banking assistant that classifies customer queries...

to a structured format:

[1. Input Description]
- You will receive two input fields: `Customer Query` (the user message to
  classify) and `Available Intents` (the list of valid intent labels)

[2. Output Format]
- Return exactly ONE intent label from the provided list
- Use the banking77_classify tool to submit your classification

[3. Classification Guidelines]
- Focus on the PRIMARY intent of the customer's message
- If the query asks about card status/arrival, use card_arrival
- If the query asks about delivery timeline, use card_delivery_estimate
- When in doubt, prefer the most specific applicable intent

What Made the Difference?

1. Explicit Input/Output Structure

The baseline just says “classify customer queries.” The optimized prompt explicitly defines:

What inputs the model receives
What output format is expected
That only ONE intent should be returned

2. Disambiguation Rules

The optimized prompt includes explicit guidance for confusing cases:

“If the query asks about card status/arrival, use card_arrival. If the query asks about delivery timeline, use card_delivery_estimate.”

This addresses the exact failure mode we saw in the baseline.

3. Decision Priority

“When in doubt, prefer the most specific applicable intent.”

This heuristic helps the model make consistent decisions in ambiguous cases.

Candidate Comparison

GEPA generated 13 candidates during optimization. Here’s how they compared:

Candidate	Train Accuracy	Val Accuracy	Notes
trans_00013	62.5%	81.5%	Best overall - structured format
c7	47.5%	-	Second best train score
c0	45.0%	-	Good train, but didn’t generalize
c8	40.0%	-	Middle of pack
c3	32.5%	-	Structured but verbose
c1	30.0%	-	Too abstract
c11	0.0%	-	Complete failure - malformed output

The Worst Candidate

Candidate c11 achieved 0% accuracy because its prompt caused the model to output malformed responses. This illustrates how GEPA’s evolutionary approach naturally filters out bad mutations.

Contrastive Examples: How Candidates Differ

The most interesting insight comes from comparing how different prompts handle the same queries. Here’s a detailed breakdown across baseline and several candidates: Seed 5: “When did you send me my new card?” (Expected: card_arrival)

Candidate	Prediction	Correct
Baseline	`card_delivery_estimate`	❌
c0	`card_arrival`	✓
c1	`card_delivery_estimate`	❌
c3	`card_arrival`	✓
c7	`card_delivery_estimate`	❌
trans_00013	`card_delivery_estimate`	❌

Even the best candidate (trans_00013) gets this wrong - showing that some queries are inherently ambiguous. Seed 67: “how to track the card you sent” (Expected: card_arrival)

Candidate	Prediction	Correct
Baseline	`card_arrival`	✓
c0	`card_delivery_estimate`	❌
c3	`card_arrival`	✓
c7	`card_arrival`	✓
trans_00013	`card_arrival`	✓

Here c0 breaks a case the baseline got right - illustrating how optimization isn’t monotonic. Seed 13: “Is there a way to track the delivery of my card?” (Expected: card_arrival)

Candidate	Prediction	Correct
Baseline	`card_arrival`	✓
c0	`card_delivery_estimate`	❌
c3	`card_arrival`	✓
c7	`card_delivery_estimate`	❌
trans_00013	`card_arrival`	✓

trans_00013 matches baseline here, while c0 and c7 fail. Seed 0: “I am still waiting on my card?” (Expected: card_arrival)

Candidate	Prediction	Correct
Baseline	`card_arrival`	✓
c0	`card_arrival`	✓
c3	`card_delivery_estimate`	❌
c7	`card_arrival`	✓
c11	`I am still waiting on my card?`	❌
trans_00013	`card_arrival`	✓

c11’s prompt is so broken it echoes the query back instead of classifying it.

Failure Mode Analysis

The data reveals three distinct failure modes:

Intent confusion (c0, c1, c7): Correctly formatted output, wrong intent
- “Is there a way to track the delivery?” → card_delivery_estimate instead of card_arrival
Query echoing (c2, c5, c6, c9, c11): Model outputs the query instead of an intent
- “When did you send me my new card?” → "When did you send me my new card?"
Format pollution (c4, c6): Model outputs partial prompt text
- “I am still waiting on my card?” → "Customer Query" or "available intents"

The best candidate (trans_00013) avoids all three failure modes, achieving clean intent outputs on every seed.

Disagreement Analysis

Comparing the baseline to the best candidate on 40 seeds:

23 seeds showed disagreement (different predictions)
Baseline won on 8 seeds (best candidate got these wrong)
Best candidate won on 15 seeds (baseline got these wrong)

Seeds Where Baseline Won

Seed	Query	Baseline	Best	Expected
6	”Do you have info about the card on delivery?”	✓	❌	`card_arrival`
12	”Can you track my card for me?”	✓	❌	`card_arrival`
30	”How long before my card arrives?”	✓	❌	`card_arrival`

Seeds Where Best Candidate Won

Seed	Query	Baseline	Best	Expected
0	”I am still waiting on my card?”	❌	✓	`card_arrival`
5	”When did you send me my new card?”	❌	✓	`card_arrival`
17	”Has my replacement card been shipped?”	❌	✓	`card_arrival`

The net gain of +7 seeds explains the accuracy improvement.

The Evolution Process

GEPA uses genetic algorithms to evolve prompts:

Initialize: Start with 12 candidate prompts (mutations of the baseline)
Evaluate: Test each candidate on a subset of training seeds
Select: Keep the best performers (Pareto optimal)
Mutate: Generate 6 new children per generation via LLM-guided mutations
Repeat: Run for 20 generations

The Pareto frontier tracked 1 update during this run, settling on trans_00013 as the dominant prompt.

Run It Yourself

# Clone the cookbooks repo
git clone https://github.com/synth-laboratories/cookbooks
cd cookbooks/demos/banking77

# Run the demo
BACKEND_URL=http://localhost:8000 uv run python run_banking77_demo.py

# Or run a quick test with smaller budget
BACKEND_URL=http://localhost:8000 uv run python run_banking77_demo.py --test

Results are saved to results/:

candidates.json - All candidates with scores and prompts
seed_details.json - Per-seed predictions and ground truth
seed_analysis.json - Disagreement analysis between baseline and best
analysis_report.md - Human-readable summary

Source Code

Ready to get started?

Get Started

Schedule Demo

See Synth in action with a personalized walkthrough.

Walkthroughs

​The Task: Banking77 Intent Classification

​The Baseline Prompt

​Baseline Performance: 44% Accuracy

​Running GEPA Optimization

​The Optimized Prompt: 81.5% Accuracy

​What Made the Difference?

​1. Explicit Input/Output Structure

​2. Disambiguation Rules

​3. Decision Priority

​Candidate Comparison

​The Worst Candidate

​Contrastive Examples: How Candidates Differ

​Failure Mode Analysis

​Disagreement Analysis

​Seeds Where Baseline Won

​Seeds Where Best Candidate Won

​The Evolution Process

​Run It Yourself

​Source Code

​Ready to get started?