Polyglot Task App: Banking77
Overview
- End-to-end Banking77 intent classification Task App (Python + polyglot ports) wired for Synth prompt optimization.
- Clear Task App contract, dataset/reward, and the exact label-explicit prompt we optimized.
- Repro steps: run locally, tunnel, launch GEPA, and inspect the real results/logs.
What you’ll learn
- How Synth talks to a Task App (
/health,/task_info,/rollout) and whatenv.seed/policy.configmean. - How to constrain outputs to the Banking77 label set (langprobe-style prompt) and compute rewards deterministically.
- How to run locally, expose via Cloudflare Tunnel, launch GEPA, and read the returned best prompt.
Who this is for
Engineers and solution owners who need a production-style reference for prompt optimization against a concrete task with reproducible commands and outputs.How to use this guide
- Skim Sections 1–4 (concept, contract, dataset, prompt).
- Pick Python (reference) or another language (Section 5) and run locally (Section 6).
- Expose via tunnel and launch GEPA (Sections 7–8).
- Compare your run to the recorded results (Section 9) and adapt (Section 11).
1. Concept: how Synth uses your Task App
- Seeds:
env.seedselects the dataset row. Optimizers sweep seeds to score a prompt. - Policy config:
policy.configcarries the prompt andinference_url(model endpoint). - Your Task App: For each
/rollout, load sample → call LLM → compute reward → returnmetrics.mean_returnin[0,1]. - Optimizer (MIPRO/GEPA): Iterates on prompts based on the rewards you emit.
2. HTTP contract (inbound)
Auth:X-API-Key must equal ENVIRONMENT_API_KEY for /task_info and /rollout (no auth on /health).Endpoints:
| Method | Path | Auth | Purpose |
|---|---|---|---|
| GET | /health | none | Liveness |
| GET | /task_info | X-API-Key | Describe task/dataset |
| POST | /rollout | X-API-Key | Run one rollout and return reward |
prompt_template mirrors the langprobe baseline and the GEPA run.)
/rollout response (what you return):
3. Dataset & reward
- Data:
code/prompt-learning/polyglot/data/banking77.json(labels array + 99-sample slice). - Reward: case-insensitive exact match; counts if the model outputs any known label token that equals the target; else 0.0. Pseudocode:
4. Prompt we optimized (langprobe-style)
- System: “You are an expert banking assistant… respond with exactly one intent label from the provided list…”
- User: Includes the customer query and the full Banking77 label list, then instructs: “Classify this query into one of the above banking intents… return only the label.”
- Deterministic:
temperature: 0. - Model:
gpt-4.1-nano(no Groq in this pipeline).
5. Repo layout (polyglot)
6. Run locally (Python reference)
7. Tunnel + env key (automated)
Script:cookbooks/dev/tunnel_gepa_banking77/run_gepa_with_tunnel.sh
- Generates
ENVIRONMENT_API_KEYif absent and writes.env.tunnel(used by backend). - Deploys the local Task App via Cloudflare Tunnel (quick mode).
- Saves the tunnel URL in
.env.tunnelasTASK_APP_URL.
8. Launch GEPA (prod backend)
- Backend:
https://agent-learning.onrender.com - Tunnel:
https://united-appointments-scholar-incl.trycloudflare.com - Job ID:
pl_6d3e035b37f04b67 - Budget: 200 rollouts, 40 transformations, ~89s, ~$0.05.
- Validation: baseline 0.6333 → best 0.7000 (Δ +0.0667).
- Results/logs:
cookbooks/dev/tunnel_gepa_banking77/results/(e.g.,gepa_results_pl_6d3e035b37f04b67_20251124_164457.txt).
9. Recorded outcomes (what we actually saw)
- Local seeds 0–9 with the label-constrained prompt: rewards
[(0,1.0),(1,1.0),(2,1.0),(3,1.0),(4,0.0),(5,1.0),(6,0.0),(7,1.0),(8,1.0),(9,1.0)]→ mean_return 0.80. - GEPA run: baseline val 0.6333 → best val 0.7000. Best prompt text is in the results file above (label-only, tool-style instructions).
10. Job payload (copy/paste)
Use your tunnel URL and the label-explicit prompt:<full Banking77 label list...> with the labels array from data/banking77.json.)
11. Adapt this to your task
- Swap
data/banking77.jsonfor your dataset; adjust loader per language. - Change
compute_rewardto your metric (similarity, numeric score, etc.). - Keep the contract stable (
/health,/task_info,/rollout,metrics.mean_return ∈ [0,1],X-API-Key). - Tune
inference_urlfor model/temperature/max_tokens as needed.
12. Production notes
- Logging: request ID, seed, model, latency, success/failure; never log secrets.
- Metrics: request/error rate, rollout latency (p50/p95/p99), mean reward trend.
- Security: rotate
ENVIRONMENT_API_KEY; keep LLM keys in secret storage; rate-limit at the tunnel edge. - Reliability: timeouts on LLM calls; reward=0 on hard failures; distinguish 4xx vs 5xx.
13. Files to check
- Cookbook page:
monorepo/docs/prompt-learning/polyglot.mdx(this file). - Python Task App:
cookbooks/code/prompt-learning/polyglot/python/app.py. - GEPA script/results:
cookbooks/dev/tunnel_gepa_banking77/.