Skip to main content

Polyglot Task App: Banking77

Overview

  • End-to-end Banking77 intent classification Task App (Python + polyglot ports) wired for Synth prompt optimization.
  • Clear Task App contract, dataset/reward, and the exact label-explicit prompt we optimized.
  • Repro steps: run locally, tunnel, launch GEPA, and inspect the real results/logs.

What you’ll learn

  • How Synth talks to a Task App (/health, /task_info, /rollout) and what env.seed/policy.config mean.
  • How to constrain outputs to the Banking77 label set (langprobe-style prompt) and compute rewards deterministically.
  • How to run locally, expose via Cloudflare Tunnel, launch GEPA, and read the returned best prompt.

Who this is for

Engineers and solution owners who need a production-style reference for prompt optimization against a concrete task with reproducible commands and outputs.

How to use this guide

  1. Skim Sections 1–4 (concept, contract, dataset, prompt).
  2. Pick Python (reference) or another language (Section 5) and run locally (Section 6).
  3. Expose via tunnel and launch GEPA (Sections 7–8).
  4. Compare your run to the recorded results (Section 9) and adapt (Section 11).

1. Concept: how Synth uses your Task App

  • Seeds: env.seed selects the dataset row. Optimizers sweep seeds to score a prompt.
  • Policy config: policy.config carries the prompt and inference_url (model endpoint).
  • Your Task App: For each /rollout, load sample → call LLM → compute reward → return metrics.mean_return in [0,1].
  • Optimizer (MIPRO/GEPA): Iterates on prompts based on the rewards you emit.

2. HTTP contract (inbound)

Auth: X-API-Key must equal ENVIRONMENT_API_KEY for /task_info and /rollout (no auth on /health).
Endpoints:
MethodPathAuthPurpose
GET/healthnoneLiveness
GET/task_infoX-API-KeyDescribe task/dataset
POST/rolloutX-API-KeyRun one rollout and return reward
/rollout request (what Synth sends):
{
  "env": { "seed": 0 },
  "policy": {
    "config": {
      "prompt_template": "You are an expert banking assistant that classifies customer queries into banking intents. Given a customer message, respond with exactly one intent label from the provided list using the `banking77_classify` tool.\n\nCustomer Query: {{text}}\n\nAvailable Intents:\n<full Banking77 label list from data/banking77.json labels>\n\nClassify this query into one of the above banking intents using the tool call, and return only the label.",
      "inference_url": "https://api.openai.com/v1?model=gpt-4.1-nano",
      "model": "gpt-4.1-nano"
    }
  }
}
(The Python Task App also injects the full label list automatically; providing it in prompt_template mirrors the langprobe baseline and the GEPA run.) /rollout response (what you return):
{
  "metrics": { "mean_return": 1.0 },
  "trajectories": [ { "steps": [ { "reward": 1.0 } ] } ]
}

3. Dataset & reward

  • Data: code/prompt-learning/polyglot/data/banking77.json (labels array + 99-sample slice).
  • Reward: case-insensitive exact match; counts if the model outputs any known label token that equals the target; else 0.0. Pseudocode:
pred = predicted.lower().strip()
target = label.lower().strip()
if target in pred: return 1.0
for lbl in LABEL_SET:
    if lbl in pred:
        return 1.0 if lbl == target else 0.0
return 1.0 if pred.split()[0] == target else 0.0

4. Prompt we optimized (langprobe-style)

  • System: “You are an expert banking assistant… respond with exactly one intent label from the provided list…”
  • User: Includes the customer query and the full Banking77 label list, then instructs: “Classify this query into one of the above banking intents… return only the label.”
  • Deterministic: temperature: 0.
  • Model: gpt-4.1-nano (no Groq in this pipeline).

5. Repo layout (polyglot)

code/prompt-learning/polyglot
├── data/banking77.json
├── python/        # FastAPI Task App (reference)
├── rust/          # Axum
├── go/            # net/http
├── typescript/    # Hono
└── zig/           # Zig stdlib
Python is the reference; other languages share the same contract and reward logic.

6. Run locally (Python reference)

cd cookbooks/code/prompt-learning/polyglot/python
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
ENVIRONMENT_API_KEY=demo OPENAI_API_KEY=sk-... PORT=8010 \
  uvicorn app:app --host 0.0.0.0 --port 8010
# Smoke
curl http://localhost:8010/health
curl -H "X-API-Key: demo" http://localhost:8010/task_info
curl -H "X-API-Key: demo" -H 'Content-Type: application/json' \
  -d '{"env":{"seed":0},"policy":{"config":{"prompt_template":"<label-list prompt above>","inference_url":"https://api.openai.com/v1?model=gpt-4.1-nano"}}}' \
  http://localhost:8010/rollout

7. Tunnel + env key (automated)

Script: cookbooks/dev/tunnel_gepa_banking77/run_gepa_with_tunnel.sh
  • Generates ENVIRONMENT_API_KEY if absent and writes .env.tunnel (used by backend).
  • Deploys the local Task App via Cloudflare Tunnel (quick mode).
  • Saves the tunnel URL in .env.tunnel as TASK_APP_URL.

8. Launch GEPA (prod backend)

cd cookbooks/dev/tunnel_gepa_banking77
./run_gepa_with_tunnel.sh
Key run (recorded):
  • Backend: https://agent-learning.onrender.com
  • Tunnel: https://united-appointments-scholar-incl.trycloudflare.com
  • Job ID: pl_6d3e035b37f04b67
  • Budget: 200 rollouts, 40 transformations, ~89s, ~$0.05.
  • Validation: baseline 0.6333 → best 0.7000 (Δ +0.0667).
  • Results/logs: cookbooks/dev/tunnel_gepa_banking77/results/ (e.g., gepa_results_pl_6d3e035b37f04b67_20251124_164457.txt).

9. Recorded outcomes (what we actually saw)

  • Local seeds 0–9 with the label-constrained prompt: rewards [(0,1.0),(1,1.0),(2,1.0),(3,1.0),(4,0.0),(5,1.0),(6,0.0),(7,1.0),(8,1.0),(9,1.0)] → mean_return 0.80.
  • GEPA run: baseline val 0.6333 → best val 0.7000. Best prompt text is in the results file above (label-only, tool-style instructions).

10. Job payload (copy/paste)

Use your tunnel URL and the label-explicit prompt:
{
  "env": { "seed": 0 },
  "policy": {
    "config": {
      "prompt_template": "You are an expert banking assistant that classifies customer queries into banking intents. Given a customer message, respond with exactly one intent label from the provided list using the `banking77_classify` tool.\n\nCustomer Query: {{text}}\n\nAvailable Intents:\n<full Banking77 label list from data/banking77.json labels>\n\nClassify this query into one of the above banking intents using the tool call, and return only the label.",
      "inference_url": "https://api.openai.com/v1?model=gpt-4.1-nano",
      "model": "gpt-4.1-nano"
    }
  }
}
(Replace <full Banking77 label list...> with the labels array from data/banking77.json.)

11. Adapt this to your task

  1. Swap data/banking77.json for your dataset; adjust loader per language.
  2. Change compute_reward to your metric (similarity, numeric score, etc.).
  3. Keep the contract stable (/health, /task_info, /rollout, metrics.mean_return ∈ [0,1], X-API-Key).
  4. Tune inference_url for model/temperature/max_tokens as needed.

12. Production notes

  • Logging: request ID, seed, model, latency, success/failure; never log secrets.
  • Metrics: request/error rate, rollout latency (p50/p95/p99), mean reward trend.
  • Security: rotate ENVIRONMENT_API_KEY; keep LLM keys in secret storage; rate-limit at the tunnel edge.
  • Reliability: timeouts on LLM calls; reward=0 on hard failures; distinguish 4xx vs 5xx.

13. Files to check

  • Cookbook page: monorepo/docs/prompt-learning/polyglot.mdx (this file).
  • Python Task App: cookbooks/code/prompt-learning/polyglot/python/app.py.
  • GEPA script/results: cookbooks/dev/tunnel_gepa_banking77/.