Skip to main content
This walkthrough demonstrates how to run GEPA optimization on Banking77 using a deployed task app via Cloudflare Tunnel. You’ll see exactly what commands to run, what output to expect, and how to retrieve your optimized prompts.

Prerequisites

  • SYNTH_API_KEY in .env (for backend authentication)
  • GROQ_API_KEY in .env (for policy model inference)
  • uv installed (for running Python commands)
  • cloudflared binary (will be auto-installed if missing)

Quick Start

Run the interactive script from the synth-ai repository:
cd walkthroughs/gepa/deployed
bash commands.sh
View the script: commands.sh The script guides you through each step interactively. Below is what happens at each stage.

Step-by-Step Walkthrough

Step 1: Generate ENVIRONMENT_API_KEY

What happens: The script generates a new API key for authenticating with the task app and registers it with the backend. Command executed:
ENV_KEY=$(uv run python -c "from synth_ai.learning.secrets import mint_environment_api_key; print(mint_environment_api_key())" 2>&1 | tail -1 | tr -d '\n' | tr -d '\r')
echo "ENVIRONMENT_API_KEY=$ENV_KEY" > /tmp/gepa_walkthrough/cli_env.txt
echo "TASK_APP_URL=" >> /tmp/gepa_walkthrough/cli_env.txt
Expected output:
✓ ENVIRONMENT_API_KEY generated
Key: 4b49d56ce9f3c02...
✅ Key registered with backend
What you’ll see: The script displays the first 20 characters of the generated key and confirms backend registration.

Step 2: Deploy with Python SDK

What happens: The script starts the Banking77 task app with a Cloudflare tunnel using the Python SDK. Python code:
from synth_ai.sdk import InProcessTaskApp
from examples.task_apps.banking77.banking77_task_app import app

async with InProcessTaskApp(
    app=app,
    port=8102,
    tunnel_mode="quick"
) as task_app:
    print(f"Task app running at: {task_app.url}")
    # Continue with optimization...
Expected output:
Starting task app...
INFO:     Started server process
INFO:     Uvicorn running on http://127.0.0.1:8102
✅ Task app running at: https://criteria-chains-incomplete-others.trycloudflare.com
What you’ll see: The task app starts with an automatic Cloudflare tunnel. The tunnel URL is available via task_app.url.

Step 3: Extract Tunnel URL

What happens: The script reads the TASK_APP_URL that was written by the deploy command. Command executed:
TASK_URL=$(grep "^TASK_APP_URL=" /tmp/gepa_walkthrough/cli_env.txt | cut -d"=" -f2- | tr -d '"' | tr -d "'" | tr -d '\n' | tr -d '\r')
Expected output:
✓ Tunnel URL extracted: https://criteria-chains-incomplete-others.trycloudflare.com
What you’ll see: The tunnel URL is displayed. This is the public URL where your task app is accessible.

Step 4: Create GEPA Config

What happens: The script updates the base TOML config to use the tunnel URL and sets rollout budget to 2000 (sufficient for prompt improvement). Command executed:
cat examples/blog_posts/langprobe/task_specific/banking77/banking77_gepa.toml | \
  sed "s|task_app_url = \".*\"|task_app_url = \"$TASK_URL\"|" | \
  sed "s|budget = .*|budget = 2000|" > /tmp/gepa_walkthrough/banking77_gepa_prod.toml
Expected output:
✓ Config created: /tmp/gepa_walkthrough/banking77_gepa_prod.toml
What you’ll see: A new config file is created with your tunnel URL and increased budget.

Step 5: Run GEPA Training

What happens: The script submits the GEPA optimization job and polls for completion. Python code:
import os
from synth_ai.sdk import PolicyOptimizationJob

async def run_training(config_path: str):
    client = PromptLearningClient(api_key=os.environ["SYNTH_API_KEY"])

    # Create and start job
    job = await client.create_job_from_toml(config_path)
    await client.start_job(job["id"])

    # Poll until completion
    result = await client.poll_until_terminal(job["id"])
    return result
Expected output during training: You’ll see task app logs showing rollouts being processed. Here are examples from an actual run: Successful prediction:
[TASK_APP] INBOUND_ROLLOUT: run_id=prompt-learning-78-3d187122 seed=78 env=banking77
[TASK_APP] PROXY ROUTING with API key: sk_env_30c78...f263 (len=39)
[TASK_APP] OUTBOUND: model=llama-3.1-8b-instant temp=0.0 max=512 tools=1
[TASK_APP] RESPONSE_STATUS: 200
[TASK_APP] PREDICTION: expected=card_arrival predicted=card_arrival correct=True
[BANKING77_ROLLOUT] run_id=prompt-learning-78-3d187122 reward=1.0
INFO:     74.220.49.253:0 - "POST /rollout HTTP/1.1" 200 OK
Failed prediction (showing a case where the model returned all intents instead of one):
[TASK_APP] INBOUND_ROLLOUT: run_id=prompt-learning-77-e727cacd seed=77 env=banking77
[TASK_APP] RESPONSE_STATUS: 200
[TASK_APP] PREDICTION: expected=card_arrival predicted=card_about_to_expire card_arrival card_delivery_estimate... (all 77 intents) correct=False
[BANKING77_ROLLOUT] run_id=prompt-learning-77-e727cacd reward=0.0
INFO:     74.220.49.253:0 - "POST /rollout HTTP/1.1" 200 OK
You’ll see both correct and incorrect predictions as GEPA tests different prompt variations. The optimizer learns from these results to improve the prompts. Progress updates:
[18:51:11] Progress: 100% complete
[18:51:22] Validation Summary:
  Baseline: 0.5667
  N=2
  Candidate 1: 0.7667
  Candidate 2: 0.6333
[18:51:22] prompt.learning.optimized.scored (info): optimized[0] train_accuracy=0.375 len=636 N=6 val_accuracy=0.767
[18:51:22] prompt.learning.optimized.scored (info): optimized[1] train_accuracy=0.625 len=725 N=5 val_accuracy=0.633
[18:51:22] prompt.learning.optimized.scored (info): optimized[2] train_accuracy=0.7878787878787878 len=763 N=32
[18:51:22] prompt.learning.results.summary (info): Results: best_score=0.8125 tried=20 frontier=6
[18:51:22] prompt.learning.best.prompt (info): Best prompt (validation) score=0.8125
[18:51:22] prompt.learning.gepa.complete (info): GEPA optimisation complete — best_score=0.8125
[18:51:37] prompt.learning.completed (info): Prompt learning job completed — billed $0.14 ($0.14 sandbox + $0.00 tokens) | best_score=0.8125
Final summary:
Final status: succeeded
{
  "job_id": "pl_320a080971124f48",
  "status": "succeeded",
  "created_at": "2025-11-21T02:47:03.981174+00:00",
  "started_at": "2025-11-21T02:47:04.690663+00:00",
  "finished_at": "2025-11-21T02:51:23.30118+00:00"
}

================================================================================
FINAL SUMMARY
================================================================================
       Cost Policy: $0.0000 | Proposal: $0.0000 | Total: $0.1433
   Rollouts N: 458 | Tokens: 0.0000M
 Throughput Rollouts: 85.6/min
       Time 257.9s
Candidate 1 Accuracy: 0.7667 (Δ+0.2000 vs baseline)
================================================================================

📄 Results saved locally to: /private/tmp/gepa_walkthrough/results/gepa_results_pl_320a080971124f48_20251120_185144.txt
📋 Verbose log saved locally to: /private/tmp/gepa_walkthrough/results/gepa_log_pl_320a080971124f48_20251120_185144.log
What you’ll see:
  • Real-time rollout processing logs showing individual predictions (correct/incorrect)
  • Progress updates showing completion percentage
  • Validation summaries with candidate scores (baseline vs optimized)
  • Final job status with best score (in this example: 81.25% accuracy, up from 56.67% baseline - a +20% improvement)
  • Cost breakdown: $0.14 total cost for 458 rollouts
  • Throughput statistics: 85.6 rollouts/minute
  • Total time: ~4.3 minutes
  • Location of saved results files
Key results from this run:
  • Baseline accuracy: 56.67%
  • Best optimized prompt: 81.25% accuracy
  • Improvement: +24.58 percentage points (+43% relative improvement)
  • Top candidate: 76.67% accuracy (+20% vs baseline)
  • Cost: $0.14 for complete optimization

Retrieving Optimized Prompts

After training completes, retrieve the optimized prompts using the job ID:
import asyncio
import os
from synth_ai.sdk import PolicyOptimizationJob

async def get_results():
    job_id = 'pl_320a080971124f48'  # Use your job ID from the output above
    client = PromptLearningClient(api_key=os.environ["SYNTH_API_KEY"])
    prompts = await client.get_prompts(job_id)

    print(f"Best score: {prompts.best_score}")
    print(f"Total candidates: {len(prompts.attempted_candidates)}")
    print(f"\nBest prompt:\n{prompts.best_prompt}")

asyncio.run(get_results())

Files Created During Execution

  • /tmp/gepa_walkthrough/cli_env.txt - Environment file with API key and tunnel URL
  • /tmp/gepa_walkthrough/banking77_gepa_prod.toml - GEPA config with tunnel URL (generated from base config)
  • /tmp/gepa_walkthrough/results/ - Results directory with logs and outputs

Troubleshooting

  • Port 8102 in use: The script automatically kills existing processes, but if issues persist, manually kill them: lsof -ti :8102 | xargs kill -9
  • Tunnel fails: Check that cloudflared is installed and network connectivity is working. The script waits 25 seconds for tunnel establishment.
  • API key errors: Ensure SYNTH_API_KEY is set in your .env file
  • Job fails with trace registration error: This is a known backend issue. The script completes successfully, but the job may fail during execution. Check backend logs for details.

Next Steps

  • Review the optimized prompts in the results file
  • Compare different candidates’ performance
  • Adjust the rollout budget or number of generations in the config for different optimization runs
  • Try the in-process walkthrough for a fully automated approach