Skip to main content
This walkthrough demonstrates how to run GEPA optimization on Banking77 using a deployed task app via Cloudflare Tunnel. You’ll see exactly what commands to run, what output to expect, and how to retrieve your optimized prompts.

Prerequisites

  • SYNTH_API_KEY in .env (for backend authentication)
  • GROQ_API_KEY in .env (for policy model inference)
  • uv installed (for running Python commands)
  • cloudflared binary (will be auto-installed if missing)

Quick Start

Run the interactive script from the synth-ai repository:
cd walkthroughs/gepa/deployed
bash commands.sh
View the script: commands.sh The script guides you through each step interactively. Below is what happens at each stage.

Step-by-Step Walkthrough

Step 1: Generate ENVIRONMENT_API_KEY

What happens: The script generates a new API key for authenticating with the task app and registers it with the backend. Command executed:
ENV_KEY=$(uv run python -c "from synth_ai.learning.rl.secrets import mint_environment_api_key; print(mint_environment_api_key())" 2>&1 | tail -1 | tr -d '\n' | tr -d '\r')
echo "ENVIRONMENT_API_KEY=$ENV_KEY" > /tmp/gepa_walkthrough/cli_env.txt
echo "TASK_APP_URL=" >> /tmp/gepa_walkthrough/cli_env.txt
Expected output:
✓ ENVIRONMENT_API_KEY generated
Key: 4b49d56ce9f3c02...
✅ Key registered with backend
What you’ll see: The script displays the first 20 characters of the generated key and confirms backend registration.

Step 2: Deploy Cloudflare Tunnel

What happens: The script kills any existing processes on port 8102, then starts the Banking77 task app locally and creates a Cloudflare tunnel to expose it publicly. Commands executed:
pkill -f "cloudflared.*8102" 2>/dev/null || true
pkill -f "uvicorn.*8102" 2>/dev/null || true
lsof -ti :8102 2>/dev/null | xargs kill -9 2>/dev/null || true
sleep 2
uv run synth-ai deploy tunnel examples/task_apps/banking77/banking77_task_app.py --tunnel-mode quick --port 8102 --env /tmp/gepa_walkthrough/cli_env.txt &
sleep 25
Expected output:
Starting tunnel deployment in background...
Task app path: /path/to/examples/task_apps/banking77/banking77_task_app.py
Waiting for tunnel to establish...
What you’ll see: The deploy command runs in the background. After ~25 seconds, the tunnel URL is written to /tmp/gepa_walkthrough/cli_env.txt. You’ll see Cloudflare tunnel logs indicating the tunnel is ready. Task app logs (example):
INFO:     Started server process
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:8102
[TUNNEL] Cloudflare tunnel established: https://criteria-chains-incomplete-others.trycloudflare.com

Step 3: Extract Tunnel URL

What happens: The script reads the TASK_APP_URL that was written by the deploy command. Command executed:
TASK_URL=$(grep "^TASK_APP_URL=" /tmp/gepa_walkthrough/cli_env.txt | cut -d"=" -f2- | tr -d '"' | tr -d "'" | tr -d '\n' | tr -d '\r')
Expected output:
✓ Tunnel URL extracted: https://criteria-chains-incomplete-others.trycloudflare.com
What you’ll see: The tunnel URL is displayed. This is the public URL where your task app is accessible.

Step 4: Create GEPA Config

What happens: The script updates the base TOML config to use the tunnel URL and sets rollout budget to 2000 (sufficient for prompt improvement). Command executed:
cat examples/blog_posts/langprobe/task_specific/banking77/banking77_gepa.toml | \
  sed "s|task_app_url = \".*\"|task_app_url = \"$TASK_URL\"|" | \
  sed "s|budget = .*|budget = 2000|" > /tmp/gepa_walkthrough/banking77_gepa_prod.toml
Expected output:
✓ Config created: /tmp/gepa_walkthrough/banking77_gepa_prod.toml
What you’ll see: A new config file is created with your tunnel URL and increased budget.

Step 5: Run GEPA Training

What happens: The script submits the GEPA optimization job to the production backend and polls for completion. Command executed:
export BACKEND_BASE_URL="https://agent-learning.onrender.com"
uv run synth-ai train /tmp/gepa_walkthrough/banking77_gepa_prod.toml --backend "$BACKEND_BASE_URL" --env /tmp/gepa_walkthrough/cli_env.txt --poll
Expected output during training: You’ll see task app logs showing rollouts being processed. Here are examples from an actual run: Successful prediction:
[TASK_APP] INBOUND_ROLLOUT: run_id=prompt-learning-78-3d187122 seed=78 env=banking77
[TASK_APP] PROXY ROUTING with API key: sk_env_30c78...f263 (len=39)
[TASK_APP] OUTBOUND: model=llama-3.1-8b-instant temp=0.0 max=512 tools=1
[TASK_APP] RESPONSE_STATUS: 200
[TASK_APP] PREDICTION: expected=card_arrival predicted=card_arrival correct=True
[BANKING77_ROLLOUT] run_id=prompt-learning-78-3d187122 reward=1.0
INFO:     74.220.49.253:0 - "POST /rollout HTTP/1.1" 200 OK
Failed prediction (showing a case where the model returned all intents instead of one):
[TASK_APP] INBOUND_ROLLOUT: run_id=prompt-learning-77-e727cacd seed=77 env=banking77
[TASK_APP] RESPONSE_STATUS: 200
[TASK_APP] PREDICTION: expected=card_arrival predicted=card_about_to_expire card_arrival card_delivery_estimate... (all 77 intents) correct=False
[BANKING77_ROLLOUT] run_id=prompt-learning-77-e727cacd reward=0.0
INFO:     74.220.49.253:0 - "POST /rollout HTTP/1.1" 200 OK
You’ll see both correct and incorrect predictions as GEPA tests different prompt variations. The optimizer learns from these results to improve the prompts. Progress updates:
[18:51:11] Progress: 100% complete
[18:51:22] Validation Summary:
  Baseline: 0.5667
  N=2
  Candidate 1: 0.7667
  Candidate 2: 0.6333
[18:51:22] prompt.learning.optimized.scored (info): optimized[0] train_accuracy=0.375 len=636 N=6 val_accuracy=0.767
[18:51:22] prompt.learning.optimized.scored (info): optimized[1] train_accuracy=0.625 len=725 N=5 val_accuracy=0.633
[18:51:22] prompt.learning.optimized.scored (info): optimized[2] train_accuracy=0.7878787878787878 len=763 N=32
[18:51:22] prompt.learning.results.summary (info): Results: best_score=0.8125 tried=20 frontier=6
[18:51:22] prompt.learning.best.prompt (info): Best prompt (validation) score=0.8125
[18:51:22] prompt.learning.gepa.complete (info): GEPA optimisation complete — best_score=0.8125
[18:51:37] prompt.learning.completed (info): Prompt learning job completed — billed $0.14 ($0.14 sandbox + $0.00 tokens) | best_score=0.8125
Final summary:
Final status: succeeded
{
  "job_id": "pl_320a080971124f48",
  "status": "succeeded",
  "created_at": "2025-11-21T02:47:03.981174+00:00",
  "started_at": "2025-11-21T02:47:04.690663+00:00",
  "finished_at": "2025-11-21T02:51:23.30118+00:00"
}

================================================================================
FINAL SUMMARY
================================================================================
       Cost Policy: $0.0000 | Proposal: $0.0000 | Total: $0.1433
   Rollouts N: 458 | Tokens: 0.0000M
 Throughput Rollouts: 85.6/min
       Time 257.9s
Candidate 1 Accuracy: 0.7667 (Δ+0.2000 vs baseline)
================================================================================

📄 Results saved locally to: /private/tmp/gepa_walkthrough/results/gepa_results_pl_320a080971124f48_20251120_185144.txt
📋 Verbose log saved locally to: /private/tmp/gepa_walkthrough/results/gepa_log_pl_320a080971124f48_20251120_185144.log
What you’ll see:
  • Real-time rollout processing logs showing individual predictions (correct/incorrect)
  • Progress updates showing completion percentage
  • Validation summaries with candidate scores (baseline vs optimized)
  • Final job status with best score (in this example: 81.25% accuracy, up from 56.67% baseline - a +20% improvement)
  • Cost breakdown: $0.14 total cost for 458 rollouts
  • Throughput statistics: 85.6 rollouts/minute
  • Total time: ~4.3 minutes
  • Location of saved results files
Key results from this run:
  • Baseline accuracy: 56.67%
  • Best optimized prompt: 81.25% accuracy
  • Improvement: +24.58 percentage points (+43% relative improvement)
  • Top candidate: 76.67% accuracy (+20% vs baseline)
  • Cost: $0.14 for complete optimization

Retrieving Optimized Prompts

After training completes, retrieve the optimized prompts using the job ID:
import asyncio
from synth_ai.learning.prompt_learning_client import PromptLearningClient
from synth_ai.api.train.utils import ensure_api_base
import os

async def get_results():
    job_id = 'pl_320a080971124f48'  # Use your job ID from the output above
    backend_url = ensure_api_base('https://agent-learning.onrender.com')
    api_key = os.getenv('SYNTH_API_KEY')
    client = PromptLearningClient(backend_url, api_key)
    prompts = await client.get_prompts(job_id)
    
    print(f"Best score: {prompts.best_score}")
    print(f"Total candidates: {len(prompts.attempted_candidates)}")
    print(f"\nBest prompt:\n{prompts.best_prompt}")

asyncio.run(get_results())

Files Created During Execution

  • /tmp/gepa_walkthrough/cli_env.txt - Environment file with API key and tunnel URL
  • /tmp/gepa_walkthrough/banking77_gepa_prod.toml - GEPA config with tunnel URL (generated from base config)
  • /tmp/gepa_walkthrough/results/ - Results directory with logs and outputs

Troubleshooting

  • Port 8102 in use: The script automatically kills existing processes, but if issues persist, manually kill them: lsof -ti :8102 | xargs kill -9
  • Tunnel fails: Check that cloudflared is installed and network connectivity is working. The script waits 25 seconds for tunnel establishment.
  • API key errors: Ensure SYNTH_API_KEY is set in your .env file
  • Job fails with trace registration error: This is a known backend issue. The script completes successfully, but the job may fail during execution. Check backend logs for details.

Next Steps

  • Review the optimized prompts in the results file
  • Compare different candidates’ performance
  • Adjust the rollout budget or number of generations in the config for different optimization runs
  • Try the in-process walkthrough for a fully automated approach