Deployed Task App Walkthrough

This walkthrough demonstrates how to run GEPA optimization on Banking77 using a deployed task app via Cloudflare Tunnel. You’ll see exactly what commands to run, what output to expect, and how to retrieve your optimized prompts.

Prerequisites

SYNTH_API_KEY in .env (for backend authentication)
GROQ_API_KEY in .env (for policy model inference)
uv installed (for running Python commands)
cloudflared binary (will be auto-installed if missing)

Quick Start

Run the interactive script from the synth-ai repository:

cd walkthroughs/gepa/deployed
bash commands.sh

View the script: commands.sh The script guides you through each step interactively. Below is what happens at each stage.

Step-by-Step Walkthrough

Step 1: Generate ENVIRONMENT_API_KEY

What happens: The script generates a new API key for authenticating with the task app and registers it with the backend. Command executed:

ENV_KEY=$(uv run python -c "from synth_ai.learning.rl.secrets import mint_environment_api_key; print(mint_environment_api_key())" 2>&1 | tail -1 | tr -d '\n' | tr -d '\r')
echo "ENVIRONMENT_API_KEY=$ENV_KEY" > /tmp/gepa_walkthrough/cli_env.txt
echo "TASK_APP_URL=" >> /tmp/gepa_walkthrough/cli_env.txt

Expected output:

✓ ENVIRONMENT_API_KEY generated
Key: 4b49d56ce9f3c02...
✅ Key registered with backend

What you’ll see: The script displays the first 20 characters of the generated key and confirms backend registration.

Step 2: Deploy Cloudflare Tunnel

What happens: The script kills any existing processes on port 8102, then starts the Banking77 task app locally and creates a Cloudflare tunnel to expose it publicly. Commands executed:

pkill -f "cloudflared.*8102" 2>/dev/null || true
pkill -f "uvicorn.*8102" 2>/dev/null || true
lsof -ti :8102 2>/dev/null | xargs kill -9 2>/dev/null || true
sleep 2
uv run synth-ai deploy tunnel examples/task_apps/banking77/banking77_task_app.py --tunnel-mode quick --port 8102 --env /tmp/gepa_walkthrough/cli_env.txt &
sleep 25

Expected output:

Starting tunnel deployment in background...
Task app path: /path/to/examples/task_apps/banking77/banking77_task_app.py
Waiting for tunnel to establish...

What you’ll see: The deploy command runs in the background. After ~25 seconds, the tunnel URL is written to /tmp/gepa_walkthrough/cli_env.txt. You’ll see Cloudflare tunnel logs indicating the tunnel is ready. Task app logs (example):

INFO:     Started server process
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:8102
[TUNNEL] Cloudflare tunnel established: https://criteria-chains-incomplete-others.trycloudflare.com

Step 3: Extract Tunnel URL

What happens: The script reads the TASK_APP_URL that was written by the deploy command. Command executed:

TASK_URL=$(grep "^TASK_APP_URL=" /tmp/gepa_walkthrough/cli_env.txt | cut -d"=" -f2- | tr -d '"' | tr -d "'" | tr -d '\n' | tr -d '\r')

Expected output:

✓ Tunnel URL extracted: https://criteria-chains-incomplete-others.trycloudflare.com

What you’ll see: The tunnel URL is displayed. This is the public URL where your task app is accessible.

Step 4: Create GEPA Config

What happens: The script updates the base TOML config to use the tunnel URL and sets rollout budget to 2000 (sufficient for prompt improvement). Command executed:

cat examples/blog_posts/langprobe/task_specific/banking77/banking77_gepa.toml | \
  sed "s|task_app_url = \".*\"|task_app_url = \"$TASK_URL\"|" | \
  sed "s|budget = .*|budget = 2000|" > /tmp/gepa_walkthrough/banking77_gepa_prod.toml

Expected output:

✓ Config created: /tmp/gepa_walkthrough/banking77_gepa_prod.toml

What you’ll see: A new config file is created with your tunnel URL and increased budget.

Step 5: Run GEPA Training

What happens: The script submits the GEPA optimization job to the production backend and polls for completion. Command executed:

export BACKEND_BASE_URL="https://agent-learning.onrender.com"
uv run synth-ai train /tmp/gepa_walkthrough/banking77_gepa_prod.toml --backend "$BACKEND_BASE_URL" --env /tmp/gepa_walkthrough/cli_env.txt --poll

Expected output during training: You’ll see task app logs showing rollouts being processed. Here are examples from an actual run: Successful prediction:

[TASK_APP] INBOUND_ROLLOUT: run_id=prompt-learning-78-3d187122 seed=78 env=banking77
[TASK_APP] PROXY ROUTING with API key: sk_env_30c78...f263 (len=39)
[TASK_APP] OUTBOUND: model=llama-3.1-8b-instant temp=0.0 max=512 tools=1
[TASK_APP] RESPONSE_STATUS: 200
[TASK_APP] PREDICTION: expected=card_arrival predicted=card_arrival correct=True
[BANKING77_ROLLOUT] run_id=prompt-learning-78-3d187122 reward=1.0
INFO:     74.220.49.253:0 - "POST /rollout HTTP/1.1" 200 OK

Failed prediction (showing a case where the model returned all intents instead of one):

[TASK_APP] INBOUND_ROLLOUT: run_id=prompt-learning-77-e727cacd seed=77 env=banking77
[TASK_APP] RESPONSE_STATUS: 200
[TASK_APP] PREDICTION: expected=card_arrival predicted=card_about_to_expire card_arrival card_delivery_estimate... (all 77 intents) correct=False
[BANKING77_ROLLOUT] run_id=prompt-learning-77-e727cacd reward=0.0
INFO:     74.220.49.253:0 - "POST /rollout HTTP/1.1" 200 OK

You’ll see both correct and incorrect predictions as GEPA tests different prompt variations. The optimizer learns from these results to improve the prompts. Progress updates:

[18:51:11] Progress: 100% complete
[18:51:22] Validation Summary:
  Baseline: 0.5667
  N=2
  Candidate 1: 0.7667
  Candidate 2: 0.6333
[18:51:22] prompt.learning.optimized.scored (info): optimized[0] train_accuracy=0.375 len=636 N=6 val_accuracy=0.767
[18:51:22] prompt.learning.optimized.scored (info): optimized[1] train_accuracy=0.625 len=725 N=5 val_accuracy=0.633
[18:51:22] prompt.learning.optimized.scored (info): optimized[2] train_accuracy=0.7878787878787878 len=763 N=32
[18:51:22] prompt.learning.results.summary (info): Results: best_score=0.8125 tried=20 frontier=6
[18:51:22] prompt.learning.best.prompt (info): Best prompt (validation) score=0.8125
[18:51:22] prompt.learning.gepa.complete (info): GEPA optimisation complete — best_score=0.8125
[18:51:37] prompt.learning.completed (info): Prompt learning job completed — billed $0.14 ($0.14 sandbox + $0.00 tokens) | best_score=0.8125

Final summary:

Final status: succeeded
{
  "job_id": "pl_320a080971124f48",
  "status": "succeeded",
  "created_at": "2025-11-21T02:47:03.981174+00:00",
  "started_at": "2025-11-21T02:47:04.690663+00:00",
  "finished_at": "2025-11-21T02:51:23.30118+00:00"
}

================================================================================
FINAL SUMMARY
================================================================================
       Cost Policy: $0.0000 | Proposal: $0.0000 | Total: $0.1433
   Rollouts N: 458 | Tokens: 0.0000M
 Throughput Rollouts: 85.6/min
       Time 257.9s
Candidate 1 Accuracy: 0.7667 (Δ+0.2000 vs baseline)
================================================================================

📄 Results saved locally to: /private/tmp/gepa_walkthrough/results/gepa_results_pl_320a080971124f48_20251120_185144.txt
📋 Verbose log saved locally to: /private/tmp/gepa_walkthrough/results/gepa_log_pl_320a080971124f48_20251120_185144.log

What you’ll see:

Real-time rollout processing logs showing individual predictions (correct/incorrect)
Progress updates showing completion percentage
Validation summaries with candidate scores (baseline vs optimized)
Final job status with best score (in this example: 81.25% accuracy, up from 56.67% baseline - a +20% improvement)
Cost breakdown: $0.14 total cost for 458 rollouts
Throughput statistics: 85.6 rollouts/minute
Total time: ~4.3 minutes
Location of saved results files

Key results from this run:

Baseline accuracy: 56.67%
Best optimized prompt: 81.25% accuracy
Improvement: +24.58 percentage points (+43% relative improvement)
Top candidate: 76.67% accuracy (+20% vs baseline)
Cost: $0.14 for complete optimization

Retrieving Optimized Prompts

After training completes, retrieve the optimized prompts using the job ID:

import asyncio
from synth_ai.learning.prompt_learning_client import PromptLearningClient
from synth_ai.api.train.utils import ensure_api_base
import os

async def get_results():
    job_id = 'pl_320a080971124f48'  # Use your job ID from the output above
    backend_url = ensure_api_base('https://agent-learning.onrender.com')
    api_key = os.getenv('SYNTH_API_KEY')
    client = PromptLearningClient(backend_url, api_key)
    prompts = await client.get_prompts(job_id)
    
    print(f"Best score: {prompts.best_score}")
    print(f"Total candidates: {len(prompts.attempted_candidates)}")
    print(f"\nBest prompt:\n{prompts.best_prompt}")

asyncio.run(get_results())

commands.sh - Interactive deployment script
banking77_task_app.py - Banking77 task app implementation
banking77_gepa.toml - Base GEPA configuration file
README.md - Additional documentation

Files Created During Execution

/tmp/gepa_walkthrough/cli_env.txt - Environment file with API key and tunnel URL
/tmp/gepa_walkthrough/banking77_gepa_prod.toml - GEPA config with tunnel URL (generated from base config)
/tmp/gepa_walkthrough/results/ - Results directory with logs and outputs

Troubleshooting

Port 8102 in use: The script automatically kills existing processes, but if issues persist, manually kill them: lsof -ti :8102 | xargs kill -9
Tunnel fails: Check that cloudflared is installed and network connectivity is working. The script waits 25 seconds for tunnel establishment.
API key errors: Ensure SYNTH_API_KEY is set in your .env file
Job fails with trace registration error: This is a known backend issue. The script completes successfully, but the job may fail during execution. Check backend logs for details.

Next Steps

Review the optimized prompts in the results file
Compare different candidates’ performance
Adjust the rollout budget or number of generations in the config for different optimization runs
Try the in-process walkthrough for a fully automated approach

Start Training

Prompt Optimization

Supervised Fine-Tuning

Reinforcement Learning

Deployed Task App Walkthrough

Prerequisites

Quick Start

Step-by-Step Walkthrough

Step 1: Generate ENVIRONMENT_API_KEY

Step 2: Deploy Cloudflare Tunnel

Step 3: Extract Tunnel URL

Step 4: Create GEPA Config

Step 5: Run GEPA Training

Retrieving Optimized Prompts

Files Created During Execution

Troubleshooting

Next Steps

Start Training

Prompt Optimization

Supervised Fine-Tuning

Reinforcement Learning

​Prerequisites

​Quick Start

​Step-by-Step Walkthrough

​Step 1: Generate ENVIRONMENT_API_KEY

​Step 2: Deploy Cloudflare Tunnel

​Step 3: Extract Tunnel URL

​Step 4: Create GEPA Config

​Step 5: Run GEPA Training

​Retrieving Optimized Prompts

​Related Files

​Files Created During Execution

​Troubleshooting

​Next Steps

Prerequisites

Quick Start

Step-by-Step Walkthrough

Step 1: Generate ENVIRONMENT_API_KEY

Step 2: Deploy Cloudflare Tunnel

Step 3: Extract Tunnel URL

Step 4: Create GEPA Config

Step 5: Run GEPA Training

Retrieving Optimized Prompts

Related Files

Files Created During Execution

Troubleshooting

Next Steps