Skip to main content
Evaluation is identical to the rollout command described in the SFT docs. For RL you typically run it (a) before training to establish a baseline, (b) during training to monitor judge metrics, and (c) afterwards to compare checkpoints. For CLI flag details, see Run Evaluations.

Baseline run

uvx synth-ai eval \
  --app-id your-task-id \
  --model Qwen/Qwen3-4B \
  --seeds 1-20 \
  --trace-db traces/v3/synth_ai.db
Save the aggregated scores so you can chart improvement during training.

Inspect judge metrics

Rollout responses include judge-specific scores (outcome, events, etc.). The CLI prints them inline. To consume the raw data:
uvx synth-ai eval \
  --app-id your-task-id \
  --model ft:latest-checkpoint \
  --seeds 21-40 \
  --trace-db traces/v3/synth_ai.db \
  --metadata difficulty=hard
This filters the task dataset so you can compare performance on specific splits.

Config-driven eval

[eval]
app_id = "your-task-id"
model = "ft:abc123:2024-09-18-034500"
seeds = [1, 2, 3, 4, 5]
trace_db = "traces/v3/synth_ai.db"
metadata = { split = "validation" }
task_app_url = "https://my-task-app.modal.run"
Run with uvx synth-ai eval --config configs/eval.toml. CLI flags still override the config, letting agents tweak parameters dynamically.

Use results downstream

  • Feed high-reward trajectories back into the SFT pipeline using the filter step.
  • Compare different checkpoints by running eval with different --model values.
  • Store judge metrics to detect regressions or plateauing performance.
Because eval is just rollouts, you can invoke it at any point—before, during, or after RL jobs—to keep a tight feedback loop. Tip: the CLI auto-loads the .env produced during setup. Use --env-file only when you need to override the inferred environment.