Eval

Evaluation is identical to the rollout command described in the SFT docs. For RL you typically run it (a) before training to establish a baseline, (b) during training to monitor judge metrics, and (c) afterwards to compare checkpoints. For CLI flag details, see Run Evaluations.

Baseline run

uvx synth-ai eval \
  --app-id your-task-id \
  --model Qwen/Qwen3-4B \
  --seeds 1-20 \
  --trace-db traces/v3/synth_ai.db

Save the aggregated scores so you can chart improvement during training.

Inspect judge metrics

Rollout responses include judge-specific scores (outcome, events, etc.). The CLI prints them inline. To consume the raw data:

uvx synth-ai eval \
  --app-id your-task-id \
  --model ft:latest-checkpoint \
  --seeds 21-40 \
  --trace-db traces/v3/synth_ai.db \
  --metadata difficulty=hard

This filters the task dataset so you can compare performance on specific splits.

Config-driven eval

[eval]
app_id = "your-task-id"
model = "ft:abc123:2024-09-18-034500"
seeds = [1, 2, 3, 4, 5]
trace_db = "traces/v3/synth_ai.db"
metadata = { split = "validation" }
task_app_url = "https://my-task-app.modal.run"

Run with uvx synth-ai eval --config configs/eval.toml. CLI flags still override the config, letting agents tweak parameters dynamically.

Use results downstream

Feed high-reward trajectories back into the SFT pipeline using the filter step.
Compare different checkpoints by running eval with different --model values.
Store judge metrics to detect regressions or plateauing performance.

Because eval is just rollouts, you can invoke it at any point—before, during, or after RL jobs—to keep a tight feedback loop. Tip: the CLI auto-loads the .env produced during setup. Use --env-file only when you need to override the inferred environment.

Get Started

Fine-Tuning

Reinforcement Learning

CLI Commands

Baseline run

Inspect judge metrics

Config-driven eval

Use results downstream

Get Started

Fine-Tuning

Reinforcement Learning

CLI Commands

​Baseline run

​Inspect judge metrics

​Config-driven eval

​Use results downstream

Baseline run

Inspect judge metrics

Config-driven eval

Use results downstream