Skip to main content
Evaluation and rollouts are the same thing in Synth. You run the agent against your task app, capture metrics, and optionally persist the traces for further training. For CLI options see Run Evaluations; if you need a refresher on starting task apps locally, check Run Task Apps Locally.

1. Choose the evaluation source

  • Local task app – run uvx synth-ai serve your-task-id --trace traces/v3.
  • Remote task app – provide --task-url https://<modal>.modal.run.
  • In-process config – pass a TOML file with an [eval] section (seeds, model, metadata filters).

2. Run the command

uvx synth-ai eval \
  --app-id your-task-id \
  --model ft:abc123:2024-09-18-034500 \
  --seeds 1-20 \
  --trace-db traces/v3/synth_ai.db
Key flags:
  • --model – model (base or fine-tuned) to roll out.
  • --seeds – comma-separated list or range syntax (1-10).
  • --metadata key=value – filter tasks by metadata in the dataset registry.
  • --trace-db none – disable trace persistence if you only need metrics.
  • --task-url – override the endpoint (useful for Modal deployments).
Output includes:
  • Official score and any rubric-specific judges exposed by the task app.
  • Per-seed outcomes (success/failure, scores, rewards).
  • Location of stored traces (both the SQLite database and any SFT JSONL dumps).

3. Use configs for reproducibility

Create configs/eval.toml:
[eval]
app_id = "your-task-id"
model = "ft:abc123:2024-09-18-034500"
seeds = [1, 2, 3, 4, 5]
trace_db = "traces/v3/synth_ai.db"
metadata = { split = "validation" }
Run:
uvx synth-ai eval --config configs/eval.toml
CLI parameters override the TOML fields, so agents can keep a canonical config and tweak inputs per run.

4. Promote or regress on metrics

  • Compare the aggregated results to the baseline SFT job.
  • Export traces (traces/v3/synth_ai.db) and use them as additional training data if necessary.
  • Record judge trends in your experiment tracker.
Because evaluation, rollouts, and data collection are all the same command, you can loop: train → eval → filter high-quality rollouts → train again. Tip: the CLI auto-loads the .env written by uvx synth-ai setup. Pass --env-file only when you need to override or layer additional environment files.