Skip to main content
synth-ai eval executes rollouts against a task app and summarizes judge scores.
  • Omit APP_ID to pick from discovered apps; provide one to skip the prompt. The command reuses the same registry and discovery logic as serve/deploy.
  • You can drive the run with a TOML config (--config) or let the CLI auto-discover the first matching config in the workspace. Config values are validated before any work begins.
  • When no remote URL is supplied the CLI starts the selected task app locally, runs the requested seeds, and tears it down afterward. Supplying --url points the evaluator at an existing deployment instead.
  • Seeds can come from CLI arguments, the config, or metadata queries (--metadata, --metadata-sql). Negative seeds wrap around the dataset for convenience.
  • Results include official task scores plus any judge specs configured for the task app; the CLI also prints mean values and optional Pearson correlations so you can gauge agreement between judges.
  • Traces are written to --trace-db (default traces/v3/synth_ai.db). Setting it to none skips persistence when you only care about immediate metrics.

Options

  • APP_ID — Task app identifier. Omit for the interactive picker.
  • --config PATH — Optional TOML config. If omitted the CLI auto-discovers the first *.toml that matches the task app.
  • --url VALUE — Base URL of a running task app. When omitted the CLI spins up a local server automatically.
  • --seeds VALUE — Comma-separated seeds or indices ("0,1,2"). Negative values wrap around the dataset.
  • --split VALUE — Dataset split name to request (train, validation, etc.).
  • --model VALUE — Override the model in the config/metadata.
  • --env-file PATH — One or more .env files holding credentials. Repeat the flag to merge multiple files.
  • --trace-db PATH — SQLite/Turso path for trace persistence. Use none to disable saving traces.
  • --metadata KEY=VALUE — Filter tasks by metadata key/value pairs (repeatable).
  • --metadata-sql QUERY — Advanced seed selection using a SQLite query (returns seeds to evaluate).
uvx synth-ai eval grpo-crafter --config configs/eval.toml --seeds 0,5,9
Example session (local server auto-started):
synth@Nomans-Resolve sdk % uvx synth-ai eval grpo-crafter --seeds 0,1,2
Starting temporary grpo-crafter server on port 8765...
Waiting for server to start...
 Server started
Using env file(s): /Users/synth/qa/sdk/.env
Evaluating seeds: 0, 1, 2
Official mean: 0.742
[judge qa_accuracy] mean: 0.810
    Pearson r: 0.64
  Seed Prompt Official qa_accuracy
  0    12     0.750     0.833
  1    98     0.708     0.792
  2    45     0.769     0.805
Stopping temporary server...

Notes

  • Remote evaluations require both SYNTH_API_KEY and ENVIRONMENT_API_KEY in your env files so the CLI can forward authentication headers.
  • Trace DB paths accept both file locations and SQLAlchemy URLs (sqlite+aiosqlite:///absolute/path.db). Set --trace-db none if you want to keep runs ephemeral.
  • Metadata filters stack: you can combine --split validation, --metadata difficulty=hard, and an SQL query to target highly specific samples.
  • When the CLI starts a temporary server it automatically chooses an open port, injects credentials, and cleans up once evaluation finishes.