Run Evaluations

synth-ai eval executes rollouts against a task app and summarizes judge scores.

Omit APP_ID to pick from discovered apps; provide one to skip the prompt. The command reuses the same registry and discovery logic as serve/deploy.
You can drive the run with a TOML config (--config) or let the CLI auto-discover the first matching config in the workspace. Config values are validated before any work begins.
When no remote URL is supplied the CLI starts the selected task app locally, runs the requested seeds, and tears it down afterward. Supplying --url points the evaluator at an existing deployment instead.
Seeds can come from CLI arguments, the config, or metadata queries (--metadata, --metadata-sql). Negative seeds wrap around the dataset for convenience.
Results include official task scores plus any judge specs configured for the task app; the CLI also prints mean values and optional Pearson correlations so you can gauge agreement between judges.
Traces are written to --trace-db (default traces/v3/synth_ai.db). Setting it to none skips persistence when you only care about immediate metrics.

Options

APP_ID — Task app identifier. Omit for the interactive picker.
--config PATH — Optional TOML config. If omitted the CLI auto-discovers the first *.toml that matches the task app.
--url VALUE — Base URL of a running task app. When omitted the CLI spins up a local server automatically.
--seeds VALUE — Comma-separated seeds or indices ("0,1,2"). Negative values wrap around the dataset.
--split VALUE — Dataset split name to request (train, validation, etc.).
--model VALUE — Override the model in the config/metadata.
--env-file PATH — One or more .env files holding credentials. Repeat the flag to merge multiple files.
--trace-db PATH — SQLite/Turso path for trace persistence. Use none to disable saving traces.
--metadata KEY=VALUE — Filter tasks by metadata key/value pairs (repeatable).
--metadata-sql QUERY — Advanced seed selection using a SQLite query (returns seeds to evaluate).

uvx synth-ai eval grpo-crafter --config configs/eval.toml --seeds 0,5,9

Example session (local server auto-started):

synth@Nomans-Resolve sdk % uvx synth-ai eval grpo-crafter --seeds 0,1,2
Starting temporary grpo-crafter server on port 8765...
Waiting for server to start...
✓ Server started
Using env file(s): /Users/synth/qa/sdk/.env
Evaluating seeds: 0, 1, 2
Official mean: 0.742
[judge qa_accuracy] mean: 0.810
    Pearson r: 0.64
  Seed Prompt Official qa_accuracy
  0    12     0.750     0.833
  1    98     0.708     0.792
  2    45     0.769     0.805
Stopping temporary server...

Notes

Remote evaluations require both SYNTH_API_KEY and ENVIRONMENT_API_KEY in your env files so the CLI can forward authentication headers.
Trace DB paths accept both file locations and SQLAlchemy URLs (sqlite+aiosqlite:///absolute/path.db). Set --trace-db none if you want to keep runs ephemeral.
Metadata filters stack: you can combine --split validation, --metadata difficulty=hard, and an SQL query to target highly specific samples.
When the CLI starts a temporary server it automatically chooses an open port, injects credentials, and cleans up once evaluation finishes.

Get Started

Fine-Tuning

Reinforcement Learning

CLI Commands

Options

Notes

Get Started

Fine-Tuning

Reinforcement Learning

CLI Commands

​Options

​Notes

Options

Notes