synth-ai eval executes rollouts against a task app and summarizes judge scores.
- Omit
APP_IDto pick from discovered apps; provide one to skip the prompt. The command reuses the same registry and discovery logic asserve/deploy. - You can drive the run with a TOML config (
--config) or let the CLI auto-discover the first matching config in the workspace. Config values are validated before any work begins. - When no remote URL is supplied the CLI starts the selected task app locally, runs the requested seeds, and tears it down afterward. Supplying
--urlpoints the evaluator at an existing deployment instead. - Seeds can come from CLI arguments, the config, or metadata queries (
--metadata,--metadata-sql). Negative seeds wrap around the dataset for convenience. - Results include official task scores plus any judge specs configured for the task app; the CLI also prints mean values and optional Pearson correlations so you can gauge agreement between judges.
- Traces are written to
--trace-db(defaulttraces/v3/synth_ai.db). Setting it tononeskips persistence when you only care about immediate metrics.
Options
APP_ID— Task app identifier. Omit for the interactive picker.--config PATH— Optional TOML config. If omitted the CLI auto-discovers the first*.tomlthat matches the task app.--url VALUE— Base URL of a running task app. When omitted the CLI spins up a local server automatically.--seeds VALUE— Comma-separated seeds or indices ("0,1,2"). Negative values wrap around the dataset.--split VALUE— Dataset split name to request (train,validation, etc.).--model VALUE— Override the model in the config/metadata.--env-file PATH— One or more.envfiles holding credentials. Repeat the flag to merge multiple files.--trace-db PATH— SQLite/Turso path for trace persistence. Usenoneto disable saving traces.--metadata KEY=VALUE— Filter tasks by metadata key/value pairs (repeatable).--metadata-sql QUERY— Advanced seed selection using a SQLite query (returns seeds to evaluate).
Notes
- Remote evaluations require both
SYNTH_API_KEYandENVIRONMENT_API_KEYin your env files so the CLI can forward authentication headers. - Trace DB paths accept both file locations and SQLAlchemy URLs (
sqlite+aiosqlite:///absolute/path.db). Set--trace-db noneif you want to keep runs ephemeral. - Metadata filters stack: you can combine
--split validation,--metadata difficulty=hard, and an SQL query to target highly specific samples. - When the CLI starts a temporary server it automatically chooses an open port, injects credentials, and cleans up once evaluation finishes.