1. Choose the evaluation source
- Local task app – run
uvx synth-ai serve your-task-id --trace traces/v3. - Remote task app – provide
--task-url https://<modal>.modal.run. - In-process config – pass a TOML file with an
[eval]section (seeds, model, metadata filters).
2. Run the command
--model– model (base or fine-tuned) to roll out.--seeds– comma-separated list or range syntax (1-10).--metadata key=value– filter tasks by metadata in the dataset registry.--trace-db none– disable trace persistence if you only need metrics.--task-url– override the endpoint (useful for Modal deployments).
- Official score and any rubric-specific judges exposed by the task app.
- Per-seed outcomes (success/failure, scores, rewards).
- Location of stored traces (both the SQLite database and any SFT JSONL dumps).
3. Use configs for reproducibility
Createconfigs/eval.toml:
4. Promote or regress on metrics
- Compare the aggregated results to the baseline SFT job.
- Export traces (
traces/v3/synth_ai.db) and use them as additional training data if necessary. - Record judge trends in your experiment tracker.
.env written by uvx synth-ai setup. Pass --env-file only when you need to override or layer additional environment files.