Baseline run
Inspect judge metrics
Rollout responses include judge-specific scores (outcome, events, etc.). The CLI prints them inline. To consume the raw data:
Config-driven eval
uvx synth-ai eval --config configs/eval.toml. CLI flags still override the config, letting agents tweak parameters dynamically.
Use results downstream
- Feed high-reward trajectories back into the SFT pipeline using the filter step.
- Compare different checkpoints by running eval with different
--modelvalues. - Store judge metrics to detect regressions or plateauing performance.
.env produced during setup. Use --env-file only when you need to override the inferred environment.