Evals Demo
Run quick, local evaluations to compare models on the Crafter environment, then filter traces to prepare a fine‑tuning dataset.Final Result - Achievement frequencies
- Have
uv
installed and useuvx
/uv run
OPENAI_API_KEY
exported in your shell- Local tracing and environment service deployed with
uvx synth-ai serve
(auto‑kills port 8901)
- Side‑by‑side model comparison on Crafter with concurrency and timeouts
- Live progress with timeouts and achievements per episode
- Post‑run: filter traces → JSONL for SFT + basic stats
Run the demo
Experiment summary
Live progress (sample)
Results summary
Post‑run analysis (trace filtering)
List achievements present in your tracing DB, then filter traces to JSONL for SFT:ft_data/evals_filtered.jsonl
ready for SFT, and a DB for deeper analysis.