Overview
The typical workflow for judge-based filtering:- Collect traces – Run
uvx synth-ai evalwith tracing enabled - Grade traces – Judges score episodes based on rubrics (quality, task completion, etc.)
- Filter dataset – Use
uvx synth-ai filterwithmin_judge_scoresto keep only high-quality examples - Train SFT – Feed the filtered JSONL into your fine-tuning job
1. Collecting Traces with Judging
Enable Tracing During Evaluation
TASKAPP_TRACING_ENABLED=1– Enable trace collectionTURSO_NATIVE=1– Use local SQLite/Turso databaseSQLD_DB_PATH– Path to trace database (e.g.,traces/v3/synth_ai.db)
Traces are stored with:
- Session metadata (model, timestamp, configuration)
- Event history (observations, actions, rewards)
- LM call records (prompts, completions, tool calls)
- Outcome rewards (final scores, achievements)
2. Grading Traces with Judges
Judges evaluate traces using LLM-based rubrics. Synth supports multiple judge providers:Supported Judge Providers
| Provider | Models | Use Case |
|---|---|---|
| OpenAI | gpt-4o-mini, gpt-4o, o1-mini | High-quality scoring |
| Groq | llama-3.3-70b-versatile | Fast, cost-effective |
| Anthropic | claude-3-5-sonnet-20241022 | Detailed reasoning |
Judge Configuration in Task Apps
Task apps can define rubrics for judges to use. Example from the Crafter task app:Automatic Judging During Eval
Judges can run automatically during evaluation if configured:3. Filtering Traces by Judge Scores
Once traces are collected and graded, use thefilter command to export only high-quality examples:
Basic Filtering
Judge Score Schema
Judges return structured scores stored in the trace database:Advanced Filtering Examples
Filter by multiple criteria:4. Querying Judge Results
You can inspect judge scores directly from the trace database:View All Judge Scores
Filter High-Quality Traces
Export to CSV for Analysis
5. Complete Example: Curated SFT Pipeline
Here’s a complete workflow for building a high-quality SFT dataset using judges:Step 1: Collect Diverse Traces
Step 2: Run Offline Judging (Optional)
If judges weren’t run during eval, you can grade traces separately:Step 3: Filter by Quality
Step 4: Validate and Train
6. Best Practices
Judge Selection
- gpt-4o-mini – Best default choice (fast, accurate, cost-effective)
- gpt-4o – Use for critical datasets where quality matters most
- groq/llama-3.3-70b – Good for large-scale filtering (faster, cheaper)
- o1-mini – Use for complex reasoning tasks
Score Thresholds
| Use Case | Recommended Threshold |
|---|---|
| Initial filtering | primary ≥ 0.6 |
| Standard quality | primary ≥ 0.7 |
| High quality | primary ≥ 0.8 |
| Expert demonstrations | primary ≥ 0.9 |
Cost Optimization
- Sample first – Run judges on a subset to calibrate thresholds
- Use cheaper models – Start with Groq/Llama for bulk filtering
- Cache results – Store judge scores in the database to avoid re-grading
- Batch processing – Grade multiple traces in parallel
Data Quality
- Diverse filtering – Don’t only keep perfect traces; include learning examples
- Balance scores – Mix high-scoring and medium-scoring examples
- Manual review – Spot-check filtered dataset before training
- Iterate – Adjust thresholds based on downstream model performance
Summary
Judge-based filtering workflow:- ✅ Collect traces with
uvx synth-ai eval --trace-db traces/v3/synth_ai.db - ✅ Grade traces automatically (during eval) or offline (via SDK)
- ✅ Filter by scores with
uvx synth-ai filter --config filter.toml - ✅ Train SFT on curated dataset with
uvx synth-ai train --type sft