Skip to main content
Judges are LLM-based evaluators that score traces using rubrics. In the SFT pipeline, judges help you filter and curate training data by grading collected traces before exporting them to JSONL datasets. Command references: Run Evaluations, Filter Traces, and Launch Training Jobs.

Overview

The typical workflow for judge-based filtering:
  1. Collect traces – Run uvx synth-ai eval with tracing enabled
  2. Grade traces – Judges score episodes based on rubrics (quality, task completion, etc.)
  3. Filter dataset – Use uvx synth-ai filter with min_judge_scores to keep only high-quality examples
  4. Train SFT – Feed the filtered JSONL into your fine-tuning job
This ensures your SFT dataset contains only successful, high-quality demonstrations.

1. Collecting Traces with Judging

Enable Tracing During Evaluation

# Run evaluation and store traces in Turso/SQLite
TASKAPP_TRACING_ENABLED=1 \
TURSO_NATIVE=1 \
SQLD_DB_PATH="traces/v3/crafter_eval.db" \
uvx synth-ai eval \
  --app-id grpo-crafter \
  --model Qwen/Qwen3-4B \
  --seeds 1-100 \
  --env-file .env
Key environment variables:
  • TASKAPP_TRACING_ENABLED=1 – Enable trace collection
  • TURSO_NATIVE=1 – Use local SQLite/Turso database
  • SQLD_DB_PATH – Path to trace database (e.g., traces/v3/synth_ai.db)

Traces are stored with:

  • Session metadata (model, timestamp, configuration)
  • Event history (observations, actions, rewards)
  • LM call records (prompts, completions, tool calls)
  • Outcome rewards (final scores, achievements)

2. Grading Traces with Judges

Judges evaluate traces using LLM-based rubrics. Synth supports multiple judge providers:

Supported Judge Providers

ProviderModelsUse Case
OpenAIgpt-4o-mini, gpt-4o, o1-miniHigh-quality scoring
Groqllama-3.3-70b-versatileFast, cost-effective
Anthropicclaude-3-5-sonnet-20241022Detailed reasoning

Judge Configuration in Task Apps

Task apps can define rubrics for judges to use. Example from the Crafter task app:
# In your task app
rubrics = RubricBundle(
    global_rubric=GlobalRubric(
        id="crafter/bundle@v1",
        description="Evaluate agent performance in Crafter environment",
        criteria=[
            "Achievement progression (collect wood, make tools, etc.)",
            "Survival capability (health management, avoid death)",
            "Exploration behavior (discovering new areas)",
            "Resource efficiency (optimal use of actions)"
        ]
    ),
    instance_rubrics=[]  # Optional per-task rubrics
)

Automatic Judging During Eval

Judges can run automatically during evaluation if configured:
# configs/eval_with_judges.toml
[eval]
app_id = "grpo-crafter"
model = "Qwen/Qwen3-4B"
seeds = [1, 2, 3, 4, 5]
trace_db = "traces/v3/synth_ai.db"

# Judge configuration
[eval.judge]
enabled = true
provider = "openai"              # or "groq", "anthropic"
model = "gpt-4o-mini"            # Judge model
api_key_env = "OPENAI_API_KEY"  # Environment variable for API key
rubric_id = "crafter/bundle@v1" # Task rubric to use
Run with:
uvx synth-ai eval \
  --config configs/eval_with_judges.toml \
  --env-file .env

3. Filtering Traces by Judge Scores

Once traces are collected and graded, use the filter command to export only high-quality examples:

Basic Filtering

# configs/filter_by_judges.toml
[filter]
db = "traces/v3/synth_ai.db"
output = "ft_data/crafter_filtered_sft.jsonl"

# Filter by judge scores
min_judge_scores.primary = 0.7     # Keep episodes with score ≥ 0.7
min_judge_scores.event = 0.6       # Optional: filter by event-level scores
min_judge_scores.outcome = 0.8     # Optional: filter by outcome scores

# Optional additional filters
min_official_score = 0.01          # Environment reward threshold
splits = ["train"]                 # Only training split
limit = 1000                       # Max examples to export
Run the filter:
uvx synth-ai filter \
  --config configs/filter_by_judges.toml \
  --env-file .env

Judge Score Schema

Judges return structured scores stored in the trace database:
{
  "primary": 0.85,           // Overall quality score (0-1)
  "event": 0.78,             // Per-step/event quality
  "outcome": 0.92,           // Final outcome quality
  "reasoning": "Agent successfully...",
  "metadata": {
    "achievements": 5,
    "survival_rate": 0.9,
    "exploration_coverage": 0.6
  }
}

Advanced Filtering Examples

Filter by multiple criteria:
[filter]
db = "traces/v3/synth_ai.db"
output = "ft_data/high_quality_sft.jsonl"

# Combine judge scores with environment rewards
min_judge_scores.primary = 0.8
min_official_score = 5.0

# Only include specific models
models = ["Qwen/Qwen3-4B", "ft:abc123:2024-09-18"]

# Limit dataset size
limit = 500
Vision-specific filtering:
[filter]
db = "traces/v3/crafter_vision.db"
output = "ft_data/vision_filtered_sft.jsonl"

# Filter vision traces
min_judge_scores.primary = 0.75
task_ids = ["crafter_classic_procedural"]

# Ensure traces have image observations
require_images = true

4. Querying Judge Results

You can inspect judge scores directly from the trace database:

View All Judge Scores

sqlite3 traces/v3/synth_ai.db \
  "SELECT 
    session_id,
    json_extract(judge_metadata, '\$.primary') as primary_score,
    json_extract(judge_metadata, '\$.reasoning') as reasoning,
    total_reward
   FROM session_traces 
   WHERE json_extract(judge_metadata, '\$.primary') IS NOT NULL
   ORDER BY primary_score DESC
   LIMIT 10;"

Filter High-Quality Traces

sqlite3 traces/v3/synth_ai.db \
  "SELECT 
    COUNT(*) as total,
    AVG(json_extract(judge_metadata, '\$.primary')) as avg_score,
    MAX(json_extract(judge_metadata, '\$.primary')) as max_score
   FROM session_traces 
   WHERE json_extract(judge_metadata, '\$.primary') >= 0.7;"

Export to CSV for Analysis

sqlite3 -header -csv traces/v3/synth_ai.db \
  "SELECT 
    session_id,
    created_at,
    json_extract(judge_metadata, '\$.primary') as score,
    total_reward,
    json_extract(judge_metadata, '\$.reasoning') as reasoning
   FROM session_traces 
   WHERE json_extract(judge_metadata, '\$.primary') >= 0.8" \
  > high_quality_traces.csv

5. Complete Example: Curated SFT Pipeline

Here’s a complete workflow for building a high-quality SFT dataset using judges:

Step 1: Collect Diverse Traces

# Collect 1000 episodes with multiple seeds
TASKAPP_TRACING_ENABLED=1 \
TURSO_NATIVE=1 \
SQLD_DB_PATH="traces/v3/crafter_collection.db" \
uvx synth-ai eval \
  --app-id grpo-crafter \
  --model Qwen/Qwen3-4B \
  --seeds 1-1000 \
  --env-file .env

Step 2: Run Offline Judging (Optional)

If judges weren’t run during eval, you can grade traces separately:
# Using the Synth SDK to grade traces
from synth_ai.judge import JudgeClient
import sqlite3

client = JudgeClient(api_key="your-key")

# Load traces from database
conn = sqlite3.connect("traces/v3/crafter_collection.db")
cursor = conn.execute("SELECT session_id, trace_data FROM session_traces")

for session_id, trace_data in cursor:
    # Grade the trace
    result = client.score_episode(
        trace=trace_data,
        rubric_id="crafter/bundle@v1",
        provider="openai",
        model="gpt-4o-mini"
    )
    
    # Update database with judge scores
    conn.execute(
        "UPDATE session_traces SET judge_metadata = ? WHERE session_id = ?",
        (result.to_json(), session_id)
    )
    conn.commit()

Step 3: Filter by Quality

# configs/filter_top_quality.toml
[filter]
db = "traces/v3/crafter_collection.db"
output = "ft_data/crafter_curated_sft.jsonl"

# Only keep top 20% by judge scores
min_judge_scores.primary = 0.8
min_judge_scores.outcome = 0.75

# Also require positive environment rewards
min_official_score = 2.0

# Cap at 200 examples
limit = 200
uvx synth-ai filter \
  --config configs/filter_top_quality.toml \
  --env-file .env

Step 4: Validate and Train

# Inspect filtered dataset
head -n 3 ft_data/crafter_curated_sft.jsonl | jq

# Run SFT training
uvx synth-ai train \
  --type sft \
  --config configs/sft_crafter.toml \
  --dataset ft_data/crafter_curated_sft.jsonl \
  --env-file .env \
  --poll

6. Best Practices

Judge Selection

  • gpt-4o-mini – Best default choice (fast, accurate, cost-effective)
  • gpt-4o – Use for critical datasets where quality matters most
  • groq/llama-3.3-70b – Good for large-scale filtering (faster, cheaper)
  • o1-mini – Use for complex reasoning tasks

Score Thresholds

Use CaseRecommended Threshold
Initial filteringprimary ≥ 0.6
Standard qualityprimary ≥ 0.7
High qualityprimary ≥ 0.8
Expert demonstrationsprimary ≥ 0.9

Cost Optimization

  1. Sample first – Run judges on a subset to calibrate thresholds
  2. Use cheaper models – Start with Groq/Llama for bulk filtering
  3. Cache results – Store judge scores in the database to avoid re-grading
  4. Batch processing – Grade multiple traces in parallel

Data Quality

  1. Diverse filtering – Don’t only keep perfect traces; include learning examples
  2. Balance scores – Mix high-scoring and medium-scoring examples
  3. Manual review – Spot-check filtered dataset before training
  4. Iterate – Adjust thresholds based on downstream model performance

Summary

Judge-based filtering workflow:
  1. ✅ Collect traces with uvx synth-ai eval --trace-db traces/v3/synth_ai.db
  2. ✅ Grade traces automatically (during eval) or offline (via SDK)
  3. ✅ Filter by scores with uvx synth-ai filter --config filter.toml
  4. ✅ Train SFT on curated dataset with uvx synth-ai train --type sft
Judges transform raw rollout traces into high-quality SFT datasets, ensuring your fine-tuned models learn from the best demonstrations.