Judges

Judges are LLM-based evaluators that score traces using rubrics. In the SFT pipeline, judges help you filter and curate training data by grading collected traces before exporting them to JSONL datasets. Command references: Run Evaluations, Filter Traces, and Launch Training Jobs.

Overview

The typical workflow for judge-based filtering:

Collect traces – Run uvx synth-ai eval with tracing enabled
Grade traces – Judges score episodes based on rubrics (quality, task completion, etc.)
Filter dataset – Use uvx synth-ai filter with min_judge_scores to keep only high-quality examples
Train SFT – Feed the filtered JSONL into your fine-tuning job

This ensures your SFT dataset contains only successful, high-quality demonstrations.

1. Collecting Traces with Judging

Enable Tracing During Evaluation

# Run evaluation and store traces in Turso/SQLite
TASKAPP_TRACING_ENABLED=1 \
TURSO_NATIVE=1 \
SQLD_DB_PATH="traces/v3/crafter_eval.db" \
uvx synth-ai eval \
  --app-id grpo-crafter \
  --model Qwen/Qwen3-4B \
  --seeds 1-100 \
  --env-file .env

Key environment variables:

TASKAPP_TRACING_ENABLED=1 – Enable trace collection
TURSO_NATIVE=1 – Use local SQLite/Turso database
SQLD_DB_PATH – Path to trace database (e.g., traces/v3/synth_ai.db)

Traces are stored with:

Session metadata (model, timestamp, configuration)
Event history (observations, actions, rewards)
LM call records (prompts, completions, tool calls)
Outcome rewards (final scores, achievements)

2. Grading Traces with Judges

Judges evaluate traces using LLM-based rubrics. Synth supports multiple judge providers:

Supported Judge Providers

Provider	Models	Use Case
OpenAI	`gpt-4o-mini`, `gpt-4o`, `o1-mini`	High-quality scoring
Groq	`llama-3.3-70b-versatile`	Fast, cost-effective
Anthropic	`claude-3-5-sonnet-20241022`	Detailed reasoning

Judge Configuration in Task Apps

Task apps can define rubrics for judges to use. Example from the Crafter task app:

# In your task app
rubrics = RubricBundle(
    global_rubric=GlobalRubric(
        id="crafter/bundle@v1",
        description="Evaluate agent performance in Crafter environment",
        criteria=[
            "Achievement progression (collect wood, make tools, etc.)",
            "Survival capability (health management, avoid death)",
            "Exploration behavior (discovering new areas)",
            "Resource efficiency (optimal use of actions)"
        ]
    ),
    instance_rubrics=[]  # Optional per-task rubrics
)

Automatic Judging During Eval

Judges can run automatically during evaluation if configured:

# configs/eval_with_judges.toml
[eval]
app_id = "grpo-crafter"
model = "Qwen/Qwen3-4B"
seeds = [1, 2, 3, 4, 5]
trace_db = "traces/v3/synth_ai.db"

# Judge configuration
[eval.judge]
enabled = true
provider = "openai"              # or "groq", "anthropic"
model = "gpt-4o-mini"            # Judge model
api_key_env = "OPENAI_API_KEY"  # Environment variable for API key
rubric_id = "crafter/bundle@v1" # Task rubric to use

Run with:

uvx synth-ai eval \
  --config configs/eval_with_judges.toml \
  --env-file .env

3. Filtering Traces by Judge Scores

Once traces are collected and graded, use the filter command to export only high-quality examples:

Basic Filtering

# configs/filter_by_judges.toml
[filter]
db = "traces/v3/synth_ai.db"
output = "ft_data/crafter_filtered_sft.jsonl"

# Filter by judge scores
min_judge_scores.primary = 0.7     # Keep episodes with score ≥ 0.7
min_judge_scores.event = 0.6       # Optional: filter by event-level scores
min_judge_scores.outcome = 0.8     # Optional: filter by outcome scores

# Optional additional filters
min_official_score = 0.01          # Environment reward threshold
splits = ["train"]                 # Only training split
limit = 1000                       # Max examples to export

Run the filter:

uvx synth-ai filter \
  --config configs/filter_by_judges.toml \
  --env-file .env

Judge Score Schema

Judges return structured scores stored in the trace database:

{
  "primary": 0.85,           // Overall quality score (0-1)
  "event": 0.78,             // Per-step/event quality
  "outcome": 0.92,           // Final outcome quality
  "reasoning": "Agent successfully...",
  "metadata": {
    "achievements": 5,
    "survival_rate": 0.9,
    "exploration_coverage": 0.6
  }
}

Advanced Filtering Examples

Filter by multiple criteria:

[filter]
db = "traces/v3/synth_ai.db"
output = "ft_data/high_quality_sft.jsonl"

# Combine judge scores with environment rewards
min_judge_scores.primary = 0.8
min_official_score = 5.0

# Only include specific models
models = ["Qwen/Qwen3-4B", "ft:abc123:2024-09-18"]

# Limit dataset size
limit = 500

Vision-specific filtering:

[filter]
db = "traces/v3/crafter_vision.db"
output = "ft_data/vision_filtered_sft.jsonl"

# Filter vision traces
min_judge_scores.primary = 0.75
task_ids = ["crafter_classic_procedural"]

# Ensure traces have image observations
require_images = true

4. Querying Judge Results

You can inspect judge scores directly from the trace database:

View All Judge Scores

sqlite3 traces/v3/synth_ai.db \
  "SELECT 
    session_id,
    json_extract(judge_metadata, '\$.primary') as primary_score,
    json_extract(judge_metadata, '\$.reasoning') as reasoning,
    total_reward
   FROM session_traces 
   WHERE json_extract(judge_metadata, '\$.primary') IS NOT NULL
   ORDER BY primary_score DESC
   LIMIT 10;"

Filter High-Quality Traces

sqlite3 traces/v3/synth_ai.db \
  "SELECT 
    COUNT(*) as total,
    AVG(json_extract(judge_metadata, '\$.primary')) as avg_score,
    MAX(json_extract(judge_metadata, '\$.primary')) as max_score
   FROM session_traces 
   WHERE json_extract(judge_metadata, '\$.primary') >= 0.7;"

Export to CSV for Analysis

sqlite3 -header -csv traces/v3/synth_ai.db \
  "SELECT 
    session_id,
    created_at,
    json_extract(judge_metadata, '\$.primary') as score,
    total_reward,
    json_extract(judge_metadata, '\$.reasoning') as reasoning
   FROM session_traces 
   WHERE json_extract(judge_metadata, '\$.primary') >= 0.8" \
  > high_quality_traces.csv

5. Complete Example: Curated SFT Pipeline

Here’s a complete workflow for building a high-quality SFT dataset using judges:

Step 1: Collect Diverse Traces

# Collect 1000 episodes with multiple seeds
TASKAPP_TRACING_ENABLED=1 \
TURSO_NATIVE=1 \
SQLD_DB_PATH="traces/v3/crafter_collection.db" \
uvx synth-ai eval \
  --app-id grpo-crafter \
  --model Qwen/Qwen3-4B \
  --seeds 1-1000 \
  --env-file .env

Step 2: Run Offline Judging (Optional)

If judges weren’t run during eval, you can grade traces separately:

# Using the Synth SDK to grade traces
from synth_ai.judge import JudgeClient
import sqlite3

client = JudgeClient(api_key="your-key")

# Load traces from database
conn = sqlite3.connect("traces/v3/crafter_collection.db")
cursor = conn.execute("SELECT session_id, trace_data FROM session_traces")

for session_id, trace_data in cursor:
    # Grade the trace
    result = client.score_episode(
        trace=trace_data,
        rubric_id="crafter/bundle@v1",
        provider="openai",
        model="gpt-4o-mini"
    )
    
    # Update database with judge scores
    conn.execute(
        "UPDATE session_traces SET judge_metadata = ? WHERE session_id = ?",
        (result.to_json(), session_id)
    )
    conn.commit()

Step 3: Filter by Quality

# configs/filter_top_quality.toml
[filter]
db = "traces/v3/crafter_collection.db"
output = "ft_data/crafter_curated_sft.jsonl"

# Only keep top 20% by judge scores
min_judge_scores.primary = 0.8
min_judge_scores.outcome = 0.75

# Also require positive environment rewards
min_official_score = 2.0

# Cap at 200 examples
limit = 200

uvx synth-ai filter \
  --config configs/filter_top_quality.toml \
  --env-file .env

Step 4: Validate and Train

# Inspect filtered dataset
head -n 3 ft_data/crafter_curated_sft.jsonl | jq

# Run SFT training
uvx synth-ai train \
  --type sft \
  --config configs/sft_crafter.toml \
  --dataset ft_data/crafter_curated_sft.jsonl \
  --env-file .env \
  --poll

6. Best Practices

Judge Selection

gpt-4o-mini – Best default choice (fast, accurate, cost-effective)
gpt-4o – Use for critical datasets where quality matters most
groq/llama-3.3-70b – Good for large-scale filtering (faster, cheaper)
o1-mini – Use for complex reasoning tasks

Score Thresholds

Use Case	Recommended Threshold
Initial filtering	`primary ≥ 0.6`
Standard quality	`primary ≥ 0.7`
High quality	`primary ≥ 0.8`
Expert demonstrations	`primary ≥ 0.9`

Cost Optimization

Sample first – Run judges on a subset to calibrate thresholds
Use cheaper models – Start with Groq/Llama for bulk filtering
Cache results – Store judge scores in the database to avoid re-grading
Batch processing – Grade multiple traces in parallel

Data Quality

Diverse filtering – Don’t only keep perfect traces; include learning examples
Balance scores – Mix high-scoring and medium-scoring examples
Manual review – Spot-check filtered dataset before training
Iterate – Adjust thresholds based on downstream model performance

Summary

Judge-based filtering workflow:

✅ Collect traces with uvx synth-ai eval --trace-db traces/v3/synth_ai.db
✅ Grade traces automatically (during eval) or offline (via SDK)
✅ Filter by scores with uvx synth-ai filter --config filter.toml
✅ Train SFT on curated dataset with uvx synth-ai train --type sft

Judges transform raw rollout traces into high-quality SFT datasets, ensuring your fine-tuned models learn from the best demonstrations.

Get Started

Fine-Tuning

Reinforcement Learning

CLI Commands

Overview

1. Collecting Traces with Judging

Enable Tracing During Evaluation

Traces are stored with:

2. Grading Traces with Judges

Supported Judge Providers

Judge Configuration in Task Apps

Automatic Judging During Eval

3. Filtering Traces by Judge Scores

Basic Filtering

Judge Score Schema

Advanced Filtering Examples

4. Querying Judge Results

View All Judge Scores

Filter High-Quality Traces

Export to CSV for Analysis

5. Complete Example: Curated SFT Pipeline

Step 1: Collect Diverse Traces

Step 2: Run Offline Judging (Optional)

Step 3: Filter by Quality

Step 4: Validate and Train

6. Best Practices

Judge Selection

Score Thresholds

Cost Optimization

Data Quality

Summary

Get Started

Fine-Tuning

Reinforcement Learning

CLI Commands

​Overview

​1. Collecting Traces with Judging

​Enable Tracing During Evaluation

​Traces are stored with:

​2. Grading Traces with Judges

​Supported Judge Providers

​Judge Configuration in Task Apps

​Automatic Judging During Eval

​3. Filtering Traces by Judge Scores

​Basic Filtering

​Judge Score Schema

​Advanced Filtering Examples

​4. Querying Judge Results

​View All Judge Scores

​Filter High-Quality Traces

​Export to CSV for Analysis

​5. Complete Example: Curated SFT Pipeline

​Step 1: Collect Diverse Traces

​Step 2: Run Offline Judging (Optional)

​Step 3: Filter by Quality

​Step 4: Validate and Train

​6. Best Practices

​Judge Selection

​Score Thresholds

​Cost Optimization

​Data Quality

​Summary

Overview

1. Collecting Traces with Judging

Enable Tracing During Evaluation

Traces are stored with:

2. Grading Traces with Judges

Supported Judge Providers

Judge Configuration in Task Apps

Automatic Judging During Eval

3. Filtering Traces by Judge Scores

Basic Filtering

Judge Score Schema

Advanced Filtering Examples

4. Querying Judge Results

View All Judge Scores

Filter High-Quality Traces

Export to CSV for Analysis

5. Complete Example: Curated SFT Pipeline

Step 1: Collect Diverse Traces

Step 2: Run Offline Judging (Optional)

Step 3: Filter by Quality

Step 4: Validate and Train

6. Best Practices

Judge Selection

Score Thresholds

Cost Optimization

Data Quality

Summary