Skip to main content
This cookbook shows how to train a verifier graph to evaluate RL agent performance in the Crafter environment. The verifier learns to judge execution traces and assign scores that correlate with human evaluation.

The Task

Evaluate Crafter agent traces and produce scores:
Input: V3 trace of agent playing Crafter
Output: {score: 0.85, reasoning: "Agent achieved 4 of 22 achievements..."}

Why Train a Verifier?

Instead of using expensive frontier models (GPT-4, Claude) to judge every agent trace:
  1. Train once on human-labeled traces
  2. Run cheaply on GPT-4o-mini or Groq
  3. Get consistent evaluation across runs
  4. Use as reward for RL training

Dataset Format

A Crafter verifier dataset contains V3 traces with gold scores:
{
  "version": "1.0",
  "metadata": {
    "name": "crafter-verifier-training",
    "description": "Training data for Crafter episode judge",
    "default_rubric": {
      "outcome": {
        "criteria": [
          {
            "name": "achievement_progression",
            "description": "How many achievements did the agent unlock?",
            "weight": 2.0
          },
          {
            "name": "survival",
            "description": "Did the agent maintain health, food, and drink?",
            "weight": 1.0
          },
          {
            "name": "resource_collection",
            "description": "Did the agent collect useful resources?",
            "weight": 1.0
          }
        ]
      }
    }
  },
  "tasks": [
    {
      "id": "trace_001",
      "input": {
        "trace": {
          "session_id": "crafter_episode_001",
          "metadata": {
            "environment": "CrafterClassic",
            "difficulty": "easy",
            "seed": 1000
          },
          "session_time_steps": [
            {
              "step_id": "step_0",
              "step_index": 0,
              "events": [
                {
                  "event_id": 1,
                  "event_type": "environment",
                  "metadata": {
                    "hp": 9,
                    "food": 9,
                    "drink": 9,
                    "achievements": 0
                  }
                },
                {
                  "event_id": 2,
                  "event_type": "runtime",
                  "metadata": {
                    "action": "move_right",
                    "reasoning": "Exploring to find resources"
                  }
                }
              ]
            },
            {
              "step_id": "step_1",
              "step_index": 1,
              "events": [
                {
                  "event_id": 3,
                  "event_type": "environment",
                  "metadata": {
                    "hp": 9,
                    "food": 8,
                    "drink": 8,
                    "achievements": 1,
                    "new_achievement": "collect_wood"
                  }
                },
                {
                  "event_id": 4,
                  "event_type": "runtime",
                  "metadata": {
                    "action": "do",
                    "reasoning": "Collecting wood for crafting"
                  }
                }
              ]
            }
          ]
        }
      }
    }
  ],
  "gold_outputs": [
    {
      "task_id": "trace_001",
      "output": {
        "score": 0.133,
        "event_rewards": [
          {"event_id": 2, "value": 0.5, "annotation": {"reason": "Good exploration"}},
          {"event_id": 4, "value": 1.0, "annotation": {"reason": "Achieved collect_wood"}}
        ],
        "outcome": {
          "achievements_count": 3,
          "total_steps": 100,
          "annotation": {"summary": "Basic early-game progress"}
        }
      }
    }
  ],
  "judge_config": {
    "mode": "rubric",
    "model": "gpt-4o-mini"
  }
}

Key Dataset Requirements

Verifier datasets have specific requirements. Missing these will cause training to fail.

1. Task Inputs Must Have Traces

Each task input must contain a trace field with a V3 SessionTrace:
{
  "id": "trace_001",
  "input": {
    "trace": {
      "session_id": "...",
      "session_time_steps": [...],
      "metadata": {...}
    }
  }
}

2. Gold Outputs Must Have Scores

Every gold output must include a score field (float, 0-1):
{
  "task_id": "trace_001",
  "output": {
    "score": 0.133
  }
}

3. Events Need Integer IDs

Each event in the trace must have an integer event_id for linking rewards:
{
  "event_id": 1,
  "event_type": "runtime",
  "metadata": {"action": "move_right"}
}

Training the Verifier

from synth_ai.sdk.api.train.adas import ADASJob

job = ADASJob.from_dataset(
    "crafter_verifier_dataset.json",
    graph_type="verifier",  # Critical: specify verifier type
    policy_model="gpt-4o-mini",
    rollout_budget=200,
)
job.submit()

for event in job.stream_events():
    if event["type"] == "generation_complete":
        print(f"Gen {event['data']['generation']}: MAE {event['data']['best_score']:.3f}")

result = job.wait()
print(f"Final correlation: {result['best_score']:.3f}")

Using the Trained Verifier

Evaluate a Single Trace

verifier = ADASJob.from_existing("adas_verifier_xyz", api_key=api_key)

result = verifier.run_judge(
    session_trace={
        "session_id": "new_episode_001",
        "session_time_steps": [...],  # Your agent's trace
        "metadata": {"environment": "CrafterClassic"}
    },
    context={
        "rubric": {
            "outcome": {
                "criteria": [
                    {"name": "achievement_progression", "description": "Achievements unlocked", "weight": 2.0},
                    {"name": "survival", "description": "Health/food/drink management", "weight": 1.0}
                ]
            }
        }
    }
)

print(f"Score: {result['score']:.2f}")
print(f"Reasoning: {result['reasoning']}")
print(f"Event rewards: {result['event_rewards']}")

Use as RL Reward Signal

The verifier output integrates directly with synth-ai tracing:
from synth_ai.core.tracing_v3 import SessionTracer

# After your agent completes an episode
trace = tracer.get_session_trace()

# Get verifier judgement
judgement = verifier.run_judge(
    session_trace=trace,
    context={"rubric": crafter_rubric}
)

# Attach rewards for RL training
for event_reward in judgement["event_rewards"]:
    await tracer.record_event_reward(
        event_id=event_reward["event_id"],
        reward_value=event_reward["value"]
    )

await tracer.record_outcome_reward(
    reward_value=judgement["score"],
    feedback=judgement["reasoning"]
)

Crafter Scoring Rubric

The example rubric evaluates Crafter agents on 5 dimensions:
CriterionWeightDescription
Achievement Progression35%Late-game achievements (iron tools, furnace) score higher
Resource Stockpile20%Inventory quality (>20 wood = high score)
Survival State20%Health, food, drink above 50%
Failure Analysis15%How well agent mitigated death risk
Future Readiness10%Preparation for next objectives

Trace Compression

Crafter traces can be large (~30KB/step with images). For training, compress to essential data:
def compress_trace(trace):
    """Strip heavy data, keep essential state."""
    compressed = {
        "session_id": trace["session_id"],
        "metadata": trace["metadata"],
        "session_time_steps": []
    }

    for step in trace["session_time_steps"]:
        compressed_step = {
            "step_id": step["step_id"],
            "step_index": step["step_index"],
            "events": []
        }

        for event in step["events"]:
            compressed_event = {
                "event_id": event["event_id"],
                "event_type": event["event_type"],
                "metadata": {
                    k: v for k, v in event.get("metadata", {}).items()
                    if k not in ["image", "observation", "raw_state"]
                }
            }
            compressed_step["events"].append(compressed_event)

        compressed["session_time_steps"].append(compressed_step)

    return compressed

Configuration Tips

For Better Correlation with Human Scores

job = ADASJob.from_dataset(
    "crafter_verifier_dataset.json",
    graph_type="verifier",
    policy_model="gpt-4o",        # Stronger reasoning
    rollout_budget=500,            # More optimization
    graph_structure="dag",         # Multi-step evaluation
)

For Lower Inference Cost

job = ADASJob.from_dataset(
    "crafter_verifier_dataset.json",
    graph_type="verifier",
    policy_model="llama-3.1-8b-instant",  # Groq for speed
    graph_structure="single_prompt",
)