Crafter Verifier

This cookbook shows how to train a verifier graph to evaluate RL agent performance in the Crafter environment. The verifier learns to judge execution traces and assign scores that correlate with human evaluation.

The Task

Evaluate Crafter agent traces and produce scores:

Input: V3 trace of agent playing Crafter
Output: {score: 0.85, reasoning: "Agent achieved 4 of 22 achievements..."}

Why Train a Verifier?

Instead of using expensive frontier models (GPT-4, Claude) to judge every agent trace:

Train once on human-labeled traces
Run cheaply on GPT-4o-mini or Groq
Get consistent evaluation across runs
Use as reward for RL training

Dataset Format

A Crafter verifier dataset contains V3 traces with gold scores:

{
  "version": "1.0",
  "metadata": {
    "name": "crafter-verifier-training",
    "description": "Training data for Crafter episode judge",
    "default_rubric": {
      "outcome": {
        "criteria": [
          {
            "name": "achievement_progression",
            "description": "How many achievements did the agent unlock?",
            "weight": 2.0
          },
          {
            "name": "survival",
            "description": "Did the agent maintain health, food, and drink?",
            "weight": 1.0
          },
          {
            "name": "resource_collection",
            "description": "Did the agent collect useful resources?",
            "weight": 1.0
          }
        ]
      }
    }
  },
  "tasks": [
    {
      "id": "trace_001",
      "input": {
        "trace": {
          "session_id": "crafter_episode_001",
          "metadata": {
            "environment": "CrafterClassic",
            "difficulty": "easy",
            "seed": 1000
          },
          "session_time_steps": [
            {
              "step_id": "step_0",
              "step_index": 0,
              "events": [
                {
                  "event_id": 1,
                  "event_type": "environment",
                  "metadata": {
                    "hp": 9,
                    "food": 9,
                    "drink": 9,
                    "achievements": 0
                  }
                },
                {
                  "event_id": 2,
                  "event_type": "runtime",
                  "metadata": {
                    "action": "move_right",
                    "reasoning": "Exploring to find resources"
                  }
                }
              ]
            },
            {
              "step_id": "step_1",
              "step_index": 1,
              "events": [
                {
                  "event_id": 3,
                  "event_type": "environment",
                  "metadata": {
                    "hp": 9,
                    "food": 8,
                    "drink": 8,
                    "achievements": 1,
                    "new_achievement": "collect_wood"
                  }
                },
                {
                  "event_id": 4,
                  "event_type": "runtime",
                  "metadata": {
                    "action": "do",
                    "reasoning": "Collecting wood for crafting"
                  }
                }
              ]
            }
          ]
        }
      }
    }
  ],
  "gold_outputs": [
    {
      "task_id": "trace_001",
      "output": {
        "score": 0.133,
        "event_rewards": [
          {"event_id": 2, "value": 0.5, "annotation": {"reason": "Good exploration"}},
          {"event_id": 4, "value": 1.0, "annotation": {"reason": "Achieved collect_wood"}}
        ],
        "outcome": {
          "achievements_count": 3,
          "total_steps": 100,
          "annotation": {"summary": "Basic early-game progress"}
        }
      }
    }
  ],
  "judge_config": {
    "mode": "rubric",
    "model": "gpt-4o-mini"
  }
}

Key Dataset Requirements

Verifier datasets have specific requirements. Missing these will cause training to fail.

1. Task Inputs Must Have Traces

Each task input must contain a trace field with a V3 SessionTrace:

{
  "id": "trace_001",
  "input": {
    "trace": {
      "session_id": "...",
      "session_time_steps": [...],
      "metadata": {...}
    }
  }
}

2. Gold Outputs Must Have Scores

Every gold output must include a score field (float, 0-1):

{
  "task_id": "trace_001",
  "output": {
    "score": 0.133
  }
}

3. Events Need Integer IDs

Each event in the trace must have an integer event_id for linking rewards:

{
  "event_id": 1,
  "event_type": "runtime",
  "metadata": {"action": "move_right"}
}

Training the Verifier

from synth_ai.sdk.api.train.adas import ADASJob

job = ADASJob.from_dataset(
    "crafter_verifier_dataset.json",
    graph_type="verifier",  # Critical: specify verifier type
    policy_model="gpt-4o-mini",
    rollout_budget=200,
)
job.submit()

for event in job.stream_events():
    if event["type"] == "generation_complete":
        print(f"Gen {event['data']['generation']}: MAE {event['data']['best_score']:.3f}")

result = job.wait()
print(f"Final correlation: {result['best_score']:.3f}")

Using the Trained Verifier

Evaluate a Single Trace

verifier = ADASJob.from_existing("adas_verifier_xyz", api_key=api_key)

result = verifier.run_judge(
    session_trace={
        "session_id": "new_episode_001",
        "session_time_steps": [...],  # Your agent's trace
        "metadata": {"environment": "CrafterClassic"}
    },
    context={
        "rubric": {
            "outcome": {
                "criteria": [
                    {"name": "achievement_progression", "description": "Achievements unlocked", "weight": 2.0},
                    {"name": "survival", "description": "Health/food/drink management", "weight": 1.0}
                ]
            }
        }
    }
)

print(f"Score: {result['score']:.2f}")
print(f"Reasoning: {result['reasoning']}")
print(f"Event rewards: {result['event_rewards']}")

Use as RL Reward Signal

The verifier output integrates directly with synth-ai tracing:

from synth_ai.core.tracing_v3 import SessionTracer

# After your agent completes an episode
trace = tracer.get_session_trace()

# Get verifier judgement
judgement = verifier.run_judge(
    session_trace=trace,
    context={"rubric": crafter_rubric}
)

# Attach rewards for RL training
for event_reward in judgement["event_rewards"]:
    await tracer.record_event_reward(
        event_id=event_reward["event_id"],
        reward_value=event_reward["value"]
    )

await tracer.record_outcome_reward(
    reward_value=judgement["score"],
    feedback=judgement["reasoning"]
)

Crafter Scoring Rubric

The example rubric evaluates Crafter agents on 5 dimensions:

Criterion	Weight	Description
Achievement Progression	35%	Late-game achievements (iron tools, furnace) score higher
Resource Stockpile	20%	Inventory quality (>20 wood = high score)
Survival State	20%	Health, food, drink above 50%
Failure Analysis	15%	How well agent mitigated death risk
Future Readiness	10%	Preparation for next objectives

Trace Compression

Crafter traces can be large (~30KB/step with images). For training, compress to essential data:

def compress_trace(trace):
    """Strip heavy data, keep essential state."""
    compressed = {
        "session_id": trace["session_id"],
        "metadata": trace["metadata"],
        "session_time_steps": []
    }

    for step in trace["session_time_steps"]:
        compressed_step = {
            "step_id": step["step_id"],
            "step_index": step["step_index"],
            "events": []
        }

        for event in step["events"]:
            compressed_event = {
                "event_id": event["event_id"],
                "event_type": event["event_type"],
                "metadata": {
                    k: v for k, v in event.get("metadata", {}).items()
                    if k not in ["image", "observation", "raw_state"]
                }
            }
            compressed_step["events"].append(compressed_event)

        compressed["session_time_steps"].append(compressed_step)

    return compressed

Configuration Tips

For Better Correlation with Human Scores

job = ADASJob.from_dataset(
    "crafter_verifier_dataset.json",
    graph_type="verifier",
    policy_model="gpt-4o",        # Stronger reasoning
    rollout_budget=500,            # More optimization
    graph_structure="dag",         # Multi-step evaluation
)

For Lower Inference Cost

job = ADASJob.from_dataset(
    "crafter_verifier_dataset.json",
    graph_type="verifier",
    policy_model="llama-3.1-8b-instant",  # Groq for speed
    graph_structure="single_prompt",
)

Verifier Graph Dataset Format - Full dataset spec
Graph Inference - Verifier API reference
V3 Traces - Trace format documentation

Workflows

Prompt Optimization

The Task

Why Train a Verifier?

Dataset Format

Key Dataset Requirements

1. Task Inputs Must Have Traces

2. Gold Outputs Must Have Scores

3. Events Need Integer IDs

Training the Verifier

Using the Trained Verifier

Evaluate a Single Trace

Use as RL Reward Signal

Crafter Scoring Rubric

Trace Compression

Configuration Tips

For Better Correlation with Human Scores

For Lower Inference Cost

Workflows

Prompt Optimization

​The Task

​Why Train a Verifier?

​Dataset Format

​Key Dataset Requirements

​1. Task Inputs Must Have Traces

​2. Gold Outputs Must Have Scores

​3. Events Need Integer IDs

​Training the Verifier

​Using the Trained Verifier

​Evaluate a Single Trace

​Use as RL Reward Signal

​Crafter Scoring Rubric

​Trace Compression

​Configuration Tips

​For Better Correlation with Human Scores

​For Lower Inference Cost

​Related

The Task

Why Train a Verifier?

Dataset Format

Key Dataset Requirements

1. Task Inputs Must Have Traces

2. Gold Outputs Must Have Scores

3. Events Need Integer IDs

Training the Verifier

Using the Trained Verifier

Evaluate a Single Trace

Use as RL Reward Signal

Crafter Scoring Rubric

Trace Compression

Configuration Tips

For Better Correlation with Human Scores

For Lower Inference Cost

Related