Overview

This tutorial demonstrates how to evaluate an LLM agent’s shortcomings at playing Craftax, a Minecraft-inspired game environment, using the Synth SDK and platform. The agent uses a ReAct (Reasoning + Action) approach to make decisions and interact with the game world.

Game Rules and Actions

We’ll use the CraftaxLM to render the game in a text format that the LLM can engage with.

The LLM’s state and surroundings are rendered in the prompt, and it’s able to take between 1 and 8 consecutive actions like

  1. Basic movement (up, down, left, right)
  2. Resource gathering and crafting
  3. Combat and tool usage
  4. Building and construction

to make headway each step. Because it uses the Re-Act framework, the agent definition is rather simple:

from craftaxlm import CraftaxClassicACI
from synth_sdk.tracing.abstractions import RewardSignal
from zyk import LM

class SimpleReActLanguageAgent:
    def __init__(self, lm: LM, mode: str = "craftax_classic"):
        self.system_instance_id = str(uuid.uuid4())
        self.system_name = "craftax_agent"
        self.lm = lm
        self.mode = mode

    @trace_event_async(
        event_type="re-act",
    )
    async def get_actions(self):
        if self.mode == "classic":
            rules = craftax_classic_game_rules
            game_tips = crafter_game_tips
            actions = craftax_classic_action_dict
        elif self.mode == "full":
            rules = craftax_full_game_rules
            game_tips = craftax_game_tips
            actions = craftax_full_action_dict
        else:
            raise ValueError(f"Mode {self.mode} not recognized")
        system_message = f"""
# Premise
You're playing the game of Crafter.
Here is some information about this setting
<Crafter Information>
<Rules>
{rules}
</Rules>
<Tips>
{game_tips}
</Tips>
<Actions Available>
{[a for a in list(actions.keys()) if a.lower() not in ['noop']]}
</Actions Available>
You'll be given your past actions/thoughts, along with recent raw observations from the environment
The environment one step in the past is your current environment

{self.instructions}
"""
        react_history, obs_history = self.render_history()
        user_message = f"""
# Recent Actions / Thoughts
{react_history}
# Recent Observations
{obs_history}

Your next actions / thought: """

        react_step = await self.lm.respond_async(
            system_message=system_message,
            user_message=user_message,
            response_model=ReAct,
        )
        illegal_actions = [
            action
            for action in react_step.actions
            if action not in craftax_classic_action_dict.keys()
        ]
        legal_actions = [
            action
            for action in react_step.actions
            if action in craftax_classic_action_dict.keys()
        ]
        react_info = react_step.dict()
        react_info["errors"] = {
            "illegal_actions": illegal_actions,
        }

        self.react_history.append(react_info)

        return legal_actions

Configuration

We can configure what model underlies the agent, how long to give the agent before cutting off the trajectory, and the number of agents to run at once via the config. In this example, we’ll run the agent a handful of times to help the Synth platform identify some common failures.

[agent]
model_name = "gpt-4o-mini"
mode = "classic"
max_steps = 25
save_upload = false

[environment]
seeds = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

[recording]
fps = 3
enabled = true

We run a batch of agent episodes and give the Synth platform time to analyze them. Soon enough, we can find breakdowns of each trajectory

along with an analysis of which errors plague our agent most

The agent seems to consistently struggle with obtaining wood, although it often figures it out eventually!

For the complete implementation, including game rules, agent logic, and configuration options, check out the full source code:

And follow the walkthrough here