Overview
This tutorial demonstrates how to evaluate an LLM agent’s shortcomings at playing Craftax, a Minecraft-inspired game environment, using the Synth SDK and platform. The agent uses a ReAct (Reasoning + Action) approach to make decisions and interact with the game world.
Game Rules and Actions
We’ll use the CraftaxLM to render the game in a text format that the LLM can engage with.
The LLM’s state and surroundings are rendered in the prompt, and it’s able to take between 1 and 8 consecutive actions like
- Basic movement (up, down, left, right)
- Resource gathering and crafting
- Combat and tool usage
- Building and construction
to make headway each step. Because it uses the Re-Act framework, the agent definition is rather simple:
from craftaxlm import CraftaxClassicACI
from synth_sdk.tracing.abstractions import RewardSignal
from zyk import LM
class SimpleReActLanguageAgent:
def __init__(self, lm: LM, mode: str = "craftax_classic"):
self.system_instance_id = str(uuid.uuid4())
self.system_name = "craftax_agent"
self.lm = lm
self.mode = mode
@trace_event_async(
event_type="re-act",
)
async def get_actions(self):
if self.mode == "classic":
rules = craftax_classic_game_rules
game_tips = crafter_game_tips
actions = craftax_classic_action_dict
elif self.mode == "full":
rules = craftax_full_game_rules
game_tips = craftax_game_tips
actions = craftax_full_action_dict
else:
raise ValueError(f"Mode {self.mode} not recognized")
system_message = f"""
# Premise
You're playing the game of Crafter.
Here is some information about this setting
<Crafter Information>
<Rules>
{rules}
</Rules>
<Tips>
{game_tips}
</Tips>
<Actions Available>
{[a for a in list(actions.keys()) if a.lower() not in ['noop']]}
</Actions Available>
You'll be given your past actions/thoughts, along with recent raw observations from the environment
The environment one step in the past is your current environment
{self.instructions}
"""
react_history, obs_history = self.render_history()
user_message = f"""
# Recent Actions / Thoughts
{react_history}
# Recent Observations
{obs_history}
Your next actions / thought: """
react_step = await self.lm.respond_async(
system_message=system_message,
user_message=user_message,
response_model=ReAct,
)
illegal_actions = [
action
for action in react_step.actions
if action not in craftax_classic_action_dict.keys()
]
legal_actions = [
action
for action in react_step.actions
if action in craftax_classic_action_dict.keys()
]
react_info = react_step.dict()
react_info["errors"] = {
"illegal_actions": illegal_actions,
}
self.react_history.append(react_info)
return legal_actions
Configuration
We can configure what model underlies the agent, how long to give the agent before cutting off the trajectory, and the number of agents to run at once via the config. In this example, we’ll run the agent a handful of times to help the Synth platform identify some common failures.
[agent]
model_name = "gpt-4o-mini"
mode = "classic"
max_steps = 25
save_upload = false
[environment]
seeds = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[recording]
fps = 3
enabled = true
We run a batch of agent episodes and give the Synth platform time to analyze them. Soon enough, we can find breakdowns of each trajectory
along with an analysis of which errors plague our agent most
The agent seems to consistently struggle with obtaining wood, although it often figures it out eventually!
For the complete implementation, including game rules, agent logic, and configuration options, check out the full source code:
And follow the walkthrough here