Eval Harness Improvement

Goal
Python path
MCP path
Expected evidence
Failure notes

Use this cookbook when the target is an eval harness, benchmark runner, or scoring workflow that needs reliability, clarity, or better failure evidence.

Goal

Start a directed run that inspects the harness, makes the smallest high-impact improvement, runs the relevant check, and returns a report with artifacts.

Python path

run = client.runs.start(
    "Inspect the eval harness, fix the highest-leverage reliability issue, run the relevant check, and leave evidence.",
    host_kind="daytona",
    work_mode="directed_effort",
    providers=[{"provider": "openrouter"}],
    runbook="lite",
)

MCP path

Ask your MCP client:

Start a Managed Research run to improve the eval harness. Use directed_effort, daytona, openrouter, and runbook lite. Require a final report with the command run, failures found, patch summary, and artifacts.

Expected evidence

changed files or a PR
command output or failure summary
artifact manifest
final report explaining what improved and what remains risky

Failure notes

If the run cannot launch, preflight usually points to repo access, missing credentials, provider availability, or budget state.

Preflight and Errors Benchmark Improvement

⌘I

Getting Started

Core Concepts

Configure Runs

Reference

Cookbooks

Eval Harness Improvement

Goal

Python path

MCP path

Expected evidence

Failure notes

Getting Started

Core Concepts

Configure Runs

Reference

Cookbooks

Documentation Index

​Goal

​Python path

​MCP path

​Expected evidence

​Failure notes

Goal

Python path

MCP path

Expected evidence

Failure notes