Skip to main content

Synth Graphs

Graphs are Synth’s abstraction for multi-node LLM workflows. Like task apps, graphs are first-class artifacts you can train, download, and serve in production.

What is a Graph?

A Synth graph is a directed workflow of LLM calls and transformations. Each node can:
  • Call an LLM with a specific prompt template
  • Transform data between nodes
  • Branch conditionally based on intermediate results
  • Aggregate outputs from multiple paths
Unlike single prompts, graphs can express complex reasoning patterns: chain-of-thought, retrieval-augmented generation, self-consistency, and more.

Graph Types

Synth supports two fundamental graph types:
TypePurposeExample
policyMaps inputs to outputsQA, classification, generation
verifierJudges/scores existing resultsQuality scoring, ranking, evaluation

Policy Graphs

Policy graphs solve tasks. They take an input and produce an output:
Input → [Graph] → Output
Examples:
  • Question answering: {question, context}{answer}
  • Classification: {text}{category, confidence}
  • Code generation: {spec}{code}

Verifier Graphs

Verifier graphs evaluate execution traces and produce structured scores. At inference time, they take:
  1. A V3 trace - The execution trace from synth-ai tracing
  2. A rubric - Evaluation criteria defining what to score
{trace, rubric} → [Verifier Graph] → {score, event_rewards, reasoning}
Key use case: Training custom judges. Instead of using expensive frontier models (GPT-4, Claude) to evaluate outputs, you can train a verifier graph that:
  1. Matches human evaluation quality
  2. Runs on cheaper models (GPT-4o-mini, Groq)
  3. Provides consistent, calibrated scores
  4. Returns structured rewards (event-level and outcome-level)

Verifier Graph Dataset Format

Critical: Training a verifier graph requires a specific dataset format. The dataset must include V3 traces as inputs and gold scores as outputs.

Required Dataset Structure

A verifier-compliant ADAS dataset has this structure:
{
  "version": "1.0",
  "metadata": {
    "name": "my-verifier-dataset",
    "description": "Training data for custom judge"
  },
  "tasks": [
    {
      "id": "trace_001",
      "input": {
        "trace": { /* V3 SessionTrace object */ }
      }
    }
  ],
  "gold_outputs": [
    {
      "task_id": "trace_001",
      "output": {
        "score": 0.85
      }
    }
  ]
}

The tasks[].input.trace Field (Required)

Each task input must contain a trace field with a V3 SessionTrace object:
{
  "id": "trace_001",
  "input": {
    "trace": {
      "session_id": "session_abc123",
      "session_time_steps": [
        {
          "step_id": "step_0",
          "step_index": 0,
          "events": [
            {
              "event_id": 1,
              "event_type": "runtime",
              "metadata": {
                "action": "search_database",
                "query": "capital of France"
              }
            },
            {
              "event_id": 2,
              "event_type": "environment",
              "metadata": {
                "result": "Paris",
                "source": "geography_db"
              }
            }
          ]
        },
        {
          "step_id": "step_1",
          "step_index": 1,
          "events": [
            {
              "event_id": 3,
              "event_type": "runtime",
              "metadata": {
                "action": "generate_response",
                "output": "The capital of France is Paris."
              }
            }
          ]
        }
      ],
      "metadata": {
        "environment": "qa_system",
        "model": "gpt-4o-mini"
      }
    }
  }
}
Event IDs are critical. Each event must have an integer event_id so the verifier can assign per-event rewards that link back to specific actions in the trace.

The gold_outputs[].output.score Field (Required)

Every gold output must include a score field (float, 0-1):
{
  "task_id": "trace_001",
  "output": {
    "score": 0.85
  }
}

Optional: Event-Level Rewards

For fine-grained training, include event_rewards to teach the verifier which specific events were good or bad:
{
  "task_id": "trace_001",
  "output": {
    "score": 0.85,
    "event_rewards": [
      {
        "event_id": 1,
        "value": 1.0,
        "annotation": {"reason": "Good query formulation"}
      },
      {
        "event_id": 3,
        "value": 0.7,
        "annotation": {"reason": "Correct but could be more detailed"}
      }
    ]
  }
}

Optional: Outcome-Level Rewards

Include episode-level summary information:
{
  "task_id": "trace_001",
  "output": {
    "score": 0.85,
    "outcome": {
      "total_reward": 0.85,
      "achievements_count": 3,
      "annotation": {"summary": "Completed main objective"}
    },
    "outcome_feedback": "Agent found the correct answer efficiently"
  }
}

Optional: Default Rubric

Include a rubric in the dataset metadata to define evaluation criteria:
{
  "metadata": {
    "name": "my-verifier-dataset",
    "default_rubric": {
      "outcome": {
        "criteria": [
          {
            "name": "correctness",
            "description": "Is the final answer factually correct?",
            "weight": 2.0
          },
          {
            "name": "efficiency",
            "description": "Did the agent reach the answer efficiently?",
            "weight": 1.0
          }
        ]
      },
      "events": {
        "criteria": [
          {
            "name": "appropriate_action",
            "description": "Was each action appropriate for the context?",
            "weight": 1.0
          }
        ]
      }
    }
  }
}

Complete Verifier Dataset Example

Here’s a complete, production-ready verifier dataset:
{
  "version": "1.0",
  "metadata": {
    "name": "qa-judge-training",
    "description": "Training data for QA evaluation judge",
    "default_rubric": {
      "outcome": {
        "criteria": [
          {"name": "correctness", "description": "Factual accuracy", "weight": 2.0},
          {"name": "completeness", "description": "Answer addresses all parts", "weight": 1.0}
        ]
      }
    }
  },
  "tasks": [
    {
      "id": "trace_001",
      "input": {
        "trace": {
          "session_id": "qa_session_001",
          "session_time_steps": [
            {
              "step_id": "step_0",
              "step_index": 0,
              "events": [
                {"event_id": 1, "event_type": "environment", "metadata": {"question": "What is 2+2?"}},
                {"event_id": 2, "event_type": "runtime", "metadata": {"thought": "Simple arithmetic"}},
                {"event_id": 3, "event_type": "runtime", "metadata": {"answer": "4"}}
              ]
            }
          ],
          "metadata": {"task_type": "math"}
        }
      }
    },
    {
      "id": "trace_002",
      "input": {
        "trace": {
          "session_id": "qa_session_002",
          "session_time_steps": [
            {
              "step_id": "step_0",
              "step_index": 0,
              "events": [
                {"event_id": 1, "event_type": "environment", "metadata": {"question": "What is the capital of France?"}},
                {"event_id": 2, "event_type": "runtime", "metadata": {"answer": "London"}}
              ]
            }
          ],
          "metadata": {"task_type": "geography"}
        }
      }
    }
  ],
  "gold_outputs": [
    {
      "task_id": "trace_001",
      "output": {
        "score": 1.0,
        "event_rewards": [
          {"event_id": 2, "value": 1.0},
          {"event_id": 3, "value": 1.0}
        ],
        "outcome_feedback": "Correct answer with clear reasoning"
      }
    },
    {
      "task_id": "trace_002",
      "output": {
        "score": 0.0,
        "event_rewards": [
          {"event_id": 2, "value": 0.0}
        ],
        "outcome_feedback": "Incorrect - Paris is the capital of France"
      }
    }
  ],
  "judge_config": {
    "mode": "rubric",
    "model": "gpt-4o-mini",
    "provider": "openai"
  }
}

Training a Verifier Graph

With your dataset prepared, train via ADAS:
from synth_ai.sdk.api.train.adas import ADASJob

job = ADASJob.from_dataset(
    "verifier_dataset.json",
    graph_type="verifier",  # Critical: specify verifier type
    policy_model="gpt-4o-mini",
    rollout_budget=200,
)
job.submit()
result = job.stream_until_complete()
Or via the Graph GEPA config:
[graph]
graph_type = "verifier"

[model]
model_id = "gpt-4o-mini"
provider = "openai"

Verifier Graph Inference

At inference time, pass a V3 trace and rubric:
result = verifier_job.run_judge(
    session_trace={
        "session_id": "new_trace_001",
        "session_time_steps": [...],
        "metadata": {...}
    },
    context={
        "rubric": {
            "outcome": {
                "criteria": [
                    {"name": "correctness", "description": "Is the answer correct?", "weight": 1.0}
                ]
            }
        }
    }
)

print(result["score"])          # 0.92
print(result["reasoning"])       # "The answer is factually correct..."
print(result["event_rewards"])   # [{"event_id": 1, "value": 0.9}, ...]

Inference via cURL

curl -X POST $HOST/api/adas/graph/judge \
  -H "Authorization: Bearer $SYNTH_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "job_id": "adas_XXXX",
    "session_trace": {
      "session_id": "trace_to_evaluate",
      "session_time_steps": [...],
      "metadata": {}
    },
    "context": {
      "rubric": {
        "outcome": {"criteria": [{"name": "quality", "description": "...", "weight": 1.0}]}
      }
    }
  }'

Benefits of Trained Verifier Graphs

This enables:
  • 10x cost reduction vs frontier model judging
  • Consistent evaluation across runs
  • Domain-specific scoring tuned to your criteria
  • Structured rewards at event and outcome levels
  • Integration with RL training - use verifier output as rewards

Graph Structures

StructureDescription
single_promptOne LLM call, minimal complexity
dagMultiple nodes in sequence, no branching
conditionalFull graph with conditional branching

Creating Graphs

Graphs are created through optimization. You provide:
  1. A dataset - Examples of inputs and expected outputs
  2. Configuration - Graph type, structure constraints, models to use
  3. A budget - How much optimization to run
Synth’s Graph GEPA algorithm then evolves an optimal graph structure and prompts. The simplest way to create graphs is through the ADAS API:
from synth_ai.sdk.api.train.adas import ADASJob

job = ADASJob.from_dataset(
    "my_tasks.json",
    policy_model="gpt-4o-mini",
    rollout_budget=200,
)
job.submit()
result = job.stream_until_complete()
See Workflows for the full ADAS documentation.

Using Graph GEPA Directly

For more control, use the Graph Optimization client:
from synth_ai.products.graph_gepa import GraphOptimizationConfig, GraphOptimizationClient

config = GraphOptimizationConfig.from_toml("config.toml")

async with GraphOptimizationClient(backend_url, api_key) as client:
    job_id = await client.start_job(config)
    async for event in client.stream_events(job_id):
        print(event["type"])
    result = await client.get_result(job_id)
See Graph GEPA for configuration reference.

Using Graphs

Once trained, graphs can be:

1. Run in Production

Call the /graph/completions endpoint for production inference:
result = job.run_inference({"query": "What is the capital of France?"})
print(result["output"])
See Graph Inference for details.

2. Downloaded

Export the graph for local use or inspection:
graph_export = job.download_prompt()
print(graph_export)
See Downloading Graphs for details.

Graph Artifacts

When you train a graph, Synth produces:
ArtifactDescription
Graph YAMLFull graph definition with nodes and prompts
Prompt snapshotsIndividual prompt versions from training
Training metricsScores, costs, latencies per generation