Results Overview
| Benchmark | Synth GEPA (GEPA) | Synth GEPA (DSPy) | Synth MIPRO | GEPA-AI Library | DSPy-AI MIPROv2 | DSPy GEPA Library |
|---|---|---|---|---|---|---|
| Banking77 | 87% → 100% | 86% → 100% | 90% → 84% | 75% → 95% | 60% → 71% | 49% → 49% |
| HeartDisease | 54% → 63% | 54% → 75% | 54% → 75% | 48% → 76% | 51% → 61% | 0% → 38% |
| HotpotQA | 78% → 89% | 79% → 77% | 61% → 70% | 65% → 75% | - | 0% → 52% |
| Pupa | - | - | 35% → 56% | 56% → 56% | - | 45% → 51% |
Benchmark Implementation Details
Because Synth uses a http relay boundary to separate client-side business logic from server-side optimization logic, we’re unable to reuse the gepa-ai and dspy-ai implementations of GEPA and MIPROv2. Instead, we re-implemented the methods from scratch. To make sure we didn’t veer wildly off course, we took the opportunity to run all 3 implementations of GEPA, and both implementations of MIPROv2, on a small but growing fraction of the LangProBe benchmark, with a few related benchmarks thrown in. It’s impossible to compare the methods apples-to-apples, as the three differ in what constraints they place on the inital module - DSPy imposes substantial structure, GEPA-AI a bit, and Synth very little. Moreover, although the generation pipelines resemble one another and use the same reflection model (here, Llama 3.3 70b), they are not exact copies. Nonetheless, we hope the results we’ve obtained shed a bit of light on what results developers can expect to see on simple pipeline tasks with small models. Note: to bring migration frictions to a minimum, Synth’s GEPA and MIPRO adapters allow the user to select between our own proposer and proposers closely resembling the dspy-ai and gepa-ai references. Synth GEPA (GEPA-AI proposer) is simply the core Synth GEPA optimizer with a proposer designed to resemble the gepa-ai proposer swapped in.Want to try prompt optimization? See Prompt Optimization Overview for concepts, Train Guide for CLI usage, and SDK Guide for programmatic access.
Frameworks Compared
- Synth AI: Synth’s Python client, designed for CLI use by agents and in production
- GEPA-AI: Official library implementing GEPA
- DSPy: Stanford’s framework with GEPA and MIPROv2 optimizers
LangProbe(+) Benchmark Suite
Four benchmarks testing different task types: Banking77 (77-class intent classification) HeartDisease (binary medical classification)- Dataset: buio/heart-disease
- Source: UCI Heart Disease Data Set
- Dataset: hotpotqa/hotpot_qa
- Paper: HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering
- Project: hotpotqa.github.io
- Dataset: Columbia-NLP/PUPA
- Paper: PAPILLON: Privacy Preservation from Internet-based and Local Language Model Ensembles
- Tutorial: DSPy RL for Privacy-Conscious Delegation
Methodology
Unified Adapter Pattern: Framework-specific adapters (DSPy, GEPA-AI, Synth) interface with shared task apps (FastAPI servers) that evaluate prompts on identical datasets. Same initial prompts, models, seeds, and rollout budgets across all frameworks. Architecture:Results Summary
Library Reference Experiments
GEPA-AI vs DSPy GEPA:| Benchmark | Optimizer | Model | Baseline | Best | Lift | Rollouts | Time |
|---|---|---|---|---|---|---|---|
| banking77 | GEPA-AI | groq/llama-3.1-8b-instant | 0.7500 | 0.9500 | +26.7% | 500 | 48.5s |
| banking77 | DSPy GEPA | groq/llama-3.1-8b-instant | 0.4900 | 0.4900 | +0.0% | 500 | 8.1s |
| heartdisease | GEPA-AI | groq/openai/gpt-oss-20b | 0.4800 | 0.7600 | +58.3% | 500 | 3.6m |
| heartdisease | DSPy GEPA | groq/openai/gpt-oss-20b | 0.0000 | 0.3800 | +0.3800 | 500 | 1.4m |
| hotpotqa | GEPA-AI | groq/llama-3.3-70b-versatile | 0.6500 | 0.7500 | +15.4% | 100 | 1.3m |
| hotpotqa | DSPy GEPA | groq/llama-3.3-70b-versatile | 0.0000 | 0.5167 | +0.5167 | 100 | 1.4m |
| pupa | GEPA-AI | groq/openai/gpt-oss-120b | 0.5557 | 0.5601 | +0.8% | 300 | 2.5m |
| pupa | DSPy GEPA | groq/openai/gpt-oss-120b | 0.4493 | 0.5102 | +13.6% | 300 | 1.3m |
| Benchmark | Model | Baseline | Best | Lift | Rollouts | Time |
|---|---|---|---|---|---|---|
| heartdisease | groq/openai/gpt-oss-20b | 0.5100 | 0.6100 | +10.0% | 969 | 1.6m |
| banking77 | groq/llama-3.1-8b-instant | 0.5950 | 0.7050 | +11.0% | 791 | 4.1m |
| Optimizer | Avg Final Score | Avg Lift | Benchmarks |
|---|---|---|---|
| GEPA-AI | 75.5% | +14.6% | 4 |
| DSPy-AI MIPROv2 | 65.8% | +10.5% | 2 benchmarks |
| DSPy GEPA | 47.4% | +23.9% | 4 |
Synth AI GEPA Results
Synth AI’s GEPA implementation with different proposer modes:| Task | Policy Model | Proposer | Baseline | Best | Lift | Rollouts | Time |
|---|---|---|---|---|---|---|---|
| Banking77_synth | llama-3.1-8b-instant | gepa-ai | 0.8700 | 1.0000 | +0.1300 | 136 | 185.2s |
| Banking77_dspy | llama-3.1-8b-instant | dspy | 0.8600 | 1.0000 | +0.1400 | 352 | 228.4s |
| Banking77_gepa_ai | llama-3.1-8b-instant | gepa-ai | 0.8700 | 0.8750 | +0.0050 | 352 | 190.9s |
| HeartDisease_dspy | openai/gpt-oss-20b | dspy | 0.5400 | 0.7500 | +0.2100 | 212 | 295.0s |
| HeartDisease_gepa_ai | openai/gpt-oss-20b | gepa-ai | 0.5400 | 0.6250 | +0.0850 | 240 | 301.3s |
| HotpotQA_gepa_ai | llama-3.3-70b-versatile | gepa-ai | 0.7817 | 0.8937 | +0.1120 | 96 | 275.4s |
| HotpotQA_dspy | llama-3.3-70b-versatile | dspy | 0.7867 | 0.7688 | -0.0229 | 88 | 252.7s |
| Proposer Mode | Avg Baseline | Avg Final Score | Avg Lift | Avg Time |
|---|---|---|---|---|
| DSPy Proposer | 0.7406 | 0.7979 | +0.0574 | 15.2m |
| GEPA-AI Proposer | 0.7629 | 0.8484 | +0.0855 | 21.5m |
Synth AI MIPRO Results
Synth AI’s MIPRO (using DSPy proposer mode):| Benchmark | Status | Baseline | Best | Lift | Rollouts | Time |
|---|---|---|---|---|---|---|
| HeartDisease | completed | 0.5400 | 0.7500 | +21.0% | N/A | N/A |
| HotpotQA | completed | 0.6050 | 0.6950 | +9.0% | 80 | 52.0s |
| Pupa | completed | 0.3466 | 0.5588 | +21.2% | 80 | 53.0s |
| Banking77 | completed | 0.9000 | 0.8400 | -6.0% | 125 | 48.0s |
- Average Baseline: 59.8%
- Average Best Score: 71.1%
- Average Lift: +11.3%
- Average Time: 51.0s
- Success Rate: 75% (3/4 benchmarks improved)
Per-Benchmark Results
| Benchmark | Synth GEPA (GEPA-AI) | Synth GEPA (DSPy) | Synth MIPRO | GEPA-AI Lib | DSPy-AI MIPROv2 | DSPy GEPA Lib |
|---|---|---|---|---|---|---|
| Banking77 | 87% to 100% | 86% to 100% | 90% to 84% | 75% to 95% | 60% to 71% | 49% to 49% |
| HeartDisease | 54% to 63% | 54% to 75% | 54% to 75% | 48% to 76% | 51% to 61% | 0% to 38% |
| HotpotQA | 78% to 89% | 79% to 77% | 61% to 70% | 65% to 75% | failed | 0% to 52% |
| Pupa | - | - | 35% to 56% | 56% to 56% | - | 45% to 51% |
Analysis
Performance Rankings
| Rank | Framework | Avg Final Score | Notes |
|---|---|---|---|
| 1 | Synth GEPA (GEPA-AI proposer) | 84.8% | - |
| 2 | Synth GEPA (DSPy proposer) | 79.8% | - |
| 3 | GEPA-AI Library | 75.5% | - |
| 4 | Synth MIPRO (DSPy proposer) | 71.1% | - |
| 5 | DSPy-AI MIPROv2 | 65.8% | 2 benchmarks only |
| 6 | DSPy GEPA | 47.4% | - |
Interpretation
It’s important to keep in mind that these 4 benchmarks comprise only a small subset of the types of problem devs care about, the models here were all small language models, and the rollout budgets were shoestring. These scripts are chiefly intented to provide fast feedback to developers, and not provide definitive scientific insight on which implementation is the best. But, it’s impossible to notice that all three consistently deliver improvements even on modest budgets. If you’re interested in seeing what they do for your application, consider installing dspy-ai, gepa-ai, or creating a free account with Synth.Reproduction
Full code, configs, and instructions: synth-ai/examples/blog_posts/langprobeLearn More
Prompt Optimization Documentation:- Prompt Optimization Overview - Core concepts and algorithms (GEPA, MIPRO)
- Train Your Own Prompts - CLI guide with examples
- Evaluation Guide - Evaluating optimized prompts
- Spec-Driven Optimization - Advanced: optimize with custom specifications
- Examples - Banking77 and more
- SDK Overview - Programmatic access to prompt optimization
- CLI Train Command - Command reference