Skip to main content
Machine-aided prompt learning is a useful tool for devs writing AI software. It’s cheap, interpretable, and easy to implement and version. By far the most popular method today is GEPA, an evolutionary search routine. In contexts where demonstrations are needed, though, MIPROv2 is likely still the state of the art. We’ve used related methods extensively internally to calibrate judges and establish strong agent baselines for SFT and RL development and testing. While eventually, we expect most developers to reach for our methods and their derivatives, we’re starting by offering serverless implementations of the academic state of the art to offer a familiar option for devs who want to keep using familiar algorithms, just servelessly via the Synth platform.

Results Overview

BenchmarkSynth GEPA (GEPA)Synth GEPA (DSPy)Synth MIPROGEPA-AI LibraryDSPy-AI MIPROv2DSPy GEPA Library
Banking7787% → 100%86% → 100%90% → 84%75% → 95%60% → 71%49% → 49%
HeartDisease54% → 63%54% → 75%54% → 75%48% → 76%51% → 61%0% → 38%
HotpotQA78% → 89%79% → 77%61% → 70%65% → 75%-0% → 52%
Pupa--35% → 56%56% → 56%-45% → 51%
Format: Baseline → Final Score. Bold indicates best final score per benchmark.

Benchmark Implementation Details

Because Synth uses a http relay boundary to separate client-side business logic from server-side optimization logic, we’re unable to reuse the gepa-ai and dspy-ai implementations of GEPA and MIPROv2. Instead, we re-implemented the methods from scratch. To make sure we didn’t veer wildly off course, we took the opportunity to run all 3 implementations of GEPA, and both implementations of MIPROv2, on a small but growing fraction of the LangProBe benchmark, with a few related benchmarks thrown in. It’s impossible to compare the methods apples-to-apples, as the three differ in what constraints they place on the inital module - DSPy imposes substantial structure, GEPA-AI a bit, and Synth very little. Moreover, although the generation pipelines resemble one another and use the same reflection model (here, Llama 3.3 70b), they are not exact copies. Nonetheless, we hope the results we’ve obtained shed a bit of light on what results developers can expect to see on simple pipeline tasks with small models. Note: to bring migration frictions to a minimum, Synth’s GEPA and MIPRO adapters allow the user to select between our own proposer and proposers closely resembling the dspy-ai and gepa-ai references. Synth GEPA (GEPA-AI proposer) is simply the core Synth GEPA optimizer with a proposer designed to resemble the gepa-ai proposer swapped in.
Want to try prompt optimization? See Prompt Optimization Overview for concepts, Train Guide for CLI usage, and SDK Guide for programmatic access.

Frameworks Compared

  • Synth AI: Synth’s Python client, designed for CLI use by agents and in production
  • GEPA-AI: Official library implementing GEPA
  • DSPy: Stanford’s framework with GEPA and MIPROv2 optimizers

LangProbe(+) Benchmark Suite

Four benchmarks testing different task types: Banking77 (77-class intent classification) HeartDisease (binary medical classification) HotpotQA (multi-hop question answering) PUPA (privacy-aware task delegation)

Methodology

Unified Adapter Pattern: Framework-specific adapters (DSPy, GEPA-AI, Synth) interface with shared task apps (FastAPI servers) that evaluate prompts on identical datasets. Same initial prompts, models, seeds, and rollout budgets across all frameworks. Architecture:
Task App (FastAPI) -> Adapter Layer -> Optimizer -> Results
Models Used: Groq Llama-3.1-8b (Banking77), OpenAI GPT variants (HeartDisease, Pupa), Llama-3.3-70b (HotpotQA)

Results Summary

Library Reference Experiments

GEPA-AI vs DSPy GEPA:
BenchmarkOptimizerModelBaselineBestLiftRolloutsTime
banking77GEPA-AIgroq/llama-3.1-8b-instant0.75000.9500+26.7%50048.5s
banking77DSPy GEPAgroq/llama-3.1-8b-instant0.49000.4900+0.0%5008.1s
heartdiseaseGEPA-AIgroq/openai/gpt-oss-20b0.48000.7600+58.3%5003.6m
heartdiseaseDSPy GEPAgroq/openai/gpt-oss-20b0.00000.3800+0.38005001.4m
hotpotqaGEPA-AIgroq/llama-3.3-70b-versatile0.65000.7500+15.4%1001.3m
hotpotqaDSPy GEPAgroq/llama-3.3-70b-versatile0.00000.5167+0.51671001.4m
pupaGEPA-AIgroq/openai/gpt-oss-120b0.55570.5601+0.8%3002.5m
pupaDSPy GEPAgroq/openai/gpt-oss-120b0.44930.5102+13.6%3001.3m
DSPy-AI MIPROv2 Library:
BenchmarkModelBaselineBestLiftRolloutsTime
heartdiseasegroq/openai/gpt-oss-20b0.51000.6100+10.0%9691.6m
banking77groq/llama-3.1-8b-instant0.59500.7050+11.0%7914.1m
Average Performance:
OptimizerAvg Final ScoreAvg LiftBenchmarks
GEPA-AI75.5%+14.6%4
DSPy-AI MIPROv265.8%+10.5%2 benchmarks
DSPy GEPA47.4%+23.9%4
Note: DSPy GEPA shows high % lift from very low baselines (0-49% starting accuracy). This is probably worth further review.

Synth AI GEPA Results

Synth AI’s GEPA implementation with different proposer modes:
TaskPolicy ModelProposerBaselineBestLiftRolloutsTime
Banking77_synthllama-3.1-8b-instantgepa-ai0.87001.0000+0.1300136185.2s
Banking77_dspyllama-3.1-8b-instantdspy0.86001.0000+0.1400352228.4s
Banking77_gepa_aillama-3.1-8b-instantgepa-ai0.87000.8750+0.0050352190.9s
HeartDisease_dspyopenai/gpt-oss-20bdspy0.54000.7500+0.2100212295.0s
HeartDisease_gepa_aiopenai/gpt-oss-20bgepa-ai0.54000.6250+0.0850240301.3s
HotpotQA_gepa_aillama-3.3-70b-versatilegepa-ai0.78170.8937+0.112096275.4s
HotpotQA_dspyllama-3.3-70b-versatiledspy0.78670.7688-0.022988252.7s
Average Performance by Proposer Mode:
Proposer ModeAvg BaselineAvg Final ScoreAvg LiftAvg Time
DSPy Proposer0.74060.7979+0.057415.2m
GEPA-AI Proposer0.76290.8484+0.085521.5m
Synth GEPA with GEPA-AI proposer: 84.8% avg final score, +8.55% avg lift. Achieved 100% on Banking77.

Synth AI MIPRO Results

Synth AI’s MIPRO (using DSPy proposer mode):
BenchmarkStatusBaselineBestLiftRolloutsTime
HeartDiseasecompleted0.54000.7500+21.0%N/AN/A
HotpotQAcompleted0.60500.6950+9.0%8052.0s
Pupacompleted0.34660.5588+21.2%8053.0s
Banking77completed0.90000.8400-6.0%12548.0s
Synth MIPRO Statistics:
  • Average Baseline: 59.8%
  • Average Best Score: 71.1%
  • Average Lift: +11.3%
  • Average Time: 51.0s
  • Success Rate: 75% (3/4 benchmarks improved)

Per-Benchmark Results

BenchmarkSynth GEPA (GEPA-AI)Synth GEPA (DSPy)Synth MIPROGEPA-AI LibDSPy-AI MIPROv2DSPy GEPA Lib
Banking7787% to 100%86% to 100%90% to 84%75% to 95%60% to 71%49% to 49%
HeartDisease54% to 63%54% to 75%54% to 75%48% to 76%51% to 61%0% to 38%
HotpotQA78% to 89%79% to 77%61% to 70%65% to 75%failed0% to 52%
Pupa--35% to 56%56% to 56%-45% to 51%

Analysis

Performance Rankings

RankFrameworkAvg Final ScoreNotes
1Synth GEPA (GEPA-AI proposer)84.8%-
2Synth GEPA (DSPy proposer)79.8%-
3GEPA-AI Library75.5%-
4Synth MIPRO (DSPy proposer)71.1%-
5DSPy-AI MIPROv265.8%2 benchmarks only
6DSPy GEPA47.4%-

Interpretation

It’s important to keep in mind that these 4 benchmarks comprise only a small subset of the types of problem devs care about, the models here were all small language models, and the rollout budgets were shoestring. These scripts are chiefly intented to provide fast feedback to developers, and not provide definitive scientific insight on which implementation is the best. But, it’s impossible to notice that all three consistently deliver improvements even on modest budgets. If you’re interested in seeing what they do for your application, consider installing dspy-ai, gepa-ai, or creating a free account with Synth.

Reproduction

Full code, configs, and instructions: synth-ai/examples/blog_posts/langprobe

Learn More

Prompt Optimization Documentation: SDK & CLI: