Early Prompt Optimization Baselines: MIPRO and GEPA

Machine-aided prompt learning is a useful tool for devs writing AI software. It’s cheap, interpretable, and easy to implement and version. By far the most popular method today is GEPA, an evolutionary search routine. In contexts where demonstrations are needed, though, MIPROv2 is likely still the state of the art. We’ve used related methods extensively internally to calibrate judges and establish strong agent baselines for SFT and RL development and testing. While eventually, we expect most developers to reach for our methods and their derivatives, we’re starting by offering serverless implementations of the academic state of the art to offer a familiar option for devs who want to keep using familiar algorithms, just servelessly via the Synth platform.

Results Overview

Benchmark	Synth GEPA (GEPA)	Synth GEPA (DSPy)	Synth MIPRO	GEPA-AI Library	DSPy-AI MIPROv2	DSPy GEPA Library
Banking77	87% → 100%	86% → 100%	90% → 84%	75% → 95%	60% → 71%	49% → 49%
HeartDisease	54% → 63%	54% → 75%	54% → 75%	48% → 76%	51% → 61%	0% → 38%
HotpotQA	78% → 89%	79% → 77%	61% → 70%	65% → 75%	-	0% → 52%
Pupa	-	-	35% → 56%	56% → 56%	-	45% → 51%

Format: Baseline → Final Score. Bold indicates best final score per benchmark.

Benchmark Implementation Details

Because Synth uses a http relay boundary to separate client-side business logic from server-side optimization logic, we’re unable to reuse the gepa-ai and dspy-ai implementations of GEPA and MIPROv2. Instead, we re-implemented the methods from scratch. To make sure we didn’t veer wildly off course, we took the opportunity to run all 3 implementations of GEPA, and both implementations of MIPROv2, on a small but growing fraction of the LangProBe benchmark, with a few related benchmarks thrown in. It’s impossible to compare the methods apples-to-apples, as the three differ in what constraints they place on the inital module - DSPy imposes substantial structure, GEPA-AI a bit, and Synth very little. Moreover, although the generation pipelines resemble one another and use the same reflection model (here, Llama 3.3 70b), they are not exact copies. Nonetheless, we hope the results we’ve obtained shed a bit of light on what results developers can expect to see on simple pipeline tasks with small models. Note: to bring migration frictions to a minimum, Synth’s GEPA and MIPRO adapters allow the user to select between our own proposer and proposers closely resembling the dspy-ai and gepa-ai references. Synth GEPA (GEPA-AI proposer) is simply the core Synth GEPA optimizer with a proposer designed to resemble the gepa-ai proposer swapped in.

Want to try prompt optimization? See Prompt Optimization Overview for concepts, Train Guide for CLI usage, and SDK Guide for programmatic access.

Frameworks Compared

Synth AI: Synth’s Python client, designed for CLI use by agents and in production
GEPA-AI: Official library implementing GEPA
DSPy: Stanford’s framework with GEPA and MIPROv2 optimizers

LangProbe(+) Benchmark Suite

Four benchmarks testing different task types: Banking77 (77-class intent classification)

Dataset: PolyAI/banking77
Paper: Efficient Intent Detection with Dual Sentence Encoders

HeartDisease (binary medical classification)

Dataset: buio/heart-disease
Source: UCI Heart Disease Data Set

HotpotQA (multi-hop question answering)

Dataset: hotpotqa/hotpot_qa
Paper: HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering
Project: hotpotqa.github.io

PUPA (privacy-aware task delegation)

Dataset: Columbia-NLP/PUPA
Paper: PAPILLON: Privacy Preservation from Internet-based and Local Language Model Ensembles
Tutorial: DSPy RL for Privacy-Conscious Delegation

Methodology

Unified Adapter Pattern: Framework-specific adapters (DSPy, GEPA-AI, Synth) interface with shared task apps (FastAPI servers) that evaluate prompts on identical datasets. Same initial prompts, models, seeds, and rollout budgets across all frameworks. Architecture:

Task App (FastAPI) -> Adapter Layer -> Optimizer -> Results

Models Used: Groq Llama-3.1-8b (Banking77), OpenAI GPT variants (HeartDisease, Pupa), Llama-3.3-70b (HotpotQA)

Results Summary

Library Reference Experiments

GEPA-AI vs DSPy GEPA:

Benchmark	Optimizer	Model	Baseline	Best	Lift	Rollouts	Time
banking77	GEPA-AI	groq/llama-3.1-8b-instant	0.7500	0.9500	+26.7%	500	48.5s
banking77	DSPy GEPA	groq/llama-3.1-8b-instant	0.4900	0.4900	+0.0%	500	8.1s
heartdisease	GEPA-AI	groq/openai/gpt-oss-20b	0.4800	0.7600	+58.3%	500	3.6m
heartdisease	DSPy GEPA	groq/openai/gpt-oss-20b	0.0000	0.3800	+0.3800	500	1.4m
hotpotqa	GEPA-AI	groq/llama-3.3-70b-versatile	0.6500	0.7500	+15.4%	100	1.3m
hotpotqa	DSPy GEPA	groq/llama-3.3-70b-versatile	0.0000	0.5167	+0.5167	100	1.4m
pupa	GEPA-AI	groq/openai/gpt-oss-120b	0.5557	0.5601	+0.8%	300	2.5m
pupa	DSPy GEPA	groq/openai/gpt-oss-120b	0.4493	0.5102	+13.6%	300	1.3m

DSPy-AI MIPROv2 Library:

Benchmark	Model	Baseline	Best	Lift	Rollouts	Time
heartdisease	groq/openai/gpt-oss-20b	0.5100	0.6100	+10.0%	969	1.6m
banking77	groq/llama-3.1-8b-instant	0.5950	0.7050	+11.0%	791	4.1m

Average Performance:

Optimizer	Avg Final Score	Avg Lift	Benchmarks
GEPA-AI	75.5%	+14.6%	4
DSPy-AI MIPROv2	65.8%	+10.5%	2 benchmarks
DSPy GEPA	47.4%	+23.9%	4

Note: DSPy GEPA shows high % lift from very low baselines (0-49% starting accuracy). This is probably worth further review.

Synth AI GEPA Results

Synth AI’s GEPA implementation with different proposer modes:

Task	Policy Model	Proposer	Baseline	Best	Lift	Rollouts	Time
Banking77_synth	llama-3.1-8b-instant	gepa-ai	0.8700	1.0000	+0.1300	136	185.2s
Banking77_dspy	llama-3.1-8b-instant	dspy	0.8600	1.0000	+0.1400	352	228.4s
Banking77_gepa_ai	llama-3.1-8b-instant	gepa-ai	0.8700	0.8750	+0.0050	352	190.9s
HeartDisease_dspy	openai/gpt-oss-20b	dspy	0.5400	0.7500	+0.2100	212	295.0s
HeartDisease_gepa_ai	openai/gpt-oss-20b	gepa-ai	0.5400	0.6250	+0.0850	240	301.3s
HotpotQA_gepa_ai	llama-3.3-70b-versatile	gepa-ai	0.7817	0.8937	+0.1120	96	275.4s
HotpotQA_dspy	llama-3.3-70b-versatile	dspy	0.7867	0.7688	-0.0229	88	252.7s

Average Performance by Proposer Mode:

Proposer Mode	Avg Baseline	Avg Final Score	Avg Lift	Avg Time
DSPy Proposer	0.7406	0.7979	+0.0574	15.2m
GEPA-AI Proposer	0.7629	0.8484	+0.0855	21.5m

Synth GEPA with GEPA-AI proposer: 84.8% avg final score, +8.55% avg lift. Achieved 100% on Banking77.

Synth AI MIPRO Results

Synth AI’s MIPRO (using DSPy proposer mode):

Benchmark	Status	Baseline	Best	Lift	Rollouts	Time
HeartDisease	completed	0.5400	0.7500	+21.0%	N/A	N/A
HotpotQA	completed	0.6050	0.6950	+9.0%	80	52.0s
Pupa	completed	0.3466	0.5588	+21.2%	80	53.0s
Banking77	completed	0.9000	0.8400	-6.0%	125	48.0s

Synth MIPRO Statistics:

Average Baseline: 59.8%
Average Best Score: 71.1%
Average Lift: +11.3%
Average Time: 51.0s
Success Rate: 75% (3/4 benchmarks improved)

Per-Benchmark Results

Benchmark	Synth GEPA (GEPA-AI)	Synth GEPA (DSPy)	Synth MIPRO	GEPA-AI Lib	DSPy-AI MIPROv2	DSPy GEPA Lib
Banking77	87% to 100%	86% to 100%	90% to 84%	75% to 95%	60% to 71%	49% to 49%
HeartDisease	54% to 63%	54% to 75%	54% to 75%	48% to 76%	51% to 61%	0% to 38%
HotpotQA	78% to 89%	79% to 77%	61% to 70%	65% to 75%	failed	0% to 52%
Pupa	-	-	35% to 56%	56% to 56%	-	45% to 51%

Analysis

Performance Rankings

Rank	Framework	Avg Final Score	Notes
1	Synth GEPA (GEPA-AI proposer)	84.8%	-
2	Synth GEPA (DSPy proposer)	79.8%	-
3	GEPA-AI Library	75.5%	-
4	Synth MIPRO (DSPy proposer)	71.1%	-
5	DSPy-AI MIPROv2	65.8%	2 benchmarks only
6	DSPy GEPA	47.4%	-

Interpretation

It’s important to keep in mind that these 4 benchmarks comprise only a small subset of the types of problem devs care about, the models here were all small language models, and the rollout budgets were shoestring. These scripts are chiefly intented to provide fast feedback to developers, and not provide definitive scientific insight on which implementation is the best. But, it’s impossible to notice that all three consistently deliver improvements even on modest budgets. If you’re interested in seeing what they do for your application, consider installing dspy-ai, gepa-ai, or creating a free account with Synth.

Reproduction

Full code, configs, and instructions: synth-ai/examples/blog_posts/langprobe

Learn More

Prompt Optimization Documentation:

Prompt Optimization Overview - Core concepts and algorithms (GEPA, MIPRO)
Train Your Own Prompts - CLI guide with examples
Evaluation Guide - Evaluating optimized prompts
Spec-Driven Optimization - Advanced: optimize with custom specifications
Examples - Banking77 and more

SDK & CLI:

SDK Overview - Programmatic access to prompt optimization
CLI Train Command - Command reference

Deep Dives

​Results Overview

​Benchmark Implementation Details

​Frameworks Compared

​LangProbe(+) Benchmark Suite

​Methodology

​Results Summary

​Library Reference Experiments

​Synth AI GEPA Results

​Synth AI MIPRO Results

​Per-Benchmark Results

​Analysis

​Performance Rankings

​Interpretation

​Reproduction

​Learn More