| Pattern | Domain | Challenge | Ground Truth |
|---|---|---|---|
| Data-Heavy | Code evaluation | Large artifacts, many test cases | Deterministic tests |
| Criteria-Heavy | Visual evaluation | Multi-dimensional assessment | Human ratings |
Data-Heavy: Code Evaluation
When evaluating code, deterministic signals exist (compilation, tests) but don’t capture all quality dimensions. Verifier optimization tunes the rubric so scores correlate with these signals while also capturing harder-to-measure qualities.The Challenge
Deterministic tests provide partial ground truth:| Signal | Tests Provide | Verifier Adds |
|---|---|---|
| Correctness | ✅ Pass/fail | Pattern adherence to reference |
| Completeness | ❌ May miss stubs | Detects todo!() placeholders |
| Code quality | ❌ Not measured | Idiomatic patterns, readability |
| Architecture | ❌ Not measured | Engine pattern compliance |
What Gets Optimized
GEPA evolves the verifier’s rubric and evaluation prompt:Architecture
Configuration
verifier_optimization.toml
Data-Heavy Characteristics
| Factor | Impact on Verifier Optimization |
|---|---|
| Many test cases | More ground truth for correlation |
| Large artifacts | Verifier must handle 200-300KB code |
| Expensive rollouts | Amortize cost across verifier candidates |
| 5+ criteria | More weights to optimize |
Run It
View on GitHubCriteria-Heavy: Visual Evaluation
Visual evaluation has no deterministic ground truth—only human judgment. Verifier optimization tunes criteria to match human ratings across multiple dimensions.The Challenge
Visual fidelity requires multi-dimensional assessment:| Criterion | What to Measure | Baseline Score |
|---|---|---|
| Color Scheme | Background, text, accent colors | 2.4/10 |
| Typography | Font sizes, weights, hierarchy | 3.8/10 |
| Layout | Spacing, margins, positioning | 4.0/10 |
| Visual Elements | Icons, images, decorations | 2.8/10 |
| Overall | Would it pass for the original? | 3.2/10 |
What Gets Optimized
GEPA evolves the verifier’s evaluation prompt and criteria definitions:Architecture
Configuration
visual_verifier_optimization.toml
Criteria-Heavy Characteristics
| Factor | Impact on Verifier Optimization |
|---|---|
| 5 evaluation dimensions | Each criterion needs calibration |
| Subjective ground truth | Human ratings have variance |
| Vision model required | Multimodal prompt optimization |
| No deterministic signals | Entirely dependent on verifier quality |
Run It
View on GitHubComparison
| Aspect | Data-Heavy (Code) | Criteria-Heavy (Visual) |
|---|---|---|
| Ground truth | Deterministic tests | Human ratings |
| Optimized components | Rubric weights, criterion descriptions | Evaluation prompt, criteria definitions |
| Verifier model | gpt-5-mini | gemini-2.5-flash (vision) |
| Correlation target | Compile + test pass rate | Human similarity scores |
| Main challenge | Many criteria to weight | Subjective dimensions to calibrate |
When to Use Each Pattern
Data-Heavy:- Ground truth from deterministic signals (tests, validators)
- Many evaluation criteria to weight
- Large artifacts requiring structured evaluation
- Correlation with existing metrics
- Subjective or creative evaluation
- No deterministic ground truth
- Multi-dimensional quality assessment
- Vision or multimodal evaluation