---
title: "LLM evals-as-code · the CI gate we run on every RAG deploy"
description: "The eval harness we actually ship: metrics, regression-blocking thresholds, gold-set curation, and the eval-as-code CI gate that catches regressions early."
date: 2026-04-22
updated: 2026-04-22
author: "Dezső Mező"
tags: "AI, LLM, RAG, Eval, CI, Production"
slug: llm-evals-as-code-production
canonical: https://dfieldsolutions.com/blog/llm-evals-as-code-production
---

# LLM evals-as-code · the CI gate we run on every RAG deploy

An eval that's not in CI is not an eval. Here's the evals-as-code workflow we run on every RAG project.
An eval you run manually before each release isn't an eval · it's hope. The RAG systems that actually hold up in production have evals-as-code: a fixed gold set, metric classes in CI, regression blocking, and diff reporting per pull request.

## Five metric classes we measure

- Faithfulness · did the answer stay inside the retrieved context?
- Context precision · did retrieval pull the right chunks?
- Answer relevance · did the answer address the question?
- Bias · did neutral context produce a neutral answer?
- Injection resistance · did the system resist 80+ known attack patterns?

## Gold set discipline

Version the gold set in the repo · not a spreadsheet, not a shared Notion page. Any change to it requires a PR with reviewer sign-off. The eval harness pins to a specific gold-set version so you can compare eval runs across time fairly.

## Regression blocking

CI fails the build when any metric drops more than 5 points from the baseline. Baseline updates only merge when a human explicitly approves · not automatically on every improvement. Stops drift where 'small' regressions compound.

```python
if current_score < baseline_score - 0.05:
    raise AssertionError(
        f'Regression · {metric}: {current_score:.3f} < {baseline_score:.3f} - 0.05'
    )
```

## Weekly canary against live model

Beyond PR-blocking evals, we run the same suite weekly against the live production traffic · catches drift between what CI saw at merge time and what happens in the wild (query distribution shift, prompt-template drift, data-staleness).

> **TIP:** The fastest evals-as-code win: point promptfoo at 20 of your real customer prompts + what you'd accept as the right answer. Ship to CI. Before that, any LLM quality conversation is vibes.

---

Source: https://dfieldsolutions.com/blog/llm-evals-as-code-production
Author: Dezső Mező · Founder, DField Solutions
Site: https://dfieldsolutions.com
