Skip to content

An eval you run manually before each release isn't an eval · it's hope. The RAG systems that actually hold up in production have evals-as-code: a fixed gold set, metric classes in CI, regression blocking, and diff reporting per pull request.

Five metric classes we measure

  • Faithfulness · did the answer stay inside the retrieved context?
  • Context precision · did retrieval pull the right chunks?
  • Answer relevance · did the answer address the question?
  • Bias · did neutral context produce a neutral answer?
  • Injection resistance · did the system resist 80+ known attack patterns?

Gold set discipline

Version the gold set in the repo · not a spreadsheet, not a shared Notion page. Any change to it requires a PR with reviewer sign-off. The eval harness pins to a specific gold-set version so you can compare eval runs across time fairly.

Regression blocking

CI fails the build when any metric drops more than 5 points from the baseline. Baseline updates only merge when a human explicitly approves · not automatically on every improvement. Stops drift where 'small' regressions compound.

if current_score < baseline_score - 0.05:
    raise AssertionError(
        f'Regression · {metric}: {current_score:.3f} < {baseline_score:.3f} - 0.05'
    )

Weekly canary against live model

Beyond PR-blocking evals, we run the same suite weekly against the live production traffic · catches drift between what CI saw at merge time and what happens in the wild (query distribution shift, prompt-template drift, data-staleness).

The fastest evals-as-code win: point promptfoo at 20 of your real customer prompts + what you'd accept as the right answer. Ship to CI. Before that, any LLM quality conversation is vibes.

ShareXLinkedIn#
Dezső Mező

By

Dezső Mező

Founder, DField Solutions

I've shipped production products from fintech to creator-tooling · for startups and enterprises, from Budapest to San Francisco.

Keep reading

RELATED PROJECTS

Would rather build together?

Let's talk about your project. 30 minutes, no strings.

Let's talk