---
title: "Shipping AI agents that actually work in production"
description: "How we deliver LLM agents that don't just demo well · retrieval, evals, guardrails, and cost control as a real engineering system."
date: 2026-04-08
updated: 2026-04-15
author: "Dezső Mező"
tags: "AI, LLM, RAG, Production"
slug: shipping-ai-agents-that-work
canonical: https://dfieldsolutions.com/blog/shipping-ai-agents-that-work
---

# Shipping AI agents that actually work in production

From demo to live system: the retrieval, eval, guardrails and cost control we run on every AI project we ship.
Most 'AI agent' projects we see start with a promising ChatGPT demo, and three months later nobody knows why it hallucinates, why it's expensive, or why it falls apart under a real user. The problem isn't the LLM. The problem is the missing systems thinking.

Here's how we deliver AI agents that behave like production systems: every release passes an eval suite, every token has a cost SLA, and we see · in real time · when behavior drifts from the trend.

## 1. Retrieval: if you only do one thing, do this

Most hallucinations aren't solved by 'bigger model' · they're solved by retrieval. If the context is in the prompt, the model has nothing to invent. Hybrid retrieval (BM25 + vector + reranker) plus careful chunking covers 80% of customer-side errors.

- Chunk size 300-800 tokens, overlap 15-20%.
- Reranker (bge-reranker, Cohere rerank-3) is a dramatic quality jump.
- Always return citations. No hits → refuse.

## 2. Evals: 'looks fine' is no longer fine

We build a golden set from the customer's data · 50-200 questions · and run it in CI before every release. LLM-as-judge + factual regression tests. If the quality trend breaks, we don't deploy.

```ts
// Eval CI step
import { runEvals } from "@dfield/eval";

const result = await runEvals({
  suite: "support-copilot",
  model: process.env.MODEL_VERSION,
  thresholds: { accuracy: 0.88, factual: 0.95, latencyP95Ms: 1800 },
});

if (!result.passed) {
  throw new Error(`Eval failed: ${result.failures.join(", ")}`);
}
```

## 3. Guardrails: PII, prompt injection, output schema

Input side: PII scrubber, prompt-injection detector (keyword + LLM classifier). Output side: JSON schema validation, topic filters. This isn't cosmetic · it's what protects the brand.

> **TIP:** Guardrails are cheap insurance: they barely affect latency, and they stop 99% of unsafe / off-brand output.

## 4. Cost control: LLM router + cache

Not every question needs a GPT-4o answer. Route by intent: simple FAQ → small model + cache; complex reasoning → big model. 3-5x cost reduction is realistic.

## 5. Observability: every question measured

OpenTelemetry + our own dashboard: tokens in/out, latency P50/P95/P99, quality metrics (accuracy, refusal rate), cost per user. When a metric breaks, we know instantly and a pager fires.

## Closing

An AI system isn't fundamentally different from any other backend service · it needs the same engineering discipline. If you want to start this way, email us · we can show a running prototype on your data within a week.

---

Source: https://dfieldsolutions.com/blog/shipping-ai-agents-that-work
Author: Dezső Mező · Founder, DField Solutions
Site: https://dfieldsolutions.com
