Shipping AI agents that actually work in production

From demo to live system: the retrieval, eval, guardrails and cost control we run on every AI project we ship.

Last verified15 April 2026• new

By Dezső MezőFounder, DField Solutions

ShareX LinkedIn#

Reviewed by:Dezső Mező· Founder · Engineer, DField Solutions· 18 Apr 2026

Most 'AI agent' projects we see start with a promising ChatGPT demo, and three months later nobody knows why it hallucinates, why it's expensive, or why it falls apart under a real user. The problem isn't the LLM. The problem is the missing systems thinking.

Here's how we deliver AI agents that behave like production systems: every release passes an eval suite, every token has a cost SLA, and we see — in real time — when behavior drifts from the trend.

1. Retrieval: if you only do one thing, do this

Most hallucinations aren't solved by 'bigger model' — they're solved by retrieval. If the context is in the prompt, the model has nothing to invent. Hybrid retrieval (BM25 + vector + reranker) plus careful chunking covers 80% of customer-side errors.

Chunk size 300–800 tokens, overlap 15–20%.
Reranker (bge-reranker, Cohere rerank-3) is a dramatic quality jump.
Always return citations. No hits → refuse.

2. Evals: 'looks fine' is no longer fine

We build a golden set from the customer's data — 50–200 questions — and run it in CI before every release. LLM-as-judge + factual regression tests. If the quality trend breaks, we don't deploy.

// Eval CI step
import { runEvals } from "@dfield/eval";

const result = await runEvals({
  suite: "support-copilot",
  model: process.env.MODEL_VERSION,
  thresholds: { accuracy: 0.88, factual: 0.95, latencyP95Ms: 1800 },
});

if (!result.passed) {
  throw new Error(`Eval failed: ${result.failures.join(", ")}`);
}

3. Guardrails: PII, prompt injection, output schema

Input side: PII scrubber, prompt-injection detector (keyword + LLM classifier). Output side: JSON schema validation, topic filters. This isn't cosmetic — it's what protects the brand.

Guardrails are cheap insurance: they barely affect latency, and they stop 99% of unsafe / off-brand output.

4. Cost control: LLM router + cache

Not every question needs a GPT-4o answer. Route by intent: simple FAQ → small model + cache; complex reasoning → big model. 3–5x cost reduction is realistic.

5. Observability: every question measured

OpenTelemetry + our own dashboard: tokens in/out, latency P50/P95/P99, quality metrics (accuracy, refusal rate), cost per user. When a metric breaks, we know instantly and a pager fires.

Closing

An AI system isn't fundamentally different from any other backend service — it needs the same engineering discipline. If you want to start this way, email us — we can show a running prototype on your data within a week.

ShareX LinkedIn#

Dezső Mező

Founder, DField Solutions

I've shipped production products from fintech to creator-tooling — for startups and enterprises, from Budapest to San Francisco.

ABOUT →Let's talk →

Keep reading

14 Apr 2026·7 min read

MCP (Model Context Protocol): what it means for LLM agents

MCP is the most important agent standard of the past year. What it means in practice, where we use it, and why to bet on it in 2026.

Read

05 Mar 2026·9 min read

GDPR + AI: training on user data in 2026 — what's allowed, what isn't

'We train on user data' — one sentence most startups drop without friction. In 2026 it opens a GDPR door. Here's the concrete checklist.

Read

18 Feb 2026·8 min read

EU AI Act for SaaS: what you actually have to do in 2026

AI Act is live. Who it affects, which tier you're in, deadlines — and the three things worth starting now.

Read

Would rather build together?

Let's talk about your project. 30 minutes, no strings.

Let's talk