Eval harness
Related service AI solutions
DEFINITION
An eval harness is the runnable infrastructure that, on every model bump, every prompt change, and every release, automatically runs a fixed test set, computes metrics (accuracy, factuality, refusal rate, latency, cost), stores results as a time series, and blocks release if any threshold drops. Saying we tested it usually means a developer played a few prompts through once by hand. An eval harness is the CI-wired, regression-catching, version-comparing version of that. Without one, every model bump is flying blind. A serious LLM stack today is a dataset, a runner (Promptfoo, Inspect, in-house), a scoring layer (LLM-as-judge plus deterministic asserts), and a dashboard where yesterday's run sits next to the new one.
- RAG (Retrieval-Augmented Generation)→
An AI architecture where the model retrieves relevant documents from your own data before answering, and only reasons over that context. Kills ~80% of hallucinations.
- LLM (Large Language Model)→
A neural model with billions of parameters (GPT-4, Claude, Mistral) that generates text. In production we never use one bare · always wrapped in retrieval and guardrails.
- Embedding→
A vector representation of text (e.g. 1536 floats). If two embeddings are close, the meanings are close. In RAG we use this to pick relevant chunks.
- Vector database→
A database specialised for fast approximate-nearest-neighbour search over embedding vectors (pgvector, Qdrant, Weaviate). The engineering base of RAG retrieval.
- Eval (LLM evaluation)→
An automated test suite that runs ~50–200 'golden' questions against the model before every release and checks that quality metrics (accuracy, factuality, latency) clear the threshold.
- Guardrail→
An input- or output-layer that filters the model's prompt/response (PII scrubbers, prompt-injection detectors, JSON-schema validation, topic blocks). Not before/after the model · around it.
- 0130 Sept 2026DField Q3 2026 roundup · what shifted, what we shipped, what is broken→
- 0201 Jul 2026DField Q2 2026 roundup · what shifted, what we shipped, what is broken→
- 0326 Apr 2026RAG's three failure modes · and the diagnostic table we use on every audit→
- 0426 Apr 2026We built our own LLM eval harness in 200 lines of TS · here is the file→
- 0526 Apr 2026Why your AI agent leaks money · 6 prompt-cache wins worth doing this week→
- 0626 Apr 2026OWASP LLM Top 10 v2 · what changed and what to ship→
- 0723 Apr 2026On-device LLMs in 2026 · Gemini Nano vs Apple Intelligence for mobile builds→
- 0822 Apr 2026Signed-firmware OTA pipeline · the 2026 default we ship→